Skip to main navigation Skip to search Skip to main content

Application of Simhash Algorithm Based on Bucket Index in Deduplication of Privacy Data

  • Beijing Institute of Technology
  • Ltd.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large Language Model (LLM) is the basis of many natural language processing tasks. It is a statistical model used to predict the probability of the next word. Large models based on modern neural networks use very large model parameters to train on massive data sets. This method enhances the ability of large models to generate fluent natural language and can be widely applied to other tasks without updating parameters. However, the large amount of private information in the data set makes it easier for large models to remember and generate this private information, resulting in the leakage of private information. The traditional SimHash algorithm performs well in retrieving similar information in a data set, but for large-scale data sets, there are problems such as long retrieval time and low efficiency. This paper proposes a SimHash algorithm based on bucket indexing. By constructing bucket indexes for SimHash values, the number of unnecessary SimHash value comparisons can be reduced and the probability of similar text retrieval can be increased. Experimental results show that this method can improve the retrieval efficiency of similar private information in large-scale data sets without affecting model performance.

Original languageEnglish
Title of host publication2025 IEEE International Conference on Electronics, Energy Systems and Power Engineering, EESPE 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6-9
Number of pages4
ISBN (Electronic)9798350389579
DOIs
Publication statusPublished - 2025
Event2025 IEEE International Conference on Electronics, Energy Systems and Power Engineering, EESPE 2025 - Shenyang, China
Duration: 17 Mar 202519 Mar 2025

Publication series

Name2025 IEEE International Conference on Electronics, Energy Systems and Power Engineering, EESPE 2025

Conference

Conference2025 IEEE International Conference on Electronics, Energy Systems and Power Engineering, EESPE 2025
Country/TerritoryChina
CityShenyang
Period17/03/2519/03/25

Keywords

  • Large Language Model
  • Privacy Deduplication
  • Privacy Leakage
  • SimHash

Fingerprint

Dive into the research topics of 'Application of Simhash Algorithm Based on Bucket Index in Deduplication of Privacy Data'. Together they form a unique fingerprint.

Cite this