TY - GEN
T1 - Application of Simhash Algorithm Based on Bucket Index in Deduplication of Privacy Data
AU - Xiahui, Zheng
AU - Zhongying, Niu
AU - Licheng, Wang
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Large Language Model (LLM) is the basis of many natural language processing tasks. It is a statistical model used to predict the probability of the next word. Large models based on modern neural networks use very large model parameters to train on massive data sets. This method enhances the ability of large models to generate fluent natural language and can be widely applied to other tasks without updating parameters. However, the large amount of private information in the data set makes it easier for large models to remember and generate this private information, resulting in the leakage of private information. The traditional SimHash algorithm performs well in retrieving similar information in a data set, but for large-scale data sets, there are problems such as long retrieval time and low efficiency. This paper proposes a SimHash algorithm based on bucket indexing. By constructing bucket indexes for SimHash values, the number of unnecessary SimHash value comparisons can be reduced and the probability of similar text retrieval can be increased. Experimental results show that this method can improve the retrieval efficiency of similar private information in large-scale data sets without affecting model performance.
AB - Large Language Model (LLM) is the basis of many natural language processing tasks. It is a statistical model used to predict the probability of the next word. Large models based on modern neural networks use very large model parameters to train on massive data sets. This method enhances the ability of large models to generate fluent natural language and can be widely applied to other tasks without updating parameters. However, the large amount of private information in the data set makes it easier for large models to remember and generate this private information, resulting in the leakage of private information. The traditional SimHash algorithm performs well in retrieving similar information in a data set, but for large-scale data sets, there are problems such as long retrieval time and low efficiency. This paper proposes a SimHash algorithm based on bucket indexing. By constructing bucket indexes for SimHash values, the number of unnecessary SimHash value comparisons can be reduced and the probability of similar text retrieval can be increased. Experimental results show that this method can improve the retrieval efficiency of similar private information in large-scale data sets without affecting model performance.
KW - Large Language Model
KW - Privacy Deduplication
KW - Privacy Leakage
KW - SimHash
UR - https://www.scopus.com/pages/publications/105007714023
U2 - 10.1109/EESPE63401.2025.10987512
DO - 10.1109/EESPE63401.2025.10987512
M3 - Conference contribution
AN - SCOPUS:105007714023
T3 - 2025 IEEE International Conference on Electronics, Energy Systems and Power Engineering, EESPE 2025
SP - 6
EP - 9
BT - 2025 IEEE International Conference on Electronics, Energy Systems and Power Engineering, EESPE 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Conference on Electronics, Energy Systems and Power Engineering, EESPE 2025
Y2 - 17 March 2025 through 19 March 2025
ER -