TY - GEN
T1 - TokenFree
T2 - 2024 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2024
AU - Yan, Ruiyi
AU - Song, Tian
AU - Yang, Yating
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Since tokenization serves a fundamental preprocessing step in numerous language models, tokens naturally constitute the basic embedding units for generative linguistic steganography. However, tokenization-based methods face challenges including limited embedding capacity and possible segmentation ambiguity. Despite existing character-level (one tokenization-free type) linguistic steganographic approaches, they face the problem of generating unknown or out-of-vocabulary words, potentially compromising steganographic imperceptibility. In this paper, we focus on both embedding capacity and imperceptibility of tokenization-free linguistic steganography. First, we suggest that unknown words mainly result from low-entropy distributions and rigid coding rules used in candidate pools, thus we propose an entropy-based selection approach to flexibly construct candidate pools. Further, we present a lexical emphasis approach, prioritizing characters within candidate pools capable of forming in-vocabulary words. Experiments show that, across a range of high embedding rates, our approaches achieve considerably higher imperceptibility and text fluency, increase anti-steganalysis capacity averagely by 14.4%, and particularly reduce out-of-vocabulary rate averagely by 88.7%, compared to the existing state-of-the-art character-level steganographic methods.
AB - Since tokenization serves a fundamental preprocessing step in numerous language models, tokens naturally constitute the basic embedding units for generative linguistic steganography. However, tokenization-based methods face challenges including limited embedding capacity and possible segmentation ambiguity. Despite existing character-level (one tokenization-free type) linguistic steganographic approaches, they face the problem of generating unknown or out-of-vocabulary words, potentially compromising steganographic imperceptibility. In this paper, we focus on both embedding capacity and imperceptibility of tokenization-free linguistic steganography. First, we suggest that unknown words mainly result from low-entropy distributions and rigid coding rules used in candidate pools, thus we propose an entropy-based selection approach to flexibly construct candidate pools. Further, we present a lexical emphasis approach, prioritizing characters within candidate pools capable of forming in-vocabulary words. Experiments show that, across a range of high embedding rates, our approaches achieve considerably higher imperceptibility and text fluency, increase anti-steganalysis capacity averagely by 14.4%, and particularly reduce out-of-vocabulary rate averagely by 88.7%, compared to the existing state-of-the-art character-level steganographic methods.
UR - https://www.scopus.com/pages/publications/85217834170
U2 - 10.1109/SMC54092.2024.10831652
DO - 10.1109/SMC54092.2024.10831652
M3 - Conference contribution
AN - SCOPUS:85217834170
T3 - Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
SP - 449
EP - 455
BT - 2024 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 6 October 2024 through 10 October 2024
ER -