TY - GEN
T1 - An Improved Topic Extraction Method Based on Word Frequency Information Entropy for Multilingual Topic Attentional Division
AU - Yuan, Yue
AU - Zhang, Huaping
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In the contemporary era of ubiquitous global information dissemination, a myriad of news articles are generated worldwide on a daily basis. The topics that capture the attention of different countries diverge due to variances in culture, values, and other influential factors. Analyzing these discrepancies in topic preferences across languages within specific timeframes holds paramount importance for comprehensively understanding and delineating the nuances of diverse national cultures. This paper proposes a novel statistical analysis methodology for extracting multi-language news topic keywords, leveraging the concept of word frequency information entropy. Our approach facilitates the identification of shared topics across different languages, as well as language-specific concerns, within extensive news datasets. Furthermore, we address a prevalent challenge encountered in existing topic modeling methodologies, namely output redundancy. Through the aggregation of synonymous terms, we effectively alleviate redundancy, thereby enhancing the quality of extracted topic keywords. Experimental evaluations are conducted on a meticulously collected multinational news dataset, wherein we assess the effectiveness of our approach in partitioning common and language-specific focus topics across multiple languages, while also quantifying the efficacy of redundancy elimination.
AB - In the contemporary era of ubiquitous global information dissemination, a myriad of news articles are generated worldwide on a daily basis. The topics that capture the attention of different countries diverge due to variances in culture, values, and other influential factors. Analyzing these discrepancies in topic preferences across languages within specific timeframes holds paramount importance for comprehensively understanding and delineating the nuances of diverse national cultures. This paper proposes a novel statistical analysis methodology for extracting multi-language news topic keywords, leveraging the concept of word frequency information entropy. Our approach facilitates the identification of shared topics across different languages, as well as language-specific concerns, within extensive news datasets. Furthermore, we address a prevalent challenge encountered in existing topic modeling methodologies, namely output redundancy. Through the aggregation of synonymous terms, we effectively alleviate redundancy, thereby enhancing the quality of extracted topic keywords. Experimental evaluations are conducted on a meticulously collected multinational news dataset, wherein we assess the effectiveness of our approach in partitioning common and language-specific focus topics across multiple languages, while also quantifying the efficacy of redundancy elimination.
KW - BERTopic
KW - Entropy
KW - mT5
KW - Topic Attentional division
UR - http://www.scopus.com/inward/record.url?scp=85211495197&partnerID=8YFLogxK
U2 - 10.1109/ICSP62122.2024.10743506
DO - 10.1109/ICSP62122.2024.10743506
M3 - Conference contribution
AN - SCOPUS:85211495197
T3 - 2024 9th International Conference on Intelligent Computing and Signal Processing, ICSP 2024
SP - 675
EP - 681
BT - 2024 9th International Conference on Intelligent Computing and Signal Processing, ICSP 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th International Conference on Intelligent Computing and Signal Processing, ICSP 2024
Y2 - 19 April 2024 through 21 April 2024
ER -