TY - GEN
T1 - SynBERT
T2 - 6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022
AU - Zeng, Lingze
AU - Yao, Chang
AU - Zhang, Meihui
AU - Xie, Zhongle
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Discovering medical synonym sets (i.e.,set of terms referring to a similar medical concept) is an important task in real-world, which can benefit many downstream applications such as medical information retrieval system and clinical decision support system. Recent synonym discovery methods take words as the input unit and leverage raw text as contextual information. However, they are ill-suited in Chinese participle as taking word as the input unit leads to serious Out-of-Vocabulary (OOV) problems. Additionally, it is hard to get large-scaled raw clinical texts in medical domain because of the privacy and security. Therefore, we define a new task discovering Chinese synonym from Privacy-Constrain terms (i.e., only terms without raw corpus) and propose a framework SynBERT to solve it. SynBERT consists of a binary classifier, inferring whether two term sets can form a synonym set, and two-phase clustering algorithm, applying classifier to cluster given terms into different synonym sets. In particular, SynBERT composes term’s embedding with character’s embedding to address the OOV problems. SynBERT introduces a BERT model pre-trained on public large-scaled corpus before to eliminate the need of raw context information. Ȧccording to our experiment, SynBERT outperforms better than baseline methods such as Kmeans, L2C, SynSetMine, etc.
AB - Discovering medical synonym sets (i.e.,set of terms referring to a similar medical concept) is an important task in real-world, which can benefit many downstream applications such as medical information retrieval system and clinical decision support system. Recent synonym discovery methods take words as the input unit and leverage raw text as contextual information. However, they are ill-suited in Chinese participle as taking word as the input unit leads to serious Out-of-Vocabulary (OOV) problems. Additionally, it is hard to get large-scaled raw clinical texts in medical domain because of the privacy and security. Therefore, we define a new task discovering Chinese synonym from Privacy-Constrain terms (i.e., only terms without raw corpus) and propose a framework SynBERT to solve it. SynBERT consists of a binary classifier, inferring whether two term sets can form a synonym set, and two-phase clustering algorithm, applying classifier to cluster given terms into different synonym sets. In particular, SynBERT composes term’s embedding with character’s embedding to address the OOV problems. SynBERT introduces a BERT model pre-trained on public large-scaled corpus before to eliminate the need of raw context information. Ȧccording to our experiment, SynBERT outperforms better than baseline methods such as Kmeans, L2C, SynSetMine, etc.
KW - Data mining
KW - Information extraction
KW - Information retrieval
UR - http://www.scopus.com/inward/record.url?scp=85151138868&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-25158-0_25
DO - 10.1007/978-3-031-25158-0_25
M3 - Conference contribution
AN - SCOPUS:85151138868
SN - 9783031251573
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 331
EP - 344
BT - Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings
A2 - Li, Bohan
A2 - Tao, Chuanqi
A2 - Yue, Lin
A2 - Han, Xuming
A2 - Calvanese, Diego
A2 - Amagasa, Toshiyuki
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 25 November 2022 through 27 November 2022
ER -