SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT

Lingze Zeng, Chang Yao*, Meihui Zhang, Zhongle Xie

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Discovering medical synonym sets (i.e.,set of terms referring to a similar medical concept) is an important task in real-world, which can benefit many downstream applications such as medical information retrieval system and clinical decision support system. Recent synonym discovery methods take words as the input unit and leverage raw text as contextual information. However, they are ill-suited in Chinese participle as taking word as the input unit leads to serious Out-of-Vocabulary (OOV) problems. Additionally, it is hard to get large-scaled raw clinical texts in medical domain because of the privacy and security. Therefore, we define a new task discovering Chinese synonym from Privacy-Constrain terms (i.e., only terms without raw corpus) and propose a framework SynBERT to solve it. SynBERT consists of a binary classifier, inferring whether two term sets can form a synonym set, and two-phase clustering algorithm, applying classifier to cluster given terms into different synonym sets. In particular, SynBERT composes term’s embedding with character’s embedding to address the OOV problems. SynBERT introduces a BERT model pre-trained on public large-scaled corpus before to eliminate the need of raw context information. Ȧccording to our experiment, SynBERT outperforms better than baseline methods such as Kmeans, L2C, SynSetMine, etc.

Original languageEnglish
Title of host publicationWeb and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings
EditorsBohan Li, Chuanqi Tao, Lin Yue, Xuming Han, Diego Calvanese, Toshiyuki Amagasa
PublisherSpringer Science and Business Media Deutschland GmbH
Pages331-344
Number of pages14
ISBN (Print)9783031251573
DOIs
Publication statusPublished - 2023
Event6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022 - Nanjing, China
Duration: 25 Nov 202227 Nov 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13421 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022
Country/TerritoryChina
CityNanjing
Period25/11/2227/11/22

Keywords

  • Data mining
  • Information extraction
  • Information retrieval

Fingerprint

Dive into the research topics of 'SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT'. Together they form a unique fingerprint.

Cite this