SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT

Lingze Zeng; Chang Yao; Meihui Zhang; Zhongle Xie

doi:10.1007/978-3-031-25158-0_25

SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT

Lingze Zeng, Chang Yao^*, Meihui Zhang, Zhongle Xie

^*Corresponding author for this work

School of Computer Science and Technology

Zhejiang University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

Discovering medical synonym sets (i.e.,set of terms referring to a similar medical concept) is an important task in real-world, which can benefit many downstream applications such as medical information retrieval system and clinical decision support system. Recent synonym discovery methods take words as the input unit and leverage raw text as contextual information. However, they are ill-suited in Chinese participle as taking word as the input unit leads to serious Out-of-Vocabulary (OOV) problems. Additionally, it is hard to get large-scaled raw clinical texts in medical domain because of the privacy and security. Therefore, we define a new task discovering Chinese synonym from Privacy-Constrain terms (i.e., only terms without raw corpus) and propose a framework SynBERT to solve it. SynBERT consists of a binary classifier, inferring whether two term sets can form a synonym set, and two-phase clustering algorithm, applying classifier to cluster given terms into different synonym sets. In particular, SynBERT composes term’s embedding with character’s embedding to address the OOV problems. SynBERT introduces a BERT model pre-trained on public large-scaled corpus before to eliminate the need of raw context information. Ȧccording to our experiment, SynBERT outperforms better than baseline methods such as Kmeans, L2C, SynSetMine, etc.

Original language	English
Title of host publication	Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings
Editors	Bohan Li, Chuanqi Tao, Lin Yue, Xuming Han, Diego Calvanese, Toshiyuki Amagasa
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	331-344
Number of pages	14
ISBN (Print)	9783031251573
DOIs	https://doi.org/10.1007/978-3-031-25158-0_25
Publication status	Published - 2023
Event	6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022 - Nanjing, China Duration: 25 Nov 2022 → 27 Nov 2022

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13421 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022
Country/Territory	China
City	Nanjing
Period	25/11/22 → 27/11/22

Keywords

Data mining
Information extraction
Information retrieval

Access to Document

10.1007/978-3-031-25158-0_25

Cite this

Zeng, L., Yao, C., Zhang, M., & Xie, Z. (2023). SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT. In B. Li, C. Tao, L. Yue, X. Han, D. Calvanese, & T. Amagasa (Eds.), Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings (pp. 331-344). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13421 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-25158-0_25

Zeng, Lingze ; Yao, Chang ; Zhang, Meihui et al. / SynBERT : Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT. Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings. editor / Bohan Li ; Chuanqi Tao ; Lin Yue ; Xuming Han ; Diego Calvanese ; Toshiyuki Amagasa. Springer Science and Business Media Deutschland GmbH, 2023. pp. 331-344 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{2578b6f8166f428899c509ae1ac0a7c7,

title = "SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT",

abstract = "Discovering medical synonym sets (i.e.,set of terms referring to a similar medical concept) is an important task in real-world, which can benefit many downstream applications such as medical information retrieval system and clinical decision support system. Recent synonym discovery methods take words as the input unit and leverage raw text as contextual information. However, they are ill-suited in Chinese participle as taking word as the input unit leads to serious Out-of-Vocabulary (OOV) problems. Additionally, it is hard to get large-scaled raw clinical texts in medical domain because of the privacy and security. Therefore, we define a new task discovering Chinese synonym from Privacy-Constrain terms (i.e., only terms without raw corpus) and propose a framework SynBERT to solve it. SynBERT consists of a binary classifier, inferring whether two term sets can form a synonym set, and two-phase clustering algorithm, applying classifier to cluster given terms into different synonym sets. In particular, SynBERT composes term{\textquoteright}s embedding with character{\textquoteright}s embedding to address the OOV problems. SynBERT introduces a BERT model pre-trained on public large-scaled corpus before to eliminate the need of raw context information. Ȧccording to our experiment, SynBERT outperforms better than baseline methods such as Kmeans, L2C, SynSetMine, etc.",

keywords = "Data mining, Information extraction, Information retrieval",

author = "Lingze Zeng and Chang Yao and Meihui Zhang and Zhongle Xie",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022 ; Conference date: 25-11-2022 Through 27-11-2022",

year = "2023",

doi = "10.1007/978-3-031-25158-0_25",

language = "English",

isbn = "9783031251573",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "331--344",

editor = "Bohan Li and Chuanqi Tao and Lin Yue and Xuming Han and Diego Calvanese and Toshiyuki Amagasa",

booktitle = "Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings",

address = "Germany",

}

Zeng, L, Yao, C, Zhang, M & Xie, Z 2023, SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT. in B Li, C Tao, L Yue, X Han, D Calvanese & T Amagasa (eds), Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13421 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 331-344, 6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022, Nanjing, China, 25/11/22. https://doi.org/10.1007/978-3-031-25158-0_25

SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT. / Zeng, Lingze; Yao, Chang; Zhang, Meihui et al.
Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings. ed. / Bohan Li; Chuanqi Tao; Lin Yue; Xuming Han; Diego Calvanese; Toshiyuki Amagasa. Springer Science and Business Media Deutschland GmbH, 2023. p. 331-344 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13421 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - SynBERT

T2 - 6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022

AU - Zeng, Lingze

AU - Yao, Chang

AU - Zhang, Meihui

AU - Xie, Zhongle

PY - 2023

Y1 - 2023

N2 - Discovering medical synonym sets (i.e.,set of terms referring to a similar medical concept) is an important task in real-world, which can benefit many downstream applications such as medical information retrieval system and clinical decision support system. Recent synonym discovery methods take words as the input unit and leverage raw text as contextual information. However, they are ill-suited in Chinese participle as taking word as the input unit leads to serious Out-of-Vocabulary (OOV) problems. Additionally, it is hard to get large-scaled raw clinical texts in medical domain because of the privacy and security. Therefore, we define a new task discovering Chinese synonym from Privacy-Constrain terms (i.e., only terms without raw corpus) and propose a framework SynBERT to solve it. SynBERT consists of a binary classifier, inferring whether two term sets can form a synonym set, and two-phase clustering algorithm, applying classifier to cluster given terms into different synonym sets. In particular, SynBERT composes term’s embedding with character’s embedding to address the OOV problems. SynBERT introduces a BERT model pre-trained on public large-scaled corpus before to eliminate the need of raw context information. Ȧccording to our experiment, SynBERT outperforms better than baseline methods such as Kmeans, L2C, SynSetMine, etc.

AB - Discovering medical synonym sets (i.e.,set of terms referring to a similar medical concept) is an important task in real-world, which can benefit many downstream applications such as medical information retrieval system and clinical decision support system. Recent synonym discovery methods take words as the input unit and leverage raw text as contextual information. However, they are ill-suited in Chinese participle as taking word as the input unit leads to serious Out-of-Vocabulary (OOV) problems. Additionally, it is hard to get large-scaled raw clinical texts in medical domain because of the privacy and security. Therefore, we define a new task discovering Chinese synonym from Privacy-Constrain terms (i.e., only terms without raw corpus) and propose a framework SynBERT to solve it. SynBERT consists of a binary classifier, inferring whether two term sets can form a synonym set, and two-phase clustering algorithm, applying classifier to cluster given terms into different synonym sets. In particular, SynBERT composes term’s embedding with character’s embedding to address the OOV problems. SynBERT introduces a BERT model pre-trained on public large-scaled corpus before to eliminate the need of raw context information. Ȧccording to our experiment, SynBERT outperforms better than baseline methods such as Kmeans, L2C, SynSetMine, etc.

KW - Data mining

KW - Information extraction

KW - Information retrieval

UR - http://www.scopus.com/inward/record.url?scp=85151138868&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-25158-0_25

DO - 10.1007/978-3-031-25158-0_25

M3 - Conference contribution

AN - SCOPUS:85151138868

SN - 9783031251573

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 331

EP - 344

BT - Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings

A2 - Li, Bohan

A2 - Tao, Chuanqi

A2 - Yue, Lin

A2 - Han, Xuming

A2 - Calvanese, Diego

A2 - Amagasa, Toshiyuki

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 25 November 2022 through 27 November 2022

ER -

Zeng L, Yao C, Zhang M, Xie Z. SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT. In Li B, Tao C, Yue L, Han X, Calvanese D, Amagasa T, editors, Web and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings. Springer Science and Business Media Deutschland GmbH. 2023. p. 331-344. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-25158-0_25