TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing

Mucheng Ren; Heyan Huang; Yuxiang Zhou; Qianwen Cao; Yuan Bu; Yang Gao

TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing

Mucheng Ren, Heyan Huang, Yuxiang Zhou, Qianwen Cao, Yuan Bu, Yang Gao

计算机学院

科研成果: 会议稿件 › 论文 › 同行评审

4 引用（Scopus）

摘要

Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient's symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system-syndrome differentiation (SD)-and we introduce the first public large-scale benchmark for SD, called TCM-SD. Our benchmark contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.

源语言	英语
页	908-920
页数	13
出版状态	已出版 - 2022
活动	21st Chinese National Conference on Computational Linguistic, CCL 2022 - Nanchang, 中国期限: 14 10月 2022 → 16 10月 2022

会议

会议	21st Chinese National Conference on Computational Linguistic, CCL 2022
国家/地区	中国
市	Nanchang
时期	14/10/22 → 16/10/22

其它文件与链接

链接到 Scopus 的出版物

引用此

Ren, M., Huang, H., Zhou, Y., Cao, Q., Bu, Y., & Gao, Y. (2022). TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing. 908-920. 论文发表于 21st Chinese National Conference on Computational Linguistic, CCL 2022, Nanchang, 中国.

@conference{9c419e2b7e3e449089975553cf51a1b4,

title = "TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing",

abstract = "Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient's symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system-syndrome differentiation (SD)-and we introduce the first public large-scale benchmark for SD, called TCM-SD. Our benchmark contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.",

author = "Mucheng Ren and Heyan Huang and Yuxiang Zhou and Qianwen Cao and Yuan Bu and Yang Gao",

note = "Publisher Copyright: {\textcopyright} 2022 China National Conference on Computational Linguistics Published under Creative Commons Attribution 4.0 International License.; 21st Chinese National Conference on Computational Linguistic, CCL 2022 ; Conference date: 14-10-2022 Through 16-10-2022",

year = "2022",

language = "English",

pages = "908--920",

}

TY - CONF

T1 - TCM-SD

T2 - 21st Chinese National Conference on Computational Linguistic, CCL 2022

AU - Ren, Mucheng

AU - Huang, Heyan

AU - Zhou, Yuxiang

AU - Cao, Qianwen

AU - Bu, Yuan

AU - Gao, Yang

PY - 2022

Y1 - 2022

N2 - Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient's symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system-syndrome differentiation (SD)-and we introduce the first public large-scale benchmark for SD, called TCM-SD. Our benchmark contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.

AB - Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient's symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system-syndrome differentiation (SD)-and we introduce the first public large-scale benchmark for SD, called TCM-SD. Our benchmark contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.

UR - http://www.scopus.com/inward/record.url?scp=85146367612&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85146367612

SP - 908

EP - 920

Y2 - 14 October 2022 through 16 October 2022

ER -

TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing

摘要

会议

其它文件与链接

指纹

引用此