TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing

Mucheng Ren; Heyan Huang; Yuxiang Zhou; Qianwen Cao; Yuan Bu; Yang Gao

doi:10.1007/978-3-031-18315-7_16

TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing

Mucheng Ren, Heyan Huang^*, Yuxiang Zhou, Qianwen Cao, Yuan Bu, Yang Gao

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient’s symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system—syndrome differentiation (SD)—and we introduce the first public large-scale benchmark for SD, called TCM-SD. Our benchmark contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.

源语言	英语
主期刊名	Chinese Computational Linguistics - 21st China National Conference, CCL 2022, Proceedings
编辑	Maosong Sun, Yang Liu, Wanxiang Che, Yang Feng, Xipeng Qiu, Gaoqi Rao, Yubo Chen
出版商	Springer Science and Business Media Deutschland GmbH
页	247-263
页数	17
ISBN（印刷版）	9783031183140
DOI	https://doi.org/10.1007/978-3-031-18315-7_16
出版状态	已出版 - 2022
活动	21st China National Conference on Computational Linguistics, CCL 2022 - Nanchang, 中国期限: 14 10月 2022 → 16 10月 2022

出版系列

姓名	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
卷	13603 LNAI
ISSN（印刷版）	0302-9743
ISSN（电子版）	1611-3349

会议

会议	21st China National Conference on Computational Linguistics, CCL 2022
国家/地区	中国
市	Nanchang
时期	14/10/22 → 16/10/22

访问文件

10.1007/978-3-031-18315-7_16

其它文件与链接

链接到 Scopus 的出版物

引用此

Ren, M., Huang, H., Zhou, Y., Cao, Q., Bu, Y., & Gao, Y. (2022). TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing. 在 M. Sun, Y. Liu, W. Che, Y. Feng, X. Qiu, G. Rao, & Y. Chen (编辑), Chinese Computational Linguistics - 21st China National Conference, CCL 2022, Proceedings (页码 247-263). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 13603 LNAI). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-18315-7_16

Ren, Mucheng ; Huang, Heyan ; Zhou, Yuxiang 等. / TCM-SD : A Benchmark for Probing Syndrome Differentiation via Natural Language Processing. Chinese Computational Linguistics - 21st China National Conference, CCL 2022, Proceedings. 编辑 / Maosong Sun ; Yang Liu ; Wanxiang Che ; Yang Feng ; Xipeng Qiu ; Gaoqi Rao ; Yubo Chen. Springer Science and Business Media Deutschland GmbH, 2022. 页码 247-263 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{41205571ffc1440092af7def63693806,

title = "TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing",

abstract = "Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient{\textquoteright}s symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system—syndrome differentiation (SD)—and we introduce the first public large-scale benchmark for SD, called TCM-SD. Our benchmark contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.",

keywords = "Bioinformatics, Natural language processing, Traditional chinese medicine",

author = "Mucheng Ren and Heyan Huang and Yuxiang Zhou and Qianwen Cao and Yuan Bu and Yang Gao",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 21st China National Conference on Computational Linguistics, CCL 2022 ; Conference date: 14-10-2022 Through 16-10-2022",

year = "2022",

doi = "10.1007/978-3-031-18315-7_16",

language = "English",

isbn = "9783031183140",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "247--263",

editor = "Maosong Sun and Yang Liu and Wanxiang Che and Yang Feng and Xipeng Qiu and Gaoqi Rao and Yubo Chen",

booktitle = "Chinese Computational Linguistics - 21st China National Conference, CCL 2022, Proceedings",

address = "Germany",

}

Ren, M, Huang, H, Zhou, Y, Cao, Q, Bu, Y & Gao, Y 2022, TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing. 在 M Sun, Y Liu, W Che, Y Feng, X Qiu, G Rao & Y Chen (编辑), Chinese Computational Linguistics - 21st China National Conference, CCL 2022, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 卷 13603 LNAI, Springer Science and Business Media Deutschland GmbH, 页码 247-263, 21st China National Conference on Computational Linguistics, CCL 2022, Nanchang, 中国, 14/10/22. https://doi.org/10.1007/978-3-031-18315-7_16

TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing. / Ren, Mucheng; Huang, Heyan; Zhou, Yuxiang 等.
Chinese Computational Linguistics - 21st China National Conference, CCL 2022, Proceedings. 编辑 / Maosong Sun; Yang Liu; Wanxiang Che; Yang Feng; Xipeng Qiu; Gaoqi Rao; Yubo Chen. Springer Science and Business Media Deutschland GmbH, 2022. 页码 247-263 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 13603 LNAI).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - TCM-SD

T2 - 21st China National Conference on Computational Linguistics, CCL 2022

AU - Ren, Mucheng

AU - Huang, Heyan

AU - Zhou, Yuxiang

AU - Cao, Qianwen

AU - Bu, Yuan

AU - Gao, Yang

PY - 2022

Y1 - 2022

N2 - Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient’s symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system—syndrome differentiation (SD)—and we introduce the first public large-scale benchmark for SD, called TCM-SD. Our benchmark contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.

AB - Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient’s symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system—syndrome differentiation (SD)—and we introduce the first public large-scale benchmark for SD, called TCM-SD. Our benchmark contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.

KW - Bioinformatics

KW - Natural language processing

KW - Traditional chinese medicine

UR - http://www.scopus.com/inward/record.url?scp=85141744375&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-18315-7_16

DO - 10.1007/978-3-031-18315-7_16

M3 - Conference contribution

AN - SCOPUS:85141744375

SN - 9783031183140

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 247

EP - 263

BT - Chinese Computational Linguistics - 21st China National Conference, CCL 2022, Proceedings

A2 - Sun, Maosong

A2 - Liu, Yang

A2 - Che, Wanxiang

A2 - Feng, Yang

A2 - Qiu, Xipeng

A2 - Rao, Gaoqi

A2 - Chen, Yubo

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 14 October 2022 through 16 October 2022

ER -

Ren M, Huang H, Zhou Y, Cao Q, Bu Y, Gao Y. TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing. 在 Sun M, Liu Y, Che W, Feng Y, Qiu X, Rao G, Chen Y, 编辑, Chinese Computational Linguistics - 21st China National Conference, CCL 2022, Proceedings. Springer Science and Business Media Deutschland GmbH. 2022. 页码 247-263. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-18315-7_16

TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此