Double-target self-supervised clustering with multi-feature fusion for medical question texts

Xifeng Shen; Yuanyuan Sun; Chunxia Zhang; Cheng Yang; Yi Qin; Weining Zhang; Jiale Nan; Meiling Che; Dongping Gao

doi:10.7717/peerj-cs.2075

Double-target self-supervised clustering with multi-feature fusion for medical question texts

Xifeng Shen, Yuanyuan Sun, Chunxia Zhang, Cheng Yang, Yi Qin, Weining Zhang, Jiale Nan, Meiling Che, Dongping Gao^*

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Background. To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content. Methods. Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, i.e., the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC. Results. The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346,0.4934,0.8649 and 0.5737, respectively.

源语言	英语
文章编号	e2075
期刊	PeerJ Computer Science
卷	10
DOI	https://doi.org/10.7717/peerj-cs.2075
出版状态	已出版 - 2024

访问文件

10.7717/peerj-cs.2075

其它文件与链接

链接到 Scopus 的出版物

引用此

Shen, X., Sun, Y., Zhang, C., Yang, C., Qin, Y., Zhang, W., Nan, J., Che, M., & Gao, D. (2024). Double-target self-supervised clustering with multi-feature fusion for medical question texts. PeerJ Computer Science, 10, 文章 e2075. https://doi.org/10.7717/peerj-cs.2075

@article{be55a251bed841feae00c78f94cbb092,

title = "Double-target self-supervised clustering with multi-feature fusion for medical question texts",

abstract = "Background. To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content. Methods. Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, i.e., the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC. Results. The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346,0.4934,0.8649 and 0.5737, respectively.",

keywords = "Clustering, Medical question text, Multi-feature fusion, Self-supervised",

author = "Xifeng Shen and Yuanyuan Sun and Chunxia Zhang and Cheng Yang and Yi Qin and Weining Zhang and Jiale Nan and Meiling Che and Dongping Gao",

year = "2024",

doi = "10.7717/peerj-cs.2075",

language = "English",

volume = "10",

journal = "PeerJ Computer Science",

issn = "2376-5992",

publisher = "PeerJ Inc.",

}

TY - JOUR

T1 - Double-target self-supervised clustering with multi-feature fusion for medical question texts

AU - Shen, Xifeng

AU - Sun, Yuanyuan

AU - Zhang, Chunxia

AU - Yang, Cheng

AU - Qin, Yi

AU - Zhang, Weining

AU - Nan, Jiale

AU - Che, Meiling

AU - Gao, Dongping

PY - 2024

Y1 - 2024

N2 - Background. To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content. Methods. Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, i.e., the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC. Results. The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346,0.4934,0.8649 and 0.5737, respectively.

AB - Background. To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content. Methods. Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, i.e., the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC. Results. The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346,0.4934,0.8649 and 0.5737, respectively.

KW - Clustering

KW - Medical question text

KW - Multi-feature fusion

KW - Self-supervised

UR - http://www.scopus.com/inward/record.url?scp=85199277718&partnerID=8YFLogxK

U2 - 10.7717/peerj-cs.2075

DO - 10.7717/peerj-cs.2075

M3 - Article

AN - SCOPUS:85199277718

SN - 2376-5992

VL - 10

JO - PeerJ Computer Science

JF - PeerJ Computer Science

M1 - e2075

ER -

Double-target self-supervised clustering with multi-feature fusion for medical question texts

摘要

访问文件

其它文件与链接

指纹

引用此