TY - JOUR
T1 - Double-target self-supervised clustering with multi-feature fusion for medical question texts
AU - Shen, Xifeng
AU - Sun, Yuanyuan
AU - Zhang, Chunxia
AU - Yang, Cheng
AU - Qin, Yi
AU - Zhang, Weining
AU - Nan, Jiale
AU - Che, Meiling
AU - Gao, Dongping
N1 - Publisher Copyright:
© 2024 Shen et al. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Background. To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content. Methods. Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, i.e., the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC. Results. The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346,0.4934,0.8649 and 0.5737, respectively.
AB - Background. To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content. Methods. Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, i.e., the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC. Results. The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346,0.4934,0.8649 and 0.5737, respectively.
KW - Clustering
KW - Medical question text
KW - Multi-feature fusion
KW - Self-supervised
UR - http://www.scopus.com/inward/record.url?scp=85199277718&partnerID=8YFLogxK
U2 - 10.7717/peerj-cs.2075
DO - 10.7717/peerj-cs.2075
M3 - Article
AN - SCOPUS:85199277718
SN - 2376-5992
VL - 10
JO - PeerJ Computer Science
JF - PeerJ Computer Science
M1 - e2075
ER -