Abstract
Proper noun recognition is a sub-task in named entity recognition. However, few methods have been specifically applied to the Chinese. The reason is that most of the existing deep clustering methods rely on manually labeled training sets, which take a long time in the learning process. And due to the wide and large-scale nature of the proprietary domain and the lack of word boundaries, recognizing Chinese specialized terms from unstructured text remains challenging. In this paper, we design an unsupervised method to improve Chinese proper noun recognition. The first step is to implement the word separation for Chinese, followed by a BERT-based improved word characterization method to obtain word vectors. Finally, we use the autoencoder-based deep clustering method to complete the extraction of proper nouns from books. We have done comparison experiments on the public dataset and our selected professional book data respectively, and the result is an improvement of our method in both the accuracy and F1 values.
Original language | English |
---|---|
Pages (from-to) | 57-62 |
Number of pages | 6 |
Journal | Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE |
Volume | 2023-July |
DOIs | |
Publication status | Published - 2023 |
Event | 35th International Conference on Software Engineering and Knowledge Engineering, SEKE 2023 - Hybrid, San Francisco, United States Duration: 1 Jul 2023 → 10 Jul 2023 |
Keywords
- BERT
- Deep clustering
- GMM
- Proper noun recognition