The discovery of natural typing annotations: User-produced potential Chinese word delimiters

Dakui Zhang; Yu Mao; Yang Liu; Hanshi Wang; Chuyuan Wei; Shiping Tang

doi:10.3115/v1/p15-2109

The discovery of natural typing annotations: User-produced potential Chinese word delimiters

Dakui Zhang, Yu Mao, Yang Liu, Hanshi Wang, Chuyuan Wei, Shiping Tang

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Human labeled corpus is indispensable for the training of supervised word segmenters. However, it is time-consuming and laborintensive to label corpus manually. During the process of typing Chinese text by Pingyin, people usually need to type "space" or numeric keys to choose the words due to homophones, which can be viewed as a cue for segmentation. We argue that such a process can be used to build a labeled corpus in a more natural way. Thus, in this paper, we investigate Natural Typing Annotations (NTAs) that are potential word delimiters produced by users while typing Chinese. A detailed analysis on over three hundred user-produced texts containing NTAs reveals that highquality NTAs mostly agree with gold segmentation and, consequently, can be used for improving the performance of supervised word segmentation model in out-of-domain. Experiments show that a classification model combined with a voting mechanism can reliably identify the high-quality NTAs texts that are more readily available labeled corpus. Furthermore, the NTAs might be particularly useful to deal with out-of-vocabulary (OOV) words such as proper names and neo-logisms.

Original language	English
Title of host publication	ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference
Publisher	Association for Computational Linguistics (ACL)
Pages	662-667
Number of pages	6
ISBN (Electronic)	9781941643730
DOIs	https://doi.org/10.3115/v1/p15-2109
Publication status	Published - 2015
Event	53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015 - Beijing, China Duration: 26 Jul 2015 → 31 Jul 2015

Publication series

Name	ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference
Volume	2

Conference

Conference	53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015
Country/Territory	China
City	Beijing
Period	26/07/15 → 31/07/15

Access to Document

10.3115/v1/p15-2109

Cite this

Zhang, D., Mao, Y., Liu, Y., Wang, H., Wei, C., & Tang, S. (2015). The discovery of natural typing annotations: User-produced potential Chinese word delimiters. In ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference (pp. 662-667). (ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference; Vol. 2). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/p15-2109

Zhang, Dakui ; Mao, Yu ; Liu, Yang et al. / The discovery of natural typing annotations : User-produced potential Chinese word delimiters. ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2015. pp. 662-667 (ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference).

@inproceedings{7cab0d32abf14eb88941524f93f83b4d,

title = "The discovery of natural typing annotations: User-produced potential Chinese word delimiters",

abstract = "Human labeled corpus is indispensable for the training of supervised word segmenters. However, it is time-consuming and laborintensive to label corpus manually. During the process of typing Chinese text by Pingyin, people usually need to type {"}space{"} or numeric keys to choose the words due to homophones, which can be viewed as a cue for segmentation. We argue that such a process can be used to build a labeled corpus in a more natural way. Thus, in this paper, we investigate Natural Typing Annotations (NTAs) that are potential word delimiters produced by users while typing Chinese. A detailed analysis on over three hundred user-produced texts containing NTAs reveals that highquality NTAs mostly agree with gold segmentation and, consequently, can be used for improving the performance of supervised word segmentation model in out-of-domain. Experiments show that a classification model combined with a voting mechanism can reliably identify the high-quality NTAs texts that are more readily available labeled corpus. Furthermore, the NTAs might be particularly useful to deal with out-of-vocabulary (OOV) words such as proper names and neo-logisms.",

author = "Dakui Zhang and Yu Mao and Yang Liu and Hanshi Wang and Chuyuan Wei and Shiping Tang",

note = "Publisher Copyright: {\textcopyright} 2015 Association for Computational Linguistics.; 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015 ; Conference date: 26-07-2015 Through 31-07-2015",

year = "2015",

doi = "10.3115/v1/p15-2109",

language = "English",

series = "ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference",

publisher = "Association for Computational Linguistics (ACL)",

pages = "662--667",

booktitle = "ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference",

address = "United States",

}

Zhang, D, Mao, Y, Liu, Y, Wang, H, Wei, C & Tang, S 2015, The discovery of natural typing annotations: User-produced potential Chinese word delimiters. in ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference. ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, vol. 2, Association for Computational Linguistics (ACL), pp. 662-667, 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015, Beijing, China, 26/07/15. https://doi.org/10.3115/v1/p15-2109

The discovery of natural typing annotations: User-produced potential Chinese word delimiters. / Zhang, Dakui; Mao, Yu; Liu, Yang et al.
ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2015. p. 662-667 (ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference; Vol. 2).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - The discovery of natural typing annotations

T2 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015

AU - Zhang, Dakui

AU - Mao, Yu

AU - Liu, Yang

AU - Wang, Hanshi

AU - Wei, Chuyuan

AU - Tang, Shiping

PY - 2015

Y1 - 2015

N2 - Human labeled corpus is indispensable for the training of supervised word segmenters. However, it is time-consuming and laborintensive to label corpus manually. During the process of typing Chinese text by Pingyin, people usually need to type "space" or numeric keys to choose the words due to homophones, which can be viewed as a cue for segmentation. We argue that such a process can be used to build a labeled corpus in a more natural way. Thus, in this paper, we investigate Natural Typing Annotations (NTAs) that are potential word delimiters produced by users while typing Chinese. A detailed analysis on over three hundred user-produced texts containing NTAs reveals that highquality NTAs mostly agree with gold segmentation and, consequently, can be used for improving the performance of supervised word segmentation model in out-of-domain. Experiments show that a classification model combined with a voting mechanism can reliably identify the high-quality NTAs texts that are more readily available labeled corpus. Furthermore, the NTAs might be particularly useful to deal with out-of-vocabulary (OOV) words such as proper names and neo-logisms.

AB - Human labeled corpus is indispensable for the training of supervised word segmenters. However, it is time-consuming and laborintensive to label corpus manually. During the process of typing Chinese text by Pingyin, people usually need to type "space" or numeric keys to choose the words due to homophones, which can be viewed as a cue for segmentation. We argue that such a process can be used to build a labeled corpus in a more natural way. Thus, in this paper, we investigate Natural Typing Annotations (NTAs) that are potential word delimiters produced by users while typing Chinese. A detailed analysis on over three hundred user-produced texts containing NTAs reveals that highquality NTAs mostly agree with gold segmentation and, consequently, can be used for improving the performance of supervised word segmentation model in out-of-domain. Experiments show that a classification model combined with a voting mechanism can reliably identify the high-quality NTAs texts that are more readily available labeled corpus. Furthermore, the NTAs might be particularly useful to deal with out-of-vocabulary (OOV) words such as proper names and neo-logisms.

UR - http://www.scopus.com/inward/record.url?scp=84944038481&partnerID=8YFLogxK

U2 - 10.3115/v1/p15-2109

DO - 10.3115/v1/p15-2109

M3 - Conference contribution

AN - SCOPUS:84944038481

T3 - ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference

SP - 662

EP - 667

BT - ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

Y2 - 26 July 2015 through 31 July 2015

ER -

Zhang D, Mao Y, Liu Y, Wang H, Wei C, Tang S. The discovery of natural typing annotations: User-produced potential Chinese word delimiters. In ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2015. p. 662-667. (ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference). doi: 10.3115/v1/p15-2109

The discovery of natural typing annotations: User-produced potential Chinese word delimiters

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this