TY - JOUR
T1 - A new unsupervised approach to word segmentation
AU - Wang, Hanshi
AU - Zhu, Jian
AU - Tang, Shiping
AU - Fan, Xiaozhong
PY - 2011/9
Y1 - 2011/9
N2 - This article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of three phases: Evaluation, Selection, and Adjustment. In Evaluation, both the certainty and uncertainty of character sequence co-occurrence in corpora are considered as statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequences with various lengths become comparable with each other by using a simple process called Balancing. In Selection, a local maximum strategy is adopted without thresholds, and the strategy can be implemented with dynamic programming. In Adjustment, a part of the statistical data is updated to improve successive results. In our experiment, ESA was evaluated on the SIGHAN Bakeoff-2 data set. The results suggest that ESA is effective on Chinese corpora. It is noteworthy that the F-measures of the results are basically monotone increasing and can rapidly converge to relatively high values. Furthermore, empirical formulae based on the results can be used to predict the parameter in ESA to avoid parameter estimation that is usually time-consuming.
AB - This article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of three phases: Evaluation, Selection, and Adjustment. In Evaluation, both the certainty and uncertainty of character sequence co-occurrence in corpora are considered as statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequences with various lengths become comparable with each other by using a simple process called Balancing. In Selection, a local maximum strategy is adopted without thresholds, and the strategy can be implemented with dynamic programming. In Adjustment, a part of the statistical data is updated to improve successive results. In our experiment, ESA was evaluated on the SIGHAN Bakeoff-2 data set. The results suggest that ESA is effective on Chinese corpora. It is noteworthy that the F-measures of the results are basically monotone increasing and can rapidly converge to relatively high values. Furthermore, empirical formulae based on the results can be used to predict the parameter in ESA to avoid parameter estimation that is usually time-consuming.
UR - http://www.scopus.com/inward/record.url?scp=80052197996&partnerID=8YFLogxK
U2 - 10.1162/COLI_a_00058
DO - 10.1162/COLI_a_00058
M3 - Article
AN - SCOPUS:80052197996
SN - 0891-2017
VL - 37
SP - 421
EP - 454
JO - Computational Linguistics
JF - Computational Linguistics
IS - 3
ER -