A new unsupervised approach to word segmentation

Hanshi Wang; Jian Zhu; Shiping Tang; Xiaozhong Fan

doi:10.1162/COLI_a_00058

A new unsupervised approach to word segmentation

Hanshi Wang^*, Jian Zhu, Shiping Tang, Xiaozhong Fan

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

22 引用（Scopus）

摘要

This article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of three phases: Evaluation, Selection, and Adjustment. In Evaluation, both the certainty and uncertainty of character sequence co-occurrence in corpora are considered as statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequences with various lengths become comparable with each other by using a simple process called Balancing. In Selection, a local maximum strategy is adopted without thresholds, and the strategy can be implemented with dynamic programming. In Adjustment, a part of the statistical data is updated to improve successive results. In our experiment, ESA was evaluated on the SIGHAN Bakeoff-2 data set. The results suggest that ESA is effective on Chinese corpora. It is noteworthy that the F-measures of the results are basically monotone increasing and can rapidly converge to relatively high values. Furthermore, empirical formulae based on the results can be used to predict the parameter in ESA to avoid parameter estimation that is usually time-consuming.

源语言	英语
页（从-至）	421-454
页数	34
期刊	Computational Linguistics
卷	37
期	3
DOI	https://doi.org/10.1162/COLI_a_00058
出版状态	已出版 - 9月 2011

访问文件

10.1162/COLI_a_00058

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{3c344bcfa06943ca8c5a628d4c42093f,

title = "A new unsupervised approach to word segmentation",

abstract = "This article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of three phases: Evaluation, Selection, and Adjustment. In Evaluation, both the certainty and uncertainty of character sequence co-occurrence in corpora are considered as statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequences with various lengths become comparable with each other by using a simple process called Balancing. In Selection, a local maximum strategy is adopted without thresholds, and the strategy can be implemented with dynamic programming. In Adjustment, a part of the statistical data is updated to improve successive results. In our experiment, ESA was evaluated on the SIGHAN Bakeoff-2 data set. The results suggest that ESA is effective on Chinese corpora. It is noteworthy that the F-measures of the results are basically monotone increasing and can rapidly converge to relatively high values. Furthermore, empirical formulae based on the results can be used to predict the parameter in ESA to avoid parameter estimation that is usually time-consuming.",

author = "Hanshi Wang and Jian Zhu and Shiping Tang and Xiaozhong Fan",

year = "2011",

month = sep,

doi = "10.1162/COLI_a_00058",

language = "English",

volume = "37",

pages = "421--454",

journal = "Computational Linguistics",

issn = "0891-2017",

publisher = "MIT Press",

number = "3",

}

TY - JOUR

T1 - A new unsupervised approach to word segmentation

AU - Wang, Hanshi

AU - Zhu, Jian

AU - Tang, Shiping

AU - Fan, Xiaozhong

PY - 2011/9

Y1 - 2011/9

N2 - This article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of three phases: Evaluation, Selection, and Adjustment. In Evaluation, both the certainty and uncertainty of character sequence co-occurrence in corpora are considered as statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequences with various lengths become comparable with each other by using a simple process called Balancing. In Selection, a local maximum strategy is adopted without thresholds, and the strategy can be implemented with dynamic programming. In Adjustment, a part of the statistical data is updated to improve successive results. In our experiment, ESA was evaluated on the SIGHAN Bakeoff-2 data set. The results suggest that ESA is effective on Chinese corpora. It is noteworthy that the F-measures of the results are basically monotone increasing and can rapidly converge to relatively high values. Furthermore, empirical formulae based on the results can be used to predict the parameter in ESA to avoid parameter estimation that is usually time-consuming.

AB - This article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of three phases: Evaluation, Selection, and Adjustment. In Evaluation, both the certainty and uncertainty of character sequence co-occurrence in corpora are considered as statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequences with various lengths become comparable with each other by using a simple process called Balancing. In Selection, a local maximum strategy is adopted without thresholds, and the strategy can be implemented with dynamic programming. In Adjustment, a part of the statistical data is updated to improve successive results. In our experiment, ESA was evaluated on the SIGHAN Bakeoff-2 data set. The results suggest that ESA is effective on Chinese corpora. It is noteworthy that the F-measures of the results are basically monotone increasing and can rapidly converge to relatively high values. Furthermore, empirical formulae based on the results can be used to predict the parameter in ESA to avoid parameter estimation that is usually time-consuming.

UR - http://www.scopus.com/inward/record.url?scp=80052197996&partnerID=8YFLogxK

U2 - 10.1162/COLI_a_00058

DO - 10.1162/COLI_a_00058

M3 - Article

AN - SCOPUS:80052197996

SN - 0891-2017

VL - 37

SP - 421

EP - 454

JO - Computational Linguistics

JF - Computational Linguistics

IS - 3

ER -

A new unsupervised approach to word segmentation

摘要

访问文件

其它文件与链接

指纹

引用此