The state of the art and difficulties in automatic Chinese word segmentation

Chun Xia Zhang; Tian Yong Hao

The state of the art and difficulties in automatic Chinese word segmentation

Chun Xia Zhang^*, Tian Yong Hao

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

28 Citations (Scopus)

Abstract

Automatic Chinese word segmentation is a basic research issue on Chinese information processing tasks such as information extraction, information retrieval, machine translation, text classification, automatic text summarization, speech recognition, text-to-speech, natural language understanding, and so on. Though it has been investigated for more than twenty years, it is still a bottleneck for Chinese information processing. We give a detailed analysis of the state of the art in automatic Chinese word segmentation, build a formal model of word segmentation, discuss factors affecting word segmentation and the two great difficulties in word segmentation and their resolutions, and finally, point out the existing problems, especially those on the word segmentation evaluation, as well as the research problems to be resolved.

Original language	English
Pages (from-to)	138-143+147
Journal	Xitong Fangzhen Xuebao / Journal of System Simulation
Volume	17
Issue number	1
Publication status	Published - Jan 2005
Externally published	Yes

Keywords

Automatic Chinese word segmentation
Formal model
Unknown words
Word segmentation evaluation

Cite this

@article{30ec5d1b7c8e4375a6e8555ed70aa513,

title = "The state of the art and difficulties in automatic Chinese word segmentation",

abstract = "Automatic Chinese word segmentation is a basic research issue on Chinese information processing tasks such as information extraction, information retrieval, machine translation, text classification, automatic text summarization, speech recognition, text-to-speech, natural language understanding, and so on. Though it has been investigated for more than twenty years, it is still a bottleneck for Chinese information processing. We give a detailed analysis of the state of the art in automatic Chinese word segmentation, build a formal model of word segmentation, discuss factors affecting word segmentation and the two great difficulties in word segmentation and their resolutions, and finally, point out the existing problems, especially those on the word segmentation evaluation, as well as the research problems to be resolved.",

keywords = "Automatic Chinese word segmentation, Formal model, Unknown words, Word segmentation evaluation",

author = "Zhang, {Chun Xia} and Hao, {Tian Yong}",

year = "2005",

month = jan,

language = "English",

volume = "17",

pages = "138--143+147",

journal = "Xitong Fangzhen Xuebao / Journal of System Simulation",

issn = "1004-731X",

publisher = "Acta Simulata Systematica Sinica",

number = "1",

}

TY - JOUR

T1 - The state of the art and difficulties in automatic Chinese word segmentation

AU - Zhang, Chun Xia

AU - Hao, Tian Yong

PY - 2005/1

Y1 - 2005/1

N2 - Automatic Chinese word segmentation is a basic research issue on Chinese information processing tasks such as information extraction, information retrieval, machine translation, text classification, automatic text summarization, speech recognition, text-to-speech, natural language understanding, and so on. Though it has been investigated for more than twenty years, it is still a bottleneck for Chinese information processing. We give a detailed analysis of the state of the art in automatic Chinese word segmentation, build a formal model of word segmentation, discuss factors affecting word segmentation and the two great difficulties in word segmentation and their resolutions, and finally, point out the existing problems, especially those on the word segmentation evaluation, as well as the research problems to be resolved.

AB - Automatic Chinese word segmentation is a basic research issue on Chinese information processing tasks such as information extraction, information retrieval, machine translation, text classification, automatic text summarization, speech recognition, text-to-speech, natural language understanding, and so on. Though it has been investigated for more than twenty years, it is still a bottleneck for Chinese information processing. We give a detailed analysis of the state of the art in automatic Chinese word segmentation, build a formal model of word segmentation, discuss factors affecting word segmentation and the two great difficulties in word segmentation and their resolutions, and finally, point out the existing problems, especially those on the word segmentation evaluation, as well as the research problems to be resolved.

KW - Automatic Chinese word segmentation

KW - Formal model

KW - Unknown words

KW - Word segmentation evaluation

UR - http://www.scopus.com/inward/record.url?scp=13944265944&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:13944265944

SN - 1004-731X

VL - 17

SP - 138-143+147

JO - Xitong Fangzhen Xuebao / Journal of System Simulation

JF - Xitong Fangzhen Xuebao / Journal of System Simulation

IS - 1

ER -

The state of the art and difficulties in automatic Chinese word segmentation

Abstract

Keywords

Other files and links

Fingerprint

Cite this