Which performs better for new word detection, character based or Chinese Word Segmentation based?

Haijun Zhang, Shumin Shi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

This paper proposed a novel method to evaluate the performance of New Word Detection (NWD) based on repeats extraction. For small-scale corpus, we put forward employing Conditional Random Field (CRF) as statistical framework to estimate the effects of different strategies of NWD. For the situations of large-scale corpus, as there is no infinity of annotated corpus, comparative experiments are unable to carry out evaluation. Accordingly, this paper proposed a pragmatic quantitative model to analyze and estimate the performance of NWD for all kinds of cases, especially for large-scale corpus situation. Studies have shown there is a good mutual authentication between experimental results and conclusion from the quantitative model. On the basis of analysis for experimental data and quantitative model, a reliable conclusion for effects of Chinese NWD basing the two strategies is reached, which can give a certain instruction for follow-up studies in Chinese new word detection.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Asian Language Processing 2014, IALP 2014
EditorsRafael E. Banchs, Minghui Dong, Yanfeng Lu, Bali Ranaivo-Malancon
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages10-14
Number of pages5
ISBN (Electronic)9781479953301
DOIs
Publication statusPublished - 3 Dec 2014
EventInternational Conference on Asian Language Processing 2014, IALP 2014 - Kuching, Malaysia
Duration: 20 Oct 201422 Oct 2014

Publication series

NameProceedings of the International Conference on Asian Language Processing 2014, IALP 2014

Conference

ConferenceInternational Conference on Asian Language Processing 2014, IALP 2014
Country/TerritoryMalaysia
CityKuching
Period20/10/1422/10/14

Keywords

  • CRF
  • Character Based
  • Chinese Word Segmentation
  • New Words Detection
  • Repeats

Fingerprint

Dive into the research topics of 'Which performs better for new word detection, character based or Chinese Word Segmentation based?'. Together they form a unique fingerprint.

Cite this