TY - GEN
T1 - Active learning for cross language text categorization
AU - Liu, Yue
AU - Dai, Lin
AU - Zhou, Weitao
AU - Huang, Heyan
PY - 2012
Y1 - 2012
N2 - Cross Language Text Categorization (CLTC) is the task of assigning class labels to documents written in a target language (e.g. Chinese) while the system is trained using labeled examples in a source language (e.g. English). With the technique of CLTC, we can build classifiers for multiple languages employing the existing training data in only one language, therefore avoid the cost of preparing training data for each individual language. One challenge for CLTC is the culture differences between languages, which causes the classifier trained on the source language doesn't perform well on the target language. In this paper, we propose an active learning algorithm for CLTC, which takes full advantage of both labeled data in the source language and unlabeled data in the target language. The classifier first learns the classification knowledge from the source language, and then learns the cultural dependent knowledge from the target language. In addition, we extend our algorithm to double viewed form by considering the source and target language as two views of the classification problem. Experiments show that our algorithm can effectively improve the cross language classification performance.
AB - Cross Language Text Categorization (CLTC) is the task of assigning class labels to documents written in a target language (e.g. Chinese) while the system is trained using labeled examples in a source language (e.g. English). With the technique of CLTC, we can build classifiers for multiple languages employing the existing training data in only one language, therefore avoid the cost of preparing training data for each individual language. One challenge for CLTC is the culture differences between languages, which causes the classifier trained on the source language doesn't perform well on the target language. In this paper, we propose an active learning algorithm for CLTC, which takes full advantage of both labeled data in the source language and unlabeled data in the target language. The classifier first learns the classification knowledge from the source language, and then learns the cultural dependent knowledge from the target language. In addition, we extend our algorithm to double viewed form by considering the source and target language as two views of the classification problem. Experiments show that our algorithm can effectively improve the cross language classification performance.
KW - Active Learning
KW - Cross Language Text Categorization
UR - http://www.scopus.com/inward/record.url?scp=84861421272&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-30217-6_17
DO - 10.1007/978-3-642-30217-6_17
M3 - Conference contribution
AN - SCOPUS:84861421272
SN - 9783642302169
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 195
EP - 206
BT - Advances in Knowledge Discovery and Data Mining - 16th Pacific-Asia Conference, PAKDD 2012, Proceedings
T2 - 16th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2012
Y2 - 29 May 2012 through 1 June 2012
ER -