TY - JOUR
T1 - Mining pure high-order word associations via information geometry for information retrieval
AU - Hou, Yuexian
AU - Zhao, Xiaozhao
AU - Song, Dawei
AU - Li, Wenjie
PY - 2013/7
Y1 - 2013/7
N2 - The classical bag-of-word models for information retrieval (IR) fail to capture contextual associations between words. In this article, we propose to investigate pure high-order dependence among a number of words forming an unseparable semantic entity, that is, the high-order dependence that cannot be reduced to the random coincidence of lower-order dependencies. We believe that identifying these pure high-order dependence patterns would lead to a better representation of documents and novel retrieval models. Specifically, two formal definitions of pure dependence-unconditional pure dependence (UPD) and conditional pure dependence (CPD)-are defined. The exact decision on UPD and CPD, however, is NP-hard in general.We hence derive and prove the sufficient criteria that entail UPD and CPD, within the well-principled information geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods for extracting word patterns with pure high-order dependence. Our methods are applied to and extensively evaluated on three typical IR tasks: text classification and text retrieval without and with query expansion.
AB - The classical bag-of-word models for information retrieval (IR) fail to capture contextual associations between words. In this article, we propose to investigate pure high-order dependence among a number of words forming an unseparable semantic entity, that is, the high-order dependence that cannot be reduced to the random coincidence of lower-order dependencies. We believe that identifying these pure high-order dependence patterns would lead to a better representation of documents and novel retrieval models. Specifically, two formal definitions of pure dependence-unconditional pure dependence (UPD) and conditional pure dependence (CPD)-are defined. The exact decision on UPD and CPD, however, is NP-hard in general.We hence derive and prove the sufficient criteria that entail UPD and CPD, within the well-principled information geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods for extracting word patterns with pure high-order dependence. Our methods are applied to and extensively evaluated on three typical IR tasks: text classification and text retrieval without and with query expansion.
KW - Information geometry
KW - Pure high-order dependence
KW - Text classification
KW - Text retrieval
KW - Word association
UR - http://www.scopus.com/inward/record.url?scp=84894562039&partnerID=8YFLogxK
U2 - 10.1145/2493175.2493177
DO - 10.1145/2493175.2493177
M3 - Article
AN - SCOPUS:84894562039
SN - 1046-8188
VL - 31
JO - ACM Transactions on Information Systems
JF - ACM Transactions on Information Systems
IS - 3
ER -