TY - GEN
T1 - Pure high-order word dependence mining via information geometry
AU - Hou, Yuexian
AU - He, Liang
AU - Zhao, Xiaozhao
AU - Song, Dawei
PY - 2011
Y1 - 2011
N2 - The classical bag-of-word models fail to capture contextual associations between words. We propose to investigate the "high-order pure dependence" among a number of words forming a semantic entity, i.e., the high-order dependence that cannot be reduced to the random coincidence of lower-order dependence. We believe that identifying these high-order pure dependence patterns will lead to a better representation of documents. We first present two formal definitions of pure dependence: Unconditional Pure Dependence (UPD) and Conditional Pure Dependence (CPD). The decision on UPD or CPD, however, is a NP-hard problem. We hence prove a series of sufficient criteria that entail UPD and CPD, within the well-principled Information Geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods to extract word patterns with high-order pure dependence, which can then be used to extend the original unigram document models. Our methods are evaluated in the context of query expansion. Compared with the original unigram model and its extensions with term associations derived from constant n-grams and Apriori association rule mining, our IG-based methods have proved mathematically more rigorous and empirically more effective.
AB - The classical bag-of-word models fail to capture contextual associations between words. We propose to investigate the "high-order pure dependence" among a number of words forming a semantic entity, i.e., the high-order dependence that cannot be reduced to the random coincidence of lower-order dependence. We believe that identifying these high-order pure dependence patterns will lead to a better representation of documents. We first present two formal definitions of pure dependence: Unconditional Pure Dependence (UPD) and Conditional Pure Dependence (CPD). The decision on UPD or CPD, however, is a NP-hard problem. We hence prove a series of sufficient criteria that entail UPD and CPD, within the well-principled Information Geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods to extract word patterns with high-order pure dependence, which can then be used to extend the original unigram document models. Our methods are evaluated in the context of query expansion. Compared with the original unigram model and its extensions with term associations derived from constant n-grams and Apriori association rule mining, our IG-based methods have proved mathematically more rigorous and empirically more effective.
KW - High-order Pure Dependence
KW - Information Geometry
KW - Language Model
KW - Log likelihood Ratio Test
KW - Query Expansion
KW - Word Association
UR - http://www.scopus.com/inward/record.url?scp=80053020523&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-23318-0_8
DO - 10.1007/978-3-642-23318-0_8
M3 - Conference contribution
AN - SCOPUS:80053020523
SN - 9783642233173
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 64
EP - 76
BT - Advances in Information Retrieval Theory - Third International Conference, ICTIR 2011, Proceedings
T2 - 3rd International Conference on the Theory of Information Retrieval, ICTIR 2011
Y2 - 12 September 2011 through 14 September 2011
ER -