TY - JOUR
T1 - Generalized analysis of a distribution separation method
AU - Zhang, Peng
AU - Yu, Qian
AU - Hou, Yuexian
AU - Song, Dawei
AU - Li, Jingfei
AU - Hu, Bin
N1 - Publisher Copyright:
© 2016 by the authors.
PY - 2016/4/1
Y1 - 2016/4/1
N2 - Separating two probability distributions from a mixture model that is made up of the combinations of the two is essential to a wide range of applications. For example, in information retrieval (IR), there often exists a mixture distribution consisting of a relevance distribution that we need to estimate and an irrelevance distribution that we hope to get rid of. Recently, a distribution separation method (DSM) was proposed to approximate the relevance distribution, by separating a seed irrelevance distribution from the mixture distribution. It was successfully applied to an IR task, namely pseudo-relevance feedback (PRF), where the query expansion model is often a mixture term distribution. Although initially developed in the context of IR, DSM is indeed a general mathematical formulation for probability distribution separation. Thus, it is important to further generalize its basic analysis and to explore its connections to other related methods. In this article, we first extend DSM's theoretical analysis, which was originally based on the Pearson correlation coefficient, to entropy-related measures, including the KL-divergence (Kullback-Leibler divergence), the symmetrized KL-divergence and the JS-divergence (Jensen-Shannon divergence). Second, we investigate the distribution separation idea in a well-known method, namely the mixture model feedback (MMF) approach. We prove that MMF also complies with the linear combination assumption, and then, DSM's linear separation algorithm can largely simplify the EM algorithm in MMF. These theoretical analyses, as well as further empirical evaluation results demonstrate the advantages of our DSM approach.
AB - Separating two probability distributions from a mixture model that is made up of the combinations of the two is essential to a wide range of applications. For example, in information retrieval (IR), there often exists a mixture distribution consisting of a relevance distribution that we need to estimate and an irrelevance distribution that we hope to get rid of. Recently, a distribution separation method (DSM) was proposed to approximate the relevance distribution, by separating a seed irrelevance distribution from the mixture distribution. It was successfully applied to an IR task, namely pseudo-relevance feedback (PRF), where the query expansion model is often a mixture term distribution. Although initially developed in the context of IR, DSM is indeed a general mathematical formulation for probability distribution separation. Thus, it is important to further generalize its basic analysis and to explore its connections to other related methods. In this article, we first extend DSM's theoretical analysis, which was originally based on the Pearson correlation coefficient, to entropy-related measures, including the KL-divergence (Kullback-Leibler divergence), the symmetrized KL-divergence and the JS-divergence (Jensen-Shannon divergence). Second, we investigate the distribution separation idea in a well-known method, namely the mixture model feedback (MMF) approach. We prove that MMF also complies with the linear combination assumption, and then, DSM's linear separation algorithm can largely simplify the EM algorithm in MMF. These theoretical analyses, as well as further empirical evaluation results demonstrate the advantages of our DSM approach.
KW - Distribution separation
KW - Information retrieval
KW - KL-divergence
KW - Mixture model
UR - http://www.scopus.com/inward/record.url?scp=84964523881&partnerID=8YFLogxK
U2 - 10.3390/e18040105
DO - 10.3390/e18040105
M3 - Article
AN - SCOPUS:84964523881
SN - 1099-4300
VL - 18
JO - Entropy
JF - Entropy
IS - 4
M1 - 105
ER -