A distribution separation method using irrelevance feedback data for information retrieval

Peng Zhang; Qian Yu; Yuexian Hou; Dawei Song; Jingfei Li; Bin Hu

doi:10.1145/2994608

A distribution separation method using irrelevance feedback data for information retrieval

Peng Zhang, Qian Yu, Yuexian Hou, Dawei Song^*, Jingfei Li, Bin Hu

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

In many research and application areas, such as information retrieval and machine learning, we often encounter dealing with a probability distribution that is mixed by one distribution that is relevant to our task in hand and the other that is irrelevant and that we want to get rid of. Thus, it is an essential problem to separate the irrelevant distribution from the mixture distribution. This article is focused on the application in Information Retrieval, where relevance feedback is a widely used technique to build a refined query model based on a set of feedback documents. However, in practice, the relevance feedback set, even provided by users explicitly or implicitly, is often a mixture of relevant and irrelevant documents. Consequently, the resultant query model (typically a term distribution) is often a mixture rather than a true relevance term distribution, leading to a negative impact on the retrieval performance. To tackle this problem, we recently proposed a Distribution SeparationMethod (DSM), which aims to approximate the true relevance distribution by separating a seed irrelevance distribution from the mixture one. While it achieved a promising performance in an empirical evaluation with simulated explicit irrelevance feedback data, it has not been deployed in the scenario where one should automatically obtain the irrelevance feedback data. In this article, we propose a substantial extension of the basic DSM from two perspectives: developing a further regularization framework and deploying DSM in the automatic irrelevance feedback scenario. Specifically, in order to avoid the output distribution of DSM drifting away from the true relevance distribution when the quality of seed irrelevant distribution (as the input to DSM) is not guaranteed, we propose a DSM regularization framework to constrain the estimation for the relevance distribution. This regularization framework includes three algorithms, each corresponding to a regularization strategy incorporated in the objective function of DSM. In addition, we exploit DSM in automatic (i.e., pseudo) irrelevance feedback, by automatically detecting the seed irrelevant documents via three different document reranking methods. We have carried out extensive experiments based on various TREC datasets, in order to systematically evaluate the proposed methods. The experimental results demonstrate the effectiveness of our proposed approaches in comparison with various strong baselines.

Original language	English
Article number	46
Journal	ACM Transactions on Intelligent Systems and Technology
Volume	8
Issue number	3
DOIs	https://doi.org/10.1145/2994608
Publication status	Published - Jan 2017
Externally published	Yes

Keywords

Distribution separation method
Irrelevant term distribution
Mixture distribution
Regularization
Relevance feedback

Access to Document

10.1145/2994608

Cite this

@article{f6a191ac1b5b4c22b79271879c61f999,

title = "A distribution separation method using irrelevance feedback data for information retrieval",

abstract = "In many research and application areas, such as information retrieval and machine learning, we often encounter dealing with a probability distribution that is mixed by one distribution that is relevant to our task in hand and the other that is irrelevant and that we want to get rid of. Thus, it is an essential problem to separate the irrelevant distribution from the mixture distribution. This article is focused on the application in Information Retrieval, where relevance feedback is a widely used technique to build a refined query model based on a set of feedback documents. However, in practice, the relevance feedback set, even provided by users explicitly or implicitly, is often a mixture of relevant and irrelevant documents. Consequently, the resultant query model (typically a term distribution) is often a mixture rather than a true relevance term distribution, leading to a negative impact on the retrieval performance. To tackle this problem, we recently proposed a Distribution SeparationMethod (DSM), which aims to approximate the true relevance distribution by separating a seed irrelevance distribution from the mixture one. While it achieved a promising performance in an empirical evaluation with simulated explicit irrelevance feedback data, it has not been deployed in the scenario where one should automatically obtain the irrelevance feedback data. In this article, we propose a substantial extension of the basic DSM from two perspectives: developing a further regularization framework and deploying DSM in the automatic irrelevance feedback scenario. Specifically, in order to avoid the output distribution of DSM drifting away from the true relevance distribution when the quality of seed irrelevant distribution (as the input to DSM) is not guaranteed, we propose a DSM regularization framework to constrain the estimation for the relevance distribution. This regularization framework includes three algorithms, each corresponding to a regularization strategy incorporated in the objective function of DSM. In addition, we exploit DSM in automatic (i.e., pseudo) irrelevance feedback, by automatically detecting the seed irrelevant documents via three different document reranking methods. We have carried out extensive experiments based on various TREC datasets, in order to systematically evaluate the proposed methods. The experimental results demonstrate the effectiveness of our proposed approaches in comparison with various strong baselines.",

keywords = "Distribution separation method, Irrelevant term distribution, Mixture distribution, Regularization, Relevance feedback",

author = "Peng Zhang and Qian Yu and Yuexian Hou and Dawei Song and Jingfei Li and Bin Hu",

note = "Publisher Copyright: {\textcopyright} 2017 ACM.",

year = "2017",

month = jan,

doi = "10.1145/2994608",

language = "English",

volume = "8",

journal = "ACM Transactions on Intelligent Systems and Technology",

issn = "2157-6904",

publisher = "Association for Computing Machinery (ACM)",

number = "3",

}

TY - JOUR

T1 - A distribution separation method using irrelevance feedback data for information retrieval

AU - Zhang, Peng

AU - Yu, Qian

AU - Hou, Yuexian

AU - Song, Dawei

AU - Li, Jingfei

AU - Hu, Bin

PY - 2017/1

Y1 - 2017/1

N2 - In many research and application areas, such as information retrieval and machine learning, we often encounter dealing with a probability distribution that is mixed by one distribution that is relevant to our task in hand and the other that is irrelevant and that we want to get rid of. Thus, it is an essential problem to separate the irrelevant distribution from the mixture distribution. This article is focused on the application in Information Retrieval, where relevance feedback is a widely used technique to build a refined query model based on a set of feedback documents. However, in practice, the relevance feedback set, even provided by users explicitly or implicitly, is often a mixture of relevant and irrelevant documents. Consequently, the resultant query model (typically a term distribution) is often a mixture rather than a true relevance term distribution, leading to a negative impact on the retrieval performance. To tackle this problem, we recently proposed a Distribution SeparationMethod (DSM), which aims to approximate the true relevance distribution by separating a seed irrelevance distribution from the mixture one. While it achieved a promising performance in an empirical evaluation with simulated explicit irrelevance feedback data, it has not been deployed in the scenario where one should automatically obtain the irrelevance feedback data. In this article, we propose a substantial extension of the basic DSM from two perspectives: developing a further regularization framework and deploying DSM in the automatic irrelevance feedback scenario. Specifically, in order to avoid the output distribution of DSM drifting away from the true relevance distribution when the quality of seed irrelevant distribution (as the input to DSM) is not guaranteed, we propose a DSM regularization framework to constrain the estimation for the relevance distribution. This regularization framework includes three algorithms, each corresponding to a regularization strategy incorporated in the objective function of DSM. In addition, we exploit DSM in automatic (i.e., pseudo) irrelevance feedback, by automatically detecting the seed irrelevant documents via three different document reranking methods. We have carried out extensive experiments based on various TREC datasets, in order to systematically evaluate the proposed methods. The experimental results demonstrate the effectiveness of our proposed approaches in comparison with various strong baselines.

AB - In many research and application areas, such as information retrieval and machine learning, we often encounter dealing with a probability distribution that is mixed by one distribution that is relevant to our task in hand and the other that is irrelevant and that we want to get rid of. Thus, it is an essential problem to separate the irrelevant distribution from the mixture distribution. This article is focused on the application in Information Retrieval, where relevance feedback is a widely used technique to build a refined query model based on a set of feedback documents. However, in practice, the relevance feedback set, even provided by users explicitly or implicitly, is often a mixture of relevant and irrelevant documents. Consequently, the resultant query model (typically a term distribution) is often a mixture rather than a true relevance term distribution, leading to a negative impact on the retrieval performance. To tackle this problem, we recently proposed a Distribution SeparationMethod (DSM), which aims to approximate the true relevance distribution by separating a seed irrelevance distribution from the mixture one. While it achieved a promising performance in an empirical evaluation with simulated explicit irrelevance feedback data, it has not been deployed in the scenario where one should automatically obtain the irrelevance feedback data. In this article, we propose a substantial extension of the basic DSM from two perspectives: developing a further regularization framework and deploying DSM in the automatic irrelevance feedback scenario. Specifically, in order to avoid the output distribution of DSM drifting away from the true relevance distribution when the quality of seed irrelevant distribution (as the input to DSM) is not guaranteed, we propose a DSM regularization framework to constrain the estimation for the relevance distribution. This regularization framework includes three algorithms, each corresponding to a regularization strategy incorporated in the objective function of DSM. In addition, we exploit DSM in automatic (i.e., pseudo) irrelevance feedback, by automatically detecting the seed irrelevant documents via three different document reranking methods. We have carried out extensive experiments based on various TREC datasets, in order to systematically evaluate the proposed methods. The experimental results demonstrate the effectiveness of our proposed approaches in comparison with various strong baselines.

KW - Distribution separation method

KW - Irrelevant term distribution

KW - Mixture distribution

KW - Regularization

KW - Relevance feedback

UR - http://www.scopus.com/inward/record.url?scp=85011382356&partnerID=8YFLogxK

U2 - 10.1145/2994608

DO - 10.1145/2994608

M3 - Article

AN - SCOPUS:85011382356

SN - 2157-6904

VL - 8

JO - ACM Transactions on Intelligent Systems and Technology

JF - ACM Transactions on Intelligent Systems and Technology

IS - 3

M1 - 46

ER -

A distribution separation method using irrelevance feedback data for information retrieval

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this