Approximating true relevance distribution from a mixture model based on irrelevance data

Peng Zhang*, Yuexian Hou, Dawei Song

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Citations (Scopus)

Abstract

Pseudo relevance feedback (PRF), which has been widely applied in IR, aims to derive a distribution from the top n pseudo relevant documents D. However, these documents are often a mixture of relevant and irrelevant documents. As a result, the derived distribution is actually a mixture model, which has long been limiting the performance of PRF. This is particularly the case when we deal with difficult queries where the truly relevant documents in D are very sparse. In this situation, it is often easier to identify a small number of seed irrelevant documents, which can form a seed irrelevant distribution. Then, a fundamental and challenging problem arises: solely based on the mixed distribution and a seed irrelevance distribution, how to automatically generate an optimal approximation of the true relevance distribution? In this paper, we propose a novel distribution separation model (DSM) to tackle this problem. Theoretical justifications of the proposed algorithm are given. Evaluation results from our extensive simulated experiments on several large scale TREC data sets demonstrate the effectiveness of our method, which outperforms a well respected PRF Model, the Relevance Model (RM), as well as the use of RM on D with the seed negative documents directly removed.

Original languageEnglish
Title of host publicationProceedings - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009
Pages107-114
Number of pages8
DOIs
Publication statusPublished - 2009
Externally publishedYes
Event32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009 - Boston, MA, United States
Duration: 19 Jul 200923 Jul 2009

Publication series

NameProceedings - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009

Conference

Conference32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009
Country/TerritoryUnited States
CityBoston, MA
Period19/07/0923/07/09

Keywords

  • Distribution separation model
  • Irrelevant data
  • Pseudo-relevance feedback
  • True relevance distribution

Fingerprint

Dive into the research topics of 'Approximating true relevance distribution from a mixture model based on irrelevance data'. Together they form a unique fingerprint.

Cite this