AM08 2008 START Conference Manager    

Large-Scale Multiple Hypothesis Testing in Information Retrieval: Towards a new approach to Document Ranking

Miles Efron

ASIS&T 2008 Annual Meeting (AM08 2008)
Columbus, Ohio, October 24-29, 2008


Information retrieval (IR) may be considered an instance of a common modern statistical problem: a massive simultaneous hypothesis test. Such problems arise often in biostatistics where plentiful data must be winnowed to name a small number of potentially “interesting” cases. For instance, DNA microarray analysis requires researchers to filter thousands of genes, searching for genes implicated in a particular condition. This paper describes a novel approach to IR that is based on the notion of simultaneous hypothesis testing. In this case the test is performed on each document and the null hypothesis is that the document is non-relevant. After a mathematical derivation of the proposed model, we test its performance on three standard data sets against the effectiveness of two baseline IR systems, a vector space model and a language modeling-based system. These preliminary experiments show that the hypothesis testing approach to IR is not only philosophically appealing, but that it also operates at the state of the art in effectiveness.

START Conference Manager (V2.54.6)