LargeScale Multiple Hypothesis Testing in Information Retrieval: Towards a new approach to Document Ranking
Miles Efron
ASIS&T 2008 Annual Meeting (AM08 2008)
Columbus, Ohio, October 2429, 2008
Summary
Information retrieval (IR) may be considered an instance of a common modern statistical problem: a massive simultaneous hypothesis test. Such problems arise often in biostatistics where plentiful data must be winnowed to name a small number of potentially “interesting” cases. For instance, DNA microarray analysis requires researchers to filter thousands of genes, searching for genes implicated in a particular condition. This paper describes a novel approach to IR that is based on the notion of simultaneous hypothesis testing. In this case the test is performed on each document and the null hypothesis is that the document is nonrelevant. After a mathematical derivation of the proposed model, we test its performance on three standard data sets against the effectiveness of two baseline IR systems, a vector space model and a language modelingbased system. These preliminary experiments show that the hypothesis testing approach to IR is not only philosophically appealing, but that it also operates at the state of the art in effectiveness.
START
Conference Manager (V2.54.6)
