|
B U
L L E T I N |
|
|
|
|
|
TREC: Improving Information Access through Evaluation by Ellen M. Voorhees Ellen M. Voorhees can be reached
at National Institute of Standards and Technology, Gaithersburg, MD
20899; email: ellen.voorhees
at nist.gov “If you can not measure it, you can not improve it.” – Lord Kelvin Evaluation
is a fundamental component of the scientific method: researchers
form a hypothesis, construct an experiment that tests the hypothesis
and then assess the extent to which the experimental results support
the hypothesis. A very common type of experiment is a comparative
one in which the hypothesis asserts that Method 1 is a more
effective solution than Method 2, and the experiment compares
the performance of the two methods on a common set of problems. The
set of sample problems together with the evaluation measures used to
assess the quality of the methods’ output form a benchmark task.
Information retrieval researchers have used test collections,
a form of benchmark task, ever since Cyril Cleverdon and his
colleagues created the first test collection for the Cranfield tests
in the 1960s. Many experiments followed in the subsequent two
decades, and several other test collections were built. Yet by 1990
there was growing dissatisfaction with the methodology. While some
research groups did use the same test collections, there was no
concerted effort to work with the same data, to use the same
evaluation measures or to compare results across systems to
consolidate findings. The available test collections were so small
– the largest of the generally available collections contained
about 12,000 documents and fewer than 100 queries – that operators
of commercial retrieval systems were unconvinced that the techniques
developed using test collections would scale to their much larger
document sets. Even some experimenters were questioning whether test
collections had outlived their usefulness.
At this time, NIST was asked to build a large test collection
for use in evaluating text retrieval technology developed as part of
the Defense Advanced Research Projects Agency’s TIPSTER project.
NIST proposed that instead of simply building a single large test
collection, it organize a workshop that would both build a
collection and investigate the larger issues surrounding test
collection use. This was the genesis of the Text REtrieval
Conference (TREC).
The first TREC workshop was held in November 1992, and there
has been a workshop held annually since then. The cumulative effort
represented by TREC is significant. Approximately 250 distinct
groups representing more than 20 different countries have
participated in at least one TREC, thousands of individual retrieval
experiments have been performed and hundreds of papers have been
published in the TREC proceedings. TREC’s impact on information
retrieval research has been equally significant. A variety of large
test collections have been built for both traditional ad hoc
retrieval and new tasks such as cross-language retrieval, speech
retrieval and question answering. TREC has standardized the
evaluation methodology used to assess the quality of retrieval
results and, through the large repository of retrieval runs,
demonstrated both the validity and efficacy of the methodology. The
workshops themselves have provided a forum for researchers to meet,
facilitating technology transfer and discussions on best practices.
Most importantly, retrieval effectiveness has doubled since TREC
began.
This article provides a brief introduction to TREC. After an
initial section that describes how TREC operates, the article
summarizes the impact TREC has had in the areas of retrieval system
effectiveness, retrieval system evaluation and support of new
retrieval tasks. TREC Mechanics
TREC is sponsored by the U.S. National Institute of Standards
and Technology (NIST) with some support from the U.S. Department of
Defense. Participants in TREC are retrieval research groups drawn
from the academic, commercial and government sectors. TREC assumes
the Cranfield paradigm of retrieval system evaluation, which is
based on the abstraction of a test collection: a set of documents, a
set of information needs that TREC calls topics and a set of
relevance judgments that say which documents should be retrieved for
which topics.
For each TREC, NIST supplies a common set of test documents
and a set of 50 topic statements. The format of the topic statements
has varied over the years, but generally consists of at least a
brief natural language statement of the information desired (e.g., What
are the economic benefits of recycling tires?). Participants
use their systems to run the topics against the document collection
and return to NIST a list of the top-ranked documents for each
topic.
Since TREC document sets contain an average of about 800,000
documents, they are too large for each document to be judged for
each topic. Instead, a technique called pooling
is used to produce a sample of the documents to be judged. For each
topic, the pool consists of the union of the top 100 documents
across all runs submitted to that TREC. Because different systems
tend to retrieve some of the same documents in the top 100
retrieved, this process tends to produce pools of approximately 1500
documents. The documents in a pool are then viewed by a human and
judged as to whether they are relevant to the topic.
Once all the relevance judgments for all of the topics in the
test set are complete, NIST evaluates the retrieval runs on the
basis of the relevance judgments and returns the evaluation results
to the participants. A TREC cycle ends with the workshop that is a
forum for participants to share their experiences.
The first two TRECs had two tasks, the ad hoc task and the
routing task. The ad hoc task is the prototypical retrieval task
such as a researcher doing a literature search in a library. In this
environment, the system knows the set of documents to be searched
(the library’s holdings), but cannot anticipate the particular
topic that will be investigated. In contrast, the routing task
assumes the topics are static but need to be matched to a stream of
new documents. The routing task is similar to the task performed by
a news-clipping service or a library’s profiling system.
Starting in TREC-3, additional tasks, called tracks, were
added to TREC. The tracks serve several purposes. First, tracks act
as incubators for new research areas. The first running of a track
often defines what the problem really
is, and a track creates the necessary infrastructure such as test
collections and evaluation methodology to support research on its
task. The tracks also demonstrate the robustness of core retrieval
technology in that the same techniques are frequently appropriate
for a variety of tasks. Finally, the tracks make TREC attractive to
a broader community by providing tasks that match the research
interests of more groups. Evaluation Methodology
As mentioned, the original catalyst for TREC was the request
to create a large test collection, but that goal was broadened to
standardizing and validating evaluation methodology for retrieval
from realistically sized collections. A standard evaluation
methodology allows results to be compared across systems –
important not so there can be winners of retrieval competitions, but
because it facilitates the consolidation of a wider variety of
results than any one research group can tackle. TREC has succeeded
in standardizing and validating the use of test collections as a
research tool for ad hoc retrieval and has extended the use of test
collections to other tasks. This section summarizes the support for
this claim by examining three areas: the test collections, the
trec_eval suite of evaluation measures and two experiments that
confirm the reliability of comparing retrieval effectiveness using
test collections.
Through the pooling process described above, TREC has created
a set of test collections for the English ad hoc task. In the
aggregate, the collections include five disks of documents, each
containing approximately one gigabyte of English text (largely news
articles but also including some government documents and some
abstracts of scientific papers) and nine sets of 50 topics. Each
topic has a set of manual relevance judgments for the corresponding
document set. The collections are publicly available (see http://trec.nist.gov/data.html)
and are now the collections of choice for most researchers working
in basic retrieval technologies.
The addition of tracks to TREC allowed the creation of test
collections for other tasks, as well. Collections have been made for
languages other than English, media other than text and tasks that
range from factoid question answering to text categorization. In
each case the test collections have been integral to progress on the
task.
Scoring the quality of a retrieval result given the system
output from a test collection has been standardized by the trec_eval
program written by Chris Buckley (see http://trec.nist.gov/trec_eval).
Trec_eval provides a common implementation for over 100 different
evaluation measures that ensures issues such as interpolation are
handled consistently. A much smaller set of measures has emerged as
the de facto standard by which retrieval effectiveness is
characterized. These measures include the recall-precision graph,
mean average precision and precision at ten retrieved documents.
One objection to test collections that dates back to the
Cranfield tests is the use of relevance judgments as the basis for
evaluation. Relevance is known to be very idiosyncratic, and critics
question how an evaluation methodology can be based on such an
unstable foundation. An experiment using the TREC-4 and TREC-6
retrieval results investigated the effect of changing relevance
assessors on system comparisons. The experiment demonstrated that
the absolute scores for evaluation measures did change when
different relevance assessors were used, but the relative scores
between runs did not change. That is, if system A evaluated as
better than system B using one set of judgments, then system A
almost always evaluated as better than system B using a second
set of judgments (the exception was in the case where the two runs
evaluated as so similar to one another that they should be deemed
equivalent). The stable comparisons result held for different
evaluation measures and for different kinds of assessors and was
independent of whether a judgment was based on a single assessor’s
opinion or was the consensus opinion of a majority of assessors.
The use of pooling where only some documents are judged for a
topic and all unjudged documents are treated as not relevant was
another source of concern. Critics feared that runs that did not
contribute to the pool would be unfairly penalized in the evaluation
because those runs would contain highly ranked unjudged documents.
Examination of larger pools did confirm one aspect of the critics’
fears – there are unjudged documents remaining in the collections
that would have been judged relevant had they made it into the
pools. Further, the quality of final test collection does depend on
the diversity of the runs that contribute to the pools and the
number of documents selected from each run. As an extreme example,
pools created from only the top-ranking document from each of 30
runs do not form a good test collection. But tests showed that the
TREC collections are not biased against unjudged runs. In these
tests, the documents uniquely retrieved by a TREC run are treated as
not relevant when that run is evaluated. The difference in the
evaluation results for runs evaluated both with and without their
own uniquely retrieved relevant documents was smaller than the
difference produced by changing relevance assessors. Scaling Retrieval Systems
When TREC began there was real doubt as to whether the
statistical systems that had been developed in the research labs (as
opposed to the operational systems that used Boolean searches on
manually indexed collections) could effectively retrieve documents
from “large” collections. TREC has shown not only that the
retrieval engines of the early 1990s did scale to large collections,
but that those engines have improved since then. This effectiveness
has been demonstrated both in the laboratory on TREC test
collections and by today’s operational systems that incorporate
the techniques. Further, the techniques are routinely used on
collections far larger than what was considered large in 1992. Web
search engines are a prime example of the power of the statistical
techniques. The ability of search engines to point users to the
information they seek has been fundamental to the success of the
Web.
Improvement in retrieval effectiveness cannot be determined
simply by looking at TREC scores from year to year. It is invalid to
compare the results from one year of TREC to the results of another
year since any differences are likely to be caused by the different
test collections in use. However, developers of the SMART retrieval
system kept a frozen copy of the system they used to participate in
each of the eight TREC ad hoc tasks. After every TREC, they ran each
system on each test collection. For every test collection, the later
versions of the SMART system were much more effective than the
earlier versions of the SMART system, with the later scores
approximately twice that of the earlier scores (see Figure 1).
While these scores are evidence for only one
system, the SMART system results consistently tracked with the
other systems’ results in each TREC, and thus the SMART results
can be considered representative of the state-of-the-art. The
improvement was evident for all evaluation scores that were
examined, including mean average precision and precision and recall
at various cut-off levels.
Figure
1. Mean average precision for the SMART System by year and task
The TREC results suggest the increased effectiveness arises
from improvements in two general areas – query expansion and
indexing of full-text documents. Getting a better query from the
user is one of the best ways to improve retrieval performance, but
is frequently not a realistic option. Query expansion is a method
that can improve the query without relying on the user to supply it.
The most common way of expanding the query is to use pseudo-relevance
feedback, which involves the following steps: 1.
Retrieve a first set of
documents using the original query. 2.
Assume the first X
documents are relevant. 3.
Perform relevance feedback
with that set of documents to create a new query (usually including
both new terms and refined query weights). 4.
Return the results of
searching with the new query to the user.
The TREC collections purposely contain full-text documents of
greatly varying lengths, in sharp contrast to the test collections
that predate TREC. Results from the first several TRECs show that
these full-text documents posed challenges for the retrieval systems
of that era, but researchers were soon able to overcome the
challenges. The following best practices for document indexing
emerged as a result: ·
Tokenization that
regularizes word forms is generally helpful. The most common form of
regularization is stemming, but normalizing proper nouns to a
standard format can also be helpful. ·
Simple phrasing techniques
are generally helpful. The most helpful part of phrasing is the
identification of common collocations that are then treated as a
single unit. More elaborate schemes have shown little benefit. ·
Appropriate
weighting of terms is critical. The best weighting schemes reflect
the discrimination power of a term in the corpus and control for
document length. There are several different weighting schemes that
achieve these goals and are equivalently effective. Language
modeling techniques are not only effective but also provide a
theoretical justification for the weights assigned.
The two gigabytes of text that was considered massive in 1992
is modest when compared to the amount of text some commercial
retrieval systems search today. While TREC has some collections that
are somewhat bigger than two gigabytes – the Web track used an
18-gigabyte extract of the Web, for example – there is once again
doubt whether research retrieval systems and the test collection
methodology can scale to collections another three orders of
magnitude larger. The terabyte track, a track introduced in TREC
2004, has been initiated to examine these questions. Incubator for New Retrieval Tasks
The TREC track structure enables TREC to extend the test
collection paradigm to new tasks. Several of the TREC tracks have
been the first large-scale evaluations in that area. In these cases,
the track has established a research community and created the first
specialized test collections to support the research area. A few
times, the track has spun-off from TREC and established its own
evaluation conference. The Cross-Language Evaluation Forum (CLEF,
see http://clef.iei.pi.cnr.it)
and TRECVid workshops (http://www.itl.nist.gov/iaui/894.02/projects/trecvid/)
are examples of this. Other conferences such as NTCIR (http://research.nii.ac.jp/ntcir)
and INitiative for the Evaluation of XML Retrieval (INEX, http://inex.is.informatik.uni-duisburg.de:2004)
were not direct spin-offs from TREC, but were inspired by TREC and
extend the methodology to still other areas.
The set of tracks run in any particular TREC depends on the
interests of the participants and sponsors, as well as on the
suitability of the problem to the TREC environment. The decision of
which tracks to include is made by the TREC program committee, a
group of academic, industrial and government researchers who have
the responsibility for oversight of TREC. Tracks are discontinued
when the goals of the track are met, or when there are diminishing
returns on what can be learned about the area in TREC. Some tracks
run for many years but change focus in different years.
Figure 2 shows the set of tracks that were run in the
different years of TREC and groups the tracks by the aspects that
differentiate them from one another. The aspects listed on the left
of the figure show the breadth of the problems that TREC has
addressed, while the individual tracks listed on the right show the
progression of tasks within the given problem area.
Figure
2. TREK tracks by year
Space limitations prohibit going into the details of all of
these tracks, and the interested reader is referred to the track
overview papers in the TREC proceedings (available at http://trec.nist.gov/pubs.html).
Instead, a few tracks that established new research communities are
highlighted. Cross-Language Retrieval. One of
the first tracks to be introduced into TREC was the Spanish track.
The task in the Spanish track was a basic ad hoc retrieval task,
except the topics and documents were written in Spanish rather than
English. The track was discontinued when the results demonstrated
that retrieval systems could retrieve Spanish documents as
effectively as English documents. Another single language track,
this time using Chinese as a language with a very different
structure than that of English, was introduced next. Again, systems
were able to effectively retrieve Chinese documents using Chinese
topics.
There are a variety of naturally occurring document
collections, such as the Web, that contain documents written in
different languages that a user would like to search using a single
query. A cross-language retrieval system uses topics written in one
language to retrieve documents written in one of a variety of
languages. The first cross-language track was introduced in TREC-6
to address this problem. The TREC-6 track used a document collection
consisting of the French and German documents from the Swiss news
agency Schweizerische Depeschen Agentur plus English documents from the
Associated Press of the same time period. Topics were generated in
English and then translated into French, German, Spanish and Dutch.
Participants searched for documents in one target language using
topics written in a different language. Later versions of the track
had participants search for documents using topics in one language
against the entire combined document collection. Still later
versions of the track used more disparate languages (English topics
against either Chinese or Arabic document sets).
The TREC cross-language tracks built the first large-scale
test collections to support cross-language retrieval research and
helped establish the cross-language retrieval research community.
The track demonstrated that cross-language retrieval can be more
effective than the corresponding monolingual results due to the
expansion that results from translating the query. TREC no longer
contains tasks involving languages other than English because there
are now other venues for this research. The NTCIR and CLEF
evaluations mentioned earlier offer a range of retrieval tasks with
a multilingual focus. Spoken Document Retrieval. The
dual goals of the spoken document retrieval track were to foster
research on content-based access to recordings of speech and to
bring together the speech recognition and retrieval communities. The
track ran for four years, from TREC-6 through TREC-9, and explored
the feasibility of retrieving speech documents by using the output
of an automatic speech recognizer.
The documents used in the track were stories from audio news
broadcasts that were manually segmented into the component stories.
Several other forms of the content, including manual transcripts and
transcripts produced from a baseline recognizer, were also made
available to track participants. The different versions of the
broadcasts made it possible for participants to explore the effects
of varying amounts of errors in the text – from (assumed to be) no
errors for the manual transcripts through varying degrees of
recognition errors associated with the baseline and participants’
recognizer transcripts – on retrieval performance.
Over the course of the track, researchers developed systems
that achieved retrieval effectiveness on automatically recognized
transcripts that was comparable to its effectiveness on the
human-produced reference transcripts and demonstrated that their
technology is robust across a wide-range of recognition accuracy.
The worst effect of automatic recognition was out of vocabulary (OOV)
words. Participants compensated for OOV words by using adaptive
language models to limit the number of OOV words encountered and by
expanding recognized text by related clean texts to include OOV
words in the documents. The track also contributed to the
development of techniques for near-real-time recognition of open
vocabulary speech under a variety of non-ideal conditions including
spontaneous speech, non-native speakers and background noise. Video
Retrieval.
After the success of the spoken document retrieval track, TREC
introduced a video track to foster research on content-based access
to digital video recordings. A document in the track is defined as a
video shot. Tasks have included shot boundary detection, feature
detection (where features are high-level semantic constructs such as
“people running” or “fire”) and an ad hoc search task where
the topic is expressed as a textual information need possibly
including a still or video image as an example. The test set of
videos has been derived from a variety of sources including
broadcast news, training videos and recordings of scientific talks.
Because there is no obvious analog to words for video
retrieval, there has been little overlap in the techniques used to
retrieve text documents versus video documents. Yet interest in the
problem of video retrieval is high and increasing among researchers,
content providers and potential users of the technology. To allow
greater room for expansion than would be possible as a TREC track,
the video track was spun off from TREC as the separate TRECVid
workshop in 2003. TRECVid continues to date, with approximately 60
participating groups in TRECVid 2004. Question
Answering.
While a list of on-topic documents is undoubtedly useful, even that
can be more information than a user wants to examine. The TREC
question answering track was introduced in 1999 to focus attention
on the problem of returning exactly the answer in response to a
question. The initial question answering tracks focused on factoid
questions such as “Where
is the Taj Mahal?” Later tracks have incorporated more
difficult question types, such as list questions (a question whose
answer is a distinct set of instances of the type requested, such as
“What actors have played
Tevye in Fiddler on the
Roof?”) and
definition/biographical questions (such as “What
is a golden parachute?” or “Who
is Vlad the Impaler?”).
The question answering track was the first large-scale
evaluation of open-domain question answering systems, and it has
brought the benefits of test collection evaluation observed in other
parts of TREC to bear on the question answering task. The track
established a common task for the retrieval and natural language
processing research communities, creating a renaissance in question
answering research. This wave of research has created significant
progress in automatic natural language understanding as researchers
have successfully incorporated sophisticated language processing
into their question answering systems. Conclusion
Evaluating competing technologies on a common problem set is
a powerful way to improve the state of the art and hasten technology
transfer. TREC has been able to build on the text-retrieval
field’s tradition of experimentation to significantly improve
retrieval effectiveness and extend the experimentation to new
sub-problems. By defining a common set of tasks, TREC focuses
retrieval research on problems that have a significant impact
throughout the community. The conference itself provides a forum in
which researchers can efficiently learn from one another and thus
facilitates technology transfer. TREC also provides a forum in which
methodological issues can be raised and discussed, resulting in
improved text retrieval research. More information regarding TREC can be found on the TREC website – http://trec.nist.gov – and in the book TREC: Experiment and Evaluation in Information Retrieval recently published by MIT Press. |
|
|
|
|
Copyright © 2005, American Society for Information Science and Technology |