Bulletin, February/March 2007

Special Section

TRECVid – Video Evaluation

by Alan Smeaton

Alan Smeaton is professor of computing at Dublin City University. He has been a coordinator of TRECVid since it started in 2001. He can be reached at Alan.Smeaton<at>DCU.ie

Evaluation has always been a hugely important aspect of information retrieval (IR) research. Even in the earliest days of our field, when the pioneers were wrestling with Boolean queries that were run against document titles and abstracts, measuring the effectiveness of new IR techniques was ingrained into the psyche of the information retrieval discipline. This experience manifests itself in what has become known as the Cranfield model for information retrieval evaluation, derived from work at the Cranfield Institute many decades ago. This model incorporates a test collection of documents, a set of test queries and, for each query, a set of judgments about whether each document in the collection is relevant or not relevant to that query. Working within such an environment, IR researchers developed measures for evaluating retrieval effectiveness, such as precision and recall, as well as methods to compute these measures over standard normalized points and to average them over sets of queries.

As the amount of information available for searching increased through the late 1980s, IR research was in a quandary, with no large-scale unified datasets easily available to IR researchers. In 1990 the annual series of TREC (Text REtrieval Conference) events commenced, initially led by Donna Harman at the National Institute of Standards and Technology, and this benchmarking activity for IR tasks has continued annually since then. When it launched, TREC provided IR researchers with a large test collection, uniform scoring procedures and a forum for organizations interested in comparing their results. This framework allowed the effectiveness and scalability of new IR approaches to be tested and measured against each other, which has contributed greatly to advancing the field. Now, in 2006, TREC has over 100 participating groups each year and continues to have a major impact on the development of information retrieval techniques.

Throughout its 15 years of existence TREC has continuously diversified to include retrieval from many different types of text data, including multilingual text, text derived from an OCR process, Web documents and text from specialist domains such as legal text, blog text and text from scientific articles on genomics. TREC has also diversified in the type of IR search task to be investigated, including question-answering, ad hoc retrieval, filtering and the retrieval of novel documents. In 2001 TREC launched a track on retrieval from digital video, and in 2003 this track separated from the main TREC activity. It is known as TRECVid, and this article gives a brief overview of it.

Video IR
When we index text documents in a conventional IR system in order to support searching, the process is usually to identify word strings from the text, perhaps to apply some stopword removal and word stemming, and finally to add entries to an inverted file and any other relevant data structures. Indexing and searching video in a TRECVid-like scenario is far more complex because video is a continuous medium that has multiple modalities for retrieval. In a contemporary video IR system, the first step is to structure the video into non-overlapping segments, usually corresponding to camera shots. Then, for each shot, we identify a single frame of the video, called a keyframe, which we use as a shot representative or surrogate. We then apply various automatic semantic feature detectors to the shot or keyframe to detect things such as people, sky, water, vegetation or the outdoors. At this stage we also identify any text that is associated with the shot, such as a transcript of what is spoken from closed captions or automatic speech recognition or optical character recognition applied to the video to identify captions on news programs.

Once all this analysis is completed we can then support a variety of modalities for video (shot) retrieval. For example, a user could specify a text string that is matched against the spoken dialogue or against the video OCR in order to identify relevant shots. A user could also choose a subset of semantic features to filter a collection of shots to leave only those that contain vegetation or are outdoors, depending on what they are searching for. Alternatively, a user may have located a (still) image that is visually very similar to the video shots he/she is looking for. A video IR system should then be able to support visual query matching where one or more example frames or images are matched against keyframe shots from the video archive, based on similarity using color (local and global), texture, shapes, edges or other visual features. Such matching means that given an information need for a shot of, say, an airplane in flight, a user may locate sample images from some outside resource, and these images may then be used as seed queries against video keyframes from the archive. The final component of a contemporary video IR system is the browsing component. Video is inherently browsable, and people have the capability to very quickly browse video using fast-forward or skimming or various sophisticated keyframe browsing techniques.

When we put together all these components, we have a complex approach to video analysis, indexing and retrieval that has been shown in TRECVid to be of great use to searchers when the collection of video is of the order of some hundreds of hours of content. For example, the TRECVid campaign in 2006 had 160 hours of TV news recorded in November and December 2005 from TV channels in Arabic, Chinese and English as the test dataset for searching purposes.

Video Tasks for Evaluation
Because video analysis, indexing and subsequent retrieval involves so many individual components, TRECVid has chosen to benchmark the effectiveness of a number of independent contributors to the overall process, as follows:

  • Shot boundary detection is the process of structuring post-produced video into individual component shots. TRECVid has been benchmarking the effectiveness and the efficiency of this operation for six years. Each year we choose about five or six hours of video and invite participants not only to detect but also to characterize the shot bounds into hard cuts or gradual transitions. Shot bounds in the dataset are then manually determined, and submitted runs are measured against this standard. 
  • Accurate detection of high-level semantic concepts represents an important goal in video analysis, and this technology has also been benchmarked in TRECVid for the last six years. In 2006 we selected 39 semantic concepts and requested participating groups to build detectors for all 39. Submitted results were then pooled to eliminate duplicates, and shots from the pool were then manually assessed for 20 of these 39, and results representing the performance of the participants’ submissions were computed.
  • Video retrieval in TRECVid is done in terms of video shot retrieval. Once again, participants submit their top-ranked shots in response to each of 24 topics or query definitions each year, and there are three approaches to retrieval that we evaluate. These are automatic retrieval where the query definition is used verbatim to retrieve shots by the participants’ systems, manual retrieval where a user is allowed to reformulate the topic definition but not allowed any further interaction, and interactive retrieval where a user is given a fixed time limit, up to 15 minutes, to complete the search using whatever resources the system can offer.
  • The final task we support in TRECVid is based on using a collection of unstructured video that has not been post-produced – namely rushes video, raw footage gathered by camera crew – which forms the base material that is then edited into programs for transmission. The task has been for participants to build systems that can ingest video and detect redundancy in a way that supports efficient searching. This task is not formally evaluated in TRECVid.

The protocol that TRECVid follows for all these tasks is that we specify the task, participants build the systems and submit the results, the organizers pool the submissions, manually assess them and compute performance figures. We then gather for a workshop to share results and share details of the various approaches. Participants also complete a preliminary paper for the workshop, which is subsequently updated and published.

TRECVid Participants
TRECVid participants represent a broad spectrum of sizes and experience in the video search domain. In 2006 over 70 groups signed up to take part in the evaluation cycle, and 54 groups completed one or more tasks and submitted results for evaluation. These groups range from the larger teams like IBM Research, Carnegie Mellon University and the National University of Singapore to smaller one- or two-person teams like CLIPS-IMAG in Grenoble, France, and the University of Sao Paulo in Brazil. The participants are mostly from single sites, but also include collaborations (for example, University of Sheffield and Glasgow University, Johns Hopkins University and Imperial College London) as well as larger multi-site collaborative teams like the K-Space and COST292 European consortia. The participation is truly global in scope with nearly 20 countries represented and the majority of participants coming from outside North America. Based on the numbers of authors of the workshop papers, we estimate that about 380 researchers worldwide participated in TRECVid in some way in 2006. That number makes TRECVid a very sizeable activity.

Why do groups participate in TRECVid ? None of the participants are funded by TRECVid or its sponsors to take part, so the costs of participation in terms of labor, workshop travel, equipment and similar expenses are all borne by the participating groups themselves or come from outside funding. Thus from a value-for-money point of view, TRECVid represents an extremely worthwhile contribution to the field. Groups typically take part in TRECVid in order to measure some component of their own video IR systems. Typically there is some component or setting for their system whose values have not been tested for optimal settings, and TRECVid allows groups to submit variations of their own systems for formal evaluation and feedback. That facility means that most groups are competing against themselves rather than against each other, and the different runs from each group for the different TRECVid tasks usually reflect the knob-twisting or parameter-setting variations that groups engage in. 

TRECVid Results and Contributions
Performance of groups in shot boundary detection is impressively good and has improved throughout the last six years. Many groups can now achieve over 90% precision and recall for hard cuts and well over 70% precision and recall for gradual transitions in less than real time on standard PCs. Some groups who work only on compressed video can achieve performance 20 times faster than real time, again on standard PCs. Shot boundary detection may appear to be a solved problem but novel approaches continue to emerge, and the task provides an excellent entry point for groups new to the field.

It has proven notoriously difficult to achieve good performance for semantic concepts, with some being easier than others, but none greater than about 50% in terms of accuracy. While this level of performance is initially disappointing, it is believed that the way forward is to have feature detectors work together rather than continuing their independent efforts. For example, a detector for beach scenes should leverage the positive outputs of detectors for outdoors, water and possibly people and should use the negative outputs of detectors for such concepts as indoors, buildings or car. Nevertheless, despite the poor performance in absolute terms, it has been shown repeatedly by several TRECVid participants that the quality of feature detection is already good enough to assist video retrieval.

The quality of video retrieval is such that many groups, at least a couple of dozen, can now build and deploy systems for video shot retrieval from hundreds of hours of content. Most systems use text wherever it is present, and all use some combinations of keyframe retrieval and semantic concept features. Where retrieval systems do differ is in their browsing and presentation interfaces, and here we have a rich repertoire presenting searching and browsing combinations intertwined.

Finally, one of the most notable aspects of TRECVid is that a large amount of free and open exchange of information, results and output among the participants is occurring. These include common shot bounds, common keyframes, shared results of TRECvid feature detectors, shared results of other non-TRECVid features, text from speech recognition and machine translation, and manual annotations. The open exchange of these data and others, in a common MPEG-7 format, means that groups new to the TRECVid activity can build upon the work of others and can help advance the field much more quickly. 

TRECVid Plans 
TRECVid will run again in 2007, and for iteration we have a richer set of data to use than heretofore. We have about 120 hours of fresh rushes video supplied to us by the BBC. We have another 400 hours of daytime TV/magazine-type content from the Netherlands Institute for Sound and Vision (Beeld en Geluid), and we are likely to get another 100 hours of rushes material from TeleMadrid. Given that securing access to video data and managing the intellectual property ownership issues for that data is actually the most difficult part of running TRECVid, we are in a good position for 2007 so far as data are concerned, and we can turn our attention to what we want to achieve. We also have the large archive of broadcast TV news material gathered over the last three years.

Many research groups have invested a lot into being able to analyze, index and support search on broadcast TV news and are loathe to give up on this genre of video and simply move on to something else, so it is possible that we will retain some task on broadcast TV news. What is pretty certain is that we will continue search tasks, we will continue tasks which include semantic concept detection, and we will continue to evaluate approaches to shot boundary detection.

For further reading, readers are invited to check the TRECVid website at www-nlpir.nist.gov/projects/trecvid/ which has details of current and past TRECVid campaigns, including all past workshop papers.