Bulletin, February/March 2007


Special Section


Large-Scale Evaluation of Cross-Language Image Retrieval Systems

by Paul Clough 

Paul Clough is a lecturer in the Information Studies Department at the University of Sheffield. He is part of the information retrieval (IR) group and mainly specializes in multimedia IR (MMIR), geographic IR (GIR), cross-language IR (CLIR) and using natural language processing (NLP) within IR applications. He was co-founder of the CLEF cross-language image retrieval evaluation campaign called ImageCLEF. He can be reached at p.d.clough<at>sheffield.ac.uk. 

Evaluating the performance of information retrieval (IR) systems is an important part of the development process. For example, it is necessary to establish to what extent the system being developed meets the needs of end users, to show the effects of changing the underlying system or its functionality on system performance and to enable quantitative comparison between different systems and approaches. However, although most agree that evaluation is important in IR, much debate exists on exactly how this evaluation should be carried out. In this article we describe an initiative known as ImageCLEF (Cross-Language Evaluation Forum), a large-scale evaluation campaign with the aims of providing resources to the research community and of creating an environment in which to carry out and discuss the evaluation of image retrieval systems. 

Evaluation
Evaluation of retrieval systems tends to focus on either the system or the user. Saracevic (1995) distinguishes six levels of evaluation for information systems that include IR systems: 

  • engineering level, 
  • input level, 
  • processing level, 
  • output level, 
  • use and user level and 
  • social level. 

For many years IR evaluation has tended to focus on the first three levels, predominately through the use of standardized benchmarks (or test/reference collections) in a laboratory-style setting. 

The design of a standardized resource for evaluation of document retrieval systems was first proposed over 40 years ago (most notably in the Cranfield I and II projects) and has since been used in such major IR evaluation campaigns around the world as TREC (Text REtrieval Conference) (US), CLEF (Europe) and NTCIR (Asia). The main components of a typical IR test collection are a set of documents representative of a real-world scenario; statements of information needs (topics) expressed as narrative text or a set of keywords that represent realistic requests; and a set of relevance judgments indicating which documents in the collection are relevant to each topic. 

The last component listed is one of the most contentious issues in IR itself since the whole notion of relevance is dependent upon many factors, including a user’s task, context, experience, domain knowledge and more. In standard IR, evaluation typically concerns itself with topical (or thematic) relevance: a document is judged relevant to a query if it can be said to be on (or about) the same topic. However, task and context are also very important, and in the case of image retrieval other factors which are unrelated to topic, such as visual features, also affect relevance (for example, whether objects mentioned in the query appear in the foreground or background). 

However, IR systems are increasingly used in an interactive way within a social context, and this trend drives the need for user-centered evaluation to address performance at the latter three levels (output, use and user, and social). User-centered evaluation is important because it assesses the overall success of a retrieval system (as determined by end users of the system). Such assessment takes into account factors other than just system performance, for example, the design of the user interface and system speed. A number of researchers have highlighted the advantages of user-centered evaluation, particularly in image retrieval systems.

Over the years the creation of a standard test environment has proven invaluable for the design and evaluation of practical retrieval systems, both within and without a competitive environment. This type of evaluation has not met without criticism, such as whether the performance of a system on a benchmark reflects how a system will perform in an operational setting, but it cannot be denied that this kind of organized large-scale evaluation has done the field tremendous good. In the field of multimedia retrieval, it has increasingly been accepted that systematic and standardized evaluation is needed to help advance the field. Indeed several researchers have called for the creation of standard benchmarks for image retrieval and for large-scale evaluation campaigns similar to TREC to foster collaboration among members of the visual search community. 

ImageCLEF: Building a Test Collection for Multilingual Image Retrieval
Since 2003, I and my colleagues have been organizing an evaluation campaign (entirely voluntarily) for image retrieval called ImageCLEF, which is attached to the CLEF (Cross Language Evaluation Forum) sponsored by the U.S. National Institute of Science and Technology (NIST). CLEF is a spin-off from NIST’s TREC evaluation program that focuses on multilingual information retrieval and has been running independently since 2000. 

At the same time the image retrieval community was calling for a standardized benchmark, the CLEF community was looking for new avenues of research to complement the existing multilingual document retrieval tasks being offered to participants. Image retrieval was seen as a natural extension to existing CLEF tasks, given the language neutrality of visual media and motivated by wanting to enable multilingual users from a global community to be able to access a growing amount of multimedia information. Strategically, by attaching ourselves to CLEF, we were immediately included as part of an established benchmarking event with a growing number of participants. The main goals of ImageCLEF were (and still are) to provide a test collection to the research community at large, to run an event in which to engage and involve researchers from both academia and industry, to create a forum in which to discuss and share knowledge and to provide the necessary resources to enable systematic evaluation of image retrieval systems. One of the greatest obstacles to creating a test collection for use within benchmarking events is securing a suitable collection of images for which copyright permission has been obtained. Copyright has been a major factor influencing the datasets used in the ImageCLEF campaigns (although where possible we have attempted to model realistic tasks and scenarios).

ImageCLEF ran as an experimental track in CLEF 2003, where the focus was on multilingual access to cultural heritage using a collection of around 28,000 historic photographs from St. Andrews University Library (Scotland). We provided 50 queries in the form of short verbal descriptions, longer narratives and example relevant images (as judged by the topic creator). The topics of the queries were based on discussions with library staff about the types of requests commonly encountered from people searching for images from the archive. We also analyzed log files for a two-year period from an online text-based retrieval system used at St. Andrews University Library. Topics were translated into several languages by native speakers. 

To participate, groups had to translate the queries into English (the language of the image collection) to allow use of standard IR to find relevant images. Four groups participated in ImageCLEF 2003, focused very much on text and making no use of visual features. Submissions from groups were used to generate image pools and judged for relevance based on a ternary scheme, that is, images were judged as relevant, partially relevant and non-relevant. The person who created the topics judged the relevance of all images in the document pools to provide a “gold judgment.” In addition, we obtained a second set of judgments from a variety of volunteers. To reduce the subjectivity of relevance assessments, the set of relevant images was created from only those judged relevant by the topic creator and at least one other judge. Participating groups experimented with different translation strategies, different methods for expanding the initial query with additional terms and the use of various indexing strategies such as reducing words to their root forms. Despite low participation, we learned a great deal from organizing such an event and decided to run it again in 2004. 

Since 2004 the focus has shifted toward combining visual and multilingual textual features for multimodal, multilingual retrieval of images from medical and more general photographic collections. In 2004, the goal was motivating the participation of groups with a greater interest in visual information retrieval. A medical multilingual retrieval task was added to complement the existing retrieval task from the St. Andrews historic collection of photographs (as described in the accompanying article in this issue of the Bulletin by William Hersh). The medical task has proven to be a valuable domain for ImageCLEF, and in 2004 there was increased participation as a total of 17 research groups submitted results to one or more ImageCLEF tasks (12 groups to the non-medical task; 11 groups to the medical task). The medical task offered a very different challenge for researchers and attracted groups from medical institutions, thereby widening the audience of ImageCLEF. 

In addition to adding a medical task, we also added an interactive image retrieval task with the goal of investigating more user-centered issues of cross-language image retrieval such as interaction with the user during query formulation, query refinement and results presentation. Participants were provided with a framework in which to run interactive experiments, but had to design a system and recruit users to participate. The task was based on the St. Andrews collection, and users had to perform a target search – given an image, find it again. Only two groups participated in this task, a common problem with trying to attract participation in interactive experiments. Overall there was a stronger tendency for participants to use both visual and textual features in their submissions, and each domain exhibited particular problems.

In 2005 ImageCLEF ran a multilingual retrieval task with the St. Andrews collection for the last time and continued with the medical task, this time enlarging the collection to over 50,000 images and four different collections (a unified format but differing metadata schemas to contend with). In addition to offering standard ad hoc retrieval tasks and an interactive task, we also included a new automatic image annotation task for medical images based on a set of fully classified radiographs from the RWTH Aachen University Hospital (9,000 used for training and 1,000 for testing). A classification scheme called the IRMA code was used to categorize images based on a multi-axial code consisting of 116 classes (available in German and English). The goal of this task was to test the state-of-the-art in automatic annotation algorithms (and attract more participants from the pattern recognition domain). Image annotation can be seen as a precursor to retrieval in which textual metadata is generated for standard IR. Overall, 24 research groups participated in 2005 from a variety of backgrounds and nationalities (14 countries). There were 11 groups in the non-medical ad hoc task, 18 in the medical ad hoc task, 12 in the automatic medical image annotation task and two groups in the interactive task. 

In 2006 ImageCLEF continued to grow and maintained the four tasks run in 2005. An additional automatic image annotation task was included – this time focused on more general images (a dataset of objects provided by an industrial partner called LTUtech). The ad hoc medical retrieval task remained similar to 2005. The non-medical ad hoc retrieval task made use of a new image collection: the IAPR-TC12 benchmark, a set of 20,000 color photographs of a more general nature than the St. Andrews collection with semi-structured captions in English and German. The majority of images were provided by a travel company, and the scenario modeled was that of customers requesting images taken by travel guides who accompany them on their travels (and take many photographs). 

For this task, topics were generated based on a number of parameters, including an analysis of a log file from a system currently used by tour guides to access the photo collection, pictures of general vs. specific objects and a measure of topic difficulty based on a linguistic analysis of the queries. The interactive image retrieval task was included in the CLEF interactive track (iCLEF), and the goal was to study multilingual searching from Flickr, a large-scale, Web-based social network for managing and sharing images (over five million publicly accessible photos). Participants could create their own interfaces based on the Flickr API (Advanced Programming Interface) and would compare this effort with a baseline system. We offered a range of tasks with the emphasis for evaluation being measures of user satisfaction. Participation was high again in 2006 with 12 groups in the medical ad hoc task, 12 in the medical annotation task, 12 in the non-medical retrieval task, three in the general annotation task and three in the interactive image retrieval task. 

Discussion and Conclusions
We believe that ImageCLEF has continued to address the barriers between research interests and real-world needs by running evaluation tasks that are modeled on scenarios found in multimedia use today. The image retrieval community was calling for resources similar to those that TREC provided for the document retrieval domain, and we have begun to explore some of the issues in providing such resources within the context of multilingual multimedia retrieval. The main findings from ImageCLEF so far include the following: 

  • A combination of visual and textual features generally improves retrieval effectiveness
  • Visual features often work well for more “visual” queries
  • Multilingual image retrieval is as effective as its monolingual counterpart
  • Feedback can help improve retrieval effectiveness
  • Visual retrieval works well in constrained domains (for example, for medical radiographs). 

We have used datasets that are publicly available (under an agreed license), and we believe that this approach will help to further image retrieval research. The ImageCLEF test collection provides a unique contribution to publicly available test collections and complements existing evaluation resources.

However, there are still many issues to address with regard to evaluation and we have by no means been able to provide a “silver bullet” solution yet. There is still a tension between running system-centered and user-centered evaluation on a large scale for image retrieval (see, for example, Forsyth (2001)). Most image retrieval in practice is interactive and therefore this mode has to be addressed as a priority in future ImageCLEF events. We have attempted to run interactive tasks, but these events have not been well attended. This lack of participation is not just a problem with image retrieval, but is actually an issue with IR evaluation in general. 

The following are among the specific activities we intend to pursue in forthcoming events:

  • Investigate what measures best reflect the performance of a system for image retrieval. For example, are users really concerned with recall or do they only concern themselves with the first page of results?
  • Include alternative criteria for evaluating systems in the results, for example, the time taken for a system to produce results
  • Investigate further what types of searches users typically perform in the domains explored to create realistic topics
  • Consider the evaluation of browsing – a very important strategy for image retrieval
  • Continue to add to the datasets and domains tested thus far
  • Attempt to better understand relevance in image retrieval because it is not the same as for document retrieval. We want to find what factors affect image relevance judgments and assess how objective it is to judge the relevance of images.
  • Investigate the usefulness and role of test collections in image retrieval evaluation, especially with regard to users 


Only by pursuing these questions can we start to address some of the concerns expressed by researchers such as Saracevic. 

Acknowledgements
ImageCLEF is a team effort and would not run without the valuable help and support of many individuals. Among them are Henning Müller, Michael Grubinger, Thomas Deselaers, William Hersh, Allan Hanbury, Thomas Lehmann, Mark Sanderson and Carol Peters (from CLEF). We also thank all providers of data which has been used in ImageCLEF over the years. 

For Further Reading
Dunlop, M. (2000). Reflections on Mira: Interactive evaluation in information retrieval. Journal of the American Society for Information Science, 51(14), 1269-1274.

Forsyth, D.A. (2001). Benchmarks for storage and retrieval in multimedia databases. Proceedings of SPIE International Society for Optical Engineering, 4676, 240-247.

Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In E. A. Fox, P. Ingwersen, and R. Fidel (Eds.). Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, United States, July 09 - 13, 1995): SIGIR '95 (pp.1380146). New York, ACM Press.