Bulletin, February/March 2007

Special Section

Image Retrieval
Benchmarking Visual Information Indexing and Retrieval Systems

by Abebe Rorissa, Guest Editor

Abebe Rorissa is assistant professor in the Department of Information Studies at the University at Albany, State University of New York, Draper Hall, Room 113, 135 Western Avenue, Albany, NY 12222. He can be reached by email at arorissa<at>albany.edu or by phone at 518-442-5123.

What do cameras, Hollywood, flickr, YouTube Broadcast Yourself, magnetic resonance imaging (MRI) and computed tomography (CT) scans have in common? Among other things, they are tools or services or places for the creation, production, organization, management and sharing of images and/or videos. Information sources are becoming increasingly multimedia in nature. For the sake of brevity and delimiting the scope, this article will focus only on visual information, more specifically indexing and retrieval of image and video. As you may be aware from current technology events, news and controversies, Flickr (www.flickr.com) is a popular digital photo storage and sharing website and service, while YouTube (www.youtube.com), acquired recently (November 13, 2006) by Google, Inc, is perhaps the most popular free video sharing website.

Even though the exact amount of visual information is difficult to determine, there are an estimated 10 to 20 billion images and videos on the World Wide Web alone. This estimate does not include those in individual and private stock photo/video collections. However, only a small fraction of the large number of images and videos is indexed for effective and efficient retrieval.

What is it about anything visual that attracts people of all ages? Some researchers tell us that it is because our visual sense is the most powerful and well developed among all the human senses. Here in the United States, the day after the Thanksgiving holiday (a great holiday where families get together and give thanks for their blessings, not to mention feast on some great food) is the unofficial opening day of the holiday shopping season. Shopping centers and stores open as early as midnight and usually offer sizable discounts on their merchandise. People form lines in front of stores and, as I joined them, I could not help but notice that the longest lines were at stores such as Best Buy and Circuit City where the hottest items were photo and video cameras, camera phones, plasma/high definition TVs, DVD players/recorders and video game equipment – all appliances for image or video creation, sharing or viewing.

There is no doubt about the value of visual information for human communication. Images, such as paintings, were created and shared long before humans invented written language. Even in this electronic age where we can email or send a text message via mobile phones, words do not quite describe our vacation in Tahiti the way pictures and video clips do. Leonardo da Vinci’s Mona Lisa is indescribable, showing why a picture is worth at least a thousand words. It also illustrates why, even if you can use a thousand words, it is still difficult to generate the ones that adequately annotate, describe or index many images. Which leads to the question: What is the current status of image and video indexing and retrieval?

Image Indexing and Retrieval: The Issues
Three brief examples will highlight persistent obstacles to overcome in indexing and retrieving images. First, there have been efforts to automate the process. Recently, I came across a BBC headline that read “System 'spots multimedia content'” in which the author describes a system that can automatically recognize the content in images and videos and assign labels. This is great news. However, a few paragraphs later, the author informs us that the system is not only a prototype but is also limited to certain scenes, such as a beach or a tennis match. While great strides have been made by designers of image indexing and retrieval systems, what the BBC article described is typical of current visual information retrieval systems.

Second, human ability to create images is exploding. Before we had digital cameras, as soon as we came back from a vacation trip, we rushed to a photo printing store and then stuck the resulting pictures in a photo album. Now that digital cameras are cheaper, we take hundreds of photos and dump them on our computer hard drives before we take hundreds more. Some of us are guilty of not organizing our digital photos. It is not entirely our fault. Intuitive and easy to use personal photo collection management tools available commercially or otherwise are scarce.

Third, there are the unique challenges of analysis in various domains, such as medical image indexing and retrieval, which is a very important application receiving considerable attention. Hospitals produce and maintain not only text-based records but also an enormous amount of visual data in the form of MRI and CT scan images. Medical images are becoming integral parts of medical record, which contain multimedia information. Medical researchers and practitioners rely on these multimedia records to make critical and, literally, life and death decisions. What is more, medical imaging is a multi-billion-dollar-a-year industry. While medical image indexing and retrieval are not much different from that of non-medical digital images, they nonetheless pose some distinct challenges. For example, there are often only subtle variations between two images, especially if they are within the same domain category or anatomical region, making it difficult to identify the difference. Other challenges include the modality, which is whether the medical image was generated using MRI, CT scan or X-ray, anatomical region (e.g., head, lung/chest or leg) and the fact that medical images have lower signal to noise ratio than non-medical images.

Current Models and Systems
To fully understand and appreciate the challenges when it comes to image indexing and retrieval, a brief discussion of the state-of-the-art of current models and systems is necessary. There are two broad approaches to image indexing and retrieval: concept-based and content-based. The latter is the most widely used, not because it is superior in its performance (at least as far as human users are concerned) but because it is practical and economical in terms of time and manual effort. Concept-based image indexing and retrieval relies mainly on humans to annotate or index (using text) images manually, whereas content-based image indexing and retrieval uses automated (machine-based) image content or feature extraction (usually low level features such as color, texture, shape) and algorithms for image indexing and similarity matching. A general and simplified model of a query-by-example (QBE) content-based image retrieval (CBIR) system is shown in Figure 1.

Figure 1: A general model of CBIR system 

The color content of an image (in the form of a histogram or probability distribution depicting the intensities of pixels in an image) is the most widely used feature for content-based image retrieval (CBIR), while texture and shape features are also used, albeit to a lesser degree. The three types of image features are utilized in various CBIR applications ranging from scene/object and fingerprint identification and matching to face and pattern recognition. More often than not, a single feature is not enough to discriminate among a homogeneous group of images. In such cases, either pairs of these features or all of them are used for the purpose of indexing and retrieval. 

Similarity matching, through metrics called similarity measures, is done to determine the degree of relevance of an image in a collection to a query. Similarity matching is a key component of a content-based image retrieval (CBIR) system because finding a set of images similar to the image the user had in mind is its primary goal. In an ideal world, image indexing and retrieval systems would not only be able to identify and index relevant features of an image, but also be able to retrieve only relevant images matching a query by a human user. For current systems, this ideal remains a long-term goal rather than a current capability.

The above discussion may sound as though all the news about image indexing and retrieval is bad, and there is no light at the end of the tunnel. Actually, there is some good news. First, researchers have come to realize that in order to bridge the semantic gap (the difference between the physical, pixel-based, representation/description of a digital image and the semantic interpretation of the image by humans) a combination of concept-based and content-based approaches to image indexing and retrieval should be adopted. Second, system designers are adding relevance feedback and visualization tools as well as interactivity to CBIR systems, and they are testing their systems using agreed upon benchmarks. 

The business community is taking note of the advances made in visual information indexing and retrieval and investing in the field. Of course nothing propels a technology to new heights like an injection of money, and we need look no further than the Internet and the World Wide Web to find examples of technologies where potential for business applications made a significant contribution to their development and wider use.

Video Indexing and Retrieval
These days, with the help of a mobile phone, an individual can record an event or news and upload video clips to free video sharing websites such as YouTube faster than a major TV network could broadcast it due to scheduling restrictions or other barriers. Videos are used for entertainment, education, law enforcement and communication. Video indexing and retrieval involve temporal segmentation of video into shots (depicting an object or event) and scenes (a shot or series of shots of a single place or action), parsing (detecting scene changes or the boundaries between camera shots in a video), identification of key frames (representative of a shot), indexing using the key frames and similarity matching. The problem with video indexing and retrieval emanates from the fact that video shots contain spatial, temporal and high-level semantic information prompting the use of multiple features including motion-based features and similarity measures, which are not part of indexing and retrieval systems for still images. 

Obviously, two similar shots or scenes should have similar contents. That is, they should contain similar frames or images. This property of videos could be used to adapt methods and algorithms for still image indexing and retrieval for video retrieval as well. Currently, query-by-content is the preferred method of video retrieval. With regard to benchmarking and evaluation of video indexing and retrieval systems, the main event is the TREC Video Retrieval Evaluation, TRECVid (www-nlpir.nist.gov/projects/trecvid/).

Benchmarking and Evaluation of Image and Video Indexing and Retrieval Systems
Within the last few decades, we have witnessed the positive role played by benchmarking and evaluation campaigns in text retrieval. For instance, the Text REtrieval Conference (TREC) (http://trec.nist.gov/overview.html) has contributed tremendously toward the technology used by today's commercial search engines. In general, benchmarking is a process of gauging and comparing the performance of different systems with similar functions or applications. Given the fact that current content-based image retrieval systems are either early stage commercial ventures or open source and research prototypes, it is imperative that benchmarking and evaluation should be an integral part of the development cycle. In content-based image retrieval, benchmarking requires an agreed upon test collection of images with ground truth (degree of similarity through relevance judgments by users), search/query tasks and topics based on real user queries, and measures or metrics (such as precision and recall) to gauge system performance. Of course there should be forums or events where results of system performance evaluations are presented and discussed.

There are such events being held annually (for descriptions see http://muscle.prip.tuwien.ac.at/workshop2005_proceedings/loy.pdf). The most prominent among these are the cross-language image retrieval campaign ImageCLEF (http://ir.shef.ac.uk/imageclef/), which focuses on still non-medical images, and the Medical Image Retrieval Challenge Evaluation, ImageCLEFmed (http://ir.ohsu.edu/image/). These events have already seen participation from both academic and commercial research groups worldwide. ImageCLEF has been running for four years while researchers have been submitting test/evaluation results based on the ImageCLEFmed medical image test collection (CasImage) for the last three years. They are both part of the Cross Language Evaluation Forum (CLEF) (http://clef.iei.pi.cnr.it/), which is a benchmarking event for multilingual information retrieval held annually since 2000. A similar event for benchmarking video retrieval systems is the TREC Video Retrieval Evaluation, TRECVid (http://www-nlpir.nist.gov/projects/trecvid/) which, like other TREC events, is sponsored by the U.S. National Institute of Standards and Technology (NIST). TRECVid has been running since 2001.

This section is brief because the other contributors to this special section of the Bulletin of the American Society for Information Science and Technology, who are all accomplished researchers in the field, discuss the topic in detail. Because of the common problems of image (medical and non-medical) and video indexing and retrieval and similar approaches to benchmarking and evaluation of systems, the three articles in this issue are dedicated to three benchmarking and evaluation campaigns in detail. Paul Clough discusses ImageCLEF, William Hersh and Henning Müller discuss ImageCLEFmed, while Alan Smeaton describes TRECVid. The four authors are coordinators of these events.

Conclusion and a Look to the Future 
There is exponential growth in the number of images (both medical and non-medical) and videos, as well as in their users and in the research and publications on image and video indexing and retrieval. At the same time, research on and design of visual information retrieval systems are still uncoordinated. However, recent campaigns in benchmarking, as discussed in the following articles, have resulted in a semi-coordination of image and video indexing and retrieval research. 

Let us hope that there will come a time when not only colors of pixels and shapes of objects in an image could be extracted, but the emotions the image evokes in humans could also be identified – preferably automatically – and properly tagged. It is only then that we can say that the semantic gap has been bridged. For this goal to be a reality, content-based image retrieval systems should be evaluated and improved continuously. These tests should also be done in a coordinated manner so that they are beneficial to all parties – system designers/developers and users alike. The websites of CLEF (http://clef.iei.pi.cnr.it/), ImageCLEF (http://ir.shef.ac.uk/imageclef/), ImageCLEFmed (http://ir.ohsu.edu/image/) and TRECVid (www-nlpir.nist.gov/projects/trecvid/) are good starting places for information on these campaigns and events.