of The American Society for Information Science

Vol. 26, No. 5

June/July 2000

Go to
 Bulletin Index

bookstore2Go to the ASIS Bookstore


Special Section

Sound and Speech in Information Retrieval: An Introduction

by Abby Goodrum and Edie Rasmussen

Through most of its history, information retrieval was synonymous with retrieval of the printed word. The last 30 years have seen rapid progress in retrieval of digital information: first Boolean queries, then natural language, now moving toward question answering. New technologies, and in particular the introduction of the World Wide Web as an information delivery mechanism, have changed our expectations for information organization and retrieval in ways that go beyond simple text retrieval. As information in media such as image, video and sound proliferates, we demand new ways of organizing and searching it. We want to match a face or an image, hum a tune to locate it, instantly find the right video clip. Research into techniques for content-based retrieval of images has resulted in systems that deliver images based on their color, texture and shape; recognition and retrieval in domains such as faces, fingerprints and trademarks are available. Researchers working with video address problems such as segmenting video by scene or topic, creating abstracts and supporting rapid browsing. Audio retrieval, especially music and speech, is also an active research area.

Retrieval of sound was explored in two panels presented at the ASIS Annual Meeting in Washington, DC, in November 1999. The first, The Sound of Information: Auditory Browsing and Audio Information Retrieval, was organized and moderated by Abby Goodrum, with papers presented by Stephen Downie, University of Illinois; Marilyn Tremaine, Drexel University; and Myke Gluck, Florida State University. The second, Information Retrieval from Speech, was organized and moderated by Edie Rasmussen, with papers presented by Ellen Voorhees, NIST; Douglas Oard, University of Maryland; Matthew Siegler, MediaSite; and Lynn Connaway, University of Denver, and Bob Bruce, netLibrary. Recognizing the commonality of theme for the two panels - the use of sound in support of information retrieval - the presenters were invited to document their presentations for the Bulletin, and this special section is the result.

Speech retrieval presents all the problems of text, while adding a layer of its own. For instance, in order to conduct retrieval operations on speech, it must be transcribed into its text format. The high cost of manual transcription was a barrier in this process. With automatic speech recognition, the cost/time barrier has fallen, but the recognition process is far from perfect (especially with multiple speakers and accented speech), leading to research questions associated with the impact of imperfect text on retrieval performance. Moreover, language patterns in speech differ from those in text, differing widely depending on circumstances surrounding the speech, and this raises questions about the effectiveness of retrieval techniques developed and tested on text. The combination of continuous speech recognition and information retrieval is referred to as spoken document retrieval. Speech is also a component of other media, such as video, which allows the retrieval process to draw on combined sources of evidence, adding complexity in the process.

Non-verbal audio is also a rich source of information whether we are talking about the subtle interplay of violins and cellos in a symphony or the familiar Doppler effect created by a race car as it speeds past. Music, for example, has its own semantics, calling for new forms of retrieval. The problems of digitizing and segmenting are joined by problems relating to the representation of non- textual, non-verbal information. Finding, for example, all instances of a certain pitch, harmony, rhythm or timbre challenges IR systems built to essentially match words in a query to words in a database.

Not only does audio convey its own information content, but it can also be used as an adjunct to other channels of information acquisition. Spreading our information retrieval and browsing abilities across sensory modalities allows visual attentiveness to be used elsewhere. For example, we use sound direction, echo and loudness as navigational tools and way-finding anchors.

In spite of its great potential - both as an information-bearing object and as an adjunct to support information seeking - research into audio retrieval and browsing is in its infancy. The purpose of this special section of the Bulletin is to provide a broad introductory perspective on the challenges and opportunities embodied in audio information retrieval.


Spoken document retrieval has been a research program (track) within the TREC (Text REtrieval Conference) since 1997. Its addition recognized the potential of this domain for retrieval in large multimedia collections. In her paper, "The TREC Spoken Document Retrieval Track," Voorhees describes the track and its success in providing a research infrastructure and impetus for improvement in retrieval performance from spoken documents. Oard, in his paper "User Interface Design for Speech-Based Retrieval," makes a compelling case for the future of information retrieval from the growing corpora of audio broadcasts. He describes current research programs and argues that interface design will be critical in formulating queries that take advantage of the varied features of speech.

Stephen Downie's paper, "Access to Music Information: The State of the Art," examines music information retrieval systems from the perspective of allowing users to find musical information using music itself as an access point. This ability to frame music queries musically is on the cutting edge of MIR research today. What is most striking about this area of research is the challenge of utilizing existing text retrieval systems for the retrieval of musical words or the n-gram intervals contained within melodies. Representation is also a central theme of  Myke Gluck's paper, "The Use of Sound for Data Exploration." He explores the use of sound as an additional element to support data representation and visualization for data mining. He describes a software tool for both sonification and visualization studies that is designed to be used for spatial data analysis.

Importance of Audio Retrieval Research

The development of our aural capacity begins in the womb at around 16 weeks and is not complete until about 24 weeks, a fact revealing the complex nature of hearing. Our ability to hear includes reception of vibrations through our skin, skeleton and vestibular system as well as the ear. From the moment we are born, our hearing becomes one of the most important mechanisms we have to gather information and make sense of our world. Our aural abilities are quite sensitive at birth and even though some of this capability diminishes over time, for most of us, hearing remains a strong mechanism for information acquisition. It is reasonable, therefore, to continue to develop mechanisms to extract, structure and surrogate speech and other audio information and to pursue aural information retrieval and audio interface design research with vigor.

Abby Goodrum is with the College of Information Science and Technology, Drexel University, Philadelphia, PA. She can be reached by phone at 215/895-6627 and by e-mail at abby.goodrum@cis.drexel.edu . Edie Rasmussen is with the School of Information Sciences, University of Pittsburgh. She can be reached at 412/624-9459 or erasmus@mail.sis.pitt.edu

ASIS Home Search ASISSend us a Comment

How to Order

@ 2000, American Society for Information Science