The current research tested a template for image description to be used by naive image searchers in recording their descriptions of images. The attribute classes derived in the previous research were used to model the image description template. The results indicate that users may need training and/or more guidance in order to correctly assign descriptors to higher-level classes. Thus the template may be of more use to indexers, in guiding decisions about what to index, than as a guide to searchers in formulating structured image search requests.
One interesting result is that when participants were exposed to the full range of attribute classes in the follow-up research, the distribution of attribute classes changes from that of the original unprompted descriptions, and matches more closely results from image sorting and searching (as opposed to describing) tasks in the original research. This suggests that image indexing needs to accommodate a wide variety of attributes, including a range of "perceptual," "interpretive," and "reactive" attributes.
It should therefore be no surprise that language-based image indexing systems often fail the user in retrieval (Enser 1993; Keister 1994), and indexing for images has been referred to as "this largely unsolved problem" (Lynch 1991). Many authors concur that the major intellectual problem involved in access to images is the question of how to index them.
Creators of image indexing systems base their work largely on introspection to determine indexing terms that will meet the perceived needs of specific user communities. There are a number of extant classification systems, each addressing particular aspects of an image (pictorial matter or iconographical meaning), the needs of a particular collection (Christian art, academic library), or the type of user (art historian, architect). An indexing system, once chosen, then limits the types of access available for images to that provided by the particular indexing system.
This problem has become much more acute recently because of several factors: the availability of a number of new image indexing programs on the market, the rapid increase in the production of digital images and collections of such images, and the addition of many new imagebases to publicly available sites on the Internet. One author claims that the number of new digital images produced numbers in the millions daily (Jain 1993).
A methodology was needed which would allow participants to describe the images in as natural and unconstrained manner as possible. Pictorial perception is a complex and as yet little understood process and understanding this process requires an approach which preserves as much context as possible. Empirical reality is described similarly in discussions of qualitative research: complex, intertwined, best understood as a contextual whole, and inseparable from the individuals who know that reality (Bradley and Sutton 1993). The researcher thus chose a qualitative and exploratory approach as appropriate for pursuing this research, using multiple tasks to provide an environment in which a full range of image attributes would emerge.
Participants performed a describing task with images in several different contexts. Participants (N=82) were drawn from all levels of an academic setting. Color images were drawn randomly from a collection of notable illustrations from the Twenty-Fifth Annual American Society of Illustrators awards. Subject matter and style ranged from realistic to fantasy.
The tasks were administered to participants in groups. Six images were projected one at time for two minutes each; participants were asked to generate written natural language descriptions of the images. One group was asked to simply describe what they "noticed" about an image (Descriptive Viewing Task). A second group was asked to describe the projected image as if it were an image which they hoped to find in a collection of images (Descriptive Search Task). The third task, (Descriptive Memory Task, was administered to the first group after four weeks. Participants were asked to describe what they remembered about each image. These tasks generated data in the form of written words, phrases, and sentences.
To determine the distribution of attributes, each occurrence of a term was coded. This method follows the model of eye-tracking experiments (Buswell 1935; Yarbus 1967), in which each fixation is considered to be an indication of the focus of attention at a particular moment. A similar assumption was made for this research: each term was indicative of the focus of attention at a particular moment, and therefore each occurrence of a term was coded.
The product of this analysis was a baseline description of attributes typically reported by participants performing the describing tasks. The description contains three components:
Forty-eight image attributes and twelve higher level Classes of attributes were derived from the data, using the techniques described above. The twelve higher-level Classes, with brief definitions, are listed below.
Results from all three of the describing tasks in the original research were nearly identical. Results from the Descriptive Viewing Task, the task upon which the current research was based, are summarized in the table below. LITERAL OBJECT was overwhelmingly the most typically described component of images in this task. Several other Classes are somewhat equally described: COLOR, PEOPLE, LOCATION, CONTENT/STORY, and VISUAL ELEMENTS. The least typically described Class is ABSTRACT CONCEPTS. Other sorting and searching tasks in the original research showed different distributions of Classes, with more emphasis on Classes such as ART HISTORICAL INFORMATION and ABSTRACT CONCEPTS.
The results from the previous research seem to indicate that enumeration of attributes by participants was primarily dependent on the nature of the task; tasks which involved description of a pictorial image stimulated responses composed primarily of Perceptual attributes, while sorting stimulated Interpretive and Reactive attributes.
CLASS % LITERAL OBJECT 34.3% COLOR 9.2% PEOPLE 8.7% LOCATION 8.3% CONTENT/STORY 7.4% VISUAL ELEMENTS 7.2% DESCRIPTION 6.0% PEOPLE QUALITIES 5.2% ART HISTORICAL INFO. 3.8% PERSONAL REACTION 3.7% EXTERNAL RELATION 3.3% ABSTRACT 3.0%
Pretesting showed that the previous two-minute time limit was still adequate for completion of the task, and that participants were able to follow the directions to attempt to match a term with its Class. In order to control for any presentation order effects, the order of Classes was randomly varied in four different versions of the template which were distributed to the participants.
The attempt to analyze the participants' matching of their raw terms with the template Classes proved to be problematic for a number of reasons. Participants were inconsistent in their assignment of terms to Classes. Some would assign a single term to a Class ("adventurer" assigned to PEOPLE QUALITIES), while others would assign an entire phrase to a single Class ("woman with wings riding big four-legged creature" assigned to PEOPLE). Thus, one result of this approach would be that much visual information about the picture could be lost. An additional problem was that it became clear that some Classes were consistently misinterpreted. For instance, LOCATION was defined as "Locations of picture components," meaning use of terms such as "above," "background," "center." However, participants consistently assigned terms referring to the setting of the picture ("cave," "Japanese") to LOCATION.
It quickly became obvious from the results that naive users of such a template needed more detailed instructions in its use and perhaps even training to achieve any measure of consistency in matching terms to Classes. Because of the difficult nature of the records, the attempt to use the participants' own characterization of relationships of terms to Classes was not pursued. The analysis thus proceeded using the methods used in the original research and only the second distribution was analyzed. The researcher assigned the terms to their appropriate Classes using the previous coding scheme, and the distribution of Classes was analyzed and compared to the previous research. The research thus primarily addressed the question of whether the presentation of a full range of attribute Classes would change the image descriptions produced by participants, or whether the previous distribution of Classes was determined predominantly by the nature of the describing task, eliciting mainly Perceptual attributes.
CLASS % LITERAL OBJECT 17.7% CONTENT/STORY 14.9% ABSTRACT 14.4% COLOR 12.3% PEOPLE QUALITIES 8.1% PEOPLE 7.2% DESCRIPTION 5.2% ART HISTORICAL INFO. 5.8% PERSONAL REACTION 5.0% EXTERNAL RELATION 4.1% VISUAL ELEMENTS 3.8% LOCATION 1.6%
The order of presentation of the list of Classes did not appear to have an effect on those Classes most typically noted, as the distributions of Classes remained similar regardless of the Class order.
However, these results do suggest that such a template may be useful in eliciting a wider range of attributes from searchers than they may otherwise spontaneously generate. The results also suggest that a wide range of attributes may contribute to a searcher's notion of "similarity." Results from a sorting task in the previous research demonstrated that Interpretive and Reactive attributes contribute to the notion of "similarity" as strongly as Perceptual attributes. The current research indicates that these types of attributes may be just as important in image descriptions, even though they are not spontaneously named. There appear to be constraints operating in image description which limit named attributes to primarily Perceptual types of attributes, yet these descriptions may be too limited when the task is to find "similar" types of images. Assistance in eliciting a full range of attributes from searchers without a human intermediary may thus facilitate searching.
Arnheim, Rudolf (1974). Art and Visual Perception: A Psychology of the Creative Eye. Berkeley, Calif.: University of California Press.
Bradley, J., & Sutton, B. (1993). "Reframing the paradigm debate." The Library Quarterly, 63(4): 405-410,.
Buswell, G. T. (1935). How People Look at Pictures. Chicago: University of Chicago Press.
Drabenstott, Karen Markey (1984). "Interindexer consistency tests: a literature review and report of a test of consistency in indexing visual materials." Library and Information Science Research 6: 155-177.
Enser, P. G. B. (1993). "Query analysis in a visual information retrieval context." Journal of Document and Text Management, 1(1): 25-52.
Glaser, B. and A. Strauss (1967). The Discovery of Grounded Theory. Chicago: Aldine.
Hendee, William R. and Peter N. T. Wells, eds. (1993). The Perception of Visual Information. New York: Springer-Verlag.
Jain, R. C. (1993). "Storage and retrieval for image and video databases." In Proceedings - SPIE, the International Society for Optical Engineering in San Jose, CA, Wayne Niblack (ed.), 198-218.
Jörgensen, Corinne (1995). "Image Attributes: An Investigation." Ph. D., Syracuse University.
Keister, L. H. (1994). "User types and queries: impact on image access systems." In R. Fidel, T. B. Hahn, E. M. Rasmussen, & P. J. Smith (eds.), Challenges in Indexing Electronic Text and Images . Medford NJ: Learned Information, Inc.
Liddy, E. D. (1991). "The discourse-level structure of empirical abstracts: An exploratory study." Information Processing & Management, 27(1): 55-81.
Lynch, C. A. (1991). The technologies of electronic imaging. JASIS. Journal of the American Society for Information Science 42(8), 578-585.
Medin, Douglas L. and Lawrence W. Barsalou (1987). "Categorization processes and categorical perception." In Categorical Perception, Stevan Harnad (ed.). New York: Cambridge University Press, 1-25.
Saint-Martin, Fernande. "Semiotics of visual languages." Advances in Semiotics. Bloomington and Indianapolis: Indiana University Press, 21-33, 1990.
Wilson, Brent G. (1966). "The Development And Testing of an Instrument to Measure Aspective Perception of Paintings." Ph. D., The Ohio State University.
Yarbus, A. L. (1967). Eye Movements and Vision. New York: Plenum Press.
© 1996, American Society for Information Science. Permission to copy and distribute this document is hereby granted provided that this copyright notice is retained on all copies and that copies are not altered.