A Unified Similarity Coefficient for Navigating Through Multi-Dimensional Information

Douglas Tudhope
Computer Studies Department
University of Glamorgan
Pontypridd
Mid Glamorgan CF37 1DL
Wales, UK.
dstudhope@glamorgan.ac.uk
tel: 01443-482271

Carl Taylor
2DK Multimedia
Covent Garden
London WC2E 8PS, UK
cdtaylor@dkmm.co.uk
tel: 0171-8365411

Abstract

This paper describes an integrated approach to similarity coefficients, for information spaces with multiple dimensions of different types of index term. Applications of similarity coefficients underlying different navigation tools in hypermedia are categorised by type of term. An implementation of a unified similarity coefficient based on work in numerical taxonomy is described, with illustrative scenarios from an experimental navigation via similarity tool for a prototype social history museum hypermedia system. The underlying architecture is based on a semantic approach, where semantic relationships can exist between index terms. This allows imprecise matching when comparing for similarity, with distance measures yielding a degree of match. A ranked list of matching items over several weighted dimensions is returned by the similarity navigation tool. The approach has the potential of allowing different access methods to multimedia data to be combined.

Introduction

This paper discusses the potential of a unified approach to similarity coefficients that allows multiple index dimensions to be integrated into a single, flexible navigation tool. This work is part of a larger project on a semantic modelling approach to hypermedia architecture, in which navigation in the system ultimately results in a query on a separate link store. The link store is based on an extended binary relational model (Beynon-Davies et al 1994). There are different dimensions of indexing in the link store and index dimensions can be structured with semantic relationships between index terms. The similarity coefficient described here utilises semantic distance measures to reason over these transitive relationships in order to provide non-exact matching of queries and information items.

Previously we have described the similarity measures and how they are used in navigation tools (Tudhope et al 1995a, 1995b). A series of research prototypes of social history museum hypermedia systems (in Lisp/HyperCard) have been implemented, and demonstrated to curators and historians, but not subjected to extended usability testing. The dataset is a small collection of historical photographs of Pontypridd from the archives of the Pontypridd Historical and Cultural Centre. They were indexed by a standard subject classification for social history, the Social History and Industrial Classification (SHIC 1983). SHIC has five hierarchical levels, of which we have implemented four. Two other fundamental dimensions of indexing were space and time.

This paper reports on current work on an experimental implementation of a unified similarity coefficient and an associated navigation tool. Integrating multiple dimensions allows the strengths of different indexing techniques to be combined. We first categorise different types of similarity measures by the kinds of index terms they employ. We then go on to describe the unified similarity coefficient and a navigation tool based on it, followed by a brief description of the underlying similarity measures.

Similarity

A similarity coefficient underlies many navigation techniques in hypermedia. In information retrieval, the vector space model is based on a similarity measure between two weighted vectors of real terms, which are frequently word stems automatically extracted from a text passage and a user's query (Salton 1989, Salton and Allan 1993). In hypermedia applications, a common form of navigation is for a user to find a node of interest using some means, and then to request other information items 'similar' according to certain criteria. Typical scenarios might include a user employing standard browsers to navigate to the first item, or having the first item returned as results of an initial query, whereupon, not being completely satisfied, the user selects the item to refine or replace the query. It is interesting that similarity (using a multi-dimensional set of index terms) was one of the basic retrieval strategies observed by the BookHouse field study of 'real life' user-librarian interaction in requests for books (Pejtersen 1989). An Apple/Grolier research project (Salomon et al 1989) based the Guides which offered different perspectives on the collection of articles on a similar similarity coefficient. Strathtutor (Kibby and Mayes 1989), a learning-by-browsing tutorial system, allowed tutors to code each frame of material with up to 60 attributes. It employed binary keywords, although the approach was influenced by Hintzman's human memory model with a 3-state scale. As an aid to 'taxonomic reasoning', Parunak's (1993) Hyperset system made use of a set-based similarity measurement based on the overlap between two sets of index attributes.

Hypermedia (multimedia) applications employ a wider range of types of index descriptors than has been usual in information retrieval. For example in content-based indexing and retrieval, image processing may be used to derive statistical characterisations of images or video frames based on colour, texture, or automatic feature extraction. This would allow a user to request an image database for similar images to a given drawing or photograph (eg the 'media-based navigation' of Hirata et al 1993). Figure 1 summarises a sample of similarity coefficient applications in hypermedia with respect to different types of index attribute, and whether index terms are predefined or automatically calculated.

Table 1: Different types of attributes in similarity coefficients

TermsPredefinedAutomatic
DichotomousHyperSet, BookHouse, Apple/Grolier, StrathtutorIR Vector Space
QuantitativeTalaria,spatial Euclidean distanceContent-based Image/Video, IR Vector Space
SemanticThesaurus Search AidsAutomatic Classification

Predefined terms may be standard keywords, classifications or thesauri, or specially defined for the application. Automatic calculation of terms might occur in free text retrieval methods, or as feature vectors resulting from automatic feature extraction to characterise components of a digital image.

Dichotomous terms are the most common type in hypermedia applications (the terminology is taken from Gower's characterisation of terms, see the unified similarity coefficient below). Here the descriptor vector simply indicates the presence/absence of (boolean) terms in the information item or query. Frequently vector space applications use weighted vectors of real terms to reflect the relative importance of terms in a document, or relevance to a particular query, and Table 1 lists them as both dichotomous and quantitative. Gower also distinguishes 'qualitative terms' which are multi-level terms (eg, red, green, blue) that do not form an ordered set, and with no semantic relationships between terms. The Talaria hypermedia healthcare training tool (Madigan et al 1994) extends the Strathtutor approach by assigning information units a position in a high-dimensional 'repertory hypergrid'. Terms can have six levels representing the degree of match between the information unit and one of 29 traits. Identification of traits and ranking of information units was the result of a prior knowledge elicitation exercise. Information units can be automatically linked by a Euclidean distance measure since terms are multi-level. Quantitative terms could also represent the cartographic data used in multimedia geographic information systems, and can represent the real coordinate systems used to express the results of the feature extraction algorithms used (say) in the Miyabi system (Hirata et al 1993) to index digital images, or in video (Zhang et al 1995) - the automatic content-based indexing of video (television news programs) to allow direct navigation to news segments. In museum applications, information items are frequently indexed by a standard classification system or thesaurus, where semantic relationships may exist between a set of index terms, for example hierarchic subclass/superclass relationships, or related terms. A semantic closeness measure allows a measure of similarity to be asserted between non-identical terms. Thus semantically structured index spaces could be considered another basic type.

Certain classes of hypermedia applications, including museum systems, require a multi-dimensional classification of information. Different aspects of an information space are indexed by independent dimensions, which may require different data types. Wu et al (1995) have implemented a content-based retrieval engine for multi-dimensional multimedia objects in image databases that includes a similarity coefficient combining different levels of indexing and similarity measures. They have developed applications in computer-aided facial image (mugshot) retrieval and a trademark archive. In social history museum information systems space, time, and subject classification are fundamental categories of information. Our previous work treated each dimension separately and intersected the results. Here we have been exploring the potential that a unified similarity coefficient can offer.

Unified Similarity Coefficient

We have adopted a unified similarity coefficient that operates over multiple index dimensions of different data types from work in numerical taxonomy (Sneath and Sokal, 1973). Numerical Taxonomy is concerned to measure taxonomic distance between items in order to automatically create a new biological classification. We wish to explore the use of unified similarity coefficients in hypermedia navigation, and have implemented a version of Gower's (1971) general coefficient of similarity (see also Tudhope et al 1995b). The general form of the coefficient S for two information items i and j is

Equation 1: The Unified Similarity Coefficient

Sijk is the similarity between the two items on the kth dimension of indexing. Wijk is a user-defined weight for each dimension, but is set to zero when no valid comparison is possible between two items on that dimension. Thus, the coefficient handles missing data (for example, dates might not be available or indexing might be incomplete). Each dimension returns an individual similarity measure between 0 and 1. Gower distinguished between quantitative, dichotomous, and qualitative terms, as described above. (Additionally, he distinguished 'alternative' terms from dichotomous where two absences count as a match.) We have extended the coefficient with semantic closeness measures where semantic relationships exist between terms, and in the time dimension included temporal periods. The aim is to incorporate:

Unified Similarity Tool

The demonstrator has six index dimensions: SHIC, time, and space carried forward from previous work (Taylor et al 1994), and three new ones: Keywords, the number of people in a media item, and Exhibited (object currently on exhibit), a dichotomous datatype. The new dimensions included were included as examples of the different datatypes. Npeople is a quantitative type (0..10 thesholding at 10) and has a simple distance measure described below. It would probably be usual for terms such as Exhibited to be used as an initial query to restrict the set of items considered by the tool.

A media item can have anything up to five keyword terms associated with it - an arbitrary limit. In contrast to controlled vocabularies, in museum archives there is frequently a short free text passage associated with an information item, perhaps hand writing on the back of a photograph, or entered by a curator to describe how the object was acquired. Such information may not be easy to reconcile with controlled vocabulary classification systems. However it is amenable to automatic term extraction and free text vector space techniques, for example. Incorporating such terms into a unified similarity coefficient offers the possibility of combining the strengths of both controlled vocabulary and free text information retrieval. The keyword terms for these artificial examples were keyed in (some were manually extracted from a text in the system produced by a former worker at the Brown Lenox Chainworks, commenting on the photographs). However, similar keywords or stems could be automatically extracted from any relevant documents using standard techniques.

Both structured, controlled vocabulary (thesauri, classifications) and free text approaches have strengths and weaknesses (Aitchison and Gilchrist 1987) in terms of retrieval. Here we try to illustrate the potential for combining the two. For another approach to the problem, with different levels of indexing for free text terms and semantic concepts, see Agosti et al (1995). The Cheshire II advanced online catalog (Larson et al 1995) combines boolean and free text queries with probabilistic ranking to exploit (in a working library environment) the advantages of both types of search strategy. Romer (1996) sees a need for combining text-based and automatic content-based methods for accessing image databases.

The figures show illustrative scenarios of how the tool might be used. Figure 1 shows the similarity tool. In the scenario, the user has previously used the hierarchical SHIC browser to view media items classified by the third level term Manufacture of Metal Goods. From the Media Viewer's list of matching media items, MI93 (Chainmaking: left photograph in Figure 1) was selected as the basis of a Find Similar operation. Realising that it would be useless to further use SHIC to refine the query, the user assigns weights to Keywords and Npeople in the media similarity tool and ignores the other dimensions. The results of the search over the media items previously found by the SHIC browser are shown in the Media Viewer in Figure 1. Some items have been discarded, and those that match reasonably well over the keywords and have fairly similar numbers of people are retained. In fact, our practice in the original SHIC indexing of the dataset was somewhat uneven; the set of items relating to the Brown Lenox chainworks were only indexed by a single SHIC term, unlike some other parts of the dataset, where several SHIC terms were employed in finer-grained classifications. However, we have encountered museum collections which include sections created by different curators with different practices in collection management; some may favour systematic indexing, while others may favour rich, unstructured textual descriptions at the expense of classification.

Figure 2 shows the results of a second navigation using four of the index dimensions: SHIC, Geography, Keywords, and Npeople. The weightings can be seen in the top left field of the media viewer window. The Figure shows two media items, the original media item (left), MI39, and one similar photograph (right). Having come across an interesting street scene of coronation celebrations, the user asks for similar items fairly generally. The photograph on the right is the same scene at night and scores highly here for its identical geographical and SHIC classifications (Events in Community Life, Food Retailing). In fact, plausible, but non-matching keyword terms (eg, butcher Vs shop) were artificially assigned to these photographs to illustrate one potential advantage of including a structured classification dimension (in this case SHIC). The semantic closeness measure over SHIC permits expression of similarity between photographs with differing numbers of terms and between close, but not exactly marching terms, in the SHIC hierarchical classification (in this scenario, Events in Community Life Vs Community Life, or Food Retailing Vs Retail Distribution). The geographical dimension also helps to return several photographs of the outdoor Market in the same general area. The media item, MI42, returned by the search is an interesting case, in that it does not possess classifications in each dimension. Due to incomplete information, no geographical classification could be assigned to it, and no comparison is possible on that dimension. In the unified similarity coefficient dimensions over which it is impossible to compute similarity are simply discounted from the calculation, something not possible in our previous work.

Semantic Distance Measures

There are different implementations of the semantic distance measures depending on the dimensions of indexing that support such measures (Tudhope et al 1995b, Taylor 1996). The temporal measure supports a notion of a time period, allowing closeness to be measured between a year and a decade (say), or between two periods. The spatial measure is modelled by a semantic relationship representing street intersection in an urban spatial model. The semantic closeness measure between two terms in the hierarchical subject index is based on a traversal of the transitive semantic relationships in the schema - the semantic net connecting the two classification terms in question. A cost factor is associated with each traversal between two directly connected terms. We are currently looking at extending the measures to deal with the complete range of thesaurus relationships. Keyword terms are matched using a maximal set similarity algorithm (Tudhope et al 1995b). However they could equally well be matched using standard vector space coefficients. The similarity of quantitative datatypes is calculated using Equation 2, which represents the Distance the terms are apart over the total range of the character.

Equation 2: Similarity of Quantitative Characters

Future work

Although the tool has not been through an evaluation phase, the examples presented attempt to demonstrate potential applications for navigation through multi-dimensional hypermedia information systems, with a similarity coefficient that combines index dimensions with different datatypes. For text-based access methods, combining free text and structured classifications may hold advantages.

We are exploring criteria for choosing the cost factor associated with each traversal in the semantic closeness measures, and what control can be passed up to the user interface. The question of usability of the tool needs its own investigation. The present experimental demonstrator employs a slow, brute force method when searching for similar information items; more work is needed on efficient versions of the unified similarity algorithm.

Acknowledgements

Assistance from the Pontypridd Historical and Cultural Centre, and in particular Brian Davies and David Gwyer, is gratefully acknowledged. We would also like to thank Steve Donne from Cyfarthfa Castle Museum, and Alice Grant from the Science Museum for helpful discussions, and Joe Male for the history of Brown Lenox. Thanks also to our colleagues Paul Beynon-Davies and Chris Jones.

References

Aitchison J., Gilchrist A. (1987). Thesaurus construction: a practical manual. London: Association for Information Management.

Agosti M., Melucci M, Crestani F. (1995). Automatic authoring and construction of hypermedia for information retrieval. Multimedia Systems, 3(1), 15-24.

Beynon-Davies P., Tudhope D., Taylor C., Jones C. (1994). A Semantic Database Approach to Intelligent Hypermedia Systems. Information and Software Technology, 36 (6), 323-329.

Gower, J. (1971). A General Coefficient of Similarity and Some of its Properties. Biometrics 27, 857-74.

Hirata K., Hara Y., Shibata N. (1993). Media-based Navigation for Hypermedia Systems. Proceedings ACM Conference on Hypertext, Seattle, 159-173.

Kibby M.R., Mayes J.T. (1989). Towards Intelligent Hypertext. In R. McCaleese (Ed.), Hypertext: Theory into Practice (pp 164-172). Oxford: Intellect.

Larson R., Moon R., McDonough J., Kuntz L., O'Leary P. (1995). Cheshire II: Design and evaluation of a next-generation online catalog system. Proceedings of the 58th Annual Meeting of the American Society for Information Science (ASIS'95), Chicago, 215-225.

Madigan, D., Chapman, C., Gavrin, J., Villumsen O., Boose J. (1994). ÔRepertory Hypergrids: An application to Clinical Practices Guidelines. Proceedings ACM European Conference on Hypermedia Technology, Edinburgh, 117-125.

Pejtersen A. (1989). A library system for information retrieval based on a cognitive task analysis and supported by an icon-based interface. Proceedings ACM Conference on Information Retrieval, 40-47.

Parunak, H. (1993). Hypercubes grow on trees (and other observations from the land of Hypersets). Proceedings ACM Conference on Hypertext, Seattle. 73-81.

Salomon G., Oren T., Kreitman K. (1989). Using Guides to Explore Multimedia Databases, Proceedings 22nd Hawaii International Conference on System Sciences, 3-12.

Salton G. (1989). Automatic Text Processing. Reading,MA: Addison-Wesley.

Salton G. and Allan J. (1993). Selective Text Utilization and Text Traversal. Proceedings ACM Conference on Hypertext, Seattle. 131-144.

Romer D. (1996). Image and Multimedia Retrieval. Getty Art History Information Program Research Agenda Project Report, http://www.ahip.getty.edu/agenda/image.html.

SHIC Working Party. (1983). Social History and Industrial Classification: A Subject Classification for Museum Collections. University of Sheffield, UK: Centre for English Cultural Tradition and Language.

Sneath P., Sokal R. (1973). Numerical Taxonomy: The Principles and Practice of Numerical Classification. San Francisco: W. H. Freeman and Company.

Taylor C., Tudhope D., Beynon-Davies P. (1994). Representation and Manipulation of Conceptual, Temporal and Geographical Knowledge in a Museum Hypermedia System. Proceedings ACM European Conference on Hypermedia Technology. Edinburgh, 239-244.

Taylor C. (1996). Semantic Modelling and Navigation in a Museum Hypermedia System. PhD Thesis. University of Glamorgan.

Tudhope D., Taylor C., Beynon-Davies P. (1995a). Semantic Navigation Tools for Museum Hypermedia. Computers and the History of Art, 5(2), 51-63.

Tudhope D., Taylor C., Beynon-Davies P. (1995b). Classification and Hypermedia. Proceedings 6th ASIS Sig/CR Classification Research Workshop, Chicago, 173-192.

Wu J., Narasimhalu A., Mehtre B., Lam C., Gao Y. (1995). CORE: a content-based retrieval engine for multimedia information systems. Multimedia Systems, 3(1), 25-41.

Zhang H., Tan S., Smoliar S., Yihong G. (1995). Automatic parsing and indexing of news video. Multimedia Systems 2(6), 256-266.


© 1996, American Society for Information Science. Permission to copy and distribute this document is hereby granted provided that this copyright notice is retained on all copies and that copies are not altered.