Cross-Language Transfer of Indexing Concepts for Storage and Retrieval of Moving Images: Preliminary Results

James M. Turner
École de Bibliothéconomie et des Sciences de L'Information
Université de Montréal
Montréal, Québec
turner@ere.umontreal.ca

ABSTRACT

In previous research, participants who screened a videotape of stock footage were asked to assign terms that could be used for retrieval of each shot. The shots were taken from the National Film Board of Canadaās stockshot collection, and the data for the study were collected in English. Analysis was undertaken on the most popular terms since they are the potential indexing terms. In the present study, a French-language version of the research tapes was prepared, using the same images. Data were collected in French. The analysis compares the most popular terms identified in each of the two studies for each of the shots, in order to determine the rate of correspondence between potential indexing terms in each language. The long-term goal of this research is to investigate whether useful indexes to moving images can be generated in other languages automatically by filtering the indexing terms assigned in one language through a controlled vocabulary to derive the appropriate terms in the other language.

Introduction

The central problem in picture indexing is summed up in Brian O'Connor's statement that "there is no saying just how many words or just which words are required to describe any individual picture" (O'Connor 1996). However, some progress toward a partial answer in this area is being made. Previous research (Turner 1994) in the area of user-assigned words for storage and retrieval of still and moving images found that the patterns in data in matrices constructed to show the relationship between participant responses and individual shots followed Zipfās distribution, i.e. that only a few terms are named frequently, most are named only once, and the majority of cells in the matrix are empty (Furnas et al. 1987). The few terms named frequently are the ones of interest for studying questions related to the metadata used for storage and retrieval of moving images, since they are the potential indexing terms.

The results of a follow-up study (Turner 1995) indicate that there is a high degree of correspondence between the terms participants gave when asked to supply words and phrases that could be used for later retri eval of the shots they saw on the research tapes and terms assigned to those same shots by professional indexers. In addition, the study found that the most popular terms supplied by users also appear in the running descriptions for these shots almost all the time. These findings strongly suggest that the process of indexing moving images at the shot level can be successfully automated, using the textual metadata as the source for generating the index to the shots. Several approaches to this are possible, and studies to explore some of the possibilities have been planned (Turner 1996).

In order to consolidate the theoretical basis for this ongoing work, three studies are presently underway. The first is a replication of the original study with an important difference: the data for the original study were collected from English-speake rs, while those collected for the study reported here are collected from French-speakers. The two other studies in this series will be undertaken using a new sample from the same database to collect data in English in one case, and in French in the other. In addition, the new sample will be completely random, whereas the shots used in the original research tapes constituted a selection of 50 shots based on agreement on the subject matter of the material from a random sample of 200 shots. The goal of the s tudy reported here is to determine to what extent there is a correspondence between the data in English and French in terms of language and culture in the naming activity which formed the basis of the task of the participants. The long-term purpose of the study is to gather empirical evidence to form the theoretical basis for building bilingual and multilingual information systems for moving-image materials.


Methodology

The study is based on previous work, and a detailed description of the methodology used is given elsewhere (Turner 1994). This section discusses changes required to accommodate the particularities of the present study. These involve producing French-language versions of the data collection materials, which can be classed in two categories, printed materials and the research tapes. Changes made to the printed materials were straightforward, in that copies of the electronic files used to create the English-language forms were made, and the English text replaced by a French translation. In this way, the same appearance was maintained throughout. The printed materials include an information questionnaire used to cat egorize the participants, an instruction sheet explaining the task to be performed, a consent form each participant was required to sign, and the response sheets used for recording the terms each participant wished to associate with each shot on the resea rch tapes.

Verbal information on the research tapes is in the form of text and a voiceover. None of the shots on the tapes has sound, but each shot is preceded by an announcement of its identifying number, and followed by a repetition of this number and the indic ation to participants that it is time to inscribe the terms they wish to assign to the shot on the response sheet. This information appears in written form on the screen and is accompanied by a voice reading the text, in order to reinforce it and to alert participants who may still be writing their responses that the next shot is about to be shown. Most of the participants providing French-language data were expected to be native French speakers. However, since most of these could be expected to understan d English as well because of the environment in which they would be recruited (i.e. the greater Montreal area), it was reasonable to assume that the original research tapes could be used. Nevertheless, the possibility of introducing interference in the da ta collection process because of the constant cognitive processing required on the part of participants, and the accompanying possibility of confusion in the data were worrisome. Thus, in order to remove these obstacles which might influence the quality o f the data, a French-language version of the original research tapes was prepared.

Better equipment was available for generating the text component that appeared on screen and for the audio inserts than had been available for the original study, so that there is some improvement in the French-language version in the form of easier-to -read titles and digitized voice track. Since the same titles and voice-over inserts are repeated throughout the tape (the only change between them being the shot numbers as the tape rolls), participants can quickly relegate them to background consciousne ss; thus it is not thought that the improved quality of these components in the French-language version of the tapes has any influence on the quality of the data collected. What is significant is the preparation time of the tapes, considerably reduced fro m that of the original study. All aspects of the shots used on each of the two research tapes (identity, order, playing time, etc.) are identical to those of the tapes used in the original study. As with the tapes used in the 1994 study, both tapes contai n the same shots, a mix of still and moving images, shown in the same order. The difference between the two tapes is that the shots in moving form on one tape are in still form on the other.


Results

Overall, the patterns in the data collected from French speakers seem very similar to those in the data gathered from English speakers in the 1994 study. Table 1 compares results for the shots classified as simple (i.e. with very few significant object s or events to name in the picture), in still and moving formats, as compiled from data collected from both English-speaking and French-speaking students classified in the "nonvisual" category (i.e. who are not registered in a program in which the focus i s on construction or analysis of images). In the table heads, "NS" means "Nonvisual students", "SS" means "Simple still shots", "SM" means "Simple moving shots", "T1" means "Research Tape 1", and "T2" means "Research Tape 2". For each shot, "Term" represe nts the term which achieved the highest score among the data collected from these participants (i.e. the term which was named the most often), "Share" represents the share of the total number of occurrences of terms provided for the shot by the participan ts in the category that the top-scoring term achieved, represented as a percentage, and "Named" represents the percentage of participants in the category who named the top-scoring term. Since participants were permitted to name up to five terms for each s hot, once a stem is recorded as a score for any given participant, any further occurrences of that stem in the participantās responses to the same shot are disregarded in the calculation of the metrics reported here.



Table 1. Comparative data for English- and French-speaking students in the "nonvisual" category.

Shot numberNS/SS/T1
N=25
English-speaking
participants
NS/SS/T1
N=29
French-speaking
participants
NS/SM/T2
N =19
English-speaking
participants
NS/SM/T2
N=24
French-speaking
participants

01
01
01
Term
Share
Named
flag
59%
96%
drapeau
57%
100%
flag
36%
63%
drapeau
57%
100%
07
07
07
Term
Share
Named
chicken
41%
92%
poule
36%
90%
chicken
32%
74%
poule
36%
100%
13
13
13
Term
Share
Named
forest/mountain/train
17%
28%
montagne
18%
41%
train
18%
37%
train
33%
46%
15
15
15
Term
Share
Named
bird
36%
64%
oiseau
36%
69%
bird
41%
79%
envol
22%
46%
16
16
16
Term
Share
Named
train
38%
72%
train
38%
90%
train
38%
63%
train
49%
92%
18
18
18
Term
Share
Named
geography/lake
12%
20%
lac
18%
41%
lake
18%
37%
vue
24%
50%
24
24
24
Term
Share
Named
ship
26%
52%
bateau
22%
59%
ship
19%
42%
bateau/traversier
15%
38%
26
26
26
Term
Share
Named
woodpecker
40%
76%
oiseau
32%
69%
woodpecker
31%
53%
oiseau
33%
75%
28
28
28
Term
Share
Named
building
37%
64%
Ždifice
16%
34%
building
25%
47%
Ždifice
21%
42%
33
33
33
Term
Share
Named
moose
32%
56%
fort
18%
34%
moose
31%
58%
orignal
37%
88%
35
35
35
Term
Share
Named
sun
63%
96%
coucher
33%
76%
sun
37%
84%
coucher
33%
83%



Some aspects of the data need to be explained. As was the case with the data collected in the English-language study, the stemming method used is that described in Furnas et al. (1983) . However, because of differences in construction techniques of phrases between English and French, some seemingly unrelated terms should be considered as direct matches. This is the case for shot numbers 15 and 35. In shot 15, the participants who sa w the moving version of the shot noted the fact that the birds are taking off from the ground, then flying. The still version of the shot shows the birds airborne. In English, a phrase such as "birds taking off" or "birds flying" is stemmed to "bird" acco rding to the method used. For French-speaking participants, the way that came to mind to express "birds taking off" was "envol des oiseaux". Using the stemming method, "envol" becomes the high-scoring term, but it is significant that "oiseau" ["bird"] and "vol" ["flight"] are the second and third runners-up respectively. Thus, in practical applications in information systems, if the top three terms are considered in the indexing - and this approach was the one adopted in the data analysis of the original study there is a close correspondence among the significant terms across the languages.

Similarly, shot 35 (a sunrise) is interesting because firstly it is not clear to most participants whether it is a sunrise or a sunset. In the English-language study, responses were divided equally between these two concepts, and it is noteworthy that in a practical application perhaps the shot could be used to represent either. In the data thus far collected, French-speakers exhibit the same ambiguity, most participants either giving the equivalent of both "sunrise" ["lever du soleil"] and "sunset" [" coucher du soleil"] or expressing in some other way that they are not sure which it is. Secondly, since the concepts as expressed in French are inverted in relation to the English expressions, both of the latter starting with "sun", the seemingly disparat e results are actually very similar. Thus, for those who saw the still version of the shot "coucher du soleil" ["sunset"], stemmed to "coucher" is the top scoring term, "soleil" ["sun"] is the first runner-up, and "lever du soleil" ["sunrise"] is the seco nd stemmed to "coucher" is the top scoring term, "paysage" ["landscape"] is the first runner-up, and "soleil" ["sun"] and "lever du soleil" ["sunrise"] are tied for second runner-up.

Among the French-speaking participants in this category, shot number 33 shows an interesting pattern. "Forêt" ["forest"] is the top term in the still version, and "orignal" ["moose"] in the moving version. This is likely due to the fact that the moose is rather camouflaged by the foliage in the image, and the motion cues in the moving version of the shot helped participants distinguish the presence of the moose.


Discussion

The Collins-Robert French-English, English-French Dictionary (Collins 1993), a widely-used and respected bilingual dictionary is used as a basis of comparison to determine whether there is equivalence betw een the terms given in each language for each shot. If the term in one language appears as a possible translation of the term in the other language, then the terms are deemed to be equivalent. On this basis, there is a direct correspondance between the tw o languages among the top terms in at least one version of the shot for ten of the eleven shots in the category reported here.

The remaining shot, number 26, reflects a more general indexing level among the French-speaking participants, the top term among whom was "oiseau" ["bird"], whereas with the English-speaking participants the more specific "woodpecker" scored highest. I t is noteworthy that among the French-speaking participants, "pic" ["woodpecker"] was the first runner-up. Possibly when all the data for the study is analyzed the overall results will absorb these differences in scores in a single category of participant s.

In both languages in the data set reported here, a few shots have terms tied for the position of top-scoring term. In English, shot number 13 has a three-way tie and shot number 18 a two-way tie. In French, shot number 24 has a two-way tie. Again, lit tle significance should be attached to this until all the data is compiled and analysed.


Conclusion

A cursory glance at the percentages given in table 1 indicates a great deal of similarity, in the majority of cases, among the percentages calculated for the top terms supplied by English-speaking participants and those supplied by French-speaking part icipants for the corresponding shots screened in the same conditions. Although the calculations are preliminary and may change somewhat when all the data collected have been analysed, it is likely that the patterns which occur here will prevail.

If the results reported here are indeed representative of those of the completed study, then for the type of material that is the object of this research, namely everyday non-art images in still and moving form, the transfer of visual representations o f objects and events to verbal expressions of these objects and events takes place in the same way for French-speakers as it does for English-speakers. This suggests that shot-level indexing of such material in either of these two languages could be trans fered to the other language using automated techniques to filter the indexing terms through a bilingual controlled vocabulary database. In the context of the overall research agenda and in light of the results of the studies completed so far, it seems cle ar that the development of automated techniques for subject indexing of moving-image production materials rests on solid foundations.

Acknowledgements

This work is being carried out under grant number 96-NC-1546 of the Fonds pour la Formation de chercheurs et l'aide à la recherche (FCAR). Thanks are due to Isabelle Laplante who prepared the French-l anguage version of the tapes used for data collection, and to Kumiko Vézina and Anne-Marie Labelle, who collected much of the data and prepared the computer files for analysis.

References

Collins-Robert French-English, English-French dictionary (1993). Beryl T. Atkins et al., eds. 3d. ed. Paris: Harper Collins Publishers and Dictionnaires Le Robert. (Go back)

Furnas, G.W., T.K. Landauer, L.M. Gomez, and S.T. Dumais (1987). The vocabulary problem in human-system communication. Communications of the ACM 30, no. 11 (November):964-71. (Go back)

Furnas, G.W., T.K. Landauer, L.M. Gomez, and S.T. Dumais (1983). Statistical semantics: analysis of the potential performance of key-word information systems. The Bell System Technical Journal 62, no. 6 (July - Augus t): 1753-1806. (Go back)

O'Conner, Brian C (1996). Pictures, aboutness, and user-generated descriptors. The SIG VIS News 1, no. 2 (spring) (Go back)
http://www.unt.edu/~aag0001/oconnor.html

Turner, James M. (1996). Storage and retrieval of moving images: a research agenda. Annual conference of the Association for the Study of Canadian Radio and Television, Brock University, Saint Catharines, ON, 1996 05 28. (Go back)
http://tornade.ere.umontreal.ca/~turner/ASCRT96.html

Turner, James M. (1995). Comparing user-assigned terms with indexer-assigned terms for storage and retrieval of moving images: research results. Proceedings of the 58th ASIS Annual Meeting, Chicago, Illinois, October 9-12, 19 95, vol. 32, 9-12. (Go back)

Turner, James Ian Marc (1994). Determining the subject content of still and moving image documents for storage and retrieval: an experimental investigation. PhD thesis, University of Toronto. (Go back)