B U L L E T I N
Metadata Generation: Processes, People and Tools
Jane Greenberg is assistant professor, School of Information and Library Science, University of North Carolina, Chapel Hill. She can be reached by e-mail at firstname.lastname@example.org
Metadata generation is the act of creating or producing metadata. Generating good quality metadata in an efficient manner is essential for organizing and making accessible the growing number of rich resources available on the Web. The success of digital libraries, the sustenance of interoperability – as promoted by the Open Archives Initiative – and the evolution of Semantic Web all rely on efficient metadata generation. This article sketches a metadata generation framework that involves processes, people and tools. It also presents selected research initiatives and highlights the goals of the Metadata Generation Research Project.
Metadata Generation Processes
In today's networked environment metadata is produced by both human and automatic processes. Human metadata generation takes place when a person such as a professional metadata creator or content provider produces metadata. The quality of human generated metadata is often determined by semantic and syntactic adherence to a metadata schema specification. Historically the only form of metadata creation, this process still dominates libraries, museums, archives and other information resource centers. Human metadata generation is popular on the Web as demonstrated by a notable increase in the supply of "keyword" and "description" XHTML META tags. Further evidence is Adobe's recent enhancement of XMP (eXtensible Metadata Platform) to support the creation and storage Dublin Core metadata within the Resource Description Framework (RDF) (www.w3.org/TR/rdf-schema/). The prevalence of manually generated metadata can, in part, be attributed to the fact that human metadata production is often superior to automatically produced metadata.
Automatic metadata generation depends on machine processing. Most familiar to the library and information science community is automatic indexing, which primarily focuses on a resource's subject content. Commercial search engines practice automatic metadata generation in two situations. First, metadata is produced automatically prior to a user's search by spiders that traverse the Web daily and extract and store resource metadata in the search engine's host database. A user's query is run first against this metadata store. Second, metadata is produced automatically and dynamically at the time of the user's search by executing a search and retrieval algorithm against the Web's global store of resources (beyond the search engine's host database). The second situation occurs when a user's query fails to match the metadata stored in the search engine's host database. Document representations in both situations are generally composed of the first few lines of a document (Web resource), locator information such as a uniform resource locator (URL) and title tag metadata.
If we liberally define objects for which metadata can be generated as "any entity, form or mode for which contextual data can be recorded" (see Greenberg, 2002, in For Further Reading), we find automatic generation operations taking place daily. Examples include the automatic generation of metadata on statements documenting an online purchase of an airline ticket, an automatic bank teller machine (ATM) transaction or telephone calls made during a previous month. Automatic processes permit human resources to be directed to more intellectually challenging metadata generation activities.
Classes of Persons
Among the classes of persons involved in metadata generation, are professional metadata creators, technical metadata creators, content creators and community or subject enthusiasts. Although these classes of persons are defined separately here, for reasons of clarity, the distinctions are not always absolute.
Professional metadata creators include catalogers, indexers, Web masters and other persons who have had high-level training through a formal educational curriculum or an official on-the-job training program. Professional metadata creators have the intellectual capacity to make sophisticated interpretative metadata-related decisions and work with classificatory systems, complex schemas and other complex information standards. They are third-party metadata creators because they produce metadata for content created by other individuals. Given expert knowledge and skill, metadata professionals' greatest contributions may be in working with the more complex metadata schemas, instituting and overseeing a metadata production operation, instructing less skilled persons in metadata creation or helping to develop tools that facilitate metadata production.
Technical metadata creators include paraprofessionals, data in-putters and other persons who have had metadata training – but much less extensive training than the metadata professional. Technical creators are not expected to exercise discretion to the same degree as the metadata professionals, although they may take on more sophisticated tasks over time. They generally work with simpler metadata schemas and perform routine processes that are part of more complex metadata generation activities. For example, when a library orders a book, a library technical assistant, generally in the acquisition department, creates an "acquisition level" (acq level) MARC bibliographic record, which is a basic bibliographic description without authorized subject or name headings. When the book arrives in the library, a metadata professional uses the "acq level" metadata to create a full-level AACR2 MARC record – resulting in a richer and more standardized bibliographic description. It should be pointed out that the title of technical or bibliographic assistant can be misleading because frequently persons who are identified and paid as though they were technical metadata creators perform professional-like activities.
Content creators are individuals responsible for the creation of the intellectual content of a work. Researchers regularly produce abstracts, keywords and other types of metadata for their scientific and scholarly publications. Visual artists and crafts-persons sign and date their works. In the Web environment, content creators can provide metadata via a template or editor (see the Tools discussion below), while a webmaster "webifies" (makes the work Web-accessible). The National Digital Library of Theses and Dissertations (NDLTD) (www.ndltd.org) and the Synthesis Coalition's National Engineering Education Delivery System (NEEDS) digital library for engineering education (www.needs.org/needs/index.jhtml) are digital library projects supporting author-generated metadata. Exploratory research has in fact shown that authors can produce fairly good metadata [Cruz & Krichel (2000); Greenberg et al. (2001)]. Facilitating content creator generated metadata makes sense when weighing the rapid growth of the Internet against the economics of hiring metadata professionals.
Community or subject enthusiasts have not had any formal metadata creation training, but have special subject knowledge and want to assist with documentation. A rudimentary view of this activity is found on personal Web pages that roughly classify, hyperlink to and/or provide other limited metadata for Web pages that document a topic of passion or interest. See topics such as "golden retrievers" (http://home.att.net/~johnwaf/) or "Italy and things Italian" (www.geocities.com/alejna2004/Alejnas_italy.html).
The Fine Arts Museums of San Francisco's Thinker ImageBase (www.thinker.org/fam) provides an interesting and more formal example involving enthusiasts. Initiated during the Legion of Honor renovation following the Loma Prieta earthquake, ImageBase contains images and corresponding metadata for objects from the collections of the Fine Arts Museums of San Francisco (the de Young Museum and the Legion of Honor). Through a collaborative arrangement, museum staff provided artist's name, date of creation, technique and other types of official museum registration metadata, and community enthusiasts (volunteers) assigned keywords to approximately 20,000 images. Community enthusiasts were used for two key reasons – to assist museum staff with object documentation and to enhance access through the provision of additional subject terms. Collaborative efforts between metadata professionals and community enthusiasts, as demonstrated by the ImageBase project, while not the norm, may become more common over time. The Web's connectivity may increase collaborative metadata generation among other classes of persons as well.
Metadata generation is supported by the following types of tools:
· Human beings: intellectual tools with the capacity to exercise discretion and perform data input
· Standards & documentation: metadata specifications, content guidelines, thesauri, classification lists and other types of standards and documentation that guide metadata creation
· Devices: technical compilations that "capture" and "store" metadata for either a database or resource header (e.g., the header of an HTML or XML document). Devices, as defined below, include templates, editors and generators.
Templates are basic cribsheets that sketch a framework or provide an outline of schema elements without linking to supporting documentation. Templates, in both print and electronic format, have been predominant in metadata generation, probably because they are simple to produce and maintain. These tools guide metadata creation through the provision of a form without the bells and whistles. An example is the Linux Software Map (LSM) Entry Template (ftp://ftp.execpc.com/pub/lsm/LSM.README) for metadata about Linux software packages. Guidelines associated with the LSM schema refer to the RFC822 standard for author name content syntax, among other standards, but the official template provides no linking mechanisms. Persons using this template generally work in a text editor, seek standards documentation on their own and submit their LSM records to a Linux repository via the File Transfer Protocol (FTP). The MARC bibliographic form supporting cataloging in many second-generation online catalogs has functioned in much the same way, without any sort of automatic linking to authority files and content guidelines. This facility is fortunately changing as many catalogs become Web-based and hyperlink to cataloging documentation, thus functioning more like an editor.
Editors are similar to templates in that they require human input. They are more sophisticated in that they provide direct access to standards and documentation underlying metadata creation. These tools often assist with syntactical aspects of metadata creation via automatic means. One of most popular Dublin Core editors is the Nordic Dublin Core Metadata Template (www.lub.lu.se/cgi-bin/nmdc.pl). Although the term template appears in the official name of the Nordic Dublin Core Metadata Template, it is an editor according to the definitions offered in this article. This editor supports the generation of metadata records with HTML META tags for embedding in the header of a Web resource. The Nordic Template has been adapted to many different Dublin Core projects. A partial list of them is at: http://dublincore.org/tools/. Another example is the Reggie Metadata Editor (http://metadata.net/dstc/), which allows for metadata to be generated within RDF. Editors can also include off-the-shelf software like Metabrowser (http://metabrowser.spirit.net.au/), which hyperlinks to documentation supporting metadata generation and automatically provides the correct syntactical encoding. People work with a wide variety of Web forms when joining an organization, posting information on an online community bulletin board or purchasing a product over the Internet. All of these forms function as metadata editors documenting transactions, activities, events and other types of objects beyond the traditional information resource.
Generators support automatic metadata production. Note that the distinction given in this article between editors and generators is based loosely on those found under the "tools" link on the Meta Matters (Website) produced by the National Library of Australia (www.nla.gov.au/meta/). In the context of the Web, generators first require the submission of a uniform resource locator (URL), a persistent uniform resource identifier (PURL) or another Web address in order to locate the object. An algorithm is then used to comb an object's content, including its source code, and automatically assign metadata. An example is found with the DC.dot generator (www.ukoln.ac.uk/metadata/dcdot/), which requires the submission of a URL to locate and scan a resource's content. This generator then automatically produces a Dublin Core record within RDF. The metadata can be embedded in the header of a XHTML or XML document or stored in a database. DC.dot supports metadata generation according to a number of different metadata schemas. Full "schema-specific" generators are fairly experimental because they can produce moderately accurate metadata for some elements such as the date a resource was last updated or MIME type, but results vary greatly for more intellectually demanding metadata such as subject descriptors.
One approach to dealing with the experimental and unpredictable nature of generators has been the creation of hybrid metadata tools that combine aspects of both editors and generators. DC.dot functions this way in that an editor permits a person to edit the metadata that is automatically generated. Document editors like Microsoft's Front Page and Dreamweave exhibit the features of a hybrid tool by supporting human metadata generation of certain elements such as "keywords" and "description" and automatically producing other metadata as part of the document-creation process such as "date document was produced" and "document format."
A Research Context
This overview has identified processes, persons and tools comprising a metadata generation framework. Although we have commented on the strengths and weaknesses of these means and mechanisms, the discussion has served more to sketch the framework than to present empirical results. Decisions about the processes, persons and tools to employ for metadata generation depend on a project's architecture, complexity of desired metadata schema, time allotment and project deliverables and the availability of human, financial and time resources. Clearly, different combinations of these metadata generation components will be more effective in different environments. Research efforts testing various combinations of processes, people and tools will help establish useful models to guide metadata generation activities. A list of selected research projects exploring aspects of metadata generation is found in the sidebar.
This article concludes by highlighting the Metadata Generation Research Project (http://ils.unc.edu/~janeg/mgr) being conducted at the School of Information and Library Science, University of North Carolina (SILS/UNC-CH), in collaboration with the National Institute of Environmental Sciences (NIEHS), an Institute of the National Institutes of Health (NIH). The Metadata Generation Research Project was launched with funding from Microsoft Research and is continuing with support from the OCLC, Online Computing Library Center. Research goals underlying this project include studying human and automatic metadata generation processes, developing protocols for collaboration between resource authors (content creators) and metadata professionals during metadata generation, evaluating the integration of collaborative human metadata generation processes with automatic generation processes and considering implications for the development of the Semantic Web. A model is being developed to facilitate the most efficient and effective means of metadata production in scientific research centers. The model and the underlying methods of inquiry will, in turn, aid future metadata generation investigations. In closing, this work, along with the other initiatives listed, will ultimately aid in efficient generation of good quality metadata and in making the Web's vast collection of rich resources more accessible.
For Further Reading
Acknowledgement: Portions of this article originally appeared in the Encyclopedia of Library and Information Science (Greenberg, J. (2002)).
Selected Metadata Generation Research Projects
· Breaking the Metadata Generation Bottleneck, School of Information Studies, National Science Foundation (www.cnlp.org/research/project.asp?recid=6), is utilizing natural language processing and machine learning technologies to automate the assignment of metatags to educational resources in math and science.
· Computational Linguistics for Metadata Building (CLiMB) Project, Center for Research on Information Access, Columbia University Libraries (www.columbia.edu/cu/cria/climb/), is using the latest developments in natural language processing to study problems of automatically extracting metadata from text.
· GEM (Gateway to Educational Materials) Research (www.geminfo.org/Research/) is exploring the use of automatic indexing based on GEM metadata for major search elements to provide a mechanism for machine derivation of GEM metadata as a fairly precise "rough cut" metadata for resource discovery and retrieval.
· Metadata Generation Research Project, School of Information and Library Science, University of North Carolina (SILS/UNC-CH ), in collaboration with the National Institute of Environmental Sciences (NIEHS), an Institute of the National Institutes of Health (NIH) (http://ils.unc.edu/~janeg/mgr), is developing a model to facilitate the most efficient and effective means of metadata production by integrating human and automatic processes in scientific research centers.
· Tools for Information Resource Discovery on the World Wide Web, Charlotte Jenkins, research associated with the further development of the Wolverhampton Web Library (WWLib) (www.scit.wlv.ac.uk/~ex1253/research.html), is developing the "Classifier" – a tool that automatically classifies documents using the Dewey Decimal Classification within RDF.
Copyright © 2003, American Society for Information Science and Technology