Options for Organizing Electronic Resources: The Coexistence of Metadata

by Sherry L. Vellucci

With the rapid growth in the number of electronic resources available via the Internet, a variety of methods have been developed to organize and access these objects. Librarians, scholars and computing engineers each have applied their own techniques to the process. Catalogers envision a "super-catalog"; scholars spawn encoded texts; computer engineers design robot-generated indexes. Each method has its strengths and its weaknesses, and each group casts a wary eye upon competing systems. In fact, it is unrealistic to believe that the organizational system of any one group will be adopted by all players in the electronic arena.

While general principles of organization should apply to all approaches, cataloging the Internet is inherently different from cataloging a library. Libraries are systematically developing collections of primarily fixed objects usually under the control of one institution or agency. The Internet is more analogous to the ubiquitous ebb and flow of information and services appearing throughout society. The structure and content are not systematically developed and stable; there is no single user group or purpose for the Internet; and there is no single controlling agency. Thus, there is no one community, be it scholars, computing engineers or librarians, that is clearly vested with authority to decide upon the best means to bring control over this electronic chaos.

Considering that the Internet has always been a decentralized initiative, it is not surprising that the efforts to organize it have been similarly independent and uncoordinated. Analogous to a loosely coupled communication system, each community working to organize the Internet is somewhat responsive but essentially autonomous. It is likely that each of these autonomous groups will continue to develop methods for description and access that best meet their own perceived needs. At best, cooperative efforts may result in some common understanding of a core set of descriptive data; but it is unlikely that the application, structure and use of that data will be identical within all communities. It is important, therefore, that each group recognize the contributions of the others and that together they provide bibliographic control methods that can be layered, interchanged and translated within a broad but loosely coupled system of organization. In this way, each community can continue to develop methods compatible with its own users' needs, while taking advantage of data and systems created by other groups.

Remote electronic resources have engendered discussion of the catalog's scope and functions, the concept of a collection and the appropriateness of including bibliographic surrogates for remote resources in a local catalog. Traditionally, library catalogs have provided macro level access to whole items in the collection, relying on other bibliographic tools, such as indexes and bibliographies, to provide micro level access to bodies of literature and parts of items.

One criticism of this distributed bibliographic system is its failure to provide access to everything in the library collection through one totally integrated catalog. More sophisticated computer technology offers new possibilities for correcting this problem. Electronic indexes and databases mounted on the local computer provide, if not a truly integrated catalog, a reasonably integrated interface that allows access to several bibliographic tools from one terminal. Work continues to develop more sophisticated layering systems that offer searchers the convenience of accessing simultaneously the OPAC and several other databases with a single search query. While providing the illusion of searching a single database, these systems could use both front-end interfaces that convert differing search commands into one common command language and multi-language thesauri that translate terms used by one database into those used in another. These techniques of layering, exchanging and translating data allow us to avoid the time, expense and proprietary problems involved with actually creating one large integrated catalog. This is another example of a loosely coupled organizational system and provides a model for the more complex goal of organizing the Internet.

Due to the economic constraints of the past decade, most libraries have shifted their collection development focus from exclusive "ownership" to providing "access," a shift that has not only blurred the concept of "collection," but has caused many librarians and information specialists to lose sight of the beneficial service provided by the process of collection development. It is essential and desirable that the confining parameters that define a collection be expanded to accommodate documents that are not owned and physically housed within the library's walls; however, it is just as essential and desirable to continue the value-added concept of a collection as interrelated parts selected with a purpose. Remotely accessed electronic resources should not be excluded from the collection development scrutiny accorded documents in all other formats. In light of discussions on scholarly communication and the problems of electronic "publishing" without peer review, the inclusion of Internet resources in the library catalog based upon appropriate selection criteria may be even more necessary than in the past to assist users in finding relevant resources.

Organizing the Internet in Local Library Catalogs

The current organization of electronic resources can be described at two levels: the local agency's catalog and catalogs of Internet resources beyond the auspices of any one library. At level one, a description of the resource is contained in the local library catalog, along with bibliographic surrogates for all other materials for which that library provides access, artifactual and electronic. The Anglo-American Cataloging Rules have proved adaptable to new formats and have been modified, interpreted and supplemented to accommodate remote electronic resources. The existing MARC record also has been revised to provide special fields and coded data elements for the description and access of remote electronic resources. Specifically, field 856 has been added to convey information necessary to locate and access the electronic object. Thus, the current methods of cataloging continue to be used to describe resources that are in some ways fundamentally different from those currently owned by libraries. At present, however, only a portion of library catalogs are able to provide active links through the OPAC to access the electronic documents directly; but as libraries migrate to Web-based catalogs, these "hotlinks" will become more common.

It is not surprising that the vanguard attempts to catalog Internet resources use current and familiar cataloging methods. But this AACR2/MARC-based cataloging system is not without problems. The MARC record is designed primarily for single object description and linear access. It cannot easily be adapted to describe adequately multi-level hypertext objects. In addition, these bibliographic records are designed to carry a large amount of carefully selected data and to allow access to much of that data. They are, therefore, detailed and complex structures which require much time to create and encode correctly. Recent efforts to save time and money by simplifying the cataloging process argue against traditional cataloging methods that create full MARC records for each electronic resource. As library administrators attempt to save money by purchasing cataloging data from outside sources, new ways of creating bibliographic records are being sought.

While the library cataloging community was attempting to determine the feasibility of using traditional cataloging standards for Internet resources, scholars in the area of humanities computing concentrated on the electronic document itself and on a project to develop a text encoding scheme for complex electronic textual objects. The project became known as the Text Encoding Initiative (TEI). The TEI guidelines give recommendations on what features of the text to encode and how to encode them. They also include a chapter on creating a header imbedded within the electronic document that precedes the electronic text. This TEI header consists primarily of four parts: file description, encoding description, text profile and text revision history. Designed for multipurpose use, the TEI header provides metadata needed by librarians who will catalog the text, scholars who will use the text, and software programs that will operate on the text. The file description portion of the header is intended to serve as the electronic equivalent of the title page of a printed work and is the only required part of the header. When data elements for all four components of the header are encoded, TEI header structures may be as lengthy and complex as MARC records.

Several groups within the library community are now examining the possibility of using TEI header information as a source of data in the automated creation of MARC records for electronic documents. Computer programs have been produced that convert TEI-tagged bibliographic headers into MARC formatted records. As research develops in the area of metadata formats, the process of TEI header to MARC data conversion may become the copy cataloging of tomorrow.

Library Independent Organization of the Internet

At level two, the goal is to organize Internet resources independently of any library agency. Several means of organization and access currently exist, including separate catalogs of selected Internet resources, subject browsing lists and robot-generated search tools. The common feature of all these "second level" access tools is the exclusive focus on Internet resources. OCLC's InterCat Catalog is one example of a second level bibliographic tool designed through a cooperative library endeavor. Consisting of MARC records created by individual catalogers from many libraries, the database, with its Web interface, represents the beginnings of an Internet Union Catalog (IUC) that is not tied to any one library as information access provider. The database still relies, however, on individual catalogers creating and entering MARC records. The IUC, while one level above the local library OPAC, still contains many of the local OPAC's useful characteristics. The documents included have been selected by librarians for their quality and appropriateness; and bibliographic surrogates in the database contain the kinds of data and authority controlled access points that have proven useful in information retrieval.

When the level two organizational goal extends beyond the library selection process to the entire corpus of information on the Internet, the task of providing description and access for these electronic resources and services becomes monumental. The amount of information requiring organization is beyond the existing methods and systems of individual professional catalogers, indexers and abstractors to manage either efficiently or effectively. Several communities have wrestled with the problem and developed different ways to provide access to these resources.

Over the last few years, a dramatic increase in the amount of information on the Internet heightened the need for improved access methods. Savvy computer engineers realized they had to develop retrieval systems that could be mastered easily by the non-scientific community that now used the Internet so heavily. Their efforts resulted in three primary access methods: direct address, directory browsing lists and robot-generated searchable indexes. Direct address, i.e., entering the specific pathname, directory and filename (e.g., the URL), is a known object search and requires no organizational tools to locate the site. The two remaining methods, directory browsing lists and robot-generated searchable indexes, are organizational tools that are used heavily across all Internet user communities.

Directory browsing lists help people find Internet resources by arranging those resources by subject. The most commonly used lists, such as Yahoo!, use an alphabetico-classed arrangement with verbal subject topics hierarchically ordered. This arrangement is especially useful for browsing, where the user can move from the general subject to the more specific topic and then connect directly to the resources assigned to that topic. Although easy to use, this subject arrangement presents several problems: there are few cross references; many users may find difficulty identifying the hierarchy under which their narrower topic might be found; and these lists soon become lengthy and unwieldy.

Other browsing tools, created outside but influenced by the library community, such as Patrick's Subject Catalog and Mundie's CyberDewey, are arranged by classification number. This affords a more logical hierarchical subject approach that is not dependent on the vagaries of the alphabet for its arrangement. While remaining less tied to semantic content than the verbally based subject systems, some type of verbal clarification is required to interpret the notation correctly. The Dewey Decimal Classification appears to be the system of choice for most of these classification-based browsing tools and has the advantage of being familiar to many users, both within the United States and internationally. As with the other subject lists, however, classification lists also need a system of cross-referencing, and the subject lists can become lengthy and unwieldy.

Directory browsing lists also have other limitations. Most require a great deal of human effort to collect, arrange, encode and annotate the list of resources. When the vast number of Internet resources to be organized is considered, this could be viewed as a serious handicap.

The second type of search tool, the robot-generated searchable index, attempts to overcome many of the limitations of the browsing lists. Robot-generated indexes, such as Lycos and Alta Vista, rely on computer power to collect and index electronic resources and provide an interactive interface that allows the user to enter searches for specific terms or phrases. Many allow the user to perform very sophisticated searches employing Boolean operators, adjacency and proximity operators and imbedded truncation. As a result of the automated collecting and indexing process, these search engines provide access to a much larger volume of electronic resources than the browsing lists.

These computer generated search tools also have their own weaknesses. One problem is that they frequently produce extensive lists of search results, requiring users to spend time examining a great number of irrelevant and often incomprehensible citations in order to find a few pertinent hits. Searchers familiar with library catalogs and bibliographic databases may well wonder at the bewildering content of many retrieved citations. Many confusing citations are caused by extracting data directly from the document without human review and by the lack of standardized descriptions and authority control for names and subjects.

Thus, while directory browsing lists and robot-generated indexes have the advantage of providing immediate access to a vast array of Internet resources and are often easy to use, they lack many features, such as selectivity, descriptive citation data and authority control, that have made the more traditional bibliographic tools most useful.

If robot-generated indexes provide too little descriptive information for effective and efficient document retrieval and manually created MARC and TEI headers contain too much information for rapid and inexpensive document description, is there some method of describing Internet resources that mediates between these two extremes? With this in mind, the Dublin Core data set was developed to perform this function by defining a core data set that could describe a wide range of electronic objects. If such a core data set were adopted as an established standard and incorporated into the electronic object at the time of creation, it would increase the amount and quality of available resource descriptions with a minimal amount of human intervention. These descriptions would improve the quality of index citations created by automated tools, while enabling MARC or TEI header based records to be developed with much less expense.

The Dublin Core development workshop focused on data elements necessary for the discovery of document-like objects, with consideration given to mechanisms for extending the elements to meet specialized needs. The syntax of the Dublin Core was left deliberately unspecified so that the data elements could be mapped into a variety of more complex structures from the MARC format and TEI headers to some as yet undefined structure. It is this aspect of sharing and translating descriptive data that holds the most promise, for by creating a minimum standard for descriptive data that can be used as desired by each community, we have the basis for an umbrella organizational structure based on a loosely coupled system. Then each stakeholder can continue to develop and improve its own system and structures, while overarching structures can be created to exchange, translate and layer common data.

Moving to Metacatalogs

While level two provides access to a much wider range of Internet resources than any level one catalog, most information about nonelectronic resources that is provided at level one is lost with level two access tools. Although it is possible to locate and search library OPACs on the Internet via telnet or a Web interface, OPAC searches must be performed as a separate activity. A single query will not access information found at-large on the Internet and in the library OPAC, because library OPAC databases are not included in searches performed by current Internet search engines.

As we move forward with efforts to create minimum standards for descriptive data, develop ways for creators of electronic resources to provide descriptive information at the time of document creation and refine search engines to enable simultaneous searches of multiple databases, we should have as our goal the creation of a third level access tool -- the metacatalog. We must develop search tools whereby a user can identify specific library catalogs to include in a search query of other Internet databases, much as some existing search engines allow a user to select Web, gopher, ftp, newsgroups or commercial databases for searching.

The metacatalog should be able to translate and interpret MARC records, TEI headers, SGML, HTML and any future coding. There should be language translators built in, so that searchers can use search commands and view data labels in their language of choice and translate terms in subject browsers. Multiple thesauri could be employed to assist with vocabulary control across communities, and a high-level authority control component could transparently identify and link variant forms of names. In other words, the next generation metacatalogs should be able to access all relevant information seamlessly, no matter what format or language. In order to accomplish this each stakeholder community must stop trying to recast the tools created by others into ones of their own structure, but rather concentrate on developing ways to layer, exchange and translate data within a loosely-coupled organizational system.

Sherry L. Vellucci is assistant professor at St. John's University in Jamaica, New York. She can be reached by e-mail at 4652928@mcimail.com