Cataloging Internet Resources: Survey and Prospectus

by Eric Jul

Think back to a time before the Internet. No one created Internet resources, and no one cataloged them. Now people do both everyday. What has happened in the interim, and what is likely to happen in the future? This brief article describes key events in the cataloging of Internet resources, surveys current activities and outstanding issues, and looks ahead to the near and distant future. This survey, by its scope, must be brief to the detriment of fuller discussion and even the omission of significant activities. Attribute omissions to the limited scope of this paper and my bias to familiarity with the subject gained through two OCLC Internet cataloging projects.

Background

In 1991, OCLC initiated an examination of Internet resources to gain greater insight into the type and nature of resources available on the Internet at that time (there were no Web resources) and to test the suitability and applicability of MARC/AACR2 cataloging for Internet resources. Automated sampling and statistical analyses revealed the type, characteristics and distribution of Internet resources (Dillon, et al. 1993). This study revealed a small but growing body of data, information and knowledge resources of the sort typically collected by libraries.

To determine the potential impact of such resources on library cataloging, OCLC initiated a cataloging experiment. Some 30 catalogers from around the world, whose experience cataloging computer files ranged from 2 to 20 years, volunteered to attempt to catalog a sample of Internet resources. Project staff assembled a sample of 300 Internet resources representing randomly and manually selected resources. Divided into sets of 30, the sample resources were distributed to the catalogers. Thus, each resource was cataloged independently by three catalogers. Importantly, this experiment did not ask whether any particular Internet resource should be cataloged, only if it could be cataloged.

Three key findings arose: (1) with some distinct exceptions, MARC/AACR2 cataloging appeared to accommodate the description of Internet resources; (2) a method of linking the bibliographic record to the described resource seemed desirable; and (3) despite familiarity with computer file cataloging, instructional materials relating to the cataloging of Internet resource seemed warranted.

The first finding was not surprising and provided sufficient confidence to undertake additional cataloging efforts, described below. The publication of Cataloging Internet Resources: A Manual and Practical Guide (Olson 1995) and other reference resources addressed the third finding.

To address the second finding, project staff and advisors, including the Library of Congress, devised and proposed USMARC field 856, "Electronic Location and Access," a field in the USMARC Format for Holdings and Locations. This field, created before the now ubiquitous URL (Uniform Resource Locator), enables catalogers to encode the discrete information necessary to effect a link between a bibliographic record and a remote electronic resource. The subsequent addition of subfield $u, "URL," allows catalogers to encode a fully qualified URL.

Upon the adoption of the 856 field, OCLC developed the InterCat Catalog (http://purl.org/net/intercat) as a proof-of-concept database. Comprised of records for Internet resources extracted from WorldCat (the OCLC Online Union Catalog), the InterCat Catalog demonstrated the union of catalog searching with its host of functions -- keyword and phrase searching, selected index searching and Boolean operations -- with direct Web access to the Internet resources. The basic level of search access we have come to expect for a common book had been extended to a collection of Internet resources. Soon, library system vendors began modifying their products to take advantage of the linking capability of the 856 field, and the Web OPAC was born.

Bolstered by the belief that MARC/AACR2 cataloging could and should be extended to selected Internet resources, OCLC initiated a second Internet cataloging project. Funded in part by the U.S. Department of Education from October 1, 1994 through June 30, 1996, OCLC solicited the voluntary participation of librarians worldwide to identify, select and catalog Internet resources. Resulting bibliographic records were to be in USMARC format, conform to AACR2 rules for descriptive cataloging and contain one or more 856 fields. Participating libraries were to select resources for cataloging as guided by local collection development policies and the expressed or anticipated interests of their respective users. By the end of the project, 231 participants representing nearly all types of libraries had selected and cataloged some 4,707 Internet resources. As of this writing, some 500 different OCLC-member libraries have cataloged more than 16,000 Internet resources representing more than 330,000 individual library holdings.

Three Critical Questions

Cataloging Internet resources raises many critical questions, among which three stand out as fundamental: (1) Are Internet resources worth cataloging? (2) Is traditional MARC/AACR2 cataloging appropriate for Internet resources? (3) What about resources that change location? Addressing these three questions provides an opportunity to discuss many important and related aspects of cataloging Internet resources.

Warrant. Collection development is a significant value-adding function. Before cataloging any Internet resources, libraries should exercise selection processes similar to those extended to material in any other format. That a resource is in electronic format and accessible via the Internet is insufficient for making selection and collection development decisions. Resources that meet a library's selection criteria are candidates for cataloging. A fair rule of thumb might be this: All things being equal, if this resource were in any other media, say, print, would the library catalog it? This question sets aside the resource's format and manner of access and focuses on its content, merit and warrant.

Some years ago the relative merit of Internet resources, at least in terms of library collection policies, may have been in question. Now, some 16,000 bibliographic records later, there is ample evidence that selected Internet resources are on par with library materials in other media.

Whether Internet resources are worth cataloging should no longer be a question. Whether they should be cataloged, however, is still a fair question and one that prompts a healthy and essential self-examination of library practice. In this regard, it is useful to remind ourselves that cataloging has been and ought to be an evolutionary art and practice. This frees us from placing unhelpful limits on a discussion of cataloging Internet resources or from thinking of cataloging only in familiar terms.

Traditional Cataloging Approaches. The benefits of cataloging selected Internet resources are manifest. Catalogers represent a highly trained workforce, with practitioners distributed across library types, subject areas and geographic locations. Moreover, this distributed workforce is linked through common standards, practices and systems that can be applied immediately to the problems of providing improved description and access for Internet resources. USMARC format bibliographic records can be readily exchanged among library systems, which means that the work of one cataloger can be distributed and subsequently incorporated into an infinite number of OPACs. The hallmark of cooperative cataloging, this exchange of bibliographic records leverages and multiplies the value-adding process of cataloging. By incorporating bibliographic records for Internet resources in local OPACs, libraries can provide their users access to an integrated database of resources, and users can find, through a single system, books and Internet resources alike with equal ease. Finally, libraries that catalog Internet resources gain insight derived only through experience that will help mold the future evolution of cataloging.

Libraries and library users alike can realize immediate benefits from cataloging Internet resources using existing standards, practices and systems, and cataloging can exploit and extend the useful economic life of these standards, practices and systems, but at a cost. Libraries that catalog Internet resources may need to divert resources from other activities, and this involves an opportunity cost. Other indirect costs may derive from a loss of productivity during a learning curve, although many libraries report that the learning curve is not as steep or as long as had been feared. Direct costs can arise from the need for additional training, new Web OPAC interfaces or the need for additional staff with systems expertise. All of these costs notwithstanding, however, the cost of cataloging Internet resources is attributable chiefly to the fact that it is labor intensive and requires specialized skills.

Whether the costs of cataloging Internet resources are in line with the benefits of doing so is outside the scope of this paper. Nevertheless, we must imagine that applied technologies will continue to provide new opportunities for cost savings. One immediate and significant cost saving could be achieved if catalogers could reduce the amount of transcription required to create a bibliographic record for an Internet resource. This problem may be particularly amenable to the application of a metadata system -- a simplified but robust schema and syntax that would enable resource creators to record certain basic information (e.g., author name, title, publication date, publisher, etc.) in a way that could be manipulated automatically by some subsequent process. In this scenario, basic bibliographic information may be transferred to a preliminary bibliographic record automatically rather than through manual transcription. Beyond this rudimentary improvement, we can begin to imagine catalogers invoking increasingly sophisticated systems to assist with subject heading assignment or classification. Technological advances along these lines are not likely soon to eliminate the need for cataloger intervention, but the opportunities for improved cataloger efficiency seem palpable.

Other hindrances to traditional cataloging lie in the philosophical underpinnings of the Anglo-American Cataloging Rules. To address these large issues, the Joint Steering Committee for Revision of Anglo-American Cataloging Rules is convening an International Conference on the Principles and Future Development of AACR this fall in Toronto, Canada. An excellent collection of related papers is available at

http://www.nlc-bnc.ca/jsc/confpap.htm

Cataloging Moving Objects. Libraries add value to Internet resources by selecting, cataloging and integrating the resulting records into local OPACs. But this still leaves open the questions of transience and impermanence. What Internet user has not encountered "Error 404, File not found"? This all-too-frequent error message can signal, among other things, the fact that a resource, once identified by a URL, has moved or ceased to exist.

URLs represent address-specific locations, and encoding them in bibliographic records that are meant to be distributed across systems is problematic. Some would call it a catalog maintenance nightmare, and such a characterization is not far off. Data collected during the OCLC InterCat project revealed that, on average, 3% of URLs in the InterCat Catalog could not be accessed during any given test. An unknown subset of these links failed because the URL had changed (the resource had moved). (Other sources of failure include the remote system or the network, and determining the exact cause of a URL's failure can be difficult.) It soon became apparent that encoding URLs in bibliographic records brought both benefit and liability.

To address this problem, staff from the InterCat project and the OCLC Office of Research developed a working solution: Persistent URLs, or PURLs (http://purl.org). The concept is a simple aliasing system: create a PURL that does not change and associate it with a URL that can be changed as needed. This process is managed by the PURL server. Registered users can create a PURL and establish and maintain the PURL/URL relationship. Maintenance authorization can be carefully controlled and shared with other registered users or groups of users.

The PURL, which itself is a URL in form and function, contains within it all the information necessary to be "resolved" by the PURL server (an expressed protocol, a host name and user-assigned character string). So the PURL, http://purl.org/net/intercat, uses the http protocol to connect to the PURL server (purl.org), which looks for the URL associated with the character string net/intercat. The PURL server returns to the client (typically, a Web browser) the current URL, which the client then executes to complete the connection.

The advantage of this system is obvious: a PURL can occur any number of times (e.g., in multiple copies of a bibliographic record, on Web pages, in an electronic paper, in a bookmark list), yet if the URL for the referenced resource were to change, that change would only have to be made one time using the PURL server. Suddenly, catalog maintenance (and link maintenance generally) is reduced to a single update. Not a perfect solution, but it works.

Looking Ahead

I have alluded to future scenarios already in this paper, but now I would like to focus on some near- and long-term prospects. Initially, any library that has not yet begun to grapple with the host of issues surrounding improved description and access for Internet resources would do well to identify, select and catalog at least one Internet resource. More is to be gained from this relatively low-cost, low-risk enterprise than from reading a thousand papers on the subject (including this paper unless I have successfully motivated you to take some action). Libraries that have started cataloging Internet resources are poised to improve internal processes, such as selection, cataloging workflow or database maintenance, as necessary and to adjust the rate at which they catalog to meet user demand. Again, these actions can bring immediate and widespread benefits both within and outside the cataloging library.

Bibliographic records need not reside only in OPACs. Some libraries are using databases of bibliographic records to drive other applications, such as automatically building Web pages based on database content. Web page links, functioning as database queries, present information to users based on the current contents of the database and order the presentation of that information based on subject or classification information present in individual bibliographic records. The advantages to this approach are manifold, including (1) currency (Web pages built on the fly are as current as the underlying database), (2) the elimination of manually maintained library Web pages and (3) leveraging the intellectual content of catalog records.

Metadata, often cast about as some sort of magical cure, actually holds considerable promise. The Dublin Core (http://purl.org/metadata/dublin_core) represents significant effort building consensus among disparate user groups for a basic set of data elements that facilitate network resource discovery and retrieval. Apart from the element set, which is extensible by definition, work continues apace on a metadata transfer syntax. Once in place, producers of Internet resources will be able to express and associate metadata with the resource. Used by some, this may equate to Cataloging in Publication (CIP) data that book publishers often provide. Used by others, metadata may be little more than a mishmash of uncontrolled fields. No doubt, among those more concerned about the accessibility of their resources, certain "best practices" will prevail. Indeed, the work of the Text Encoding Initiative, under the auspices of the Center for Electronic Text in the Humanities (http://www.ceth.rutgers.edu), exemplifies meta-tagging with scholarly seriousness. Nevertheless, rich content or poor, with the deployment of metadata systems we will have taken another important step toward capturing bibliographic and other information at the time of its creation.

Given metadata as input, we ought to expect a system to be able to give us a preliminary bibliographic record as output, and, for selected resources, a fuller cataloging treatment may be warranted. Even here, we can see systems on the horizon that are moving out of experimentation and into implementation, systems that perform automated indexing, subject analysis, classification and even abstracting. The cataloger's toolkit will certainly be rich, eliminating or reducing certain tasks and leaving the cataloger free for whatever aspects of cataloging (and there will be many for some time) that will remain the domain of the human intellect.

Conclusion

Libraries continue to make significant contributions to the knowledge revolution, and knowledge access management is likely the future domain of libraries and librarians. Such heady times will not arrive in a moment, however, but are more likely to be achieved through continuous evolution. Provided an array of options and constrained by limited resources, the challenge for libraries is to do the "next right thing." Each library will define that next step differently, but usually, the next right thing is not a discontinuous, high risk, high-cost leap. More often, the next change is marginal, and so is the one after that. The risks are lower and the costs more affordable. Cataloging Internet resources, metadata, PURLs -- these represent some changes at the margin that offer significant immediate and long-term benefits.

Lest we watch the waves and ignore the tide, however, we should be open and alert to the continued application of computing technology to information management. Old ways of working are good ways of working only if they continue to make sense. So, too, with the precepts of cataloging.

Think back to a time before the Internet. For many of us, it wasn't that long ago. Consider how far we have come in such a short time. Yet the Internet is still nascent. We are in a time of unprecedented changes, many of which are directly related to recent advances in computing and telecommunications technologies and their rapid and widespread adoption. Libraries will not stay this flood of technology, but, with reasoned action, should expect to do more than just survive.

References
Dillon, et al. 1993. Assessing Information on the Internet: Toward Providing Library Services for Computer-Mediated Communication. Dublin, OH: OCLC.
Olson, Nancy B. 1995. Cataloging Internet Resources: A Manual and Practical Guide. Dublin, OH: OCLC.

http://www.oclc.org/oclc/man/9256cat/toc.htm

Erik Jul is associate director of the OCLC Institute, 6565 Frantz Road, Dublin, OH 43017-3395. He can be reached by phone at 614/764-4364 and e-mail at jul@oclc.org