by Bella Hass Weinberg
Heting Chu of Long Island University organized an Annual Meeting session entitled Improved Internet Access: Guidance from Research on Indexing and Classification. Wallace Koehler of the University of Oklahoma served as moderator for the session, sponsored by ASIS SIGs/CR, CRS, ALP and IAE; I was reactor.
Though unable to attend the meeting, Marisa Urgo of KRA Corporation submitted a paper entitled A Shape for Internet Information: An Alternative Metaphor for Web Site Information. Professor Koehler read an excerpt and directed the audience to the full paper at
Urgo believes that Web sites are unique and hence require alternative methods of indexing (see section II).
Shaoyi He of Long Island University presented Hyperlinks as Index Terms: Exhaustivity, Specificity and Beyond, of which the lead author is Heting Chu. They compared the links on the home pages of Web sites of five organizations, including ASIS. They examined the number of links as a percentage of the number of words on a page, as well as the consistency and specificity of the terminology.
Philip Smith of Ohio State University discussed The Use of Classificatory Knowledge Structures to Guide the Design and Evaluation of Search Interfaces. He pointed out that most search systems engineers do not have an information science background and proposed that experts from our field develop a list of generic tasks for information retrieval to assist such designers.
Wallace Koehler then presented Classifying Web Sites: Site Metrics and URL Characteristics as Markers. He demonstrated how elements of URLs, such as country codes, can be used to derive information about the content of Web sites. The full paper will appear in Journal of Library and Information Science.
Two of the four papers were not based on traditional indexing methods; one applied traditional index evaluation techniques to the Web's linking structure, and the last focused on search interfaces. In reacting to these papers on improved Internet access, the primary author I cited was King Solomon (849-797 B.C.E.), who in his collection of observations on life, entitled Ecclesiastes, made three points that are germane to a discussion of organizing the Web. These provide the outline for my paper.
In this limited space I cannot develop any theme in depth; therefore I indicate through brief bibliographic references where you can "read more about it."
I. The Making of Books Is Without Limit (Ecclesiastes XII:12)
Koehler suggested that the Web is unindexable by traditional methods because of the huge number of sites on it. In 800 B.C.E., however, King Solomon considered the number of books excessive. Information explosion was frequently used in the decades preceding the Internet. Today the perceptions are common that print is declining and that most new documents are issued in electronic form, but Charles Meadow's Ink into Bits (1998) shows that more books are being published now than ever before, and the number of journals is increasing slightly.
Many people predict that everything ever published will soon be converted to machine-readable form and mounted on the Web. Rebuttals to such predictions of the paperless society, notably Walt Crawford & Michael Gorman's Future Libraries (1995), contain estimates of the number of pages of books that would have to be digitized, which still exceeds the current number of Web pages. The authors show that full conversion is not economically feasible.
Koehler estimates that there are two million Web sites. That sounds like a tremendous number of files to catalog, and some have given up on the task. Let us compare this statistic with the number of catalog records in the major bibliographic utilities. The Research Libraries Information Network (RLIN) and the Online Computer Library Center (OCLC) have 30 million and 39 million unique bibliographic records, respectively. Both networks contain records for a variety of media, and each of these documents may be compared to a Web site.
Information science evolved from the documentation movement, an offshoot of library science that focused on providing access to the information in parts of documents, i.e., indexing. Users would like detailed access to individual Web pages, but catalogers consider their number overwhelming. To place this challenge in perspective, we note that DIALOG, the largest vendor of electronic indexes, provides access to 200 million unique records. Positing an average of 8 descriptors per article yields 1.6 billion humanly assigned index terms in DIALOG databases.
If one-fourth of nonfiction books have indexes, and each has 1000 index entries, that's a lot of human content analysis for the printed record. The estimate that 25% of books contain indexes is a conservative one that falls between the 9% of indexes found by Hans Wellisch in his study of incunabula, books published before 1500 (The Indexer, April 1994), and Bishop, Liddy and Settel's statistic of 82% for contemporary English books (Indexing Tradition and Innovation, 1990), which would not be true for the global scene.
While book indexes preceded the printing era, electronic bibliographic records go back only about three decades – to the founding of OCLC and Medlars. Librarians and indexers have created an amazing number of records in that brief period. It is expected, however, that the cataloging of Web sites be done instantaneously. There is the widespread perception that robots automatically index Web sites with the speed of light, but a study in Information Technology and Libraries (September 1998) shows that some search engines take more than three weeks to visit a site. Many indexing services beat that, and the Library of Congress manages to provide Cataloging in Publication (CIP) data in 14 days – prior to the publication of a book!
CIP is not provided for all books, and the view that not all Web sites merit cataloging was expressed in a Library Journal article on "Cataloging the Net" (October 1, 1998). This echoes James Anderson's point made at a prior ASIS conference – not all documents deserve human indexing.
Another point on quality: Libraries do not retain everything they acquire. Multiple copies of bestsellers are rented during periods of high demand, but not kept. Paperbacks are displayed, and pamphlets are stored in file cabinets, but neither form gets full cataloging or is kept permanently. The management of ephemera is germane to the Internet.
II. There Is Nothing New Under the Sun (Ecclesiastes I:9)
Urgo claims that "Web sites are unique. They have little in common with our familiar forms of information: books, periodicals, etc. Because of their qualities, alternative methods of classifying and indexing Web sites should be explored . . . ." She notes that the components of a Web site have different functions, as do the rooms of a dollhouse. (The Web version of her paper includes an image of a dollhouse.)
The parts of Web sites described by Urgo are analogous to the departments of a magazine. In designing an index to a periodical, one considers the indexability of the various components, e.g., letters to the editor and book reviews. Hans Wellisch's Indexing from A to Z (1996) deals with this under the rubric of indexable matter. The journal was invented in the 1600s, and the index to multiple journals in the 1800s. We have a lot of experience in dealing with this format.
Many Web sites contain single texts structured much like books and pamphlets. Marshall McLuhan's thesis, "The old is the content of the new" [Understanding Media (1964)], applies here. ASIS Director Dick Hill taught me the term shovelware: machine-readable text that is shoveled onto the Web, without considering whether it is appropriately structured for the medium or up-to-date. Multimedia Web sites are often considered radically different from traditional documents, but librarians have substantial experience in cataloging pictures, sound recordings and even electronic files.
Web sites are also considered unlike traditional documents because their content may change, but scholars have dealt with such changes for centuries. Koehler pointed out the variation inherent in the oral tradition. In the preprinting era manuscripts were copied by people, not Xerox machines. The human factor introduced errors, but there were other sources of variation between copies of a work, notably censorship. Codicologists still study such differences, and the research is facilitated by digital libraries.
Many assume that printed copies of a single edition of a book are identical, but the phenomenon of "corrections at press" is familiar to bibliographers. If an error was noticed in the middle of printing, pieces of type were switched, but the pages with the error were not discarded, and some copies of the book would have them (Ronald McKerrow, An Introduction to Bibliography, 1928, Ch. 6). Scanning technology helps in the identification of these variations. With newer printing techniques such changes are rare, but an errata slip may be inserted into a book after printing. Catalogers note these in bibliographic records (Olive Swain, Notes Used on Catalog Cards, 1940).
CIP data is often prepared from proofs of the preliminaries to a book; changes may subsequently be made by the publisher to every descriptive element, such as title. After publication, the subject of the work may be reinterpreted. Revised CIP is common.
Web sites that are frequently modified are comparable to printed serials. Librarians track the new issues and deal with title changes, mergers and splits of journals.
Indexers have tremendous experience in dealing with changes to text. Book indexers usually work while the author is reading proofs. Because a book may be revised and repaginated during this process, indexers keep a record of the terms assigned to each passage, which facilitates revision of index entries if spelling changes: renumbering of locators when a passage is shifted or deletion of entries if a section is removed. Loose-leaf publications constitute a challenge for indexers, but expertise in dealing with their revised pages exists, particularly in the legal indexing community. The Library of Congress's Subject Cataloging Manual (5th ed., 1996) exemplifies a well-indexed loose-leaf publication.
Indexers of technical documentation often work with electronic documents, using embedded indexing software, specifically to allow for revisions. Hidden text records the index entries assigned to a given passage, and the indexer modifies the entries as the document is revised. Nancy Mulvany's Indexing Books (1994) describes the index revision process in the print and electronic environments.
From the world of journal indexing we note Medline's indexing of retraction notices following the discovery of scientific fraud (Bulletin of the Medical Library Association, October 1989). The original text remains in the journal, but the retraction notice indicates a change in the status of the article from a peer-reviewed work to an invalid document. (We should apply this system to the Web!)
Web sites are also considered different because their location changes, but that is like the reclassification of a book in a library. Librarians know how to control all the references to a call number in a catalog and revise them when the number changes. My 1996 ASIS Proceedings paper suggested applying these control mechanisms to the problem of changing URLs. The disappearance of a Web site may be compared to the withdrawal of a book from a library. When books are stolen from libraries, the decision has to be made whether to retain the catalog records. This is analogous to recording Web sites that no longer exist.
A final point on change: Librarians and indexers continuously update classification schemes, subject heading lists and thesauri to reflect literary warrant – the actual documents that require content analysis; this experience can be applied to the indexing of the ever-changing Web.
III. There Is No Recollection of the [Knowledge of the] Former [Generations] (Ecclesiastes I:11)
Koehler noted that computer scientists are reinventing the organizational systems of librarians. The verse from Ecclesiastes reflects the basic theme of Phil Smith's paper. Along with his outline, he sent me an article from ACM Transactions on Information Systems (July 1989) containing many ideas that can be applied to the design of interfaces for Web searching. Many people think that the concept of interface design is very recent, but Smith's 1989 paper cites a review of the literature on that topic, by D. Thompson, published in JASIS in 1971.
King Solomon's verse has been echoed in other recent works: Trudi Bellardo Hahn demonstrated in the Bulletin of the American Society for Information Science (April/May 1998) that all of the search capabilities found on the Internet were invented in the early days of online systems. F.W. Lancaster's Indexing and Abstracting in Theory and Practice (1998) reminds us of this as well. Marcia Bates' article in JASIS, November 1998, also reviews research that should be applied to the design of indexing/access systems for the Internet.
In this brief session it was not possible to review all indexing and classification structures that are relevant to the organization of the Internet; I focused only on one – the alphabetico-classed system, which is alluded to but not named in Bates' review. Bates observed that when end users input a general subject heading, such as Psychology, they expect to be shown topical subdivisions that are narrower terms of it, e.g., Visual Perception. Alphabetico-classed systems have been deprecated in both cataloging and indexing in favor of alphabetico-specific headings; yet the highly successful Yahoo! system is essentially an alphabetico-classed system. Alternatively, it can be viewed as a controlled vocabulary that displays narrower terms. Alphabetico-classed systems are discussed in John Metcalfe's Information Indexing and Subject Cataloging (1957) and in an exchange between Ben-Ami Lipetz and me that was collected in Subheadings: A Matter of Opinion (1995).
Chu & He's analysis of hypertext links in terms of exhaustivity and consistency is commendable for its application of well-established indexing constructs to the Internet. This promises to be a fruitful area of research. My structural analysis of the Web sites they selected differs from theirs, however. First, it may not be appropriate to analyze the hyperlinks of organizational Web sites in terms of consistency. My university is implementing a standard format for faculty home pages, but no one has legislated uniformity in the structure or vocabulary of the Web sites of unrelated organizations.
Second, I do not consider the rubrics on home pages to be index terms. Instead I compare them to the headings in the table of contents of a book or magazine. The hyperlink saves you the minimal exertion of turning from the contents page to a later page. The indexing literature discusses whether section headings should be converted to index entries. Many should not, and others have to be reformulated. One hyperlink analyzed by Chu & He has the anchor "Go Shopping." While this is a catchy phrase for a button, it is not likely to be indexed under G, but rather under S.
The URL underlying a hyperlink to an external Web site is essentially a footnote. The link saves you the trouble of physical document retrieval – assuming the URL is correct. Links from a node of text to other pages of a Web site may be viewed as cross references. This concept can be explained through a sample entry from Encyclopaedia Judaica (1971): An asterisk before a name or term indicates the availability of a full article on that topic. A hyperlink saves you the effort of turning to another page or volume, but both structures are basically see also references. This approach to hypertext is found in Challenges in Indexing Electronic Text and Images (1994).
In HTML there are no typed links, and so we cannot automatically differentiate between see and see also references. XML, Extensible Markup Language, promises this capability, but its increased complexity may lead to greater inconsistency in the organization of the Web.
Koehler suggests analyzing the structure of URLs to classify Web sites automatically. His analogy with the field of diplomatics is charming, but inferences about the content of a document from its external form can go only so far. Recently I received a scroll tied up with a ribbon. While diplomatics would suggest it is a will, this scroll holds the menu of a kosher wedding!
Subject analysis is the stepchild of the metadata literature. Much attention is devoted to descriptive fields, such as creator of a Web site. Subject fields are given a token nod in discussions of metadata, and facile comments are made about automatic vocabulary switching when different thesauri are used (Bulletin of the American Society for Information Science, October/November 1997).
Subject indexing was called a "sticky wicket" in Trudi Bellardo Hahn's 1991 primer on this topic. In the context of the Web, subject indexing is even stickier but we will have to face this challenge: automatic full-text indexing and relevance ranking are not doing the job.
I reject the view that the Web is totally different from the types of documents that catalogers and indexers have handled in the pre-Internet era. The primary problem I see is economic: catalogers have traditionally created records for materials that their libraries own.
Book indexers are hired to add value to publications that will be sold, while journal indexers are paid to create thesauri and analyze documents for a database producer who charges for access to the resulting product.
Nobody owns the Internet, however, and users are reluctant to pay to consult an electronic index to find out which Web sites are relevant to their interests. It is difficult to recognize the effort that goes into creating an index that is easy to use.
OCLC has asked volunteers to contribute catalog records for Web sites to InterCat. This database documents 45,000 sites, a fraction of the total number. The redundancy of library Web pages that identify key sites for a discipline is being recognized. Perhaps librarians will stop devoting time to this activity.
The skills of indexers are being applied to the organization of Intranets, proprietary sites protected by firewalls. Search engines with a profit motive, such as Northern Light, employ indexers to add value to quality publications and make their money from document delivery.
Koehler discussed selective search engines, which may be compared to indexing and abstracting services limited in scope to a single discipline. Each service covers a piece of the bibliographic pie. Dividing up the work may ultimately be the best approach to indexing the Internet.
Bella Hass Weinberg is a professor in the Division of Library and Information Science, St. John's University, Jamaica, New York, NY 11439; 718/990-1456; fax 718/990-2071. A past president of the American Society of Indexers, she received its 1998 Hines Award for distinguished service to the profession. Information Today recently published her anthology, Can You Recommend a Good Book on Indexing?: Collected Reviews on the Organization of Information.