The recent explosion of electronic information has far outpaced the availability of automated tools to manage it effectively. As the amount of accessible electronic information increases, the cost of accessing this information also increases. End users must spend their valuable time wading through masses of irrelevant documents to get the information they require. Accordingly, communities unfamiliar with library science are beginning to grapple with metadata and the organization of large collections of data. Most notably, several Internet search providers now offer services beyond simple searching. For example, Yahoo! now presents information in categories to try to reduce information overload.
Historically, librarians have successfully organized the world's information by creating surrogates to manage, classify and filter information of many types. For instance, to manage an item, a catalog entry is created to describe the item. Then, the patron need not have the item in hand to know what it is about, who created it, where it can be found, etc. Creating metadata about an item makes searching, filtering, organizing and retrieving the item very efficient. This is true of traditional materials as well as new electronic resources.
If every electronic item had a catalog entry or its equivalent, then interfaces could more effectively provide access to the raw content (the free Internet search service model) and the metadata (the library model). Unfortunately, librarians are so overburdened that they can barely keep up with their traditional workload, let alone begin to catalog and organize the vast amounts of information available electronically. All electronic resources will never be humanly cataloged; it's just too expensive. We need automated tools to apply library science ideas like classification to electronic resources at high speed and low cost.
Subject Assignment via Scorpion
The two most popular classification schemes in the world are the Library of Congress Classification and the Dewey Decimal Classification (Dewey for short). Classification schemes partition the world of knowledge to describe resources succinctly. They have mostly been used to shelve items in libraries, not to describe the complete subject content of an item.
It is important to note that the primary topic used to classify an item is not necessarily the most important (or only) subject of the item. While the use of primary topics for classification has proven extremely useful, few resources are seldom about a single topic. Thus, librarians denote subject content when cataloging an item using subject headings from an authoritative list like the Library of Congress Subject Headings.
While traditional catalog entries contain a wealth of metadata, the subject portion is arguably the most important when it comes to building advanced search and retrieval interfaces. Tools to access and display electronic resources more effectively could be built if there were some way to assign subject headings or concept domains automatically to electronic items. That's where Scorpion comes in.
Scorpion is a research project attempting to combine indexing and cataloging, based on the observation that these are complementary activities. Scorpion specifically focuses on building tools for automatic subject recognition by combining library science and information retrieval techniques. A thesis of Scorpion is that Dewey can be used to perform automatic subject assignment for electronic items. That is, Dewey can be used to classify an item and denote subject headings. The basic idea behind Scorpion is that resources to be cataloged can be treated as queries against a special Dewey database that returns a ranked list of potential subjects.
While Scorpion cannot replace human cataloging, Scorpion can produce tools that help reduce the cost of traditional cataloging by automating subject assignment when items are available electronically. For instance, a list of potential subjects could be presented by Scorpion to a human cataloger who could then choose the most appropriate subject.
Dewey is maintained electronically via the Editorial Support System (ESS) at OCLC Forest Press. The corresponding ESS records contain all the raw information used to produce the printed Dewey Decimal Schedules and Tables. ESS records comprise a variety of labeled fields. Some or all of these fields can be used to build ranked retrieval databases for automatically assigning subject codes to documents. By treating documents as queries against such a database, the result set can be viewed as possible subjects for the document.
Figure 1 contains an overview of this process. First, a group of selected ESS records are identified for inclusion in the ranked retrieval database. Then, selected fields from these records are used during the build process to build the database. To assign subjects automatically to an electronic resource, the resource can be turned into a database query with the ranked results being treated as a list of potential subjects for the resource.
While the initial ideas of Scorpion were easy to prove, we have spent a great deal of time refining the ranked retrieval databases and working on pre-processing of the input data and post-processing of the result sets. For instance, we have built over 30 different databases trying various combinations of ESS records, ESS fields, stemming and other characteristics.
An interesting result that we have found is that different collections require different pre-processing and post-processing. For instance, a pre-processing step for an HTML collection of documents might be to remove HTML tags except for META tags. Other collections may need different pre-processing. An encyclopedia with structured articles might necessitate splitting the articles up by sections and sending each section through Scorpion so that subjects could be assigned to each individual section and not just the article as a whole.
A good example of pre-processing data for Scorpion was performed by Jean Godby at OCLC. (Figure 2.) She was interested in automatically extending Dewey with new concepts. As an example of this, she took a set of articles with the term "AT&T" in them and assumed that the collection of articles constituted a general definition of AT&T. (Given enough example articles, one can see how this would be true.) She then took this collection of articles, collapsed them into a single article and submitted that article to Scorpion. Scorpion returned good, general results, but she thought she could do better. So, she ran the raw data through her natural language processing tools to create a surrogate of the important phrases contained in the raw text. This surrogate was then run through Scorpion. The surrogate results were better, more specific than the raw data results. Note that Scorpion did not change, just the input to Scorpion.
As mentioned earlier, it may also be important to perform some clean up of result sets. This post-processing of Scorpion results has become a major research effort for us. While we are already producing good result sets, we are continually seeking ways to remove the occasional entry that humans immediately note as unwanted. For instance, we once got "forest exploitation" returned as a possible subject for OCLC's homepage because the page mentioned "OCLC Forest Press." Clearly, cleanup of this type is important.
Scorpion is a research project. While many initial tools have been built and evaluation has begun, there is still much work to be done. We have several ideas for improvement and expect many more suggestions from external users of the Scorpion tools. Some anticipated results include
where additional documentation explaining the Scorpion tools and experimental results will be posted. At the time this article was written, the Scorpion Web demonstration required an ID and password. Anyone desiring to see the Scorpion tools in action should contact the author for an ID and password.