Special Section

Text Retrieval Online: Historical Perspective on Web Search Engines

by Trudi Bellardo Hahn

The first online text retrieval systems appeared more than 30 years ago, and in the years since, they have continued to evolve. At a quick glance, it might appear that their development has been one long line of continuous improvement in functionality and usability. But is that true? Are all users - not just expert searchers - able to search with increasing facility and speed and with greater satisfaction with their results than the users of the early online era?

A part of the answer to that question can be revealed by examining the basic features of online systems, tracing when those features were introduced and comparing them with the functionalities and power of modern Web-based search engines. These features include search capabilities, browse capabilities and other miscellaneous helpful functions.

In the research that Charles Bourne and I are conducting for our forthcoming book on the early development of online systems, we are attempting to establish the milestones of invention and implementation. We will report on many details of development and implementation, in order to give credit to the many individuals who contributed to the progress of online retrieval. However, verifying dates, establishing priority of discovery and invention and giving proper credit to genuine trailblazers is a difficult and complex task. In this brief article, the examples given are not necessarily the very first or only of their type; they were, however, among the first solid trailblazing efforts that resulted in working systems.

What are the basic capabilities of online systems and search engines, when were they first introduced and by whom? Note that persons named are the individuals most often associated with a project. However, the pioneers did not work alone; they always were part of a research and development team. It is worth noting also that many of these search features had been conceptualized a decade or so earlier and used in serial searching of databases on magnetic tape.

Search Capabilities

The goal of a search capability is to match a user's specified information need with items in a database that will answer it. Two types of search capabilities are used: those that help to specify the relationship between terms in a search statement and those that facilitate the interpretation of a particular word.

Relationships Between Terms

Interpretation of a Particular Word

Browse Capabilities

Once a search is completed, browse capabilities help a user to determine which items are of interest and to select them to be displayed more fully. Since searches usually retrieve non-relevant items, browse capabilities assist in focusing on items that have the highest likelihood of meeting the information need.

Miscellaneous Capabilities

From the 1960s forward, online systems have offered additional functions that reduce the time and facilitate a user's ability to input queries and reduce the likelihood of entering a poor query.

Web Search Engine Capabilities

Search engines are a common way to find documents on the Web that contain content relevant to a specific word or topic. While hundreds of search engines index the Web, this analysis focuses on the large, general-purpose ones such as InfoSeek, Excite, AltaVista and Lycos. Each of these "supermarket" services indexes about 50 million sites in their databases (although not necessarily the same 50 million). All offer statistically based, relevance-ranked search retrieval, as well as other functionalities. Which of the basic capabilities of online systems are found in Web search engines and how well do Web search engines respond to individual users' needs and preferences?

Boolean operations are common but not universal. Sometimes the logical relationships are automatic or implied; it can be hard for a user to determine when or whether an AND or OR is automatic. Proximity operations and word phrase searching are often but not always available. Natural language query creation is rare.

Fuzzy search is standard. What is rarer to find is the ability to search for an exact match of a query, which may be all that a user wants or needs. Truncation is generally available, sometimes automatically and not under users' control, which may result in unwanted retrieval. Numeric and date ranging, term weighting and field limiting are available, but vary considerably from one search engine to another. Elimination of words on a stop list is common, but it is difficult for users to determine what is on the stop list, which may frustrate a search for certain words or phrases. The automatic incorporation of synonyms into search formulations is common, but not under the user's control to disable when not wanted. Case sensitivity is sometimes available, sometimes not; it is often hard for a user to determine. "More like this" is rare. Citation searching is not found.

Ranking and relevance feedback are available, but based on a variety of criteria unknown to users. Output options are limited in most systems. If zoning is offered, the system usually does not display enough text for the user to make a relevance judgment. Highlighting is occasionally available. The number of records displayed is typically determined by the system and cannot be altered by the user.

Iterative search and canned query are generally not available. Vocabulary browse, concept hierarchies or thesaurus expansion displays are rare. Sometimes, however, expansion is done automatically without a user's knowledge or control.

The Web is an incredible document delivery medium, unthinkable in size and variety just a few years ago. The choice of a novice or experienced user interface is sometimes available. Global access is a given on the World Wide Web.

Research and Development Needed

In spite of the difficulties in dating "firsts," we can determine that nearly all the basic functions and features were developed for online text retrieval in the 1960s. These "firsts" were developed at many different institutions by many pioneers. Most of the systems of that era were not much more than experimental or laboratory systems with small databases, but the functions and features did actually work and in some cases were applied to significant military, legal or educational applications during that first decade.

In comparing the early developments with the retrieval systems employed on the World Wide Web today, we must be careful not to think of today's search engines as simply more fully evolved versions of the early systems. They do not use all the same functionalities and they incorporate some different retrieval principles. Thus we should expect different performance. For example, the retrieval of thousands of records in response to a query need not be a cause for alarm if a user can see the most relevant ones on the first screen or two. Expert searchers know that the performance of some Web search engines can be improved by changing around the order of entry of terms or by adding or subtracting the number of synonyms and related terms.

Nonetheless, important performance issues remain that deserve more research and development. Few actual users of Web search engines understand how to manipulate and control a query to maximize the quality of their retrieval. The documentation is limited, sometimes not up-to-date or even nonexistent, and most users are too impatient or unaware to consult it anyway. Thus, despite the vast amount of information that is, in theory at least, accessible via the World Wide Web, most users still retrieve documents that have little or nothing to do with the topic of interest and fail to find the material most pertinent to them.

Furthermore, many search functions and features are done automatically, without a user's knowledge or control. The underlying philosophy of Web search engines seems to be that the system knows best, and users would be well off not to interfere. The Web searcher has given up a great deal of control in exchange for simplicity of use. The result of this philosophy is that, despite its potential, the Web remains difficult for large numbers of users to navigate.

Some of the limitations of Web search engines are being compensated for by the efforts of librarians and other information providers who are creating Web directories to organize sites by broad subject categories or creating metasites of links to related sites or creating publications that list URL addresses of sites important to a certain topic. Other organizations are offering filtering software to eliminate certain types of unwanted retrieval. Others are conducting research on adding metadata to Web records to facilitate intelligent retrieval by man or machine.

In addition to search, browse and miscellaneous other capabilities, the performance of a search system can be assessed in other ways, including completeness of results (recall); finding a needle in a haystack without a mouthful of straw (precision); response time, database coverage, update frequency, output overlap, output options, reliability and friendliness of interface. If we were to revisit the 1960s, we would find that all of these performance characteristics were considered then and rigorous evaluation studies were conducted.

Today's Web designers, developers and evaluators have not yet addressed all these performance issues. In spite of the amazing ability of Web search engines to scan millions of text records of every conceivable type in an instant, users are still frustrated with unsatisfactory results. The apparent ease of use masks the actual difficulty in finding useful information. Designers of search engines of the future must make some critical decisions about how much complexity in document type and content domain they can effectively handle and whether expected advances in bandwidth, language parsing and search algorithms can really address users' needs for useful information. The answer to giving control and power to users might be found in older features such as field specification, synonym incorporation, thesaurus expansion or truncation - or it might not. But remembering the basic functionalities and why they were designed the way they were might reveal just what it is about online text retrieval systems that makes them not only easy to use, but also powerful tools for text retrieval.

Trudi Bellardo Hahn is manager of User Education Services at the University of Maryland, College Park Libraries. She can be reached at McKeldin Library, University of Maryland, College Park, Maryland 20742-7011, or by e-mail at th90@umail.umd.edu