Special Section

Enhanced Text Retrieval Using Natural Language Processing

by Elizabeth D. Liddy

It makes common sense for linguistic processing to be used in the task of text retrieval, given that users' queries are linguistic expressions, and the relevant documents that the system is attempting to retrieve are also linguistic objects. While this may seem obvious today, it has not always been the case.

In the early days of information retrieval (IR) research, approaches to IR remained mainly statistical. This state of affairs was particularly true after funding support for machine translation (MT) research was all but abandoned due to the ALPAC (Automated Language Processing Advisory Committee of the National Academy of Science-National Research Council) Report of 1966. This report said that MT was beyond then-available computational capabilities and recommended that it not be funded. Some low-level linguistic techniques, such as stemming, were introduced and spread widely. However, most efforts to include more complex techniques, such as natural language processing (NLP), were scoffed at. The same situation continued to hold true in the 1970s and 1980s. Those who attempted to demonstrate that NLP had enhanced capabilities to offer IR had an uphill struggle, given the predominant focus on successful statistical approaches by the leaders of the field.

However, by 1993 and 1994, when Dave Lewis and I presented tutorials on the use of NLP for IR at the annual conferences of the Association for Computational Linguistics and ACM-SIGIR (Special Interest Group in Information Retrieval), attendance was exceptionally high. Also exceptionally high were both the skepticism and optimism that NLP could improve effectiveness of real IR applications. However, the large attendance augured well for the future. While we acknowledged the difficulties others had pointed out as endemic to NLP, the field had advanced sufficiently that the difficulties that had been deemed insurmountable to earlier researchers now seemed more feasible to those in attendance. Inclusion of a broader range of NLP techniques has gradually increased since that time, but it is really only in very recent years that demonstrated successes in the use of NLP have given the beleaguered NLP processing paradigm a chance at inclusion in large scale IR systems.

We will look later at the circumstances that can be considered responsible for this change after a brief overview of natural language processing.

Definition of NLP

Natural language processing is a range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of particular tasks or applications. The levels of linguistic analysis are: The above levels of linguistic processing reflect an increasing size of unit of analysis as well as increasing complexity and difficulty as we move from top to bottom. The larger the unit of analysis becomes (i.e., from morpheme to word to sentence to paragraph to full document), the less precise the language phenomena and the greater the free choice and variability. This decrease in precision results in fewer discernible rules and more reliance on less predictable regularities as one moves from the lowest to the highest levels. Additionally, higher levels presume reliance on the lower levels of language understanding, and the theories used to explain the data move more into the areas of cognitive psychology and artificial intelligence. As a result, the lower levels of language processing have been more thoroughly investigated and incorporated into IR systems. I am aware of only one system that includes all levels of language analysis.

Use of NLP in IR

The central task in NLP for IR is the translation of potentially ambiguous natural language queries and documents into unambiguous internal representations on which matching and retrieval can take place. In fact, the ideal IR system is one in which users can express their information needs naturally and with all requisite detail - exactly as they would state them to a research librarian. The system should then "understand" the underlying meaning of the query in all its complexity and subtlety. Likewise, a full NLP IR system will represent the contents of documents - no matter the nature of the document - at all the same levels of understanding, thereby permitting full-fledged conceptual matching of queries and documents.

Those who employ NLP in IR applications may elect to use one or more of the multiple levels of language processing and may elect to apply these levels of language processing to just the queries or to both the queries and documents. Unfortunately for the public's understanding of NLP, some systems that call themselves NLP systems do, in fact, use only a few levels of NLP and use them only on the queries. However, as users have become more sophisticated in their understanding of what is meant by NLP, their expectations are that the documents will likewise be processed and that the language processing will be more complex than just stemming.

Having now understood the levels of language - all of which convey meaning - let's take a look at how these levels of NLP can be utilized in an IR system. Some of these ways will be better known to those with some familiarity with IR than others - most likely due to their not being as frequently incorporated in IR systems. The reasons why some levels are not implemented include


Starting with the lowest unit, the phonological level comes into play in speech recognition systems which accept spoken queries or even provide spoken documents. For these applications, the phonological level is an obvious requirement, but for other applications in IR, this level has obviously not come into play.

The morphological level is the level of linguistic processing most commonly incorporated in IR systems and has the longest history of inclusion. Stemming of terms in documents and queries so that morphological variants between query and document will match has a long history in both experimental and commercial systems. And while there have been differing empirical results on the impact of stemming in English, most current IR systems support stemming to avoid the potential for obviously missed relevant documents. For example, if the plural forms of nouns in documents are not stemmed, these documents will not match to the singular form of the term of interest in a query or vice versa. It should be noted that for other languages that have a richer morphology, the attention to morphological processing offers a much more obvious and larger pay-off for IR than it does for English.

The lexical level of linguistic processing can be used in IR either for part-of-speech tagging or for the utilization of lexicons from which the detailed features of individual terms can be accessed. The lexical level of language is evidenced in the knowledge contained in thesauri and other similar resources, which were originally manual consultation tools for both indexers and searchers. They were and are utilized to ensure that a common vocabulary is used in selecting appropriate indexing or searching terms/phrases. These lexicons, which provide both syntagmatic and paradigmatic relations of terms, can be used in IR systems for automated or semi-automated assistance in building queries. Recognition, tagging and indexing of specific lexical features of interest (e.g., proper nouns) reflects lexical information usage.

The syntactic level of linguistic processing utilizes the part-of-speech tagging output from the lexical level and can assign phrase and clause brackets. This semi-parsed text can then be used to drive the selection of better indexing entries because phrases can be automatically recognized and used to represent the documents' contents rather than single-word indexing which frequently introduces ambiguity into the representation and resultant retrieval. Similarly, syntactically identified phrases extracted from the query can provide better searching keys for matching against similarly bracketed documents.

Use of the semantic level of language in IR includes interpretation of the meaning of sentences as the unit of understanding, as opposed to processing at the individual word or phrase level. This level of processing can include the semantic disambiguation of words with multiple senses, the identification of predicate argument relations in sentences or the expansion of the query by addition of all synonymous equivalents of the query terms. Term expansions can be obtained from lexical sources such as WordNet or IR-style thesauri, but the challenge here is to add just those terms which are expansions of the particular sense of the word intended in the query. Another usage of semantic processing is the production of semantic vectors to represent both queries and documents, but this also requires that the appropriate sense of each term has been determined and the appropriate semantic category selected for inclusion in the semantic vector.

The discourse level of language processing goes beyond the sentence to understand and represent meaning and therefore can utilize the structure and organizing principles implicitly used by writers of documents and queries. Such processing would take into account the predictable script-like structure of communications which are oft-repeated by a community which actually relies on this structure to convey meaning above and beyond that conveyed in individual words or sentences. In IR, the discourse level structure can be utilized to understand what the specific role of a piece of information is in a document, for example - is this a conclusion, is this an opinion, is this a prediction or is this a fact?

Additionally, the recognition and resolution of anaphora (abbreviated subsequent reference to a concept introduced earlier in the text, e.g., pronouns) would result in an improved representation of both documents and queries. The representation would be improved because anaphora resolution would enable the implicit presence of concepts to be more completely accounted for at the lexical level and an integrated representation of the contents of a query or document to be produced at both the semantic and discourse levels.

The pragmatic level of language, which is concerned with how the external world impacts the meaning of communications, would come into play primarily at the query processing and understanding level. In the same way that good reference librarians can elicit from users the purpose to which they plan to put the information they are seeking, IR systems need to understand the user and his/her needs in the context of their history and their goals. Gricean maxims and other principles of communication can be incorporated in the user interface of IR systems to facilitate the "conversation" between the user and the IR system.

Commercial Use of NLP in IR

Some linguistic enhancements have finally reached commercially available search engines, but there is no standard use of terminology to describe their processes. As a result, what the various search engines are actually doing is often far from clear. Most use of linguistics is rudimentary. The engines expect minimal one-word or two-word queries and are optimized for them, rather than for sentences, which would enable the user to fully present their information need. In general, the linguistic enhancements one now sees include In general, linguistic approaches have begun to filter into the Web search engine world, but they do not appear to have hit the traditional online systems, whose most common form of NLP continues to be automatic truncation. While some services permit the user to enter a query without using the idiosyncratic formats previously required, query processing still appears to be dependent at most on simple morphological and lexical levels of processing. Additionally, most systems which state that they use NLP appear to perform linguistic processing on just the queries. The documents in the database have not been processed with any level of linguistic analysis. However, current attention to linguistics among IR vendors seems to indicate that they will be incorporating more NLP in the future.

Those who are interested in learning about a full NLP-based IR system that does incorporate all the levels of language processing described above may want to check out the DR-LINK (Document Retrieval using LINguistic Knowledge) System at

www.textwise.com or www.mnis.net.

This system was developed to demonstrate the powerful capabilities which NLP has to offer IR.

Elizabeth D. Liddy is president of TextWise, LLC, and professor in the Syracuse University School of Information Studies. She can be reached there by mail at 4-206 Center for Science and Technology, Syracuse, NY 13244-4100; by phone at 315/443-4456; or by e-mail at liddy@mailbox.syr.edu.