In the early days of information retrieval (IR) research, approaches to IR remained mainly statistical. This state of affairs was particularly true after funding support for machine translation (MT) research was all but abandoned due to the ALPAC (Automated Language Processing Advisory Committee of the National Academy of Science-National Research Council) Report of 1966. This report said that MT was beyond then-available computational capabilities and recommended that it not be funded. Some low-level linguistic techniques, such as stemming, were introduced and spread widely. However, most efforts to include more complex techniques, such as natural language processing (NLP), were scoffed at. The same situation continued to hold true in the 1970s and 1980s. Those who attempted to demonstrate that NLP had enhanced capabilities to offer IR had an uphill struggle, given the predominant focus on successful statistical approaches by the leaders of the field.
However, by 1993 and 1994, when Dave Lewis and I presented tutorials on the use of NLP for IR at the annual conferences of the Association for Computational Linguistics and ACM-SIGIR (Special Interest Group in Information Retrieval), attendance was exceptionally high. Also exceptionally high were both the skepticism and optimism that NLP could improve effectiveness of real IR applications. However, the large attendance augured well for the future. While we acknowledged the difficulties others had pointed out as endemic to NLP, the field had advanced sufficiently that the difficulties that had been deemed insurmountable to earlier researchers now seemed more feasible to those in attendance. Inclusion of a broader range of NLP techniques has gradually increased since that time, but it is really only in very recent years that demonstrated successes in the use of NLP have given the beleaguered NLP processing paradigm a chance at inclusion in large scale IR systems.
We will look later at the circumstances that can be considered responsible for this change after a brief overview of natural language processing.
Those who employ NLP in IR applications may elect to use one or more of the multiple levels of language processing and may elect to apply these levels of language processing to just the queries or to both the queries and documents. Unfortunately for the public's understanding of NLP, some systems that call themselves NLP systems do, in fact, use only a few levels of NLP and use them only on the queries. However, as users have become more sophisticated in their understanding of what is meant by NLP, their expectations are that the documents will likewise be processed and that the language processing will be more complex than just stemming.
Having now understood the levels of language - all of which convey meaning - let's take a look at how these levels of NLP can be utilized in an IR system. Some of these ways will be better known to those with some familiarity with IR than others - most likely due to their not being as frequently incorporated in IR systems. The reasons why some levels are not implemented include
The morphological level is the level of linguistic processing most commonly incorporated in IR systems and has the longest history of inclusion. Stemming of terms in documents and queries so that morphological variants between query and document will match has a long history in both experimental and commercial systems. And while there have been differing empirical results on the impact of stemming in English, most current IR systems support stemming to avoid the potential for obviously missed relevant documents. For example, if the plural forms of nouns in documents are not stemmed, these documents will not match to the singular form of the term of interest in a query or vice versa. It should be noted that for other languages that have a richer morphology, the attention to morphological processing offers a much more obvious and larger pay-off for IR than it does for English.
The lexical level of linguistic processing can be used in IR either for part-of-speech tagging or for the utilization of lexicons from which the detailed features of individual terms can be accessed. The lexical level of language is evidenced in the knowledge contained in thesauri and other similar resources, which were originally manual consultation tools for both indexers and searchers. They were and are utilized to ensure that a common vocabulary is used in selecting appropriate indexing or searching terms/phrases. These lexicons, which provide both syntagmatic and paradigmatic relations of terms, can be used in IR systems for automated or semi-automated assistance in building queries. Recognition, tagging and indexing of specific lexical features of interest (e.g., proper nouns) reflects lexical information usage.
The syntactic level of linguistic processing utilizes the part-of-speech tagging output from the lexical level and can assign phrase and clause brackets. This semi-parsed text can then be used to drive the selection of better indexing entries because phrases can be automatically recognized and used to represent the documents' contents rather than single-word indexing which frequently introduces ambiguity into the representation and resultant retrieval. Similarly, syntactically identified phrases extracted from the query can provide better searching keys for matching against similarly bracketed documents.
Use of the semantic level of language in IR includes interpretation of the meaning of sentences as the unit of understanding, as opposed to processing at the individual word or phrase level. This level of processing can include the semantic disambiguation of words with multiple senses, the identification of predicate argument relations in sentences or the expansion of the query by addition of all synonymous equivalents of the query terms. Term expansions can be obtained from lexical sources such as WordNet or IR-style thesauri, but the challenge here is to add just those terms which are expansions of the particular sense of the word intended in the query. Another usage of semantic processing is the production of semantic vectors to represent both queries and documents, but this also requires that the appropriate sense of each term has been determined and the appropriate semantic category selected for inclusion in the semantic vector.
The discourse level of language processing goes beyond the sentence to understand and represent meaning and therefore can utilize the structure and organizing principles implicitly used by writers of documents and queries. Such processing would take into account the predictable script-like structure of communications which are oft-repeated by a community which actually relies on this structure to convey meaning above and beyond that conveyed in individual words or sentences. In IR, the discourse level structure can be utilized to understand what the specific role of a piece of information is in a document, for example - is this a conclusion, is this an opinion, is this a prediction or is this a fact?
Additionally, the recognition and resolution of anaphora (abbreviated subsequent reference to a concept introduced earlier in the text, e.g., pronouns) would result in an improved representation of both documents and queries. The representation would be improved because anaphora resolution would enable the implicit presence of concepts to be more completely accounted for at the lexical level and an integrated representation of the contents of a query or document to be produced at both the semantic and discourse levels.
The pragmatic level of language, which is concerned with how the external world impacts the meaning of communications, would come into play primarily at the query processing and understanding level. In the same way that good reference librarians can elicit from users the purpose to which they plan to put the information they are seeking, IR systems need to understand the user and his/her needs in the context of their history and their goals. Gricean maxims and other principles of communication can be incorporated in the user interface of IR systems to facilitate the "conversation" between the user and the IR system.
Those who are interested in learning about a full NLP-based IR system that does incorporate all the levels of language processing described above may want to check out the DR-LINK (Document Retrieval using LINguistic Knowledge) System at
This system was developed to demonstrate the powerful capabilities which NLP has to offer IR.