of The American Society for Information Science

Vol. 27, No.1

October/November 2000

Go to
Bulletin Index


Text Mining
by Elizabeth D. Liddy

Elizabeth D. Liddy is with the Center for Natural Language Processing in the School of Information Studies at Syracuse University.

 Text mining is the process of analyzing naturally occurring text for the purpose of discovering and capturing semantic information for insertion and storage in what I'll call a Knowledge Organization Structure (KOS) with the ultimate goal of enabling knowledge discovery via either textual or visual access for use in a wide range of significant applications.

 Text mining is appropriately considered a sub-specialty of the broader domain of Knowledge Discovery from Data (KDD), which in turn can be defined as the computational process of extracting useful information from massive amounts of digital data by mapping low-level data into richer, more abstract forms and by detecting meaningful patterns implicitly present in the data. KDD, which is typically conducted on structured, relational databases, has data mining as one of its sub-tasks. While data mining has become the more popular term, it is in fact only one of the steps within the KDD process. The full KDD process includes data storage and access, data cleansing, pattern detection and extraction, and data interpretation, while data mining refers more narrowly to the particular step of applying specific algorithms for detecting and extracting patterns.

  Text mining has extended the applicability of KDD dramatically by the use of sophisticated natural language processing [www.asis.org/Bulletin/Apr-98/liddy.html]. This means that there is no need to limit KDD to only that information that is available in structured databases; nor do the knowledge bases of interest need to be manually constructed. Given that much of the information of value for mining already resides in naturally occurring texts (or can be elicited as text), NLP provides the necessary techniques for text mining to extract knowledge automatically from these texts.

  Organizations interested in accomplishing knowledge management have begun to realize that a substantial proportion of knowledge that needs to be exploited and utilized already exists in textual form. A few examples of the range of information that is typically available within an organization include e-mail from customers, intranet memos and briefings, internal technical reports and patents, as well as newspaper and news wire stories about competitors as well as external views of the organization. Therefore, text mining, the process of analyzing texts to extract information useful for both discovery of patterns and trends as well as confirmation of hypotheses, has begun to gain acceptance as a highly desirable technology.

 As predicted by the Gartner Group in 1998, text-mining capabilities have begun to appear in leading information retrieval (IR) products. Beginning with the release of IBM's Intelligent Text Miner in 1998, the bar for IR has been raised, and new IR products are now expected to have at least a clustering capability that will group texts according to similarity of content, if not providing full mining capabilities.

 While more traditional definitions of IR focus on document retrieval, a more expansive view of the goal of IR is that it should minimize the human resources required to find the necessary information to accomplish a goal by

  • permitting users to convey their needs in the most convenient and expressive mode possible;
  • placing the burden on the system to understand and adapt itself to users' needs; and
  • providing precise results, pre-analyzed by the system and determined to be precisely relevant.

 If a user requires a simple answer, the ideal IR system would provide just that not a list of potentially relevant documents. If discovery of trends across a document collection is the goal, then an IR system should be able to perform text mining.

 This broader definition of IR recognizes that the information needs of many users, particularly in strategic intelligence, are of such range and sophistication that a truly useful system must go beyond simple retrieval.

 It must provide a broad range of information access and analytic capabilities. The way that text mining can accomplish this goal is through reliance on NLP. NLP consists of a range of computational techniques for analyzing and representing naturally occurring texts at all levels of linguistic analysis to achieve human-like language processing that can support this kind of analysis.

  A fully featured Information Access & Analytic System is one that combines both IR and text mining capabilities. Such a technology would

  • detect the specific sources that contain information worth mining;
  • recognize and extract meaningful entities that convey valuable knowledge;
  • produce a semantic interpretation of the information;
  • store the semantically interpreted information in an efficient data structure; and
  • provide means for easy access and utilization of this knowledge base for new insights or for utilization in decision-making tasks.

 Potential uses of such a text mining capability include those essential to strategic or competitive intelligence. For example, text mining would enable a company to process public news sources, extracting information about their competitors' responses to events. This capability would then enable a strategic analyst within the company to construct a model of how each competitor reacts to specific stimuli and thereby enable the company to predict how each competitor would react in a similar new situation. As a next step, it would enable a strategic analyst to characterize what a "generic" competitor would look like. Based on daily tracking of the company's known competitors, the analyst could build a model of the characteristics, actions and events that would define a model of a generic competitor for that company. The model could be used as a daily profiler and extractor to mine news-feeds and recognize new competitors.

 The text mining process consists of the same three stages that constitute the more traditional KDD process, namely data preparation, data processing and data analysis.

Text preparation: the selection, cleansing and preprocessing of text. In this stage selection of sites or sources for text mining would occur, usually under the guidance of a human expert or a well-trained software agent, and early text preprocessing, such as sentence/paragraph identification and part-of-speech tagging, would take place.

Text processing: the use of a data-mining algorithm to process the prepared data, compressing and transforming it to identify latent nuggets of information. At this stage, a fully featured NLP system would determine canonical and variant identities of entities (people, companies, organizations, etc.), identify conceptual relations between entities, and even instantiate particular frames of interest. Slot-filling of participants, dates and outcomes, as well as tables of extracted entities and relations, provides meaningful features for standard algorithms and techniques such as decision trees, neural networks, case-based learning, association rules or genetic algorithms.

Text analysis: the evaluation of the output to see if knowledge was discovered and to determine its importance. Having run the algorithms, the mined text/data is submitted to various techniques that will enable direct usage of the mined information, either by a Link Discovery tool or by visualization in a tool that will enable human analysts to complete the analysis begun by the text mining technology.

 These three stages must be accomplished in a thoughtful manner, with appropriate attention paid to the goals of the particular text-mining task, the limitations of the data/text being mined and the strengths and weaknesses of the particular algorithm selected for the task. If these conditions are met experience has shown that both confirming and disconfirming information will be discovered. Some quite unexpected ahas will result which is the goal of text-mining, data-mining and all KDD.

How to Order

@ 2000, American Society for Information Science