of The American Society for Information Science

Vol. 27, No.1

October/November 2000

Go to
Bulletin Index

  Copies

Systematic Knowledge Management and Knowledge Discovery

by Igor Jurisica

Igor Jurisica is with the Ontario Cancer Institute/Princess Margaret Hospital, University Health Network, 610 University Avenue, Rm. 10-303, Toronto, Ontario M5G 2M9; e-mail: jurisica@fis.utoronto.ca

The volume and complexity of relevant information is ever increasing, and we need to handle it effectively. If we fail, we will not find what we need when we need it. To support this process, it is beneficial to extend our notion of information systems. In general, data are values for observable, measurable or calculable attributes. Data in context is information. Knowledge is validated and actionable information. The trend is to support knowledge management and facilitate knowledge sharing by building computerized information systems with richer and more flexible representation structures, supplemented by new services such as cooperative query processing, similarity-based retrieval and knowledge discovery. This trend also includes support for knowledge representation schema evolution, integration and coexistence of unstructured, semi-structured, structured and hyper-structured information.

Although many approaches to knowledge organization are available, it is a challenge to organize evolving domains, since relying only on humans to create relationships among individual knowledge sources is not sufficient. It is not scalable, and it may be subjective. In order to support systematic knowledge management we need to complement traditional knowledge management techniques with approaches that automate parts of the process. Tools for information quality control help us to find missing, unexpected, incorrect or incomplete information. In addition, introducing knowledge-discovery systems helps to automate organizing, utilizing and evolving large knowledge repositories.

Research in the area of data warehouses and organizational and business knowledge management has generated important results. However, there are several reasons why traditional techniques for managing information are inadequate for knowledge management. Knowledge management systems support representation, organization, acquisition, creation, usage and evolution of knowledge in its many forms. But knowledge is complex, and we want to support knowledge management in many domains that are characterized by complex data and information, many unknowns, lack of complete theories and rapid knowledge evolution as a result of scientific discoveries. Human experts also need to be considered. When the theory is lacking, much of the reasoning process is based on experience. Experts remember positive cases for possible reuse of solutions and can recall negative experiences for avoiding potentially unsuccessful results. Thus, storing and reasoning with experiences may facilitate efficient and effective knowledge management in highly evolving domains.

Knowledge Management Systems

In order to support efficient and effective knowledge management, we must organize computer-represented knowledge into structures that are semantically meaningful and computationally efficient. The meaning of information is captured by conceptual information models, which offer semantic terms for modeling applications and structuring information. In general, the models comprise the following:

Primitive concepts that describe an application, e.g., entity, activity, agent and goal;

Abstraction mechanisms that are used to organize the information, e.g., generalization, aggregation and classification; and

Operations that can access, update and process information, i.e., they provide knowledge management operations.

Defining a conceptual model requires making assumptions about the application to be modeled. For example, if the application consists of interrelated entities, such as patient, ailment, treatment, etc., then we need to include terms such as entity and relationship into the conceptual model. In addition, the semantics of these concepts and their relationship is used to define knowledge management operations, such as navigation, search, retrieval, update and reasoning. Selection of appropriate concepts for modeling the world in which the system must support the operations is referred to as ontology. We can identify four broad groups of ontologies: static, dynamic, intentional and social.

Abstraction mechanisms support organization of knowledge and its effective use. As a result, knowledge management operations can be performed more efficiently. Abstraction involves suppression of irrelevant detail. The relevancy depends on the task and the use of information, and thus it changes with the context. There are six main abstraction mechanisms: classification, generalization, aggregation, contextualization, materialization and normalization.

Knowledge management support in complex and dynamic domains benefits from extending traditional approaches with automated methods. The next section describes knowledge-discovery techniques that are useful in determining conceptual models and can help with information system optimization and domain evolution.

Knowledge Discovery

The process of finding useful patterns in data has been referred to as data mining, knowledge extraction, information discovery, information harvesting, data archaeology and data pattern processing. The phrase knowledge discovery was coined at the first Knowledge Discovery in Databases workshop in 1989. Although it has similar goals to data mining, knowledge discovery emphasizes the end product of the process, which is knowledge. Thus, the pattern should be novel and in the form that the human users will be able to understand and use it. Knowledge discovery (KD) usually employs statistical data analysis methods but also methods in pattern recognition and artificial intelligence (AI). Database management systems (DB) ensure that the process is scalable (see Figure 1).

Discovered knowledge has three important facets: form, representation and degree of certainty. The pattern is a statement that describes relationships among a subset of facts with a particular degree of certainty. Such patterns include clusters, sequences or association rules. To be potentially useful the pattern must be simpler than an enumeration of all the facts in the subset being considered. In addition, since the discovered pattern is usually not true across all the data, the certainty factor is required to determine the faith put into a discovery. The certainty involves the integrity of the data, the size of the sample on which the discovery was performed and the degree of support from available domain knowledge. Only interesting patterns are considered to be knowledge. Interesting patterns are nontrivial to compute, novel and useful.

The main use of discovered knowledge is for hypothesis formulation and verification, building models for forecasting, planning and predicting, decision rule discovery, information cleaning, identifying outliers, information organization and structure determination. However, as we move from manufacturing and marketing domains into sciences, we need to apply mining algorithms not only to data, but to knowledge as well. As a result, we not only aim at discovering patterns, we aim at supporting knowledge management and evolution. In addition, discovered knowledge can be used for optimization of information systems.

Knowledge Discovery Process

There are two groups of knowledge discovery algorithms: quantitative discovery that locates relationships among numeric values and qualitative discovery that finds logical relationships among values. In many domains, including the biological and medical, qualitative approaches must incorporate symbolic attribute values (such as full text, logical values, taxonomic values or symbols) as well as visual attributes from still images (such as shape, color or morphometry of identified objects). The discovery process involves two main steps: identification and description of patterns. In general, several additional steps are involved: from selecting, preprocessing and transforming the data set, through using diverse data-mining algorithms, to visualizing the patterns (see Figure 2).

Identification of patterns is a process of clustering records into subclasses that reflect the patterns (sometimes called unsupervised learning). There are two basic methods: numeric and conceptual.

Traditional numeric methods use cluster analysis and mathematical taxonomy. These algorithms produce classes that maximize similarity within classes and minimize similarity between classes. Although many different similarity measures have been proposed, the most widely used metric is the Euclidean measure of distance between numeric attributes. The similarity of two objects can also be represented as a number the value of a similarity function applied to the symbolic descriptions of objects. However, such similarity measures are context free, and these methods fail to capture the properties of a cluster as a whole and are not derivable from the properties of individual entities. In addition, they cannot use background knowledge, such as likely shapes of clusters. As a result, these algorithms work well on numeric data only.

One of the useful applications is identification of "outliers", i.e., data outside of normal region of interest or input. In general, outliers could be unusual but correct or unusual and incorrect. They can be detected using histograms and then removed by threshold filters or identified by calculating the mean and standard deviation and removed by specifying a "window," for example two standard deviations from the mean.

Conceptual clustering methods determine clusters not only by attribute similarity but also by conceptual cohesiveness, as defined by background information. Inductive learning is a process of generalizing specific facts or observations. It is a basic strategy by which one can acquire and organize knowledge. For example, generalization can be used to organize existing concepts into taxonomies, referred to as "isA" or generalization hierarchies. For instance, ASIS isA professional society. The most general concept is the root of the hierarchy. The inheritance inference rule states that attributes and properties of a class are also attributes and properties of its specializations. This process can be exploited further by using two main forms of inductive learning: instance-to-class induction and clustering.

During instance-to-class induction the learning system is presented with independent instances, representing the class and the task, to induce a general description of the class. This process can result in introducing either a new entity-class or a super-class relationship. If two subclasses share the same properties then it is possible to introduce a new super-class as a generalization of the subclasses. If subclasses of a generalization are associated with an entity-class then the entity-class can be moved to the super-class. For example, after knowing that student, faculty, and staff isA person, and since faculty and staff have salary while student has scholarship, a new entity-class - employee isA person - may be introduced. Thus, faculty and staff isA employee, student and employee isA person.

Clustering problems arise when several objects or situations are presented to a learner, and the task is to find classes that group objects. For both numeric and conceptual clustering, there are two main approaches: flat clustering there is only one level of clusters; and hierarchical clustering nested clusters are created. The clustering process can progress either top-down or bottom-up. The first approach starts with all items being a member of the cluster, then removing outliers into new clusters. Bottom-up approach starts with no items being in the cluster, then moving similar items to clusters. The process can be made interactive by combining domain knowledge with the user's knowledge and visual skills to solve more complex problems. Interactive clustering algorithms prove to be useful especially in evolving domains.

Description of patterns is achieved by summarizing relevant qualities of the identified classes (the process is also referred to as supervised learning). The task is to determine commonalties or differences among class members by deriving descriptions of the classes. There are two main streams: empirical algorithms that detect statistical regularity of patterns, and knowledge-based algorithms (explanation-based learning) that incorporate domain knowledge to explain class membership. For example, aggregation views patterns as aggregates of their components. In turn, components of a pattern might themselves be aggregates of other simpler components.

Identified and described patterns must also include measures of their quality:

  • Certainty, measuring the extent to which we put our trust in the pattern;
  • Interestingness, specifying how novel, useful, nontrivial and understandable is the discovered pattern;
  • Generality, expressing the strength of a pattern in terms of the size of the subset of objects described by the pattern; and
  • Efficiency, measuring the time required to compute the pattern and space needed to store it.

Applications of Knowledge Discovery

 There are numerous applications of knowledge discovery. Here, we will focus only on a few that are related to information systems and knowledge management and thus manipulate symbols, text and images. Symbolic knowledge discovery can be used for three main tasks: system optimization, knowledge evolution and evidence creation. As examples, image-feature extraction algorithms can automatically and objectively annotate and describe images to help knowledge evolution. Text-mining applications also focus on improving access and organizing content, in this case full-text. Text mining applications are fundamental to digital libraries and the Internet and support both system optimization and knowledge evolution.

Three Basic Tasks

Using KD for system optimization involves developing knowledge representation that is effective for a given context and task and then organizing the knowledge sources (or documents) into context-based clusters. The importance of individual descriptors (knowledge properties) may change with the context or task, and thus different clusters may be created from the same database depending on requirements. During navigation and reasoning, relevant knowledge sources are located in physical proximity as a result of the clustering. This grouping results in a more efficient access. In addition, context-dependent clusters provide useful meta-information about the content organization. Browsing and analyzing the context-dependent clusters can therefore help users comprehend the structure and contents of the depository.

Knowledge evolution involves adding descriptors to assist knowledge discrimination during prediction and classification, removing redundant knowledge and descriptors, and creating hierarchies of descriptors and their values and finding associations. This process is useful in creating hierarchies of and relationships among knowledge sources. Creating hierarchies of descriptors and their values enhances knowledge organization and thus improves system performance and domain comprehensibility.

Evidence-based reasoning is achieved by analyzing created clusters, hierarchies and associations of knowledge sources to identify underlying principles in the domain. Evidence could be a rule discovered by generalization from individual experiences. Although some rules may be obtained from a human expert, deriving them by analyzing an existing knowledge repository provides an explanation of the rule and justification of its validity. Furthermore, over time the support for the rule may change, and it can be automatically removed or modified to reflect the change in the knowledge repository. Thus, this implements a systematic process of knowledge compilation to support evidence-based reasoning.

Text and Web Mining

The effectiveness of a retrieval system depends upon the richness of its representation formalism and its query subsystem. It is relatively easy to find relevant information in library catalogues. However, it is difficult to retrieve relevant information from the Internet with both high recall and precision. One of the difficulties in searching through, organizing and using unstructured or semi-structured text, such as documents on the Internet, is that we do not always have the necessary metadata available. In addition, the information content is distributed and highly dynamic. Text mining can be used to automatically find, extract, filter, categorize and evaluate the desired information and resources, as Elizabeth Liddy discusses elsewhere in this issue. Tools for text analysis are used to recognize significant vocabulary items and uncover relationships among many controlled vocabularies by creating meta-thesaurus. They can also recognize all names referring to a single entity and find multi-word terms that have a meaning of their own and abbreviations in a text with a link to their full forms. Text analysis tools automatically assign documents to preexisting categories and detect document clusters. The text analysis process can change a document from unstructured to highly structured by generating new metadata and organizing it.

Web mining is the discovery and analysis of useful information from the World Wide Web. Text mining applies to Web content mining, which is the automatic retrieval of information and resources, filtering and categorization. The other aspect of Web mining involves Web usage mining, that is, the discovery and analysis of user access patterns, as Bernard Jansen and Amanda Spink report.

Image-Feature Discovery

There are numerous applications where images are a required part of a knowledge repository. Automatic image indexing is used to make complex visual image comparison algorithms scalable for large image databases. In order to support operations on images, one has to have an ontology that describes them. Feature-based retrieval methods use image properties, such as color, shape or texture to access relevant images. Because these features describe the content of an image, such techniques are also referred to as content-based retrieval. Current systems focus on using either a human expert's abstractions or simple image characteristics (i.e., color, shape or shade). This approach obviously has only a limited applicability: expert abstractions cannot be generated manually for large image databases, and simple image properties do not capture semantics of the image content.

 The complexity of the image segmentation and analysis tasks prevents the application of generic and automatic feature extraction techniques to large databases. However, in domains with limited scope automated techniques can be applied, especially when a domain expert controls and guides the process. Recognized features can then be used to support retrieval, reasoning and knowledge discovery. During retrieval, the user can describe image properties to the system in several ways: by selecting an example (forming a query-by-example retrieval), by providing drawings or sketches or by selecting colors and textures from menus. Although humans can analyze images more flexibly, image processing techniques make the process more objective and precise. Extracted image properties (morphological information) can then complement other knowledge represented in the information system. For example, a combination of image analysis techniques and a decision-support system can thus serve as

A feature-extraction technique, which enables us to use traditional information retrieval algorithms for fast image access.

An Indexing approach, which makes content-based image retrieval scalable by organizing images into hierarchies.

An Analysis tool, which brings additional insight into relations among images and between image and symbolic features. Extracted features can be used to identify image metadata and prototypical features that correlate with important clusters.

Conclusions

 After the initial business success of data mining and knowledge discovery techniques, we can focus on their integration into information systems. We need robust approaches to deal with missing and noisy information. Algorithms should be flexible enough to be applicable in diverse tasks. Although algorithmic efficiency is required to cope with large and complex information repositories, the output of the analysis must also be understandable and actionable by users. It is essential that the process be interactive.


How to Order

@ 2000, American Society for Information Science