of The American Society for Information Science

Vol. 27, No.1

October/November 2000

Go to
Bulletin Index


Knowledge Discovery with a Little Perspective
by Melanie J. Norton

Melanie J. Norton is an assistant professor in the School of Library and Information Science at the University of Southern Mississippi, Box 5146, Hattiesburg, MS 39406-5146. She can be reached by e-mail at mjnorton@ocean.otr.usm.edu

The application of knowledge discovery (KD) techniques has in recent years become associated with the creation of computer algorithms designed to reduce data into recognizable if sometimes barren patterns of information to be distilled and explored for various applications. As an emerging technology, KD is being hailed as the ultimate prospector for information value in the vast historical and growing data mines. KD still has many obstacles to overcome, but its potential cannot be denied. As usual with emerging technology, it is necessary to emphasize that great advances do have a price and that there are always issues to be considered for its further development and implementation.

 Knowledge discovery in databases (KDD) references a series of processes involved in extracting usable information from any data collection in any format or media. These processes may require significant repetition and modification to appropriately distinguish promising patterns of data from mere experimental or statistical phenomena. Characteristics of data collection, database design and data entry practices lead to the creation of a heterogeneous corpus of potential "data mines." The multiple types of computer and network operating systems, database programs and interfaces, as well as the assorted collection points in any enterprise, contribute to a lack of structural consistency. These inconsistencies may require extensive additional processing as part of the discovery attempt.

 Nonetheless, collections of data have significant potential for yielding a variety of productive commercial or research ventures, ranging from identifying hitherto unnoticed consumer habits in markets, credit use and entertainment to ferreting out hidden intellectual treasures in the interoffice memos of the corporate world. Applications of these processes may reveal information that provides new information value from previously unexplored sources or improves an organization's ability to target and deliver products. Examples of potential uses include the following:

    • The purchasing habits of a clientele in a specific locale could be ascertained from a combination of regional grocery and department store purchasing data collected at the checkout and connected to an individual's credit card or checking method. Such data could be used to monitor spending habits, identify days or times of heavy traffic, identify spending patterns, indicate what items move out of inventory most quickly and target consumers with customized coupons or advertising notices. These types of data in real time could improve staff planning to handle peak periods and allow re-stocking to be scheduled at times least distracting to customers. They could also provide data to regulate other resources in the environment.
    • Historical data collected as a byproduct of computerized student enrollment and record keeping systems could be analyzed to investigate success ratios for students based on data divided by disciplines, grade point average classifications, standardized testing scores, previous school records, age and other discriminating characterizations available through such records.
    • Exploration of health data collections might identify patterns in incidents of unusual ailments in an area such as below average birth weights, cyclic episodes of poisoning or any number of concerns that may not be obvious in smaller data sets. Careful combination of health data collections could provide evidence of pharmaceutical interactions in categories of people or suggest relationships between otherwise disparate literatures, as suggested by Don Swanson's work with ArrowSmith [Library Trends, Summer 1999, 49(1).]
    • Mining the collected weather data now enables the substantiation of long term global weather trends not previously identifiable, improving weather prediction.
    • Subjecting data captured by system monitoring of networks, electrical grids, telecommunications and such might generate activity patterns instrumental to improving services at peak times and redistributing resources in less used times.
    • Seeking patterns in circulation records or acquisition records of libraries and museums could reveal indicators to streamline acquisition or the necessity to evaluate the collection.

 Using KDD has and will spawn new questions as well as new information value activities, and it must be explored and developed further. Realistically employing KDD systems requires confronting several important obstacles. The primary obstacle in seeking knowledge in any setting is acquiring some notion of what one is pursuing. Certainly that pursuit may be based on intuition, prior research, a wild idea or even in some environments mere curiosity. Open-ended research has always had a valid and important, if often overlooked and underestimated, status. Such open-ended research is often the first step in determining whether there is viable knowledge discovery and recovery possible.

 Much like the preliminary research, which any doctoral student should undertake prior to attempting to declare a dissertation topic, an overview examination should be undertaken of the area of concern and any available related data. Such an initial exploration for KD purposes, whether the new computerized version or the old human labor intensive method, is critical to any possible success in this endeavor. The increasing ability of computerized KD has made it possible to examine larger and larger databases or data collections. KD tools can also scan huge bodies of data from various perspectives, with a multiplicity of inquiries all seeking to highlight any patterns that might yield a clue to the next step to inquiry or even recombination of data points for exploitation. In the case of applying computer power to examining databases, the foremost obstacle in locating potential useful patterns is the possibility, even the likelihood, that the database was not constructed with these extensive discovery processes in mind.

 Beyond the obstacle of purpose there are others. The legacy databases being examined were not constructed with the extensive investigative intent that knowledge discovery in databases may involve. What influenced the construction of early data collections was the limited availability of inexpensive memory technology, ineffectual and inflexible database design, poor documentation of the structures or a very narrow view of the possible larger applications of the data. Databases, even research collections of data, are still not being appropriately constructed to enhance the potential for KD. Part of the underlying issue in this is the relative youth of the KDD endeavor. Just as algorithms are being created and tested, database designs are still being explored for ways to enhance collection as well as access of information. There is no single method to construct databases that will permit KD techniques, though there are some fairly standard processes to augment the potential.

 The essential considerations are planning at the database construction, implementation and interface design stages as well as documentation, consistency and repeatability. Planning, difficult because it may be impossible to predict how data will be used, is critical. Planning has to include a methodical approach to data collection, with attention to unique requirements across computer and telecommunications systems if the data is to be gathered directly and extensive investment in the user interface if data entry will be conducted by people rather than machines.

 User interface designs must take into account not only the demands of the potential database, but also the organizational culture and identity of the employees as well as their role in the gathering of data or application of the resulting KD. While the structure of the data storage may be invisible to the users, its interface and their understanding of the data will dramatically influence their role in data collection and applications. If data entry is too complex or too mysterious to the user, there will be little effort to be accurate in the process. If the data/information displays of the KD are too complex, or not oriented to the users' mission, the KD results will not be used.

 In planning the data storage itself, taking advantage of every available technological innovation may be critical to extending the life and usefulness of the acquired data. At minimum, planning to have multiple and cross access to data fields; documenting the creation and coding of such fields; consistency in data collection and entry; and clean time dating systems could produce significant assistance when attempting to use KD. Being able to construct complex searches is dependent upon the flexibility of the field structures employed, not just the creativity of the researcher. If only one field in a record is searchable, the manipulation of the data is seriously restricted.

 As data storage systems are built, detailed documentation of the structures, relationships and programming components, coding methods, abbreviations and authority files should be maintained in a current state. Protocols for data collection and entry should be documented, tested and monitored. Protocols for changing any data structure should be explicit and followed. Changing the use of a field in a data storage body can lead to data loss, for example, if all zip codes were dropped from an address file due to length limitations in the process of adding the four-digit zip extension. Having built-in dating systems with appropriate backup and data integrity testing will help to maintain the quality of the stored data. Entry systems that verify critical fields contain content and content that can be tested for appropriateness will help to ensure critical data collection.

 Recognizing that it is not possible to identify all the data that might be useful, many organizations are collecting all the data they can. While that might be a plan, it is not sufficient if the documentation and codification methods are not stable and consistent. Data collection needs to be undertaken as if it were a research project from the inception of the collection. All the methodology and underlying contemplation that lead to the choice of methodology need to be documented, evaluated and reviewed. If the system designer alone understands why and how a system works, that system becomes a dependent of that individual and probably a nightmare inheritance to any successors. When KD is attempted, it will require far more time and investment than would have been necessary had appropriate care been taken with design, documentation and collection.

 Retrospectively, the abundance of existing data collections that may be examined by KD complicates the process due to the lack of good planning and documentation techniques, as well as serious shortcomings in consistency of data entry. Previous and current lack of protocols and check systems permit irrational and undocumented modifications to database structures. Collection integrity may be diminished with a loss of control over data inclusion as a result of user interface failures or entire system conversions and upgrades that may not include complete data transitions. The same problems that have confronted data collection systems from their inception memory, users' and documentary limitations will continue to plague the process. In essence, and the KDD community is very conscious of this, everything derived from KDD inquiries should be subjected to significant and extensive external review, research and validation.

 The major obstacle of data structure inconsistency requires extensive cleaning of data, which can corrupt the ultimate result, contribute to spurious findings or misleading patterns. Naturally the KDD community is seeking better algorithms to address the necessary cleaning chores and admits that not all patterns discovered yield any intelligent data or knowledge. It is not the end-all of data analysis; it is only a process that may enhance our ability to exploit what data we collect within the limitations of our prior documentary failings and our future neglect of the same. The major obstacle to improving the productivity of KDD may well be what causes the need for it and our own insatiable yearning to find information with value in every deed and thought secured for posterity.

How to Order

@ 2000, American Society for Information Science