Warehousing Structured and Unstructured Data for Data Mining

L.L. Miller and Vasant Honavar
Department of Computer Science
Iowa State University
Ames, IA 50011

Tom Barta
Department of Industrial Engineering
Iowa State University
Ames, IA 50011
Abstract

More data, especially unstructured data, is available to users than ever. There is so much data available that it is difficult for users to make use of their data in its raw form. To handle the diversity of data types, we have designed and prototyped a multidatabase/warehouse system. The system has been especially designed to facilitate the interaction of structured and unstructured data. The system makes use of object oriented views.


The main features of the view mechanism, especially as they relate to textual documents, are presented in the paper. The system is designed to take target documents either from large repositories or from the Web. Issues for both sources of documents are examined in the paper. The paper also looks at how the view approach allows the interaction between the data taken from structured (e.g., relational), semistructured (e.g., object oriented) and unstructured (e.g. text) data sources. The warehouse support provided by the system is briefly examined and the paper concludes by looking at our approach to data mining and how the system will operate in the complete environment.