GCDIS-Assisted Search for Knowledge (ASK)

by Roberta Y. Rand

The United States Global Change Research Program (USGCRP), an interagency project within the federal government, is charged with observing, understanding and predicting global change and making the group's results available for use in policy matters. In the Global Change Data and Information Management Program Plan, published in 1992, the participating federal agencies agreed to do the following:

The Global Change Data and Information System Implementation Plan builds on the Program Plan to define a Global Change Data and Information System (GCDIS), calling on the participating agencies to do the following in order to achieve a better understanding of the environment: To start this process, the agencies will implement several pilot projects that are intended to broaden the scope of the GCDIS. One of the projects calls for the development of a prototype open and extensible assisted search for knowledge (ASK) system to demonstrate the GCDIS mission by using software that is relevant, currently available and off-the-shelf, client/server and adheres to common standards. Among the elements to be addressed in the pilot project are a symbiotic linking of the services of existing heterogeneous data and information resources in a distributed evolutionary environment; accommodation of the individual needs of all users by providing access at multi-skill levels, via multi-paths, including offering alternatives when computer technology is not readily available; and provision of extensive knowledge bases of electronically published dictionaries and thesauri to allow users to query the system using their own language for better precision and accurate retrieval.

Overview of the ASK System

Fairly complex system architecture is required for the ASK system because the data and information provided by the many participating agencies is diverse in content and format; the audiences needing access to this data and information vary widely in their ability to effectively utilize search tools and to access data and information; and international access will require an intelligent interpreter that can assist both experienced and inexperienced users in navigating the databases and data sources or providers in making their data easily accessible, regardless of format and location.

The ASK system will contain three modules: client module, assisted search module and data collection module.

Client Module

The client or user portion of the system will consist of a personal computer or workstation, running under a graphical user interface (GUI) that allows users of all levels (researchers, policy makers, K-12 students and experts) easy and accurate access to the global change data and information. The wide variety of users will be able to access a GUI by their level of expertise. Multiple GUIs (such as Windows, Motif, Macintosh) will be made available in order to be able to function on the multiple platforms employed by users. Users will be able to Users will have several search options. They will be able to search local data (present on their PC or workstation), as well as remote databases connected via dial-up or the Internet. Expert searchers who know where the data they want resides can query the appropriate server directly, without assistance. Access to a local user knowledge base (in addition to the primary knowledge base in the assisted search module) will allow users to query on local data using their own unique set of terms. Finally, the GUIs will allow access to general reference information available on a local CD-ROM.

A Query Scheduler in the client portion of the system will enable searches to be performed in real time or on a delayed basis and routed to the user.

User interfaces will be made available to all clients (in order to accommodate all Internet users) and will allow functionality consistent across platforms. Additionally, a character-based interface will be available in the assisted search module for use over the Internet with any VT100 style terminal or terminal emulator.

Assisted Search Module

The Assisted Search Module will be made up of multiple components. This will be the main system (or systems) for assisting and executing searches and collecting results. Multiple assisted search modules may be required based on user load, throughput considerations, network size and number of databases. A hierarchically structured guide with multiple points of access for identifying relevant data and information sets and including inventory information and library and information center resources will be handled by the assisted search modules. This module will be dedicated to the assisted search process; any user query in the assisted mode will be processed by this module. The multiple components of the assisted search module are discussed below.

NL Search Engine is the heart of the system. To leverage the latest developments in computer technology, the NL search engine combines the best of both statistical and concept-based searching. Statistical searching measures criteria such as the number of occurrences of search terms or the proximity of those terms to one another within a document. Concept-based searching enhances the process by adding user knowledge and expertise; users can choose concept definitions from a knowledge base, improving accuracy through additional contextual evidence.

In order to eliminate the costly labor intensive process needed to manually build and maintain the knowledge bases required by concept-based searching, the prototype system will contain a built-in knowledge base of information from published lexical sources (such as dictionaries and thesauri). It will also utilize advanced linguistic techniques, such as morphological analysis, and natural language processing techniques in order to identify concepts to be used for searching.

Users will be able to enter search requests in the form of Boolean queries, plain English queries or by geographical parameters. The search engine will be able to enhance the search request with related terms and concepts, which may be refined by the user prior to search execution.

The search engine will include a set of document loading tools used to build compact indexes to the documents. These will enable structured or unstructured documents to be retrieved at high speed and displayed in their native formats.

Extendable Knowledge Bases. Although the basic search engine will include a built-in knowledge base consisting of dictionaries/thesauri, it will also allow for extensions, with the capability of easy modification and loading of new terms. The diversity of vocabularies among data sources also makes it mandatory that users be able to identify and utilize terminology unique to a particular discipline or agency, thus enabling concept-based searching from a particular user perspective.

Smart Query Scheduler and Smart Query Servers. The nature of the distributed environment to be handled by the assisted search module requires that several processes be controlled by this module. Because clients can query in real time or on a delayed basis and can query over multiple databases in multiple locations simultaneously, queries must be scheduled and routed for maximum efficiency and speed. This requires a client/server architecture in which several processes (transparent to users) are controlled by the assisted search module and its interfaces to other remote servers.

One or more Smart Query Servers will exist for each database to be searched; the server will actually execute the search (through an interface to the search engine) and return the results to a Client Handler. When a client (user) makes a search request, the Client Handler first requests an available Query Server from the Smart Query Scheduler. The Scheduler will wait (if necessary) for an available Query Server and then allocate the server to the Client Handler. The Client Handler will connect to the Query Server and initiate the search over the requested database. When the search is complete, the Query Server will notify the Scheduler that it is available for additional searches. Each Query Server must be able to handle multiple simultaneous requests. The Smart Query Scheduler must also monitor all Query Servers so that no one server is over utilized.

Expanded Query Generator and Results Synthesizer. The assisted search module will enable users to query on data and information cataloged using various search engines and their associated loading and formatting mechanisms. To handle the multiplicity of formats and still take advantage of the built-in knowledge base and relevancy ranking provided by the NL search engine, an Expanded Query Generator will translate a user query into one or more queries recognizable by the database being searched. All archived and new data may remain in or be loaded in whatever format is desired, without sacrificing the greater accuracy of queries processed via the NL Search Engine and knowledge bases.

After the databases are searched and results are returned, a Results Synthesizer will combine the returned documents into a single, ranked list for the user. The Synthesizer will allow the NL Search Engine to apply relevancy ranking uniformly, even though the returned documents may have been retrieved from diverse databases and engines.

As part of this component, interfaces to a variety of search and storage standards (such as WAIS, World Wide Web, and others) will be explored.

Profiles and Routing. The flow of text will be managed so that new documents added to data collections can be routed automatically to users with profiles matching the subject code of the document. In this way, users can be notified automatically of new and pertinent data and information.

Each available source database will contain a particular code that identifies, in general terms, the subject matter of its documents. Users can create individual profiles that further define the types of information about which they wish to be notified. On a periodic basis, a Smart Drone will search documents added to target databases in order to match them to user profiles. Documents ranked above a particular threshold (based on matches with the user profile) will then be mailed to those users. Assisted Search Metadata. One portion of the search assistance provided via the GCDIS ASK system is metadata (or data about data) that will be maintained on the assisted search module. This MetaGuide will consist of information such as participating Internet addresses and IDs, server names and locations, search engines, available knowledge bases, user profile and routing data and the subject matter contained in various databases. The MetaGuide directory will serve as a global navigation tool and will be provided in the form of a main directory, as well as in an alphabetical index. Data for the MetaGuide directory will be gathered from all participating providers of global change data and information.

GIS Option. A geographic information system (GIS) is required to facilitate searching for documents that meet certain geographic parameters. As demonstrated in a GCDIS Pilot Project, the semantic network structure can already identify hierarchical references (such as part of) to political boundaries (as in Arizona is part of the United States, which is part of North America), as well as specific latitude/longitude references. An existing GIS will be available in the GCDIS Assisted Search Module. Help Desk will assist users who have difficulty accessing the system, locating Internet address information or databases to be searched, formulating queries, etc. Users can access the Help Desk via electronic messages over the Internet, via telephone, etc.

Data Collection Module

Multiple government agencies and industry participants will be providing access to global change data and information. This data and information will typically reside on a local server at the participating agency and will be accessible via the assisted search module (or directly by expert users). Individual providers will determine in what format their data and information will be maintained and updated.

Project Plan

The Project Plan for the ASK prototype system consists of three phases - design, implementation and maintenance - scheduled for completion by the end of 1995. The primary objective of each phase is described below:

Design Phase - identify the level of effort required to build the ASK prototype system. Specific major objectives include evaluating off-the-shelf vs. customized interfaces, researching optimum system configurations and developing a design specification, modifying the proposed implementation plan as necessary to reflect that specification.

Implementation Phase - build and test an operational version of the GCDIS ASK system, which can be used as prototype by users to search specific sources.

Maintenance Phase - provide for the staffing of the Help Desk and enable the expansion of the system to include more users and data sources. Roberta Y. Rand is the U.S. Department of Agriculture Change Data and Information Management Coordinator in the Information Systems Division of the National Agricultural Library.