The United States Global Change Research Program (USGCRP), an interagency project within the federal government, is charged with observing, understanding and predicting global change and making the group's results available for use in policy matters. In the Global Change Data and Information Management Program Plan, published in 1992, the participating federal agencies agreed to do the following:
The ASK system will contain three modules: client module, assisted search module and data collection module.
A Query Scheduler in the client portion of the system will enable searches to be performed in real time or on a delayed basis and routed to the user.
User interfaces will be made available to all clients (in order to accommodate all Internet users) and will allow functionality consistent across platforms. Additionally, a character-based interface will be available in the assisted search module for use over the Internet with any VT100 style terminal or terminal emulator.
NL Search Engine is the heart of the system. To leverage the latest developments in computer technology, the NL search engine combines the best of both statistical and concept-based searching. Statistical searching measures criteria such as the number of occurrences of search terms or the proximity of those terms to one another within a document. Concept-based searching enhances the process by adding user knowledge and expertise; users can choose concept definitions from a knowledge base, improving accuracy through additional contextual evidence.
In order to eliminate the costly labor intensive process needed to manually build and maintain the knowledge bases required by concept-based searching, the prototype system will contain a built-in knowledge base of information from published lexical sources (such as dictionaries and thesauri). It will also utilize advanced linguistic techniques, such as morphological analysis, and natural language processing techniques in order to identify concepts to be used for searching.
Users will be able to enter search requests in the form of Boolean queries, plain English queries or by geographical parameters. The search engine will be able to enhance the search request with related terms and concepts, which may be refined by the user prior to search execution.
The search engine will include a set of document loading tools used to build compact indexes to the documents. These will enable structured or unstructured documents to be retrieved at high speed and displayed in their native formats.
Extendable Knowledge Bases. Although the basic search engine will include a built-in knowledge base consisting of dictionaries/thesauri, it will also allow for extensions, with the capability of easy modification and loading of new terms. The diversity of vocabularies among data sources also makes it mandatory that users be able to identify and utilize terminology unique to a particular discipline or agency, thus enabling concept-based searching from a particular user perspective.
Smart Query Scheduler and Smart Query Servers. The nature of the distributed environment to be handled by the assisted search module requires that several processes be controlled by this module. Because clients can query in real time or on a delayed basis and can query over multiple databases in multiple locations simultaneously, queries must be scheduled and routed for maximum efficiency and speed. This requires a client/server architecture in which several processes (transparent to users) are controlled by the assisted search module and its interfaces to other remote servers.
One or more Smart Query Servers will exist for each database to be searched; the server will actually execute the search (through an interface to the search engine) and return the results to a Client Handler. When a client (user) makes a search request, the Client Handler first requests an available Query Server from the Smart Query Scheduler. The Scheduler will wait (if necessary) for an available Query Server and then allocate the server to the Client Handler. The Client Handler will connect to the Query Server and initiate the search over the requested database. When the search is complete, the Query Server will notify the Scheduler that it is available for additional searches. Each Query Server must be able to handle multiple simultaneous requests. The Smart Query Scheduler must also monitor all Query Servers so that no one server is over utilized.
Expanded Query Generator and Results Synthesizer. The assisted search module will enable users to query on data and information cataloged using various search engines and their associated loading and formatting mechanisms. To handle the multiplicity of formats and still take advantage of the built-in knowledge base and relevancy ranking provided by the NL search engine, an Expanded Query Generator will translate a user query into one or more queries recognizable by the database being searched. All archived and new data may remain in or be loaded in whatever format is desired, without sacrificing the greater accuracy of queries processed via the NL Search Engine and knowledge bases.
After the databases are searched and results are returned, a Results Synthesizer will combine the returned documents into a single, ranked list for the user. The Synthesizer will allow the NL Search Engine to apply relevancy ranking uniformly, even though the returned documents may have been retrieved from diverse databases and engines.
As part of this component, interfaces to a variety of search and storage standards (such as WAIS, World Wide Web, and others) will be explored.
Profiles and Routing. The flow of text will be managed so that new documents added to data collections can be routed automatically to users with profiles matching the subject code of the document. In this way, users can be notified automatically of new and pertinent data and information.
Each available source database will contain a particular code that identifies, in general terms, the subject matter of its documents. Users can create individual profiles that further define the types of information about which they wish to be notified. On a periodic basis, a Smart Drone will search documents added to target databases in order to match them to user profiles. Documents ranked above a particular threshold (based on matches with the user profile) will then be mailed to those users. Assisted Search Metadata. One portion of the search assistance provided via the GCDIS ASK system is metadata (or data about data) that will be maintained on the assisted search module. This MetaGuide will consist of information such as participating Internet addresses and IDs, server names and locations, search engines, available knowledge bases, user profile and routing data and the subject matter contained in various databases. The MetaGuide directory will serve as a global navigation tool and will be provided in the form of a main directory, as well as in an alphabetical index. Data for the MetaGuide directory will be gathered from all participating providers of global change data and information.
GIS Option. A geographic information system (GIS) is required to facilitate searching for documents that meet certain geographic parameters. As demonstrated in a GCDIS Pilot Project, the semantic network structure can already identify hierarchical references (such as part of) to political boundaries (as in Arizona is part of the United States, which is part of North America), as well as specific latitude/longitude references. An existing GIS will be available in the GCDIS Assisted Search Module. Help Desk will assist users who have difficulty accessing the system, locating Internet address information or databases to be searched, formulating queries, etc. Users can access the Help Desk via electronic messages over the Internet, via telephone, etc.
Design Phase - identify the level of effort required to build the ASK prototype system. Specific major objectives include evaluating off-the-shelf vs. customized interfaces, researching optimum system configurations and developing a design specification, modifying the proposed implementation plan as necessary to reflect that specification.
Implementation Phase - build and test an operational version of the GCDIS ASK system, which can be used as prototype by users to search specific sources.
Maintenance Phase - provide for the staffing of the Help Desk and enable the expansion of the system to include more users and data sources. Roberta Y. Rand is the U.S. Department of Agriculture Change Data and Information Management Coordinator in the Information Systems Division of the National Agricultural Library.