of The American Society for Information Science

Vol. 27, No.1

October/November 2000

Go to
Bulletin Index

  Copies

Methodological Approach in Discovering User Search Patterns through Web Log Analysis
by Bernard J. Jansen and Amanda Spink

Bernard J. Jansen is in the Computer Science Program, University of Maryland (Asian Division), Seoul, 140-022 Korea. He can be reached by e-mail at jjansen@acm.org .  Amanda Spink is with the School of Information Sciences and Technology, The Pennsylvania State University, 511 Rider I Building, 120 S. Burrowes St., University Park PA 16801. She can be reached by telephone at 814/865-4454; by fax at 814/865-5604; or by e-mail: spink@ist.psu.edu

The Web is a whole new searching environment, and therefore, a new category of user searching studies presents itself. For the past three years, we have been involved with an extensive research project focusing on an analysis of Web queries submitted by searchers of the Excite (www.excite.com) information retrieval system, a major Web search engine. The Excite project focused first on a data set of about 51,000 queries, later a data set of approximately 1 million queries and recently a data set of over 2.5 million queries. We thank Excite for making these data sets available for research. Without their cooperation, this research would not be possible.

The number of individuals now collaborating on the research project has grown to about 10, many of whom have never met in person. Most of the communication among researchers is conducted via the Web. This is an indication of the Web's impact on collaborative research. Without a doubt, the Web has had a major impact on society, certainly in terms of information access. In terms of information quality, the utility of the Web now matches that of a professional reference librarian.

The Excite project research is extremely fascinating and provides amazing insight into information searchers and their perceptions of the Web. Since these data sets from Excite are composed of real queries by real searchers outside of the academic setting, the interaction among the searchers and the search engine is sometimes almost unbelievable. For example, take the query (an actual query submitted to the Excite search engine by a real user) nude pictures of myself. Think for a moment what this query says about the user's view of both the computer system and the underlying knowledge base!

Unfortunately, there are still an extremely limited number of statistical studies on Web searching, other than those generated from this research project, although other studies are appearing. There are certainly abundant anecdotal studies and articles containing unsupported statements about Web searching characteristics. This situation has resulted in an almost general acceptance that "we all know the structure of Web queries and Web search topics" (e.g., sex). For the systemic study and investigation, one requires data and analysis, not opinion.

It is extremely challenging to construct valid Web user studies. Having worked with these large data sets for some time now, we present in this article how we structured our study and suggest improvements for future studies. We do not focus on the results of the Excite research project; however, we do provide some brief results and a list of citations for the interested researcher.

Descriptive Information

The descriptive information section records necessary background data on the searchers, the IR system, the data set and how the data was collected. In a transaction log analysis, demographic information concerning the actual searchers may be impossible to get. However, information on the Web IR system, the number of searchers and visitors in a given time period, primary language of the queries and document collection, and domain of the searchers is available from other sources.

Necessary descriptive information about the IR system also includes the simple and advanced searching rules in effect during the data collection period. Rapidly changing in "Web time," the rules in effect during the data collection period may not be the rules in effect when the results are published. The information concerning the document collection should address the number of documents in the collection and the size (MB, GB, TB, etc.) of the document collection. Other system information includes how the IR system handles indexing text, video, audio, images and URL.

The manner in which the data were collected is pertinent and will affect the conclusions one can draw from any analysis. Transaction logs and logging systems are different, and the data collected may vary. One needs a precise definition of each field in the transaction log, including data format and any assumptions made. Specific items to discuss are the identification of the user, the time period of the logging process and the format of the query.

Levels of Analysis

Because of the nature of the Excite transaction logs, we focused our research at three levels of analysis, the session, the query and term.

    1) Session Level of Analysis: The session is the entire sequence of queries entered by a searcher. The primary aim at the session level is to determine the number of queries per searcher. Defining which queries are being included in the session and which are being excluded can be difficult. For example, if a searcher goes to the query page but does not enter a query, is that page access included in the session count? If the IR system generates a query to view results, is that query included? The inclusion or exclusion of certain types of queries will affect the analysis, and therefore, any assumptions must be explained.

    2) Query Level of Analysis: Sessions are composed of queries. When using transaction logs, a query can be defined as a string of zero or more characters entered into the Web IR system. This is a mechanical definition as opposed to the common information seeking definition. Within each session, the queries can be further classified:

    n The first query by a particular searcher we refer to as the initial query.

    n A subsequent query or queries by the same searcher that is identical to one of the searcher's previous queries is a repeat query.

    n A subsequent query by the same searcher that is different from any of the searcher's previous queries is a modified query.

    n A unique query refers to a query that is different from all other queries, regardless of the searcher.

    Of course, one can have various sub-components of these classifications.

     At the query level of analysis, one is generally interested in determining query length, query complexity and failure rate.

    n Query length is measured in terms.

    n Query complexity examines the query syntax.

    n Query syntax includes the use of advanced searching techniques such as Boolean operators, phrased searching and stemming.

     Many Web IR systems permit the use of symbols to accomplish many of the same features as Boolean operators, such as "+," "-" and "!" These symbols are referred to as term modifiers and are also a component of query syntax. The failure rate is presented and is defined as deviation from the published rules of the IR system.

3) Term Level of Analysis: The final level of analysis is the term level. A term is defined as a string of characters separated by some delimiter such as a space, colon or period. The researcher decides which delimiter to utilize. For example, if a system rule requires terms to be separated by a blank space, searchers may still use other delimiters, such as a period. Is the blank or the period the delimiter? The choice will affect the term count. One also has to state whether Boolean operators are counted as terms. There are advantages and disadvantages to including or excluding them. On the one hand, the advantage to removing Boolean operators is that the system-imposed operators are not included in the term count. However, on the other hand, in practice it is difficult, and sometimes impossible, to distinguish a term that the searcher intended to be a Boolean operator from one intended to be a conjunction.

Statistical Analysis

The statistical analysis section includes the mean, the standard deviation and the median wherever justified. These metrics permit one to compare and contrast results among studies. Given that one can never present all the statistical measures that fellow researchers desire, all the data must be presented at the lowest possible denominator. For example, in presenting the query length (the number of terms per query), it is better to list the number and percentage of queries with one term, two terms and so forth than to group them ("three or fewer terms per query") and present an aggregate number. At the term level of analysis, it is useful to compare the distribution of terms to known distributions, along with measures determining the goodness of fit.

Results from Data Analysis

Although our research has not analyzed all aspects of Web searching, one can draw some general impressions of Web searchers from the research results. Most Web searchers entered about one query per session. Concerning query length, Web searchers appear to use queries of approximately two terms or fewer. Boolean usage among users is about 8%. Web searchers appear to examine a small number of the documents they retrieve with most searchers looking at a maximum of 10 items. These statistics are in line with the results from studies of other Web search engines.

In short, "typical" Web searchers use approximately two terms in a query, do not use complex query syntax, view no more than 10 documents from the results list and have a session length of one query. The "typical" information professional would probably wonder how these users find anything. However, it appears Web searchers do find information. In our survey of Excite users almost 70% of the searchers stated that they had located relevant information on the search engine.

Table 1 presents some statistics from the three studies.

This article has presented an overview of the methodology and analysis used in the Excite Research Project, which is an on-going study of searching on the Web. We hope that this project will be the precursor to numerous other major, long-term research efforts focusing on the unique IR environment of the Web. The study has been and continues to be a fascinating look into the public's searching behavior.

Measure

1997 Excite Study

(51K Queries)

1997 Excite Study

(1M Queries)

1999 Excite Study

(1M Queries from 2.5M sample)

Mean Queries Per Session

2.8

4.8

2.0

Session Length

1 Query

2 Query

3+ Query

 

31%

31%

18%

 

26.4%

31.5%

18.2%

 

25.8%

26%

15%

Mean Terms

 Per Query

2.32

2.4

2.35

Table 1. Statistics from three Excite Transaction Log Studies

Note 1: The percentage of queries in Session Length having 1, 2, or 3+ terms does not add up to 100% because some queries had 0 terms.

    Note 2: Data for the 1999 study is based on a 1 million query subset of the 2.5 million queries.

For Further Reading about the Excite Research Project

    Spink, A, Wolfram, D., Jansen, B. J., & Saracevic, T. (Forthcoming). Searching the Web: The public and their queries. Journal of the American Society for Information Science.

    Jansen, B. & Pooch, U. (Forthcoming). Web user studies: A review of and framework for future research.  Journal of the American Society for Information Science.

    Jansen, B. J., Goodrum A., & Spink, A. (Forthcoming). Searching for multimedia: An analysis of audio, video, and image web queries.  World Wide Web Journal.

    Spink, A., Jansen, B. J., & Ozmultu, C. (Forthcoming). Use of query reformulation and relevance feedback by Web users. Internet Research Electronic Networking Applications and Policy.

    Ross, N. & Wolfram, D. (2000). End user searching on the Internet: An analysis of term pair topics submitted to the Excite search engine. Journal of the American Society for Information Science, 51(10), 949-958.

    Jansen, B. J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: A study and analysis of user queries on the Web. Information Processing and Management, 36(2), 207-227.

    Jansen, B. J., Spink, A. & Saracevic, T. (1999). The use of relevance feedback on the Web: Implications for Web IR system design. Proceedings of the 1999 World Conference on the WWW and Internet, Honolulu, Hawaii.

    Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998). Real life information retrieval: A study of user queries on the web. ACM SIGIR Forum, 32(1), 5-17.

    Spink, A., Bateman, J., & Jansen, B. J. (1999). Searching the Web: A survey of Excite users. Journal of Internet Research: Electronic Networking Applications and Policy, 9(2), 117-128.

    Spink, A., Bateman, J., & Jansen, B. J. (1998). Searching heterogeneous collections on the Web: Behavior of Excite users. Information Research: An Electronic Journal, 4(2).

    Jansen, B. J., Spink, A., & Saracevic, T. (1998). Failure analysis in query construction: Data and analysis from a large sample of web queries. Proceedings of the 3rd ACM Conference on Digital Libraries . Pittsburgh, PA.

 


How to Order

@ 2000, American Society for Information Science