Direction on the construction and application of classification schemes such as taxonomies is readily available, but relatively little has been offered on evaluating the schemes themselves and their use to categorize content. A classification scheme can be judged for how well it meets its purpose and complies with standards, and a strong evaluative framework is reflected in S.R. Ranganathan’s principles of classification. The degree of certainty of classification decisions depends on objective understanding of the object to be classified, the scope and details of the class and the coverage and organization of the overall classification scheme. The more complete the information about each class, the more reliable the goodness- of-fit for an object to a class is likely to be, whether chosen by human or machine classifiers. This information comes through definitions, examples, prior use and semantic relationships. The risk of misclassification can be reduced by analyzing the goodness-of-fit of objects to classes and the patterns of missed or erroneous selections.

classification schemes
manual indexing
automatic classification

Bulletin, December 2012/January 2013

Special Section

Evaluating Classification Schema and Classification Decisions

by Denise Bedford

This article considers how evaluation pertains to taxonomies. Taxonomies and evaluation are both rich concepts, so it is best to start out with some definitions that help to define our discussion. What do we mean by taxonomy? And what do we mean by evaluation

For seasoned information professionals the traditional characterization of a taxonomy is as a hierarchical classification scheme. This characterization has expanded in the last 20 years as the taxonomy community and the information environment have expanded. Today the taxonomy community includes people who design taxonomies, those who build systems that support them and those who use them. Our complex information environment may call for a variety of taxonomic structures, including

  • flat taxonomies such as lists of languages or lists of countries;
  • hierarchical taxonomies such as topical or subject classifications, business classifications or service classifications;
  • faceted taxonomies such as metadata or parametric search structures; 
  • ring taxonomies such as synonyms or authority control data; and
  • network taxonomies such as fully relationed thesauri or knowledge networks.

Each of these structures has its own set of principles and behaviors. And each requires an evaluation method that aligns with those principles and behaviors. This article focuses on the second type of taxonomy – the traditional classification scheme or hierarchical taxonomy. Classification schemes govern the organization of objects into groups according to explicit properties or values. Classification schemes are in widespread use in everyday life – from grocery stores to websites to personal information spaces. 

To evaluate something is to determine or fix a value through careful appraisal. There seem to be two important evaluation points related to classification schemes. The first is an evaluation of the classification scheme itself. The second is how well the scheme supports classification decisions. Each requires its own framework and context. 

Evaluating a Classification Scheme
We can evaluate a classification scheme based on its intended goal and purpose and by how well it aligns with professional standards and principles. Goals and purpose will be institution-specific and are best addressed internally by those who design and work with the scheme. Evaluating a classification scheme according to professional standards and principles, though, is a process that can be generalized. ISO 11179-2 Information Technology – Metadata Registries (MDR). Part 2. Classification (2005) [1] provides advice for constructing data structures and relationships that are used to represent a scheme. ISO 25964 Thesaurus Schemas [2] and ANSI-NISO Z39.19 (R2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies [3] provide some guidance on the distinction between a thesaurus (a network taxonomy structure) and a classification scheme (a hierarchical taxonomy). Important sources for advice on how to construct classes in a classification scheme, though, derive from other sources such as S.R. Ranganathan’s Prolegomena to Library Classification [4] and discussions of set theory found in the mathematical sciences literature [5]. Table 1 provides a sample set of principles for constructing classes derived, but reinterpreted, from Ranganathan for an information technology team. The challenge with using these sources for evaluation is that they generally require substantial translation to be understandable by the teams that are building and appraising the scheme. A full set of interpreted principles is available from the author upon request. 

Table 1
Table 1. S. R. Ranganathan’s principles re-interpreted for an information technology team.

It has generally been my experience that Ranganathan’s principles align with but are more exhaustive than the popular guidelines found in the usability engineering literature. The challenge, though, is that they are difficult to interpret by anyone outside of the information science profession. Adapting and interpreting Ranganathan’s principles will provide you with a very strong framework for evaluating the strength of your classification scheme. In fact, the principles convert very nicely into a working checklist for periodic evaluations. 

Evaluating Classification Decisions 
Evaluating classification decisions is less straightforward. In order to evaluate a classification decision, we need a good description of the classification process. Within the process description we can pinpoint what to evaluate. Classification is a decision-making process that involves making choices. Typically, choices are made in the context of an existing classification scheme by a human or machine classifier and for a given object (Figure 1). In theory, this choice seems like a straight-forward decision process. The classifier who knows the classification scheme, considers what is known about the object, what is known about the possible classes and makes a decision as to which class in the scheme is the best fit for the object.

Figure 1
Figure 1. Simple characterization of the classification process.

The research on classification evaluation is extensive. The citations in this article are illustrative and not comprehensive. The research tends to focus on several contexts for evaluation, including

  • comparison of human and machine classification practices [5, 6, 7, 8, 9, 10];
  • assessment of the variability of classification decisions among human and machine classifiers [11, 12, 13];
  • comparison of machine-generated classification structures and well-established classification schemes and thesauri [14, 15, 16, 17, 18];
  • the quality of classification in the context of information retrieval [19, 20, 21]; and
  • evaluations of the quality of statistically generated classes [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]. 

One fundamental perspective appears to receive less treatment – the simple question of how well the object fits a class in a classification scheme. We suggest there are two simple reasons why this perspective has not received more attention. First, to date most classification is done by people, and we have always assumed that humans make optimal decisions. Second, until recently we have not had the capacity to evaluate decisions in a direct and controlled way. Rather, we have had to evaluate them from an information-retrieval and end-user perspective. What would a direct evaluation of the fit between an object and a chosen class look like?

In an expanding universe of information, classification decisions may be made by people who have neither professional information science training nor subject expertise. Classification decisions may also be made by machine classifiers. Regardless of who makes decisions, the goal is to ensure that those decisions are optimal. An optimal classification decision reflects the best choice that can be made given the information available at the time. An optimal choice may be defined as a good fit between the object and possible classes. An optimal decision also reduces the risk of misclassification. Misclassification may take two forms. The first form occurs when we assign the object to a class for which it is not good fit. In this case, the object will be presented to the user in error. The second form occurs when we fail to assign the object to a class for which it is a good fit. In this case, the object will be overlooked because it is not in the class. So our evaluation point for classification decisions is determining how well the classified object aligns with the chosen class(es). 

Reducing Uncertainty in the Classification Decision
Information economists tell us that optimal decisions result from reducing uncertainty. One way to improve a classification decision, then, is to reduce the uncertainty in the process. Classification is characterized by several kinds of uncertainty. The classifier may have an incomplete understanding of the object. Uncertainty may be high where the classifier has access only to an abstract or summary of the object. The classifier may be uncertain as to what properties or attributes define the class. Uncertainty about the class may result from an incomplete understanding of the scheme or an incomplete specification of the domain – perhaps not all relevant classes have been defined in the schema. Perhaps the classifier has imperfect knowledge of all the topics covered in the scheme. The scope and coverage of the classes may not be explicitly available, requiring subjective interpretation by the classifier.

Any of these uncertainties may lead to a suboptimal classification decision. In some cases, making a less than perfect classification decision may be acceptable – perhaps the risk resulting from a classification decision is not so great. If a young reader overlooks a book about the role of snakes in a desert ecosystem for a school project because it was misclassified, the risk is low. In other cases, though, the risk of misclassification may be significant. Where an energetic-materials scientist overlooks an important report of a chemical experiment, national security risks may arise. 

These uncertainties are important when we are making a classification decision and when we are evaluating a classification decision. How can we reduce the uncertainty we find in the classification decision? How can we improve the information we need to evaluate classification decisions? The answer is simple. By expanding the information we have about the object, individual classes and the overall makeup and purpose of the classification scheme. Table 2 identifies some of the conditions that might produce low, moderate and high rates of risk in classification decisions. 

Table 2
Table 2. Methods for Reducing Uncertainty

Notice that the more explicit and objective information we have about each of the factors, the lower the uncertainty. Uncertainty is highest when we rely on subjective interpretation of objects, where there is no direct access to objects and where there is no formal and extensive representation of a class. High levels of uncertainty may result in higher probabilities of misclassification. 

Today we are not likely to encounter uncertainty about an object because the classifier – human or machine – will have the object in hand or will be able to access it in its entirety in digital form. It is more probable that we will encounter uncertainty about a class. While humans have constructed hierarchical classification schemes for centuries, often they have not provided rigorous characterizations of those classes sufficient to reduce uncertainty in the decision process. For example, classification schemes are often represented 

  • through narrative scope notes (Figure 2);
  • through dictionary definitions (Figure 3);
  • by default through subclasses (Figure 4);
  • by de facto practice as defined in collections (Figure 5); and
  • through associated subject headings and descriptors (Figure 6).

In each of these cases, the classifier has little explicit knowledge to work with. As a result, the choice is made based on a subjective interpretation of the class. A human classifier relies on personal subject knowledge and experience. The choice made by the machine classifier will be simple word matching and relevancy ranking. 

Figure 2
Figure 2. Class defined through use of scope note in the ERIC Thesaurus [37].

Figure 3
Figure 3. American Heritage Dictionary definition of term [38]

Figure 4
Figure 4. FAO (Food and Agricultural Organization) Classification of Livestock. Default definition through subclasses [39].

Figure 5
Figure 5. Class default definition from library catalog collections.

Figure 6
Figure 6. Class defined through associated subject headings or terms [40].

Optimizing the Classification Decision Using Extensive Class Descriptions
Our first evaluation criterion for a classification decision was the alignment or goodness of fit of an object to the chosen class. The challenge is that we likely don’t have enough information about the class to conduct a good evaluation. Uncertainty rules in this situation [36]. The easiest way to reduce uncertainty is to provide a full and explicit representation of class, its properties and values. Such representation is not a trivial task, though. Today subject experts and human classifiers rely on a deep understanding of a field that they have built up over time. What we need is a way to efficiently and reliably create a full and explicit class definition that can be used to evaluate the choice of class for any object. And, to evaluate that choice, we need an objective, quantifiable and verifiable approach. 

One approach that appears to work leverages a combination of machine and human methods. The first step in this process is to assemble a rich but representative sample of objects for the class. It must represent a variety of perspectives – expert-novice, popular-academic, brand specific-generic. And it must represent all aspects or facets of the class. To this collection we apply natural language processing and concept extraction methods to construct a draft representation. Domain experts and classifiers review and revise the draft representation, perhaps several times. When the representation has passed their review, it may serve as an explicit representation of a class. Figure 7 provides an example of a fully elaborated class representation for Livestock that was generated using this approach. This is one of 750 classes in a scheme. The representation comprises 3,341 concepts that were reviewed and approved by domain experts and professional indexers.

Figure 7
Figure 7. Elaborated class definition for livestock and animal husbandry – machine classification [41]

Evaluating the Classification Decision
Given an extensive representation of a class, we can make a strong classification decision. It also supports an objective and verifiable evaluation. This approach provides the information we need to evaluate our first criterion – the goodness of fit of an object to a class. Generally, a good fit will result from a high number of matching properties and a high occurrence of those properties. Figure 8 illustrates the way in which a machine categorization engine might report on the goodness of fit to one or more classes. The classification engine can swiftly conduct a property-by-property, value-by-value comparison of the object and class. 

Figure 8
Figure 8. Evaluating the classification process.

Our second evaluation criterion for the classification process pertained to minimizing the risk of misclassification. We can better manage misclassification when we have a full picture of goodness of fit of an object to all classes in the classification scheme. A goodness of fit indicator can be calculated for any class where a full representation is available. Figure 9 illustrates the way in which a machine categorization engine might report on the goodness of fit to all classes. Using this approach, institutions may establish thresholds for classification decisions that can be monitored and evaluated. Misclassification is minimized where we can explicitly see which classes may have been overlooked or which may have been selected in error. 

Figure 9. Class goodness of fit results for sample object [42]
<CATEGOR_ITEM.mdx> name=”Livestock and Animal Husbandry “GOF=”1240 “/>
<CATEGOR_ITEM.mdx> name=”Biodiversity“GOF=”690“/>
<CATEGOR_ITEM.mdx> name=”Pollution Control“GOF=”552“/>
<CATEGOR_ITEM.mdx> name=”Dairies and Dairying“GOF=”551“/>
<CATEGOR_ITEM.mdx> name=”Food and Beverage Industry“GOF=”446“/>
<CATEGOR_ITEM.mdx> name=”Agribusiness“GOF=”439/>

Conclusions and Observations

We considered evaluation of a hierarchical taxonomy or classification scheme and the classification decisions made when working with a hierarchical taxonomy. We offer three observations for evaluating hierarchical taxonomies. 

  1. The principles we need to evaluate and improve classification schemes are readily available. While they are understandable to information science students, some interpretation is needed for designers, engineers and the general public. 
  2. We can convert these principles into institutional checklists to support periodic evaluation and improvement of classification schemes.
  3. Information science education should include assessment of general hierarchical taxonomies in the curriculum, in addition to introducing students to the commonly used classification schemes. 

In regard to evaluation of classification decisions, we offer four observations.

  1. Classifications are often evaluated indirectly. We have suggested an approach that more directly targets the classification decision directly.
  2. This approach requires more information about the classes in a classification scheme. It makes explicit the implicit knowledge of classifiers used to make decisions. Providing more information about a class is not a trivial task for existing schemes. However, it is manageable for new schemes.
  3. The approach allows an institution to objectively judge the goodness-of-fit of any decision and to assess the risk of misclassification.
  4. This approach supports rather than substitutes for other evaluation perspectives. Understanding the nature of the classification decision helps us to better understand end user responses to that decision. 

While we considered evaluation of the scheme and the decision separately, we should not overlook the dependencies between a well-formed classification scheme and a well-executed classification decision. Making an optimal classification choice is dependent upon a good representation of the class, a well-formed class and a well-formed classification scheme. While the context in which taxonomies are used has expanded significantly in the past 20 years, the criteria for evaluation have not changed. The expansion in affordable computing power and the availability of semantic technologies provides the capacity to make and evaluate classification decisions in low risk and object ways. 

Resources Mentioned in the Article
[1] International Standards Organization. (2006). ISO 11179-2 Information technology – Metadata registries (MDR). Part 2. Classification. Retrieved September 15, 2012, from

[2] International Standards Organization. (2011). ISO 25964 Thesaurus schemas. Retrieved September 15, 2012, from

[3] American National Standards Institute (2010). ANSI/NISO Z39.19 (R2010) Guidelines for the construction, format, and management of monolingual controlled vocabularies. Retrieved September 15, 2012, from

[4] Ranganathan, S. R. (1957). Prolegomena to library classification. London: The Library Association. 

[5] Devlin, K. (1993) The joy of sets. New York: Springer Verlag. 

[6] Borko, H., & Bernick, M. (1963). Automatic document classification. Journal of the Association for Computing Machinery, 10(2), 151-162.

[7] Borko, H. (1964). Measuring the reliability of subject classification by men and machines. American Documentation, 15(4), 268-273. 

[8] Borko, H., & Bernick, M. (1964). Automatic document classification. Part II. Additional experiments. Journal of the Association for Computing Machinery, 11(2), 138-151. 

[9] Borko, H. (May 1-3, 1962). The construction of an empirically based mathematically derived classification system. Proceedings of the Joint Computer Conference, 21, 279-289. 

[10] Maron, M. E. (1961). Automatic indexing: An experimental inquiry. Journal of the Association for Computing Machinery, 8(3), 407-417. 

[11] Hanson, B. A., & Brennan, R. I. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359. 

[12] Lee, W.-C., & Brennan, R. L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33(5), 374-390.

[13] Lee, W.-C., Hanson, B. A., & Brennan, R. I. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412-432. 

[14] Bedford, D.A.D., & Gracy, K. F. (July 2012). Leveraging semantic analysis technologies to increase effectiveness and efficiency of access to information. Qualitative and Quantitative Methods in Libraries (QQML) 1(1), 13 – 26.

[15] Fischer, K., (2005). Critical views of LCSH, 1990-2001: The third bibliographic essay. Cataloging & Classification Quarterly, 41(1), 63-109. 

[16] Olson, T. (2008). LCSH to MeSH, MeSH to LCSH. Cataloging & Classification Quarterly, 46(4), 438-439.

[17] Schabas, A. H. (1982). Post coordinate retrieval: A comparison of two indexing languages. Journal of the American Society for Information Science, 33(1), 32-37.

[18] Strader, C. R. (2009). Author-assigned keywords versus Library of Congress Subject Headings: Implications for the cataloging of electronic theses and dissertations. Library Resources and Technical Services, 53(4), 243-250. Retrieved November 14, 2012 from 

[19] Boros, E., Ibaraki, T., & Makino, K. (1998). Error-free and best-fit extensions of partially defined Boolean functions. Information and Computation, 140(2), 254-283. Retrieved November 15, 2012, from 

[20] Chen, M.-H., Dey, D. K., & Ibrahim, J. G. (2004). Bayesian criterion based model assessment for categorical data. Biometrika, 91(1), 45-63.

[21] Ciesiak, D., & Chowla, N. (2009). A framework for monitoring classifiers’ performance: When and why failure occurs. Knowledge and Information Systems, 18(1), 83-108. 

[22] Cunningham, M. A. (2006). Accuracy assessment of digitized and classified land cover data for wildlife habitat. Landscape and Urban Planning, 78(3), 217-228.

[23] Garland, K. (1983). An experiment in automatic hierarchical document classification. Information Processing and Management, 19(3), 113-120. 

[24] Hammer, P. L., Kogan, A., Simeone, B., & Szedmak, S. (2004). Pareto-optimal patterns in logical analysis of data. Discrete Applied Mathematics, 114(1-2), 79-102. 

[25] Heaps, H. S. (1973). A theory of relevance for automatic document classification. Information and Control, 22(3), 268-278. 

[26] Hildalgo, J. M. G. (2002). Evaluating cost-sensitive unsolicited bulk email categorization. Proceedings of the 2002 ACM Symposium on Applied Computing (SAC ’02), 615-620. 

[27] Kattan, M. W., & Cooper, R. B. (1998). The predictive accuracy of computer-based classification decision techniques: A review and research directions. Omega: The International Journal of Management Science, 26(4), 467-482. 

[28] Kelil, A., Wang, S., Jiang, Q., & Brzezinsky, R. (2010). A general measure of similarity for categorical sequences. Knowledge and Information Systems, 24(2), 197-220. 

[29] Lewis, D. D. (1991). Evaluating text categorization. In HLT’ 91: Proceedings of the Workshop on Speech and Natural Language [pp. 312-318]. Stroudsberg, PA: Association for Computational Linguistics.

[30] Szollosi, D., Denes, L. D., Firtha, F., Kovacs, Z., & Fekete, A. (2012). Comparison of six multiclass classifiers by the use of different classification performance indicators. Chemometrics, 15(3-4), 76-84. 

[31] Subramanian, V., Hung, M.S., & Hu, M.Y. (1993). An experimental evaluation of neural networks for classification. Computers in Operations Research, 20(7), 769-782. 

[32] Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 2001) [pp. 521-528]. New York, IEEE. Retrieved November 14, 2012, from 

[33] Villa Medina, J. L., Boqué, R., & Ferré, J. (2009). Bagged k-nearest neighbour’s classification with uncertainty in the variables. Analytica Chimica Acta, 646(1-2), 62-68.

[34] Winters, W. K. (1965). A modified method of latent class analysis for file organization in information retrieval. Journal of the Association for Computing Machinery, 12(3), 356-363. 

[35] Wyse, A. E. (2011). The potential impact of not being able to create parallel tests on expected classification accuracy. Applied Psychological Measurement, 35(2), 110-126. 

[36] Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1-2), 69-90.

[37] Education Resources Information Center. Thesaurus of ERIC Descriptors. Retrieved September 15, 2012, from

[38] American Heritage Dictionary of the English Language. Livestock. Retrieved September 15, 2012, from

[39] Food and Agriculture Organization. Classification of livestock. Retrieved September 15, 2012, from 

[40] Library of Congress. Library of Congress Subject Headings. Retrieved September 15, 2012, from 

[41] Kahneman, D., Slovik, P., & Tversky, A. (Eds.) (1982). Judgment under uncertainty: Heuristics and biases. New York: Cambridge University Press. 

[42] SAS Inc. (2012). SAS enterprise content categorization studio. Retrieved September 15, 2012, from

Denise Bedford is the Goodyear Professor of Knowledge Management at Kent State University. She is a member of several professional associations including ASIS&T, ACM, AIIM, SLA, DAMA, SCIP, ALA and AAAI. She can be reached at