Organizations using controlled vocabularies are highly diverse, but they all rely on using a terminology list that is internally standardized, commonly understood and widely applied for metatagging. The basic steps for creating a controlled vocabulary, taxonomy or thesaurus are the same for each case, starting with determining the scope to be covered and identifying representative content sources. This step is followed by gathering and organizing terms, enhancing them with synonyms and relationships and often by subject-matter-expert review. Using a dedicated taxonomy management tool is recommended to facilitate management, visualization and export in machine-readable format such as OWL or SKOS. Posting the controlled vocabulary to a registry or data warehouse enables sharing and may stimulate broad acceptance. The MITRE Corporation has followed this process in developing taxonomies consistent with the Department of Defense’s Net-Centric Data Strategy, which bases shared understanding within and among DoD programs on the use of controlled vocabularies and crosswalking equivalent data elements. 

controlled vocabularies
index language construction
metadata
information mapping
standardization

Bulletin, December 2012/January 2013


Special Section

Building Controlled Vocabularies for Metadata Harmonization

by Marcie Zaharee

What do IBM Watson, the National Oceanic and Atmospheric Administration (NOAA) National Weather Service (NWS) website and the Department of Defense (DoD) have in common? The answer is… a controlled vocabulary (CV)! Watson has roughly 200 million pages of natural language content (equivalent to 1 million books), a library of facts and a hierarchy of terms [1]. NOAA’s NWS glossary contains information on more than 2,000 terms, phrases and abbreviations used by NWS forecasters to communicate with each other [2]. The DoD has a 553-page Dictionary of Military and Associated Terms [3] that presents approved terminology for use by all DoD components. Watson, NOAA NWS and the DoD benefit from CVs because these tools 

  • provide a standardized list of terms used to tag data and information;
  • provide a shared, commonly understood language to enable communication and knowledge exchange among stakeholders;
  • build a common community of interest (COI)-wide terminology resource; and 
  • define content and other document types, the fields that will describe attributes and the actual values (metadata) [4].

Definitions
The term CV is often used interchangeably with taxonomy and thesaurus, but technically they are different. CVs, taxonomies and thesauri represent increasing levels of complexity, as shown in Figure 1. 

Figure 1
Figure 1. Structure of controlled vocabularies. Redmond-Neal, Alice, 2006 [5]. (Reprinted with permission)

  • CV – A specific list of terms for a specialized purpose. It is controlled because only terms from the list may be used for the subject area covered [6
  • Taxonomy – A collection of controlled vocabulary terms organized into a hierarchical structure with parent-child relationships [7
  • Thesaurus – A controlled vocabulary arranged in a known order and structured so that the various relationships among terms are displayed clearly and identified by standardized relationship indicators [7]

Several standards govern the development of CVs. The most common is American National Standards Institute (ANSI)/National Information Standards Organization (NISO) Z39-109.2005 – Guidelines for the Construction, Format and Management of Monolingual CVs [6]. This standard focuses on CVs used to represent content objects in knowledge organization systems such as lists, synonym rings, taxonomies and thesauri.

Developing a CV or Taxonomy 

Figure 2 outlines a suggested approach for creating a CV or taxonomy: 

Figure 2
Figure 2. Suggested approach for developing a CV/taxonomy

  • Determine Scope. Taxonomists assert that there is no one right way to create a CV and that taxonomies are always evolving. This assertion is true. However, it is important to meet with stakeholders to determine the project scope up front. This step may lead to production of a business case document or scope statement [8] that captures the activities covered by a project and, equally important, defines the elements that are out of scope.
  • Identify Sources. It is essential to review a representative collection of content and to have access to subject matter experts (SMEs) in the relevant domain and to use literature from the COI domain to identify common terminology. Sources include books, journals, department terminology, databases, web resources and search logs (the latter are sometimes referred to as behavior-based taxonomies) [5]. These sources will help determine core areas for the CV/taxonomy. A key decision at this point is whether to build or buy. Can you use or modify an existing taxonomy? It is a good practice to survey existing taxonomy resources for your domain and then compare against terms gathered to-date. 
  • Plan for Maintenance. A CV/taxonomy is never finished, unless it is no longer used for indexing or its database is no longer being updated. Plan for maintenance as part of development. A poorly maintained CV/taxonomy quickly becomes a liability rather than an asset [9]. Maintaining a CV/taxonomy includes changing terms, changing status of terms, deleting terms or relationships, adding new terms/relationships and even changing the structure of the CV/taxonomy [10].
  • Gather Terms. There are two main methods for building a CV/taxonomy: top down or bottom up. A top-down approach might be to convene a group of SMEs to decide on the scope and categories of terms to be included. A bottom-up method would be to assemble a set of representative, already indexed documents and use the indexed terms as a preliminary term list [9]. You can maintain list of terms in a flat or relational database.
  • Categorize Terms. Next, you should organize terms into major categories such as parent/child relationships to include related terms, preferred terms and non-preferred terms. ANSI/NISO Z39-19.2005 is an excellent resource for learning and understanding common terminology associated with organizing CVs and taxonomies.
  • Manage Terms. It is beneficial to use a commercial off-the-shelf (COTS) taxonomy management tool or database to manage terms. While there is no authoritative list of taxonomy software, many taxonomy management tools can maintain terms, their associated relationships and other attributes [6]. They include single-user desktop software, large-scale thesaurus systems, free and open source software and other software with taxonomy management components [6]. The nature of the taxonomy project determines what software product is needed.
  • Visualize Terms. A graphical representation of the taxonomy facilitates SME review and validation of subject categorization. Mind-mapping software can graphically display the categorization of a CV/taxonomy. Ontology tools can also provide a visualization capability; however, these tools require expert knowledge of ontology [11]. 
  • Export Terms. Terms should be extracted into a machine-readable language such as Web Ontology Language (OWL) or Simple Knowledge Organization System (SKOS). 
    • OWL is designed for use by applications that must process the content of information rather than merely present information to humans. The language facilitates greater machine interpretability of web content than eXtensible Markup Language (XML), Resource Description Framework (RDF) and RDF Schema (RDF-S) by providing additional vocabulary as well as a formal semantics. OWL has three increasingly expressive sublanguages: OWL Lite, OWL DL and OWL Full. 
    • SKOS provides a standard way to represent knowledge organization systems using the RDF [12]
  • Review/Validate. SMEs can offer invaluable assistance in reviewing terms. Develop a SME review process for reviewing both the terms and the categorization of terms. The RDF/XML from exported files should also be validated. The World Wide Web Consortium (W3C) serves as a reference source.

Post to a Registry or Data Warehouse. Finally, you should find a registry or data warehouse where you can post content for knowledge sharing and reuse. Your colleagues will be grateful. 

Use Case
The DoD Net-Centric Data Strategy (NCDS) – the DoD guide to enabling data sharing in a net-centric environment – seeks to promote shared understanding within and among DoD programs. According to the strategy, the key to understandability is to create and maintain a CV. 

The MITRE Corporation’s Tactical Intelligence, Surveillance, and Reconnaissance (ISR) Integration Metadata Harmonization (MDH) effort serves as an example of a project that directly supports the NCDS vision by developing a method for making data accessible and understandable to users within the intelligence community and the broader DoD. As a part of this project, MDH SMEs identify key data assets – entities composed of data, to include databases, documents and web pages for harmonization – and map the elements of these assets to elements in the DoD Discovery Metadata Specification (DDMS). This process is referred to as performing a crosswalk. The SMEs then prepare a crosswalk package documenting results of the harmonization process and post it on the DoD Metadata Registry for reuse. Ontological products, such as a CV, taxonomy and thesaurus, complement each crosswalk to ensure that the package meets the NCDS goal of ensuring understandability. 

Summary
As illustrated by Watson, NOAA NWS and NCDS, CVs/taxonomies are key to making data assets widely and unambiguously understandable. Users of CVs and taxonomies benefit by having common terms for metadata, enhancing communication between COIs and increasing discovery of information. 

Developing CVs/taxonomies is time consuming; thus, a repeatable process is a key factor in efficiency and success. The approach proposed in this article can serve as a guide to building useful CVs and/or taxonomies. 

Resources Mentioned in the Article
[1] IBM Systems and Technology. (February 2011) Watson – a system designed for answers: The future of workload optimized systems design. (IBM White Paper). Retrieved October 19, www-03.ibm.com/innovation/us/engines/assets/9442_Watson_A_System_White_Paper_POW03061-USEN-00_Final_Feb10_11.pdf 

[2] National Oceanic and Atmospheric Administration. National Weather Service. (n.d.). National Weather Service glossary. Retrieved September 10, 2012, from http://w1.weather.gov/glossary

[3] U.S. Department of Defense. (Original publication November 8, 2010). Dictionary of military and associated terms. (JCS Pub 1-02 Nov 8, 2010 (as amended Aug 15, 2012)). 

[4] Earley & Associates. (August 2012) The business value of taxonomies [webinar]. 

[5] Redmond-Neal, A. (May 21, 2006). Building taxonomies [Powerpoint slides]. Enterprise Search Summit. Retrieved October 18, 2012, from www.accessinn.com/library/presentations/buildingtaxonomies/buildingtaxonomies05-displaysanduses.ppt 

[6] Hedden, H. (2010). The accidental taxonomist. Medford, NJ: Information Today.

[7] ANSI/NISO. (July 5, 2005). Guidelines for the construction, format, and management of monolingual controlled vocabularies. (ANSI/NISO Z39.19-2005, July 5, 2005).

[8] Breininger, K., & Whittaker, M. Developing a taxonomy. Presented at Internet Librarian 2007 – Track B: Enterprise Trends. Retrieved September 13, 2012, from www.infotoday.com/il2007/Presentations/B205_Breininger_Whittaker_Handout.pdf

[9] American Society for Indexing. (n.d.) How do I build a thesaurus? About Indexing. Retrieved September 10, 2012, from www.asindexing.org/i4a/pages/index.cfm?pageid=3623 

[10] Hlava, M. (2010). Taxonomy creation – Experts or not [Powerpoint slides]. http://wiki.sla.org/download/attachments/54264068/Subj_Expertise-Hlava-SLA2010.pdf

[11] Lanzenberger, M., Sampson, J., & Rester, M. (2010). Ontology visualization: Tools and techniques for visual representation of semi-structured meta-data. Journal of Universal Computer Science 169(7), 1036-1054. 

[12] W3C. (February 10, 2004). OWL Web Ontology Language: Overview. Retrieved September 10, 2012, from www.w3.org/TR/owl-features/ 
 


Marcie Zaharee currently works in the MITRE Center for Integrated Intelligence Systems (CIIS) where she is the lead for developing a controlled vocabulary, taxonomy and thesaurus for the intelligence reconnaissance and surveillance (ISR) community. Marcie joined MITRE in 2005 as the associate department head for information management and practice where she worked to advance knowledge management in MITRE, championing efforts that support staff collaboration, knowledge sharing and strengthening MITRE's knowledge base. She can be reached at mzaharee<at>mitre.org

MITRE is a not-for-profit corporation, chartered to work solely in the public interest, that operates multiple federally funded research and development centers (FFRDCs). An FFRDC is a unique organization that assists the U.S. government with scientific research and analysis, development and acquisition, and/or systems engineering and integration. FFRDCs address long-term problems of considerable complexity, analyze technical questions with a high degree of objectivity and provide creative and cost-effective solutions to government problems.
Approved for Public Release: 12-3989. Distribution Unlimited
©2012-The MITRE Corporation. All rights reserved