A thesaurus should quietly streamline website operations behind the scenes but can become an obstacle if it is inadequate or falls out of date, the situation the Public Library of Science (PLOS) realized in 2011. The organization sought help from Access Innovations, Inc., to bring its thesaurus up to speed, comply with the Z39.19 standard and ensure support of machine categorization of content. An analysis revealed over- and under-used terms and provided recommendations and specific guidance. With this knowledge and various use cases, PLOS determined a new customized thesaurus was needed. Access Innovations expanded and organized the original PLOS vocabulary of 3,132 terms to produce a thesaurus of more than 10,000 terms covering 10 science topics in seven levels. PLOS monitored weekly progress, and subject matter experts reviewed the draft for terminology and organization. Thesaurus terms will be applied to content as subject metadata using Data Harmony’s MAIstro, a rule-based categorization system permitting editorial oversight and incremental improvements. After testing and tuning rules for accuracy, the system will be used to automatically index all PLOS content.
index language construction
machine aided indexing
Bulletin, December 2012/January 2013
Case Study: Developing the PLOS Thesaurus
by Jonas Dupuich and Gabe Carr
When a good thesaurus is working well, you barely notice it. People searching your website click on relevant terms and find relevant results. Those responsible for creating metadata do so with pleasure and ease. “Yes,” they may say, “this concept perfectly captures a key aspect of this work.”
Thesauri less suited to their tasks lead to any number of difficulties. Creating metadata becomes a slow and difficult task, and results suffer. Discovery is hindered, and users shy away from relying on selected terms. Analysis becomes impossible.
In late 2011, the Public Library of Science (PLOS) found itself struggling to get value from its thesaurus. Key terms were missing across a variety of disciplines while over 1000 terms went unused. It was time for a change.
To gain a better understanding of the problem, PLOS contracted with Access Innovations (AI), an information services provider, to analyze its thesaurus. AI created a study with the aims of
- creating a robust and standards (Z39.19) compliant version of the thesaurus adequate for machine analysis;
- testing the thesaurus against the published PLOS corpus – over 21,000 papers;
- identifying the work needed to produce a thesaurus that meets PLOS needs; and
- developing a budget for the work.
AI completed the study in January 2012. It identified over-used terms, under-used terms and unused terms. Radial graphs highlighted imbalances among term distribution. Most importantly, it provided a path forward. Key recommendations provided general guidance about term selection:
- All terms should be ones that would normally be typed into search queries.
- All terms should represent concepts in articles covered by PLOS journals.
In addition, it provided specific guidance about how to build out the PLOS thesaurus:
- Review as possible synonyms, the terms used less than five times.
- Break up terms used more than 1,000 times into smaller conceptual areas.
- Review unused terms with an eye toward removal.
- Check for under-indexed articles as a way of analyzing for missing concepts.
- Add terms from AI’s STEM thesaurus.
One of the biggest outstanding questions for PLOS was whether existing thesauri would meet its needs or whether rebuilding its current thesaurus would provide the most effective starting point. AI’s study recommended the latter.
Well-known thesauri like MeSH, the National Library of Medicine’s controlled vocabulary thesaurus, are great for the areas they cover, but they tend to be deep within defined fields or shallow across a broad array of subjects. As a multi-disciplinary publisher, PLOS publishes papers that don’t fit into neat categories. For a thesaurus to successfully represent content across all biomedical fields as well as disciplines as diverse as information science and paleontology, PLOS needed a custom-built thesaurus.
With a clear understanding of the project scope and the resources required to complete it, PLOS committed to rebuilding the thesaurus.
Rebuilding the Thesaurus
Before starting the work to rebuild the thesaurus, PLOS collected the use cases that would define its utility and guide its development.
The collection of use cases extended beyond current cases to include future uses for the thesaurus. One of the more painful consequences of an ill-suited thesaurus was that it became a barrier to the adoption of new services designed to leverage well-classified content.
As a primary use case, PLOS uses terms from its thesaurus to enhance the metadata of its published papers. Once terms have been assigned to a given paper, they assist in discovery services, including the search and browse features available on PLOS websites. Future cases include strategic activities like trend and data analysis and identifying gaps in editorial boards. Table 1 illustrates a complete list of use cases.
|Table 1. Current and proposed use cases for the PLOS thesaurus.|
Workflow and Matching Activities
A final decision leading up to the rebuild was vendor selection. As AI had outlined the scope of the project and completed a lot of the prefatory work, PLOS was able to leverage this effort by contracting with AI to rebuild the thesaurus.
To kick things off, PLOS delivered the following to AI:
- current copy of the PLOS thesaurus
- use cases for the thesaurus
- over 2,000 changes and additions collected by editorial staff
- guidance for developing a research analysis and methods branch
As a resource for expanding the thesaurus, AI drew upon an internally created science, technology, engineering and medicine thesaurus (STEM). AI produced a list of approximately 15,000 STEM terms that were absent from the existing thesaurus and the frequencies with which they appeared in the PLOS corpus. From this list, AI added those terms that were judged to be sufficiently significant and specific enough to be valuable for detailed subject coverage and optimum indexing accuracy.
During the rebuild process, the structure of the hierarchy developed naturally, with the placement of new terms often suggesting the need for new branches or reorganization of existing ones. PLOS met with AI weekly to discuss any concerns that arose during the week. These conversations usually revolved around basic questions of term organization such as, “Where does cognitive science belong?” “What is the relationship between information science and computer science?” “Will we limit the thesaurus to five tiers or six – or more?”
Following AI’s progress over the course of the project was easy as AI provided PLOS with web access to the current draft thesaurus. While AI put the finishing touches on the thesaurus, PLOS lined up reviewers and subject matter experts to provide feedback on AI’s draft. This allowed PLOS to begin the review the day after the draft arrived.
The review focused on two tasks – ensuring the right terms were included in the thesaurus and ensuring the terms were organized in a fashion that suited the PLOS use cases. The latter task proved difficult. On one hand, PLOS wanted the terms to be organized in an absolute sense that accurately reflected the relationships between terms. On the other hand, PLOS wanted the terms to provide users with clear categories that represented the areas in which it publishes. The tension between these goals extended the review period beyond the original estimate of two weeks, but it ultimately allowed PLOS to address important questions that were not identified up front. An outline of the review process included the following steps:
- Senior editors provided feedback about the top-level terms.
- A subject matter expert used this feedback to create a candidate list of top level terms.
- A thesaurus policy group reviewed all feedback and forwarded its recommendations to AI.
- AI suggested minor enhancements, which were later approved by the thesaurus policy group.
The result was a 10,400-term thesaurus comprising seven tiers. Table 2 displays a list of top-tier terms.
|Table 2. Top-tier terms|
Applying the Thesaurus
Rebuilding the PLOS thesaurus goes a long way toward addressing PLOS needs for a controlled vocabulary, but it only gets at half the problem. In addition to identifying deficiencies in the PLOS taxonomy, AI’s initial study made it clear that terms had not been adequately applied to the published corpus. Typically, submitting authors select terms from the PLOS taxonomy to identify concepts addressed in their papers. Journal staff supplement these terms to ensure that submissions have a minimum number of appropriate terms.
While this process captures many of the more important concepts present in papers, significant numbers of concepts were being neglected. This finding is not surprising given the approach. Machine suggestions can be far more complete than manual suggestions. Experts selecting from machine-suggested terms tend to produce even better results, but as this approach isn’t currently available, PLOS decided to investigate machine-aided indexing to improve its subject area metadata.
Relying on industry contacts and previous experience, PLOS identified candidate software applications to perform machine-aided indexing. After analyzing the top contenders, PLOS selected AI’s MAIstro for the job.
Figure 1. PLOS 2012 thesaurus in Data Harmony's dynamic view. In line with PLOS' strong emphasis on medicine, disciplines like neurology and its associated pathologies received particular attention as the thesaurus was rebuilt and expanded.
One of the benefits to PLOS of the MAIstro approach is that indexing relies on a flexible rule-based system that can be enhanced over time to produce high quality results. By default, every thesaurus term has a simple rule (for example, apply “Biology” when the term biology appears in the text). By applying conditions, such as requiring upper or lower case, proximity with other text strings or position within the sentence, AI creates complex rules that can better isolate concepts and index with more appropriate terms (for example, distinguish between plant “cells” and prison “cells”). The better the rule base, the better the metadata. At AI’s suggestion, PLOS will aim for a mix of 80% simple rules and 20% complex rules as a starting point before integrating the service into its production workflow.
The rule building is proceeding as this article is being written. Upon receipt of the final rules, PLOS will test sample output, make necessary adjustments and prepare for implementation. Implementation will include the following steps:
- PLOS will submit article text to MAIstro.
- MAIstro will return terms and term frequencies to be indexed for search purposes.
- PLOS will re-index all previously published articles (now over 50K articles).
- PLOS will make the new terms available to users in Editorial Manager, the peer-review workflow management system produced by Aries Systems.
As AI is working to complete the rule building, PLOS is preparing an implementation plan. As a starting point, PLOS is looking to submit the following sections of published content to MAIstro for indexing:
- Methods and Materials (when present)
- Results (when present)
From these sections, PLOS will add the top seven terms returned by MAIstro in an effort to capture the key concepts present in submitted articles.
Although PLOS doesn’t expect to obtain optimal results from this approach as trade-offs are involved, it believes these additions will be a good starting point from which it can learn and continue to improve in the future.
Jonas Dupuich is product manager at the Public Library of Science (PLOS). He can be reached at jdupuich<at>plos.org.
Gabe Carr is a taxonomist at Access Innovations. He can be reached at gabe_carr<at>accessinn.com
Articles in this Issue
Case Study: Developing the PLOS Thesaurus