For millennia, information about biological diversity has been collected as a way to understand the living world. The information record has evolved from early papyrus to modern electronic collections, but challenges to information access persist, despite and partly due to a variety of digital techniques and applications. Information exchange is hampered by lack of access to previous work, inconsistent naming, changes over time and insufficient resources to create comprehensive databases supporting federated search with Darwin Core metadata. With progress in biodiversity informatics, we will see greater use of DNA barcoding and metagenomic techniques to describe species, remote sensing tools and geographic information systems to detect and describe species’ locations and movements and to identify habitats and environmental conditions. Biodiversity informatics provides essential scientific knowledge to better understand global ecosystems and to inform land use and policy decisions.

biology
informatics
scientific and technical information

Bulletin, August/September 2011


Biodiversity Informatics

by P. Bryan Heidorn

Biodiversity informatics has been around since early civilization, including the Ebers Egyptian Medical Papyrus dated from 1500 BCE, which was derived from much older texts. In fact, some might argue that biodiversity informatics began when shamans and healers began teaching their pupils about the names and uses of plants. It is only the technology that has changed to answer the key questions: What plants and animals live in a location, what are they called, how do they live and how do they relate to humans? Every amateur birder, butterfly collector and nature lover is actively engaged in biodiversity informatics. Many of them use paper maps and paper-based dichotomous keys to identify their birds, lepidopterons or other living things. Some use and contribute to more recently developed online databases of birdcalls and observation records of fellow enthusiasts. The remainder of this article will focus on advances in the field over the past couple of decades.

Biodiversity informatics is a subdiscipline of biological informatics. Biological informatics comprises the information tools used in all of biology including everything from biomolecular structure to global ecosystems. Biodiversity informatics tends to focus more on entire organisms, the interaction of different organisms and their place in the environment. The contraction, bioinformatics, has come to refer to molecular biology only, but many in the field of biodiversity informatics use the term bioinformatics to refer to their work. 

Relying on Literature Standards
The form of literature on biodiversity has changed in its several thousand years of history from formularies baked on clay tablets, papyrus, scrolls and finally bound books. Some of these works can be found at the great museums of the world where natural history and cultural history cross. Biodiversity – and taxonomy in particular – is unique in that the literature since the time of Linnaeus is used regularly. A scientist cannot name a new species until she has searched the literature to insure that it does not already have a Linnaean scientific name. Some of the publications are now hundreds of years old and rare. Some contain now valuable art depicting plants and animals and are much sought after by collectors. 

This lasting and continuing value has led to efforts to digitize the collections and make them available in a more cost-effective manner to scientists all over the world, including scientists and other interested people in the developing world, who previously did not have access to these publications unless they traveled to Europe or the United States. The Biodiversity Heritage Library (www.biodiversitylibrary.org/) is one such effort that has digitized 36 million pages to date. Originally a collaboration between major museum libraries in the United States and the United Kingdom, the collaboration has recently expanded to include much of Europe, China and Australia. 

While these efforts provide unprecedented access to these materials there are many informatics challenges, which include, among others, poor optical character recognition (OCR) and page level access. Existing OCR technology cannot handle the huge variation in fonts in these publications and the multiple languages sometimes within the same document. Particularly troublesome is the identification of scientific names since OCR was introduced. Special purpose software tools such as TaxonFinder have improved the situation, but the problem is far from solved. There is no global index to articles, chapter and taxonomic treatments in these newly digitized documents so another area of research is the identification of sections, articles and taxonomic treatments within the publications so that improved indexes can be constructed. 

Standardizing Markup and Extraction. While some people wish to read biodiversity publications from front to back, more often people simply wish to find facts in the publications. To address this need, biodiversity informatics also includes the semantic markup, semantic information extraction and text fusion from biodiversity materials. Semantic markup and extraction use software techniques such as machine learning to add semantic tagging, sometimes in XML, within digital documents to identify relevant facts. The techniques might be used to identify not only treatment boundaries but also taxonomy (scientific names) and morphological characteristics in descriptions such as leaf shape in plants, of antennae shapes in beetles, butterflies and moths or any other morphological characteristic of a plant or animal. This markup makes it easy to extract the information to make new indexes or more structured, machine-readable descriptions as discussed next. Text fusion collects facts from multiple publications to create a more detailed description of a plant or animal and to help detect inconstancies in descriptions. 

New biodiversity literature is born-digital, but unfortunately most of that text has structural or presentation markup but not semantic markup, meaning that it cannot be used for machine-to-machine processing. TaxonX and taXMLit are two semantic markup standards that are being used by different projects, but application of the formats is expensive and time consuming, so an active area of research is the development of tools to facilitate the process.

Introducing Interactive Taxonomic Keys. Biodiversity literature also contains information to help people identify species. Such identification is accomplished with morphological descriptions and geographic range but also with the more structured taxonomic keys. A key identifies for the reader distinguishing characteristics to differentiate groups of species. For example if a reader wishes to identify a pine tree in a forest, a key might instruct the reader to first determine the easily distinguishable characteristic of number of needles per bundle. One set of species has five needles per bundle; another set of pines has four needles per bundle and another three and yet another two. Once the reader has decided the number of needles per bundle, the key then instructs to reader to look for other characteristics such as needle length, which will reduce the number of candidate species even further. Given enough characteristics the reader can identify the species of pine. This method can be error prone, however, because of the need to have a fixed order of characteristics. 

Computers have made interactive keys possible. These tools are sometimes called multi-entry keys. Some interactive keys such as Lucid and IntKey allow characteristics to be identified by the user in any order. Some keys are tolerant of errors and still suggest identifications even if some characters are entered incorrectly. Some keys dynamically reorder characteristics to suggest the next best distinguishing characteristic based on information theory or on an estimate of the ease for a user to see the characteristic because it is obvious. Since there are now many interactive key programs with different advantages and disadvantages it is useful to be able to exchange information among the programs. This exchange is made possible through the Structure of Descriptive Data (SDD) standard developed by the Taxanomic Database Working Group (TWDG), now called Biodiversity Informatics Standards. 

A Rose by Any Other Name. A rose by any other name leads to confusion. “D’ardaigh” only means something if we know it is Irish for rose. Roses fall under the family Rosaceae, and there are over 100 species of rose and perhaps thousands of varieties and cultivars. The hybrid tea rose of Valentine's Day is very different from the invasive Rosa multiflora that has taken over many roadsides and forest clearings. It is not clear what we are talking about unless we are much more precise with the name. This kind of confusion was why Linnaeus developed the binomial naming system for living things. No published list of all named species exists. Given that there are almost two million named species, many with description, this list would be a very long book. 

Unfortunately, many inconstancies have crept into the naming of species since the time of Linnaeus. Some species have unknowingly been given different scientific names by different researchers simply because the second researcher was unable to find the first reference in the literature. Sometimes they have been given different names just because scientists disagree. The names for species and genera have changed over time as we have learned more about species and their ancestral relationships using phylogenetics, paleontology and other techniques. 

Biodiversity informatics provides tools to attempt to create a digital version of a complete list of species. There are multiple projects around the world centered on different taxonomic groups that are collected by Species 2000. The goal of the Species 2000 project is to create a validated checklist of all of the world's species including plants, animals, fungi and microbes (www.sp2000.org). Such a list of course excludes the species that remain unnamed and undescribed, which is the majority of species. Technologies such as life science identifiers (LSID) and other forms of global unique identifiers (GUID) are being tested to help untangle the name references as they change through time. 

These lists contain only names. E. O. Wilson has an even more ambitious vision of a web page of every species on earth. The Encyclopedia of Life project (www.eol.org) is attempting to do just that by bringing together digital information about species from many sources. Unfortunately, there frequently are not standards or institutions for this synthesis of information. 

Accessing Museum Collections
One way to organize the field of biodiversity informatics is to begin in the field with the original collection event, field being the biologist’s term for the forests, meadows, deserts, lakes, rivers, oceans and even frozen glaciers where they work. When biologists go to the field they observe and collect living organisms ranging in size from microbes to whales. Much of the information about these species is gathered in museums, as we all have seen on the plaques attached to the dinosaurs in the main display rooms of museums around the world. Sometimes the specimen itself may be put in the museum along with information about the specimen, ideally including the name, date of collection, location, collector and environmental information. Sometimes it is only the information and not the specimen that makes it to the museum. Figure 1 is an example of an herbarium specimen. Typical information includes the name or taxonomy of the item, the location where it was collected, the date of collection, the name of the collector and perhaps some information about the habitat where it was collected. 

While you could not tell it from the public displays of specimens, there are billions of specimens in museums kept back in the reference collections where professionals can use them for a large number of tasks discussed below. These collections are the main representation of the biodiversity of the planet, and the organization of these collections facilitates their systematic use.

Figure 1
Figure 1. An example of an herbarium specimen


One of the chief uses of the collection is for the identification of species. For example, there are about a million species of beetle in the world and about 400,000 of these are named and in museums. When a scientist finds a new specimen and is not certain of the name, not being able to memorize hundreds of thousands of names, the scientist searches the collections for specimens that have been named by previous entomologists. Interactive keys do not exist for this range of species. For ease of reference and access, Linnaeus developed a classification scheme that orders life into larger similar groups including genus, family, order and other taxonomic levels. Even with this and other orderings in the collection, specimens are very difficult to find, so shortly after computers became commercially available, scientists began constructing museum management systems geared to biological collections. Modern descendants of these computer systems allow scientists to search a museum by Linnaean taxonomy, location collected and many other attributes of the specimen. Some highly functional custom-built systems still exist including the TROPICOS system at the Missouri Botanical Garden that is used by thousands of scientists. Most museum management systems have been replaced by commercial or non-profit, professionally developed systems such as KE Emu and SPECIFY. 

Facilitating Data Federation
There is a key weakness in systems that only index information in a single museum. A scientist wanting to find, for example, specimens of Circellium bacchus, (South African dung beetles) in the past would have needed to search the databases of each and every museum in the world to find all specimens of interest. In the 1990s, experimental systems such as Species Analyst were developed that provided for federated search. Federated search meant that people could submit one query to multiple museums all at one time. Systems like Species Analyst were inspired by the library search federation standard Z39.50. Around 2008 members of the TDWG developed standards for biological data federation. These standards included Darwin Core (DwC), named to acknowledge Dublin Core (DC) that had helped inspire it. Rather than structuring bibliographic information, like DC, DwC structures biological collections records. DC elements such as Creator and Date were replaced with terms relating to scientific name, collection event, location and other pertinent information. 

DwC can be serialized in XML but also in tabular and other forms. There are discussions underway within the biodiversity informatics community to represent this information in RDF for use on the semantic web. DwC data can be collected in central locations using protocols such as TAPIR. The Global Biodiversity Information Facility (GBIF) is an international effort centered in Copenhagen to create one central reference point where scientists and others can go to search the collections of the museums of the world. Readers are encouraged to go to www.gbif.org to search for museum specimen records for their favorite bird, butterfly, plant or Carabid beetle. 

While there are hundreds of millions of records in GBIF, it falls far short of the estimated billions of specimens in museums. The difficulty is that less than 10% of specimen records are in digital format and a much smaller percentage of this 10% is available through the Internet in standards such as DwC. Estimates of the costs for digitizing a single specimen can vary from $0.50 to several dollars depending on the richness of the digital record. This cost becomes prohibitively expensive when multiplied by billions of specimens. Consequently, one area of current biodiversity informatics research focuses on the digitization process; that is, getting specimen records from specimens and paper into databases. Methods such as those used to read addresses of postal mail and other technologies have proven to be inadequate for the diversity of data on museum specimens. 

Advancing Genome-Based Species Identification
Barcode of Life. Even if we did have full access to the metadata about museum holdings, only highly trained experts would be able to correctly identify a beetle by comparing it to the 400,000 known species. Similar problems exist with all insects and many other forms of life. DNA technology is helping to solve the species identification problem. Full DNA sequencing is still very expensive and only a very small set of species has been sequenced, most frequently the so-called model organisms like mouse and Arabidopsis. We are still a long way form the tricorder available in Star Trek, but new techniques are being applied to drastically lower the costs of identifying a species using DNA to below a dollar per sample using a technique called DNA barcoding. The International Barcode of Life project [ibol.org] helps organize efforts to create a library of relatively short (about 600 nucleotide) DNA sequences that uniquely identify species. 

While the creation of the sequences is biochemistry, the management of the data and statistical classification of the sequences is informatics. The Barcode of Life Data System (BOLD) now holds about 1.2 million specimen sequences representing over 100,000 individual named species. Different groups of organisms might need to use different parts of the DNA, but the basic idea is the same. When someone has an unknown moth or dung beetle, they can generate a DNA barcode for the specimen. This is compared to the database to see if there is a matching sequence. This capability is particularly useful for understanding the linkages of species that have different forms. Except for some herculean efforts, which raised caterpillars to adult moths and butterflies, science did not know which caterpillars developed into which adult. DNA barcoding made answering this question much easier. The technique can also help with preserving biodiversity by allowing inspectors to test if fishing being sold in a market are indeed the species on their labels. 

Metagenomic Databases. Biodiversity data collection is not limited to single organism specimens and observations. New metagenomic techniques make it possible to use genetic sequences to characterize the diversity of microbes that exist in environmental conditions but have not been cultured or characterized beyond the DNA. Work since 2006 indicates that our prior estimates of microbial diversity were off by at least two orders of magnitude. New computational methods allow analysis of hundreds of unique species all mixed into the same environmental sample. The databases needed to store the newly discovered sequences challenge the limits of digital storage technology. New techniques are being developed not just to quantify the number of species in a sample but also the main metabolic pathways of these species and therefore their functional niche within the environment. This work has profound consequences for areas such as agriculture, bio-remediation, soil and ocean carbon dioxide sequestration. 

Observing Species through Remote Sensing
For larger organisms biodiversity informatics tools now allow for remote detection of species. For example, arrays of microphones can be placed in environments of interest to record the chirping of frogs or birdcalls. Researchers can review the recordings from anywhere on Earth to identify individual species. In a subfield of eco-acoustics, computers can be programmed using machine-learning techniques to automatically recognize individual species and their relative location over time, greatly expanding the capability of researchers and land use managers to understand biodiversity. 

Informatics tools are also revolutionizing the collection and analysis of animal behavioral data. The underlying question is where animals live and what they do while they are there. It is now possible using radio collars to track the movements of elephants across the African Savannas or the migration paths of whales. When combined with visualization tools the information can be used to warn farmers of an approaching herd of elephants or to plan the boundaries of a new nature preserve. Miniaturization allows researchers to attach transponders to birds and record not only location but also vocalizations that have never before been heard. Miniaturizing information technology even further, it is now possible to trace the movements of insects using RFID (Radio-Frequency Identification). 

While RFID allows us to follow ants, sensors on artificial satellites allow us to get a broader view of biodiversity. Information technology is also allowing us to study and understand larger sections of the landscape using remote sensing. For example the Terra and Aqua satellites carry a moderate resolution imaging spectroradiometer (MODIS) that gathers images in 36 spectral bands. This information can be used to calculate forest cover. When combined with data from other satellites and plane-based sensors such as light detection and ranging (LIDAR), it is possible to use computational technology to estimate canopy heights, forest biomass and even the species of individual trees. The Google Earth engine uses massively parallel computation and remote sensing to map forest cover for large swaths of the earth as well as many other measurements. Google will donate 10 million CPU hours over the next two years to make the resources more readily available.

Plotting Species Distribution with Geographic Information Systems
Desktop and server tools such as ArcView continue to be a mainstay of geographic information system (GIS) mapping of species distributions. Google tools such as Google Maps, Google Earth and Fusion Tables are making GIS readily available to a much larger number of scientists and amateurs who work to plot the distribution of species based on observation location and museum collections. For example, Figure 2 is a map of the collection location of specimens of the African Lavia frons, the yellow-winged bat, in museums that contributed their data through GBIF. In fact, GBIF can export data for Google Earth. These point location maps can help us to quickly estimate the distribution of a species or set of species. They can tell us where ranges overlap or where populations are isolated from one another. 

Figure 2
Figure 2. Collection locations of specimens of Lavia frons, the yellow-winged bat found in Africa.


Predicting Species Distribution with Niche Modeling
Biodiversity informatics includes not only information about biological items themselves but also about environmental conditions that can impact biodiversity, such as temperature, rainfall, pH, dissolved oxygen in water and many other factors. To plot the area in which we guess a species lives we could simply draw a line around the observed locations. A method called ecological or environmental niche modeling can provide a potentially more accurate picture. The observation record is incomplete both because most museum collections have not been digitized and because scientists simply have not looked for species in all of the locations where they could be. Because of these gaps there is no reason to assume that the species might not also exist in a wider range than current observations suggest. Following this line of reasoning, perhaps Lavia frons (Figure 2) actually lives a little further west in the Congo Basin. We might also guess that it does not live in high altitudes or other conditions unsuitable for its existence. We can use mathematical and logical models in ecological niche modeling to automatically and objectively identify environmental or ecological conditions where the species may exist. In this method, known observation data and sometimes proven absence data are combined with other environmental information – altitude where observations were made, minimum and maximum temperature ranges, amount of rain in different months of the year or ground cover type at the location of observation and any number of other factors – that might influence the survival of a species. If the species has never been observed in savannas or on mountaintops, these can be excluded from the species distribution maps thus giving a more accurate picture of where the species might live. 

Similar biodiversity informatics techniques can be applied to predict the ranges of potentially invasive species. For example, the Burmese Python (Python molurus bivittatus) is native to Southeast Asia but has been introduced to southern Florida where it is reproducing in the wild and spreading. Ecological niche models are used to predict limits to its northern spread in the United States. Niche modeling can also be used to predict the impact of climate change on species distribution, allowing a predictive map of the future distribution of species. For example, the female wolverine (Gulo gulo luscus) requires deep snows with late spring snowmelt to build birthing dens. Current observation in northern Montana and climate prediction models show elimination of these snow conditions even at the highest elevations. Predictive species distribution models predict extirpation of the species from the lower 48 states. At the same time changing rain patterns and temperatures are changing the ranges of mosquitoes that carry the West Nile virus.

Summary
In the future biodiversity informatics will play a critical role in understanding biodiversity and ecosystem services of critical importance to human health and well-being. The evolution of biodiversity informatics tools is driven by advances in computational power, telecommunications and the evolution of software and hardware development tools. The development of biodiversity informatics is also governed by the questions that are being asked by scientists, land use managers and policy makers. There will be an expanding need to understand biodiversity at all scales from microbes to ecosystems. Expanding databases and computational tools will allow us to better understand microbial diversity and the role of that diversity in ecosystem services in land, sea and air. We will need more powerful information tools to track and quantify the distribution and movement of individual species for the purposes of conservation, mitigation of damage from invasive species and disease vectors and to increase and secure food production using disease resistance and other genetic properties of wild relatives. Biodiversity informatics will help us to understand the linked fates of some of our most important water resources. For example, the Gulf of Mexico at the Mississippi River delta, the Chesapeake Bay and Lake Victoria are all suffering ecosystem and fisheries declines because of siltation and eutrophication (nutrient surplus induced hypoxia). Information gathering, analysis and use can help scientists, citizens and policy makers make more informed decisions. 


P. Bryan Heidorn is the director of the School of Information Resources and Library Science at the University of Arizona, president of the JRS Biodiversity Foundation and a previous program officer in the division of biological infrastructure at the National Science Foundation. He can be reached at by email at heidorn<at>email.arizona.edu