ASIS&T 2006 START Conference Manager    

Automated Concept Discovery in Corpora of Morphological Descriptions

Hong Cui et. al.

ASIS&T Annual Meeting - 2006 (ASIS&T 2006)
Austin, Texas, November 3-9, 2006


The creation of morphological descriptions is one of the major scientific activities performed by taxonomists in recording the existence and the characters of living organisms. Unlike other types of documents, morphological descriptions are especially rich in concepts, including the name of organs (flower, style, etc), the characters (shape, vestiture, etc), and character states (cordate, pubescent, etc). The trait lends taxonomic descriptions well to structured representations, such as relational databases (Taylor, 1995 and Abascal & Sánchez, 1999) and XML-based formats (Cui, 2005). Information extraction and automatic markup are two techniques used to transform legacy, less structured taxonomic descriptions to structured representations to facilitate information retrieval and specimen identification tasks. While the focus of these projects was to prove the feasibility of the techniques, little had been reported on whether the templates for information extraction or the XML DTD/Schemas for markup are broad enough to cover all the worthy concepts appeared in a collection of descriptions.

To represent all worthy concepts in the new formats is important for specimen identification tasks, as we can not foresee what kind of rare specimen is in question. However, to discover all the worthy concepts in a comprehensive collection of morphological descriptions, such as Flora of China (25 volumes), is not a trivial task. Living organisms have amazingly diverse characters, making the distribution of characters among taxa highly skewed (Cui, 2005). Furthermore, taxonomists have great freedom to selectively present the characters of a taxon in a more or less limited space. Since a taxonomic work is often authored by multiple taxonomists (e.g. in case of Flora of China, it is said more than 600 contributors are involved), the variations introduced by the authorship diversity are also considerable.

We propose an unsupervised learning approach to the fast discovery of worthy concepts in legacy collections of morphological descriptions, with/without a small number of seed concepts. We implemented a prototype system and tested the approach on a small set of morphological description extracted from (Feist, et. al, 2005). The experimental results seem to be promising: firstly, the precision of discovered concepts was satisfactory; secondly, the recall of worthy concepts was very high; and thirdly, the results were directly usable for marking up the morphological descriptions.

START Conference Manager (V2.52.6)