|ASIS&T 2006||START Conference Manager|
To represent all worthy concepts in the new formats is important for specimen identification tasks, as we can not foresee what kind of rare specimen is in question. However, to discover all the worthy concepts in a comprehensive collection of morphological descriptions, such as Flora of China (25 volumes), is not a trivial task. Living organisms have amazingly diverse characters, making the distribution of characters among taxa highly skewed (Cui, 2005). Furthermore, taxonomists have great freedom to selectively present the characters of a taxon in a more or less limited space. Since a taxonomic work is often authored by multiple taxonomists (e.g. in case of Flora of China, it is said more than 600 contributors are involved), the variations introduced by the authorship diversity are also considerable.
We propose an unsupervised learning approach to the fast discovery of worthy concepts in legacy collections of morphological descriptions, with/without a small number of seed concepts. We implemented a prototype system and tested the approach on a small set of morphological description extracted from (Feist, et. al, 2005). The experimental results seem to be promising: firstly, the precision of discovered concepts was satisfactory; secondly, the recall of worthy concepts was very high; and thirdly, the results were directly usable for marking up the morphological descriptions.
|START Conference Manager (V2.52.6)|