START Conference Manager    

ASIST 2012 Annual Meeting 
Baltimore, MD, October 26-30, 2012 


Completeness, Coverage & Equivalence in Scientific Data Records
Andrea Thomer, Karen Baker, Simone Sacchi and David Dubin

Monday, 6:30pm


Summary

Earlier we asked, "When is a record data and when is it a fish?" (Wickett et al., 2012a). In this work, we ask, "when and in what contexts are a record and a fish equivalent?" We describe and compare a collection of potentially equivalent records describing a Mola mola, or Ocean Sunfish, specimen. We calculate the Metadata Coverage Index (MCI) of each record and explore the use the Systematic Assertion Model (Dubin, 2010) to support investigation of the assertions contained in these data records. 

Natural history museum specimen records are increasingly provisioned and discovered online through cloud-hosted databases such as GBIF and VertNet. While increased use of standard vocabularies like Darwin Core means that these records are more easily aggregated and made interoperable (Wieczorek et al., 2012), the act of cross-walking legacy data and then transferring records from local to cloud-based databases with different representation formats, encodings, and harvesting protocols results in the proliferation of different versions of the "same" record. Depending on the vocabulary and/or schema used, these roughly equivalent records make different amounts and types of data available, and, thus, their fitness-for-purpose or analytic potential in different contexts varies (Hill et al., 2010; Palmer, Weber & Cramer, 2011). In prior work (Wickett et al., 2012a, 2012b) we considered a Mola mola species occurrence record pulled from a Darwin Core Archive (DwC-A) file available on the University of Kansas Biodiversity Institute Integrated Publishing Toolkit (KUBI IPT) installation to explore these issues ("Gbif.org - Darwin Core Archives"). Here, we compare five records downloaded from different data providers describing this same specimen, explore the metadata coverage and completeness of these records, and more fully discuss the nuances of determining their equivalence. SAM was used to begin comparing conflicting assertions between data sources when they arose.