Relationships and data : extending the concepts of RDF

ASIS-PNC Annual Meeting, 2000

Andrew Grove

Information Services, Microsoft Corporation

 

Abstract:

Resource Description Framework (RDF) has at its core, the concept of paired entities held together by a relationship.  In RDF, this concept is applied to information resources and descriptions of them, however it has potential for applications far beyond such resources.  Indeed, much of the potential is because entity-relationship-entity triples are oftentimes direct representations of real-world situations.  By linking such triples together, infinitely extensible chains and networks of chains can be created which accurately record complex, multi-dimensional realities.  Ironically enough, these complex webs can be captured, recorded, and manipulated in very simple data structures; which themselves have the potential to be realized in a variety of platforms and applications.

 

Introduction:

A bit over ten years ago, the foundational course for hopeful MLS students was Bibliographic Control.  There, they learned the principles and practice of managing documents; through the mechanism of surrogates such as bibliographic records.  Several actors connected users to actual information:  reference librarians, a few online databases, various finding aids (such as LCSH, the LC Classification Scheme, Dewey Decimal Classification, Indexing and Abstracting services), and the documents themselves.  In all cases the critical element was the bibliographic record, or some similar surrogate.  Anyone who has spent appreciable time at the reference desk has stories of users asking for particular documents, only to reveal upon inquiry their real information need was for some formula, or place name, or political event; all of it information found in various documents.

 

All of this has shifted in the last decade.  First to arrive was the realization that users are interested in ‘information’; they are not as concerned with the package as librarians appear to believe.  Certainly, situations exist in which the document is crucial; children’s collections and images come readily to mind.   Parallel to this awareness is the development of information systems, primarily databases, which support direct information delivery.  It is possible to look directly for a chemical formula and molecular diagram, rather than retrieve a minerology reference and then look them up.  New data structures or long-forgotten ones in new incarnations, hold the promise of modeling the complex interconnections between real-world information elements and relationships.  Such data structures go far beyond storage and retrieval of document surrogates, beyond information elements, and possibly into the realm of genuine knowledge.

 

Information and knowledge - A three-part model of what humans know is terribly incomplete but still serves some purposes well.  The parts are: data (discrete facts); information (collections of data related to each other in meaningful ways); and knowledge (information related to other information in meaningful ways).  For example, canaries are yellow, they make pleasant sounds, they eat seeds, their feet grip things, they weigh about 3 oz., they live an average of 5 years, they lay 4-6 eggs 3-4 times per year, etc.  All these facts combine with others to make “information about canaries.”  Information about canaries connects with information about finches, which connects with information about other perching birds, which eventually connects with information about other animals.  At some point a critical mass occurs, and we say, “This is knowledge about animals, such as birds, including canaries.”  And, it’s all a network of facts related to other facts in a complex web of knowledge.  That web depends a lot on language.

 

An important component of dealing with masses of data, information, and knowledge is organizing it.  Without organization, there could be no differentiation; facts would pop up, connect and disconnect with other facts randomly, and the concept of knowledge would have no meaning.  We are forever attempting to organize our experience in the world.  Fortunately, we also collaborate somewhat about organization schemes and agree enough to have survived thus far.  Language is one of the great tools for organizing our experience.  Conversely, if we organize language (or, more accurately, certain parts of it), we find that we’ve also organized our experience.  Such organization may be incomplete, it may be one-sided, it may be inaccurate (whatever that means); it is also likely to be highly useful.  A process for organizing language, familiar to information practitioners is that used for indexing languages in the form of thesauri.  There, the standard organizational principles are that concepts, represented by words and phrases, have either genus-species or whole-part relationships to each other and these relationships are hierarchical in nature.  That is, they have properties of class, encapsulation and inheritance; everything that is true of a particular entity is also true of its species or parts, which all have at least one property distinguishing them from their siblings.  The hierarchy becomes a chain of connected bits of information.  This works fairly well in clearly defined and bounded domains, and with common nouns and certain proper nouns.  Unfortunately, it’s fairly easy to push the limits of these principles and wander into territory that doesn’t respond well to such treatment.  North America is part of the world.  The United States is part of North America.  And so forth, down to Portland and the surrounding area – all part of Oregon.  And that can be represented in a tree, or outline arrangement of text.  But what do we do with the Columbia River, which travels through two countries, a state and a province, borders a third state, passes along several cities and towns, through at least three mountain ranges, and finally empties into the Pacific Ocean.  And, then there’s the Pacific Ocean.  How can the complex relationships among these entities be represented in a data store? Clearly, the genus-species and whole-part models have limits.  This experiment of analyzing and deconstructing our knowledge about natural features leads me to wonder; can we devise data structures which will enable more complete modeling of the world as we understand it.  There are a number of systems, most of them proprietary, which claim to offer tools for doing so.  I am going to describe one approach which is not proprietary, and the beginnings of an implementation at Microsoft.

 

At the beginning of 1999, the situation across the company was this:

 

Many indexing languages, controlled vocabularies, taxonomies, terminologies, etc.

Many intranet sites and pages, support information systems, help files, and so on.

No general agreement on the form of names and terms for common concepts and topics

Many tag schemes and database architectures for capturing and recording information about that content.

Despite the many schemes, very little tagging and cataloging of that content, as we understand it.

Few finding mechanisms and retrieval tools which leveraged tagged content and indexing languages to any extent.

No adequate tools for developing and managing index languages or tag schemes.

Several significant entrances to the corporate intranet.

 

At the same time, my own thinking and knowledge consisted of a lot of experience with controlled vocabularies and some experience with relational databases for managing them.  The vocabularies I’d worked with followed the thesaurus model for the most part, but also extended thesaurus-like organizational concepts to place name authority files, and other topics generally treated with flat lists.  I was aware that standard thesauri (those conforming to ANSI/NISO Z39.19 – 1993) are limited, for the most part, to topical concepts; although there is room for names within certain domains, geography and organizations in particular.  The need to go beyond controlling the form of proper names and to organize them in some logical structure was leading me to wonder along the lines of organizing the terminology directly, not the concepts referenced by the terms.  Although this seems like a hair-splitting philosophical distinction, the choice made results in very different architectures and designs for index languages, tagging schemes, and information-finding mechanisms.  This line of thinking influenced my later leap from RDF to the architecture we developed later.  A leap made possible through the use of some unconventional data structures in the relational model.

 

So, in early 1999, our goal was:

Develop processes and working relationships with other Microsoft groups to build and maintain various indexing languages.

Develop and build tools to:

1.  Build and manage vocabularies and Schemas

2.  Associate vocabularies and schemas with terms and tags

3.  Provide an improved user experience searching for information using the Microsoft Web portal into the corporate intranet.

Extend tools and processes to an unknown number of portals, vocabularies, and content sets.

 

We began with an initiative to improve the corporate user’s experience searching for content on MSWeb (MSW), a major portal into the corporate intranet.  An important component addressed early was the indexing languages for tagging selected content to showcase on MSW search results.  That began with an architecture and tools for administering the index languages and tag schemes.  Although the first application of this would be the limited space of MSW and one set of showcased content, we wanted the design to extend to an unknown number of portals, vocabularies, and content sets.  We decided there was a realistic opportunity for agreement on the form of individual terms and only a limited possibility for a common organization or structure.  The solution is a common ‘pool’ of terms and a means of collecting them into many different structures.  Each collection is a ‘vocabulary’.  We also realized there would be little or no interest in a shared set of tags for holding the descriptions of information items, although different groups might all use a ‘title’ or an ‘author’ tag.  At first, the proposed architecture held vocabulary terms in one table, with a second table which connected them to each other via the desired relationship and proper vocabulary identification.  Tags were in a separate table, which also had a ‘linking’ table for collecting the tags into schemas.

 

Much of this design was informed by a growing understanding of the ‘triple’ concept, most familiarly expressed in the Resource Description Framework, RDF.  The idea there, which we have extended beyond describing resources, is that information items have properties which exist in relation to each other and to the item.  The item and any given property are expressed in ‘triples’: a resource, a relationship – “has property”, and a property value.  We applied that to vocabulary terms: a term, a relationship, another term.  By defining the nature of the relationship, we are able to use one data structure to store terms and their relationships to each other.  Most critically, we could define and capture the nature of the relationship; a pair of terms could have a Broader-Narrower relationship, they could have a Variant-Authorized Term relationship, or they could have an non-hierarchical Authorized-Authorized Term relationship.  Additionally, by creating the concept of relationship classes, hierarchical, variant-authorized term, associative relationship, we are able to define further types of relationships, such as genus-species, or whole-part, or instance-of, or exemplar.  With all this information defined and recorded, we are able to write applications which give added dimensions to vocabulary display (hierarchical trees, for instance), provide ‘smart’ term retrieval (variant terms return authorized terms), and enhance tagging and cataloging (for example, the display of authorized terms suggests other authorized terms).  The same information can be used in navigation and search tools, as well.  This extension of the entity-relationship-entity idea appeared very promising for storing vocabulary values and their relations to one another.

 

We went a step further; we applied this to schemas and tags.  If a vocabualry is a collection of terms, and a schema is a collection of tags, there is no essential difference between them; vocabularies and schemas are collections, terms and tags are elements.  This development gave us the power to associate terms with other terms in as many relationships as necessary and, by associating pairs of terms with vocabularies, in as many vocabularies as necessary.  Thus, one string, ‘access’ for example, could serve as an Entry Term in one vocabulary and as an authorized term in another.  Or, more likely, ‘Microsoft Access’ could be the broader of many more specific product names in one vocabulary and the sole representation of that product in another.  We have the ability to use one form of a term in an multitude of organizational structures.  We can associate terms across vocabularies:  Windows NT – Related Term – Dave Cutler (the chief architect of it); and Dave Cutler – Related Term—Senior Distinguished Engineer (his job title); Dave Cutler – Related Term – the group he works in; and so on.

 

The light bulb idea in all of this, that which is opening the doors to possibilities for knowledge bases and, perhaps, ontologies, is that relationships are themselves entities.  As such, they have at least one property, that of identity.  They also belong to classes.  These are relatively straightforward values to capture in the data structure we’ve developed.  The real challenge is identifying useful relationships between information entities and using them as the glue to build and hold information and knowledge directly, not via surrogates.  With the ability to define relationships, we have real opportunities to describe and record real-world circumstances in widely available database management systems.  The entity-relationship-entity concept is strikingly similar to the grammatical concept of subject-action-predicate.  Now we’re talking about language; and language is getting close to human thought.  We, and I mean all of us information professionals, are no longer limited to dealing with documents and document surrogates, which might or might not contain the information people need.  Now, we can directly record that information and people can access it directly.