Relationships and data :
extending the concepts of RDF
ASIS-PNC Annual Meeting,
2000
Andrew Grove
Information Services,
Microsoft Corporation
Abstract:
Resource Description Framework (RDF) has at its
core, the concept of paired entities held together by a relationship. In RDF, this concept is applied to
information resources and descriptions of them, however it has potential for
applications far beyond such resources.
Indeed, much of the potential is because entity-relationship-entity
triples are oftentimes direct representations of real-world situations. By linking such triples together, infinitely
extensible chains and networks of chains can be created which accurately record
complex, multi-dimensional realities.
Ironically enough, these complex webs can be captured, recorded, and
manipulated in very simple data structures; which themselves have the potential
to be realized in a variety of platforms and applications.
Introduction:
A bit over ten years ago, the foundational course
for hopeful MLS students was Bibliographic Control. There, they learned the principles and practice of managing
documents; through the mechanism of surrogates such as bibliographic
records. Several actors connected users
to actual information: reference
librarians, a few online databases, various finding aids (such as LCSH, the LC
Classification Scheme, Dewey Decimal Classification, Indexing and Abstracting
services), and the documents themselves.
In all cases the critical element was the bibliographic record, or some
similar surrogate. Anyone who has spent
appreciable time at the reference desk has stories of users asking for
particular documents, only to reveal upon inquiry their real information need
was for some formula, or place name, or political event; all of it information
found in various documents.
All of this has shifted in the last decade. First to arrive was the realization that
users are interested in ‘information’; they are not as concerned with the
package as librarians appear to believe.
Certainly, situations exist in which the document is crucial; children’s
collections and images come readily to mind.
Parallel to this awareness is the development of information systems,
primarily databases, which support direct information delivery. It is possible to look directly for a
chemical formula and molecular diagram, rather than retrieve a minerology
reference and then look them up. New
data structures or long-forgotten ones in new incarnations, hold the promise of
modeling the complex interconnections between real-world information elements
and relationships. Such data structures
go far beyond storage and retrieval of document surrogates, beyond information
elements, and possibly into the realm of genuine knowledge.
Information and knowledge - A three-part model of
what humans know is terribly incomplete but still serves some purposes well. The parts are: data (discrete facts);
information (collections of data related to each other in meaningful ways); and
knowledge (information related to other information in meaningful ways). For example, canaries are yellow, they make
pleasant sounds, they eat seeds, their feet grip things, they weigh about 3
oz., they live an average of 5 years, they lay 4-6 eggs 3-4 times per year,
etc. All these facts combine with
others to make “information about canaries.”
Information about canaries connects with information about finches,
which connects with information about other perching birds, which eventually
connects with information about other animals.
At some point a critical mass occurs, and we say, “This is knowledge
about animals, such as birds, including canaries.” And, it’s all a network of facts related to other facts in a
complex web of knowledge. That web
depends a lot on language.
An important component of dealing with masses of
data, information, and knowledge is organizing it. Without organization, there could be no differentiation; facts
would pop up, connect and disconnect with other facts randomly, and the concept
of knowledge would have no meaning. We
are forever attempting to organize our experience in the world. Fortunately, we also collaborate somewhat
about organization schemes and agree enough to have survived thus far. Language is one of the great tools for
organizing our experience. Conversely, if
we organize language (or, more accurately, certain parts of it), we find that
we’ve also organized our experience.
Such organization may be incomplete, it may be one-sided, it may be
inaccurate (whatever that means); it is also likely to be highly useful. A process for organizing language, familiar
to information practitioners is that used for indexing languages in the form of
thesauri. There, the standard
organizational principles are that concepts, represented by words and phrases,
have either genus-species or whole-part relationships to each other and these
relationships are hierarchical in nature.
That is, they have properties of class, encapsulation and inheritance;
everything that is true of a particular entity is also true of its species or
parts, which all have at least one property distinguishing them from their
siblings. The hierarchy becomes a chain
of connected bits of information. This
works fairly well in clearly defined and bounded domains, and with common nouns
and certain proper nouns.
Unfortunately, it’s fairly easy to push the limits of these principles
and wander into territory that doesn’t respond well to such treatment. North America is part of the world. The United States is part of North America. And so forth, down to Portland and the
surrounding area – all part of Oregon.
And that can be represented in a tree, or outline arrangement of
text. But what do we do with the
Columbia River, which travels through two countries, a state and a province,
borders a third state, passes along several cities and towns, through at least
three mountain ranges, and finally empties into the Pacific Ocean. And, then there’s the Pacific Ocean. How can the complex relationships among
these entities be represented in a data store? Clearly, the genus-species and
whole-part models have limits. This
experiment of analyzing and deconstructing our knowledge about natural features
leads me to wonder; can we devise data structures which will enable more
complete modeling of the world as we understand it. There are a number of systems, most of them proprietary, which
claim to offer tools for doing so. I am
going to describe one approach which is not proprietary, and the beginnings of
an implementation at Microsoft.
At the beginning of 1999, the situation across the
company was this:
Many indexing languages,
controlled vocabularies, taxonomies, terminologies, etc.
Many intranet sites and
pages, support information systems, help files, and so on.
No general agreement on the
form of names and terms for common concepts and topics
Many tag schemes and
database architectures for capturing and recording information about that
content.
Despite the many schemes,
very little tagging and cataloging of that content, as we understand it.
Few finding mechanisms and
retrieval tools which leveraged tagged content and indexing languages to any extent.
No adequate tools for
developing and managing index languages or tag schemes.
Several significant
entrances to the corporate intranet.
At the same time, my own thinking and knowledge
consisted of a lot of experience with controlled vocabularies and some
experience with relational databases for managing them. The vocabularies I’d worked with followed
the thesaurus model for the most part, but also extended thesaurus-like
organizational concepts to place name authority files, and other topics generally
treated with flat lists. I was aware
that standard thesauri (those conforming to ANSI/NISO Z39.19 – 1993) are
limited, for the most part, to topical concepts; although there is room for
names within certain domains, geography and organizations in particular. The need to go beyond controlling the form
of proper names and to organize them in some logical structure was leading me
to wonder along the lines of organizing the terminology directly, not the
concepts referenced by the terms.
Although this seems like a hair-splitting philosophical distinction, the
choice made results in very different architectures and designs for index
languages, tagging schemes, and information-finding mechanisms. This line of thinking influenced my later
leap from RDF to the architecture we developed later. A leap made possible through the use of some unconventional data
structures in the relational model.
So, in early 1999, our goal was:
Develop processes and working relationships with
other Microsoft groups to build and maintain various indexing languages.
Develop and build tools to:
1. Build and manage vocabularies and Schemas
2. Associate vocabularies and schemas with terms and tags
3. Provide an improved user experience searching for information
using the Microsoft Web portal into the corporate intranet.
Extend tools and processes to an unknown number of
portals, vocabularies, and content sets.
We began with an initiative to improve the corporate
user’s experience searching for content on MSWeb (MSW), a major portal into the
corporate intranet. An important
component addressed early was the indexing languages for tagging selected
content to showcase on MSW search results.
That began with an architecture and tools for administering the index
languages and tag schemes. Although the
first application of this would be the limited space of MSW and one set of
showcased content, we wanted the design to extend to an unknown number of
portals, vocabularies, and content sets.
We decided there was a realistic opportunity for agreement on the form
of individual terms and only a limited possibility for a common organization or
structure. The solution is a common
‘pool’ of terms and a means of collecting them into many different structures. Each collection is a ‘vocabulary’. We also realized there would be little or no
interest in a shared set of tags for holding the descriptions of information
items, although different groups might all use a ‘title’ or an ‘author’
tag. At first, the proposed
architecture held vocabulary terms in one table, with a second table which
connected them to each other via the desired relationship and proper vocabulary
identification. Tags were in a separate
table, which also had a ‘linking’ table for collecting the tags into schemas.
Much of this design was informed by a growing
understanding of the ‘triple’ concept, most familiarly expressed in the
Resource Description Framework, RDF.
The idea there, which we have extended beyond describing resources, is
that information items have properties which exist in relation to each other
and to the item. The item and any given
property are expressed in ‘triples’: a resource, a relationship – “has
property”, and a property value. We
applied that to vocabulary terms: a term, a relationship, another term. By defining the nature of the relationship,
we are able to use one data structure to store terms and their relationships to
each other. Most critically, we could
define and capture the nature of the relationship; a pair of terms could have a
Broader-Narrower relationship, they could have a Variant-Authorized Term
relationship, or they could have an non-hierarchical Authorized-Authorized Term
relationship. Additionally, by creating
the concept of relationship classes, hierarchical, variant-authorized term,
associative relationship, we are able to define further types of relationships,
such as genus-species, or whole-part, or instance-of, or exemplar. With all this information defined and
recorded, we are able to write applications which give added dimensions to
vocabulary display (hierarchical trees, for instance), provide ‘smart’ term
retrieval (variant terms return authorized terms), and enhance tagging and
cataloging (for example, the display of authorized terms suggests other
authorized terms). The same information
can be used in navigation and search tools, as well. This extension of the entity-relationship-entity idea appeared
very promising for storing vocabulary values and their relations to one
another.
We went a step further; we applied this to schemas
and tags. If a vocabualry is a
collection of terms, and a schema is a collection of tags, there is no
essential difference between them; vocabularies and schemas are collections,
terms and tags are elements. This
development gave us the power to associate terms with other terms in as many
relationships as necessary and, by associating pairs of terms with
vocabularies, in as many vocabularies as necessary. Thus, one string, ‘access’ for example, could serve as an Entry
Term in one vocabulary and as an authorized term in another. Or, more likely, ‘Microsoft Access’ could be
the broader of many more specific product names in one vocabulary and the sole
representation of that product in another.
We have the ability to use one form of a term in an multitude of
organizational structures. We can
associate terms across vocabularies:
Windows NT – Related Term – Dave Cutler (the chief architect of it); and
Dave Cutler – Related Term—Senior Distinguished Engineer (his job title); Dave
Cutler – Related Term – the group he works in; and so on.
The light bulb idea in all of this, that which is
opening the doors to possibilities for knowledge bases and, perhaps,
ontologies, is that relationships are themselves entities. As such, they have at least one property,
that of identity. They also belong to
classes. These are relatively
straightforward values to capture in the data structure we’ve developed. The real challenge is identifying useful
relationships between information entities and using them as the glue to build
and hold information and knowledge directly, not via surrogates. With the ability to define relationships, we
have real opportunities to describe and record real-world circumstances in
widely available database management systems.
The entity-relationship-entity concept is strikingly similar to the
grammatical concept of subject-action-predicate. Now we’re talking about language; and language is getting close
to human thought. We, and I mean all of
us information professionals, are no longer limited to dealing with documents
and document surrogates, which might or might not contain the information
people need. Now, we can directly
record that information and people can access it directly.