MIT Libraries

Data Management and Publishing

 

Data Integration

The goals of data management are for data to be preserved, reused and integrated with other datasets. Data integration is probably the most challenging of these goals. By integration we mean that a dataset can be retrieved and added to other datasets to create greater, more robust and useful datasets. For example, water temperature data from Long Island Sound could be added to Cape Cod water temperature data, and in turn integrated with water temperature data from Seattle. While this may sound simple, many variable can affect the consistency of the information that is collected. For example, is the temperature recorded in Fahrenheit or Celsius? At what intervals were the temperature readings taken? What times of day? Integrating simple datasets such as these can easily become cumbersome.

The Semantic Web is a tool that proposes to allow integration of disparate datasets such as those in the example above. To use the Semantic Web, one needs to do the following. Each of these will be explained in detail later:
  1. Mark up the data in XML or similar format
  2. Arrange the data elements in Resource Description Framework (RDF) statements
  3. Identify each data element with a URI and make each data element available via that URI
  4. Use a consistent ontology to label data elements
1. XML

eXtensible Markup Language (XML) is a markup language that looks a bit like HTML, but is much more sophisticated. It can be used to describe almost anything, and its ease of use and flexibility makes it ideal for marking up Semantic Web data. Like in HTML, XML data is enclosed by tags. An example of a few lines of XML code is:

<name>
<firstname>Jane</firstname>
<lastname>Smith</lastname>
</name>

You can see that although the tag format is similar to HTML, the content inside of the tags is not recognizable HTML.

2. RDF Statements

The Semantic Web is sometimes said to be about relationships between data elements. The key to RDF is that you want to take a complex dataset and break it down into its simplest form. RDF is simply a data model -- a way to express complex data in simple, straightforward sentences. To use our example from above, perhaps you have a simple spreadsheet from a pier in Woods Hole, Massachusetts that records the degrees in Fahrenheit of the water each day of the year.



This spreadsheet could be expressed in RDF using XML. One key point to remember is that RDF is a language of sorts -- it is flexible. There is no one way to say something in RDF, just as there is no one way to say something in English. Here is an example of writing the above spreadsheet data in RDF, without XML.

On April 1, 2008, the temperature was 44 degrees.
On April 2, 2008, the temperature was 45 degrees.
On April 3, 2008, the temperature was 44 degrees.
On April 4, 2008, the temperature was 46 degrees.
On April 5, 2008, the temperature was 45 degrees.

Although the above statements are clear to you and me, they are not readable by a computer. That's where XML comes in. XML labels the data elements so that a computer can "read" the data. This same data, in XML, might look like this:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="Woods Hole">
<date>04012008</date>
<temperature>44</temperature>
</rdf:Description>
<rdf:Description rdf:about="Woods Hole">
<date>04022008</date>
<temperature>45</temperature>
</rdf:Description>
<rdf:Description rdf:about="Woods Hole">
<date>04032008</date>
<temperature>44</temperature>
</rdf:Description>
<rdf:Description rdf:about="Woods Hole">
<date>04042008</date>
<temperature>46</temperature>
</rdf:Description>
<rdf:Description rdf:about="Woods Hole">
<date>04052008</date>
<temperature>45</temperature>
</rdf:Description>
</rdf:RDF> </xml>

3. Uniform Resource Identifiers (URIs)

Now you may be wondering, how does the computer know that what is meant by temperature in this document means the same thing in a document about temperature data from another location? This is where URIs come in. URIs, which are like URLs but they are actual locations and not just pointers to locations. By pointing to the location of term on the network, it is guaranteed that there will be consistency among documents that point to the same terms. With URIs, the text above would then look like this:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://purl.org/location/WoodsHole">
<date:http://purl.org/date>04012008</date>
<temperature:http://purl.org/temperature>44</temperature>
</rdf:Description>
<rdf:Description rdf:about="http://purl.org/location/WoodsHole">
<date:http://purl.org/date>04022008</date>
<temperature:http://purl.org/temperature>45</temperature>
</rdf:Description>
<rdf:Description rdf:about="http://purl.org/location/WoodsHole">
<date:http://purl.org/date>04032008</date>
<temperature:http://purl.org/temperature>44</temperature>
</rdf:Description>
<rdf:Description rdf:about="http://purl.org/location/WoodsHole">
<date:http://purl.org/date>04042008</date>
<temperature:http://purl.org/temperature>46</temperature>
</rdf:Description>
<rdf:Description rdf:about="http://purl.org/location/WoodsHole">
<date:http://purl.org/date>04052008</date>
<temperature:http://purl.org/temperature>45</temperature>
</rdf:Description>
</rdf:RDF> </xml>

4. Ontologies

Of course, these terms that you use to define your data are not random. They would be useless if they didn't have some kind of semantic meaning. That is, if the relationships among the terms weren't established. This is where ontologies come in. Ontologies are structured vocabularies that RDF statements refer to in order to provide meaning to their statements. Anyone can create an ontology and put it anywhere on the Web, but it only really becomes useful when many people use to allow datasets to interact with other datasets. For example, you might have a "location" ontology that includes the names of cities around the globe. Anytime you refer to one of these cities, like "Woods Hole," you simply point to the location where that ontology, and thus term, is located. A search engine for ontologies is Swoogle.

 

For advice on a data management project, contact:

data-management@mit.edu



MIT

For help on a data management project, contact: data-management@mit.edu

Text licensed under Creative Commons, unless otherwise noted. All other media all rights reserved unless otherwise noted.