Semantic Web Design Patterns—Data Publishing Patterns

Data Management Design Patterns for the Semantic Web

Introduction

This is the fifth tutorial on Linked Data patterns. In the last tutorial we introduced the concept of named graphs and looked at some patterns for applying them to help manage our RDF data.

In this tutorial we will be looking at some data publishing patterns that can guide the process of publishing and discovering datasets.

Prerequisites

Today's Lesson

Whether we're publishing data internally within an enterprise, or on the web as Linked Open Data, its important to think carefully about the life-cycle of a dataset. There are a number of established patterns that can be applied to support evolution of a dataset, enable dataset discovery, and also the migration of datasets to new locations.

Publishing Linked Data

There are many different methods for publishing RDF data to the web or within an enterprise. Data can be made accessible as static RDF documents; via templated views from an existing web application; through mapping an existing RDBMS into RDF; or by directly publishing data from an RDF triplestore. The right approach depends on many factors including the amount of data being published, its frequency of update, and the costs of extending or replacing existing systems.

But no matter what approach is taken, there are a number of issues to consider. One obvious question is: How much data could or should be shared? Security and privacy are considerations in whatever context you're publishing data. Modelling and converting data also takes effort, so understanding which data items are most valuable can help inform decisions about how to prioritize which data should be published first.

A good rule of thumb is to publish early and often. If data is accessible then users can begin consuming it. This provides a useful feedback loop that can help guide further data-publishing efforts.

But we then need to consider how to improve a dataset over time to increase its utility. We also need to ensure that the data, along with any related APIs, are discoverable by client applications. Every dataset is likely to have its own lifecycle and, ultimately, may need to be migrated to a new home, a new owner, or even retired.

There are design patterns that can guide us towards good solutions to each of these problems.

Data Publishing Patterns

Progressive Enrichment

How can the quality of data or a data model be improved over time?

Solution

After publishing an initial version of a dataset, create new releases that steadily add more data and refine the model. More information can be cleanly added without impacting existing clients, while the new data becomes available for use in more powerful queries. The richness of the graph model can evolve and expand over time.

Discussion The flexibility of RDF's graph model—and the ease with which new statements can be merged with old—makes it easy to steadily enrich a dataset over time. Progressive Enrichment can happen along several different dimensions.

Firstly the resources already present in the dataset can be annotated with additional properties or relationships. This increases the amount of detail in the dataset.

Secondly, new resources or types of resources can be added to expand the coverage or scope of the dataset.

Finally the model itself (i.e. the structure of the graph) can be expanded and revised, e.g. to increase its accuracy. For example, simple binary relationships between resources can be supplemented with Qualified Relations that provide useful context.

New data, relationships and model elements can be added to a dataset without impacting previously published data. Clients can adapt to use the newly available information at their own pace. One important way that a dataset can evolve is through the addition of links to other datasets.

Equivalence Links

How do we indicate that different URIs refer to the same resource or concept?

Solution Datasets published by different organisations often redefine the same resources or types. Declaring these equivalences simplifies data integration across different sources. Equivalences can be declared using several different properties. Use the property that is most appropriate for the type of equivalence you wish to express:

  • owl:sameAs is used to declare that two resources with different URIs are, in fact, identical
  • owl:equivalentProperty or owl:equivalentClass are used to declare that properties or classes in two vocabularies are actually the same
  • skos:exactMatch is used to declare that two concepts in a SKOS thesaurus are the same

Discussion Linked Data is published by different organisations, at different times, with little or no prior coordination. This makes it inevitable that there will be overlaps between datasets. They may use different, but equivalent, vocabulary for describing their data. They may also define URIs for the same concept or resource. Integrating data between different organisations requires us to resolve these overlaps.

By including Equivalence Links in our data, we can unambiguously declare where resources are identical. Publishing this data simplifies data integration and enriches the entire graph of Linked Data. A reasoner can use these equivalence statements to automatically integrate data.

It is important to use the right property when declaring equivalence links. This avoids applications deriving incorrect conclusions. OWL defines properties for declaring equivalences between classes, properties, and individual resources. The SKOS vocabulary defines its own equivalence linking terms. Other vocabularies may also define new terms.

Dataset Autodiscovery

How can an application discover the datasets, and associated APIs, published by a website?

Solution Publishing machine-readable descriptions of our datasets in a well-known location can help applications bootstrap themselves into using our data. The conventional way to describe an RDF dataset is using the VoID vocabulary. Using VoID we can provide metadata about our dataset, including pointers to a SPARQL endpoint. The recommended location in which to publish these VoID descriptions is at the ./well-known/void URL.

For example, a VoID description of datasets available from http://example.org/ would be found at http://example.org/well-known/void. A GET request to that URL will return the VoID description in RDF/XML or Turtle.

Discussion RFC 5785 defines a way to register well-known locations for exposing metadata about a website. An entry in that registry defines a location for discovering voiD descriptions of datasets exposed by a domain.

By publishing dataset descriptions at that location we allow applications to use Dataset Autodiscovery to find datasets and APIs. This can allow an application to automatically make use of additional data or resources. For example a Linked Data browser may be able to provide a richer view over a dataset if it can discover a SPARQL endpoint. Or a crawler might be able to use a VoID description to discover data dumps to support more efficient harvesting.

Unpublish

How do we temporarily or permanently remove some Linked Data from the web?

Solution There are a number of reasons why a dataset might need to be temporarily or permanently removed from the web. This might be part of an infrastructure upgrade or because a dataset is no longer being maintained. Datasets may also need to be migrated to new locations. In all of these cases the correct initial procedure is to use an appropriate HTTP status code to communicate the change in status of the affected resources:

  • 503 to indicate that a resource is temporarily unavailable (e.g. the server is down or under maintenance)
  • 410 to indicate that a resource has been removed
  • 301 to indicate that a resource has permanently moved to a new location

Discussion Linked Data ties our data into the web, so we need to use web infrastructure in order to help manage it. The HTTP protocol defines status codes that communicate some useful semantics around whether a resource is temporarily or permanently unavailable, or may have moved location. Correct use of status codes will ensure that applications can correctly interpret what has happened.

In circumstances where a dataset has permanently moved to a new set of URLs (e.g. to a new domain), publishing Equivalence Links between the old and new locations can help clients integrate the data. Further, to support archiving scenarios, providing data dumps of an Unpublished dataset is also important, should data licensing allow it.

Summary

In this tutorial we've looked at some patterns associated with publishing Linked Data. Publishing early and often helps set up a feedback loop with data consumers. Progressively enriching a dataset with additional data and with links to other data sources allows a dataset to be refined over time. We don't have to fully model or expose everything immediately. Links to third-party datasets will certainly evolve over time as more data becomes available from other sources.

Datasets are only useful if they are discoverable. By publishing dataset descriptions, particularly in well-known locations, we can help applications bootstrap themselves into using the data and APIs we publish.

In the next and final tutorial in this series we will move on to explore patterns that can help us create good Linked Data applications.

About the Author

Leigh Dodds
Leigh Dodds
Freelance consultant specialising in Open and Linked Data
On Twitter: