Semantic Web Design Patterns—Identifier Design Patterns
This is the second tutorial on Linked Data patterns. In the first tutorial we looked at the concept of a design pattern and the benefits of applying them in development. Design patterns are a great way to capture and share knowledge within a community, building up a shared vocabulary that references tried and tested solutions to common problems.
In this tutorial we will be looking more closely at identifier design patterns. These patterns capture useful guidance on how to create good identifiers for RDF data. While many of these patterns are drawn from Linked Open Data use cases, they apply equally well within the enterprise.
Identifiers in Linked Data
Identifiers are the cornerstone of any RDF dataset. RDF uses URIs for identifying resources as well as the properties and classes that we use to describe them. One major benefit of being consisted about how new URIs are created and which URIs you refer to is that data from multiple sources can be integrated for free on-the-fly. Therefore publishing data for use within the enterprise, and particularly as Open Data on the web, is best achieved using clean, stable HTTP URIs.
It is already best practice in modern web frameworks to have simple clear URLs for the pages in an application. This makes them easy to link to, bookmark, crawl, and "like". In RDF datasets we apply the same principle when assigning URIs to resources. Strong identifiers make it easier to link together datasets and enrich an existing dataset with additional annotations. Within the enterprise we may want to facilitate integration across different departments; on the web we're enabling linking and annotation by a variety of different users and organizations.
While RDF allows us to capture data using "blank nodes"—resources without a global identifier—as mentioned in RDF 101 these are best avoided as they limit the ability to later reference a resource or annotate it with more data.
When we're creating a new dataset it's often useful to pay some initial attention to deciding how best to structure the identifiers for our resources. These rules for how to construct URIs are often referred to as an identifier scheme.
The guidance around creating Cool URIs are equally applicable in both of these contexts: whether we're publishing documents or data. But we can tease out some useful patterns that are mainly applicable to data publishing.
The next sections introduce some identifier patterns covering some questions that are commonly encountered when developing an identifier scheme for a dataset.
For the purposes of this article we'll use a slightly abbreviated form of design pattern focusing on the question, a solution and some brief discussion. Links are provided to the full pattern description in the Linked Data Patterns book.
How can we create unique URIs from existing data?
Examine the dataset to identify the existing unique identifiers for resources and use those as the basis for constructing URIs. For example, a database table containing product information may have a unique numeric code for each item, we might use these codes to create identifiers using a common base URI, e.g:
Relational databases will typically have primary keys that contain unique identifiers for every record in a table. Other sources of identifiers are any property whose value uniquely identifies a resource. In some cases multiple properties may need to be combined to create a uniquely identifying value. These existing identifiers are referred to as Natural Keys.
By deriving our URIs from an existing identifier we avoid the need to define a new process for assigning an identifier to a resource. We also avoid the need to maintain a mapping between "legacy" identifiers and URIs.
There may be several Natural Keys for a given resource. Choosing the right identifier can help us link together datasets more easily. We can also provide more structure to our URIs to help make them more meaningful. The following two patterns explore those aspects in more detail.
How can we create more predictable, human-readable URIs?
Add structure to a URI using human-readable labels that indicate the type of resource to which the URI has been assigned. A frequently used convention is to use the plural form of a class name. For example URIs for products and orders might be created from a set of Natural Keys, as follows:
Patterned URIs are a natural way to create URIs from Natural Keys. Clean, clear URIs are more memorable and can be easier for developers to work with. Patterned URIs can be created using URI templates that indicate how to generate identifiers for the different types of resource in a dataset. These conventions can be used not just when publishing the data, but also when build links into a dataset.
By partitioning the URIs based on the type of resource we also avoid any potential clashes between Natural Keys. For example both Products and Orders might use simple numeric identifiers in a relational database, giving them separate URI spaces avoids potential confusion.
How can we simplify the inter-linking of datasets?
Where possible, create Patterned URIs using Natural Keys that are not specific to a specific application or dataset. E.g. industry standard identifiers or codes. To continue our earlier example, a product database might have both an internal primary key for a product, but also a GTIN (a global product code). We can use the GTIN in our Patterned URI:
It is very common that an organization or an industry has a set of standard identifiers or codes that are used to help data integration. These Shared Keys might not be used internally in an application or database as the primary identifier, but they are usually correlated with these "local" identifiers.
When we're publishing Linked Data for others to consume the goal is to simplify data integration so it is natural to use more standardized identifiers when constructing Patterned URIs. That way we can make it easy for developers to build links into a dataset or discover data by simply building a URI from a known identifier and a URI template.
How do we publish non-global identifiers in RDF?
A dataset may have several Natural Keys that identify a resource, all of these identifiers should be included as actual data in the dataset and not just used to construct Patterned URIs. Continuing our previous example we can include a GTIN for a resource as a simple literal property using, for example, the Dublin Core identifier property:
Where we have multiple identifiers we can add additional repeated properties, perhaps using a custom property to help document what kind of identifier is being described.
Including Literal Keys in a dataset makes it possible for a developer or application to discover the URI for a resource using a simple SPARQL query. While a dataset may be using Patterned URIs constructed from shared Natural Keys, the consumer of that data may not be aware of the URI templates being used, or have the necessary shared identifier used to construct those URIs. Including all available identifiers in the data ensures that mappings between keys and URIs can be shared and extracted. This supports data discovery as well as integration with legacy systems.
In this article we've introduced the basic identifier patterns that are most commonly applied when designing the URI scheme for a dataset. The patterns cover basic URI construction from existing identifiers, guidance on choosing between identifiers, and use of URI templates to help create more memorable and well structured URIs.
By carefully selecting the right identifiers to use when constructing our URIs we can help make it easier for re-users to discover more data. By also ensuring that we include legacy identifiers in a dataset, we can help capture mapping between different legacy identifier schemes.
In the next tutorial in this series we'll look at some useful modeling patterns that can help us structure our dataset to get the most from RDF's graph model.