Authorities and Linked Data: Making Authorities Accessible as Linked Data

Overview

This document reflects 6 years of experience working with 11 authority sources to create access working with linked data. Based on that experience, I make recommendations for best practices in creating a linked data API for an authority.

General

Ontology

The authority data can be represented in any ontology based on what is appropriate for the authority data. This can be a common ontology (e.g. skos, schema, bibframe), or a custom ontology specific to the authority (e.g. dbpedia, geonames, mesh), or a mix of predicates from multiple ontologies. There is a preference for using an existing ontology over starting a new one from scratch.

API Request Results

API requests return results as an RDF serialization. The serialization can be any RDF format (e.g. json-ld, n-triples, turtle, rdf-xml, etc.). Best practice is to set the response header content-type to the mimetype for the RDF format.

Language Tagged Literals

Best practice is to use RDF language tagging of literals. This facilitates use of the authority in multilingual sites. Not all sites have multilingual literals. It is ok to not include language tags, but you may save yourself refactoring of the data if you add support for multilingual literals at a later time.

API Request: Fetch a Single Entity

This API request is given an identifier for a single entity and returns the relevant data about the entity as RDF.

Recommended approach

BEST PRACTICE

URIs should be resolvable and return an RDF serialization of the graph of data related to the entity. The bounds of the graph are determined by the authority provider as makes sense for the authority’s data.

OPTIONAL

In addition to resolvable URIs, some APIs provide a URL to which the URI is passed as a parameter to identify the term.

In a few cases, authorities create a term fetch API that accepts an ID as the parameter. If the ID is used, it is highly recommended that there be a triple for the entity that specifies the ID exactly as it should be passed to the API request. Again, this is in addition to resolvable URIs.

Returned results

In all cases, the results are returned in an RDF serialization. The result graph generally includes all triples where the requested URI is the subject. Depending on the ontology and authority data, it may also include additional triples extending the graph to include all meaningful data for the requested entity.

For example, an authority where data is primarily in the SKOS ontology, first level triples are probably sufficient. A more complex ontology, like BibFrame, will require constructing a more complex graph to get all the data about the entity.

API Request: Search by string query

Given a string query, the API returns a set of entities as results with data about each entity represented in an RDF serialization.

Recommended parameters

NOTE: The exact name of the parameter is not important. The parameter names shown here are the ones commonly used for passing these values into the API.

MINIMAL

parameter	description
q	string query

GOOD

Minimal parameters plus…

parameter	description
maxRecords	how many results to return
startRecord	which record within the full set of search results should be the first record returned

NOTE: startRecord and maxRecords can be used together to implement pagination. For example,

startRecord=1&maxRecords=5 returns 1-5
startRecord=6&maxRecords=5 returns 6-10
startRecord=11&maxRecords=5 returns 11-15
etc.

BEST

Minimal and good parameters plus…

parameter	description
lang	return literals in the specified language
entity	when the authority has significant separation of data along an entity class, support of the entity parameter allows for limiting the return set to a subset of the authority data

It is fine to require additional parameters to facilitate subsetting or sorting of the authority data in a meaningful way.

Optimizing search performance

Why not SPARQL?

In our experience, using SPARQL directly for search can have performance issues. At best it is slow and at worst results are not returned. And even if it is performant, it does not provide ranking of search results. The lack of ranked search results means that the same search can produce different results when run multiple times giving an inconsistent experience for end users.

Index + SPARQL

It is recommended that data stored in a triple store be accompanied by a search index (e.g. lucene, solr, elastic search) for effective and efficient search performance. The index is generated over the set of literals that makes the most sense for the authority data. Minimally, this includes the primary label. It may also include other literals (e.g. alternate labels, broader terms, narrower terms, notes). For our local cache, we work with our metadata specialists to determine the best set of literals to include. With lucene or solr, the literal values can be weighted to refine the search results.

NOTE: If supporting language tagging, the literals in the search index can be included in multiple languages allowing for native language searching of the authority.

Original Search Workflow

Our cache system Search API performs the following steps to fulfill a search query request. This or a similar approach is recommended for authority providers.

search the index for the query string which returns a set of subject URIs and a search rank for each
construct a performant SPARQL query to make a precise request by URI from the triple store for each match
- This SPARQL query will pull from the triple store enough content from the graph around each subject URI to provide context for the match (More on context below. See Data in Results section.)
- If supporting language tagging, the value of the lang parameter can be used to limit the literals returned as part of the search results.
inject a rank predicate for each search result’s subject URI to provide a means for consistently sorting the results of a search. We use http://vivoweb.org/ontology/core#rank predicate. You can use a different predicate if you prefer.

Updated Search Workflow

The original search workflow using a combination of index search + SPARQL queries to gather extended context continued to perform poorly. This led to an approach that removed all SPARQL queries from the search process.

To allow for this, all SPARQL queries that gather extended context are run during the generation of the index. The data gathered by these queries are assembled into a payload graph that is stored in a result field on the index document. This simplifies the search workflow.

search the index for the query string
for each result, retrieve the graph in the index document's result field
inject a rank predicate for each search result’s subject URI to provide a means for consistently sorting the results of a search. We use http://vivoweb.org/ontology/core#rank predicate. You can use a different predicate if you prefer.

Both the original and updated search workflows architectural split between indexing/search and response graph formulation supports flexibility in the semantics of search (particularly in the ordering of results) while also maintaining flexibility in shaping context for the user in the way the graph is populated. New requirements for context can hence be accommodated without requiring the full reindexing of the authority source. The inclusion of the ranking triple ensures the retention of results ordering across the boundaries of the various system interfaces, particularly when results graphs are reloaded into local caches, erasing the positional semantics of ordering in the incoming result stream.

Data in Results

The results for each matching entity will include a subset of the full graph associated with the entity. Below I specify common types of data that are included in the subset graph. They are specified by a role instead of a specific predicate or ldpath because each authority may be using a different ontology.

REQUIRED

role	description
primary label	the primary label for the entity (e.g. skos:prefLabel, madsrdf: authoritativeLabel )

HIGHLY RECOMMENDED

role	description
rank	rank in the search results that allows for sorting

NOTE: This is marked as HIGHLY RECOMMENDED only because at this writing, I have yet to work with an authority that provides a rank predicate in their search results. This is one of the major drivers for caching external authorities. If it were completely up to me, I would mark this as REQUIRED.

COMMON

role	description
id	unique identifier for the entity within the local authority system (e.g. n1234, c_1234) This is often the final part of the URI, but not required to be.
alt label	an alternate label for the entity (e.g. skos:altLabel, madsrdf:variantLabel)
same as	URI of another entity that is considered the same entity as the result (e.g. skos:exactMatch, owl:sameAs). URI is typically for an entity outside the authority.
broader	URI of another entity that is a broader term for the result (e.g. skos:broader, geonames:parentFeature). URI is typically for an entity within the authority.
narrower	URI of another entity that is a narrower term for the result (e.g. skos:narrower, mesh:mapped_from). URI is typically for an entity within the authority.

AUTHORITY SPECIFIC

Our metadata specialists have identified additional parts of the graph that provide context to aid users in their selection process. These are authority data specific.

For example, our local cache of Library of Congress Name Authority for persons, the result graph includes:

role	ldpath
birth date	madsrdf:identifiesRWO/madsrdf:birthDate/rdfs:label
death date	madsrdf:identifiesRWO/madsrdf:birthDate/rdfs:label
field of activity	madsrdf:identifiesRWO/madsrdf:fieldOfActivity/rdfs:label
occupation	madsrdf:identifiesRWO/madsrdf:occupation/madsrdf:authoritativeLabel

NOTE: This example also shows how the data in the results can come from the deeper graph. The notation used to specify the path to the data is Marmotta’s ldpath.

Combined Workflow for Searching and Term Fetch

The recommendations in this document support the following combined workflow…

search for a string using the Search Request API which returns a subset of the data for each entity
user selects a single entity
extra the subject URI from the selected entity
fetch the full data for the selected entity by resolving the subject URI or passing the URI to the Term Fetch Request API

Original article published May 21, 2019 in Cornell University blog. (https://bit.ly/2wddOxO)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly