-
Notifications
You must be signed in to change notification settings - Fork 0
Authorities and Linked Data: Making Authorities Accessible as Linked Data
This document reflects 6 years of experience working with 11 authority sources to create access working with linked data. Based on that experience, I make recommendations for best practices in creating a linked data API for an authority.
The authority data can be represented in any ontology based on what is appropriate for the authority data. This can be a common ontology (e.g. skos, schema, bibframe), or a custom ontology specific to the authority (e.g. dbpedia, geonames, mesh), or a mix of predicates from multiple ontologies. There is a preference for using an existing ontology over starting a new one from scratch.
API requests return results as an RDF serialization. The serialization can be any RDF format (e.g. json-ld, n-triples, turtle, rdf-xml, etc.). Best practice is to set the response header content-type to the mimetype for the RDF format.
Best practice is to use RDF language tagging of literals. This facilitates use of the authority in multilingual sites. Not all sites have multilingual literals. It is ok to not include language tags, but you may save yourself refactoring of the data if you add support for multilingual literals at a later time.
This API request is given an identifier for a single entity and returns the relevant data about the entity as RDF.
URIs should be resolvable and return an RDF serialization of the graph of data related to the entity. The bounds of the graph are determined by the authority provider as makes sense for the authority’s data.
In addition to resolvable URIs, some APIs provide a URL to which the URI is passed as a parameter to identify the term.
In a few cases, authorities create a term fetch API that accepts an ID as the parameter. If the ID is used, it is highly recommended that there be a triple for the entity that specifies the ID exactly as it should be passed to the API request. Again, this is in addition to resolvable URIs.
In all cases, the results are returned in an RDF serialization. The result graph generally includes all triples where the requested URI is the subject. Depending on the ontology and authority data, it may also include additional triples extending the graph to include all meaningful data for the requested entity.
For example, an authority where data is primarily in the SKOS ontology, first level triples are probably sufficient. A more complex ontology, like BibFrame, will require constructing a more complex graph to get all the data about the entity.
Given a string query, the API returns a set of entities as results with data about each entity represented in an RDF serialization.
NOTE: The exact name of the parameter is not important. The parameter names shown here are the ones commonly used for passing these values into the API.
parameter | description |
---|---|
q | string query |
Minimal parameters plus…
parameter | description |
---|---|
maxRecords | how many results to return |
startRecord | which record within the full set of search results should be the first record returned |
NOTE: startRecord and maxRecords can be used together to implement pagination. For example,
startRecord=1&maxRecords=5 returns 1-5
startRecord=6&maxRecords=5 returns 6-10
startRecord=11&maxRecords=5 returns 11-15
etc.
Minimal and good parameters plus…
parameter | description |
---|---|
lang | return literals in the specified language |
entity | when the authority has significant separation of data along an entity class, support of the entity parameter allows for limiting the return set to a subset of the authority data |
It is fine to require additional parameters to facilitate subsetting or sorting of the authority data in a meaningful way.
In our experience, using SPARQL directly for search can have performance issues. At best it is slow and at worst results are not returned. And even if it is performant, it does not provide ranking of search results. The lack of ranked search results means that the same search can produce different results when run multiple times giving an inconsistent experience for end users.
It is recommended that data stored in a triple store be accompanied by a search index (e.g. lucene, solr, elastic search) for effective and efficient search performance. The index is generated over the set of literals that makes the most sense for the authority data. Minimally, this includes the primary label. It may also include other literals (e.g. alternate labels, broader terms, narrower terms, notes). For our local cache, we work with our metadata specialists to determine the best set of literals to include. With lucene or solr, the literal values can be weighted to refine the search results.
NOTE: If supporting language tagging, the literals in the search index can be included in multiple languages allowing for native language searching of the authority.
Our cache system Search API performs the following steps to fulfill a search query request. This or a similar approach is recommended for authority providers.
- search the index for the query string which returns a set of subject URIs and a search rank for each
- construct a performant SPARQL query to make a precise request by URI from the triple store for each match
- This SPARQL query will pull from the triple store enough content from the graph around each subject URI to provide context for the match (More on context below. See Data in Results section.)
- If supporting language tagging, the value of the lang parameter can be used to limit the literals returned as part of the search results.
- inject a rank predicate for each search result’s subject URI to provide a means for consistently sorting the results of a search. We use http://vivoweb.org/ontology/core#rank predicate. You can use a different predicate if you prefer.
The original search workflow using a combination of index search + SPARQL queries to gather extended context continued to perform poorly. This led to an approach that removed all SPARQL queries from the search process.
To allow for this, all SPARQL queries that gather extended context are run during the generation of the index. The data gathered by these queries are assembled into a payload graph that is stored in a result field on the index document. This simplifies the search workflow.
- search the index for the query string
- for each result, retrieve the graph in the index document's result field
- inject a rank predicate for each search result’s subject URI to provide a means for consistently sorting the results of a search. We use http://vivoweb.org/ontology/core#rank predicate. You can use a different predicate if you prefer.
Both the original and updated search workflows architectural split between indexing/search and response graph formulation supports flexibility in the semantics of search (particularly in the ordering of results) while also maintaining flexibility in shaping context for the user in the way the graph is populated. New requirements for context can hence be accommodated without requiring the full reindexing of the authority source. The inclusion of the ranking triple ensures the retention of results ordering across the boundaries of the various system interfaces, particularly when results graphs are reloaded into local caches, erasing the positional semantics of ordering in the incoming result stream.
The results for each matching entity will include a subset of the full graph associated with the entity. Below I specify common types of data that are included in the subset graph. They are specified by a role instead of a specific predicate or ldpath because each authority may be using a different ontology.
role | description |
---|---|
primary label | the primary label for the entity (e.g. skos:prefLabel, madsrdf: authoritativeLabel ) |
role | description |
---|---|
rank | rank in the search results that allows for sorting |
NOTE: This is marked as HIGHLY RECOMMENDED only because at this writing, I have yet to work with an authority that provides a rank predicate in their search results. This is one of the major drivers for caching external authorities. If it were completely up to me, I would mark this as REQUIRED.
role | description |
---|---|
id | unique identifier for the entity within the local authority system (e.g. n1234, c_1234) This is often the final part of the URI, but not required to be. |
alt label | an alternate label for the entity (e.g. skos:altLabel, madsrdf:variantLabel) |
same as | URI of another entity that is considered the same entity as the result (e.g. skos:exactMatch, owl:sameAs). URI is typically for an entity outside the authority. |
broader | URI of another entity that is a broader term for the result (e.g. skos:broader, geonames:parentFeature). URI is typically for an entity within the authority. |
narrower | URI of another entity that is a narrower term for the result (e.g. skos:narrower, mesh:mapped_from). URI is typically for an entity within the authority. |
Our metadata specialists have identified additional parts of the graph that provide context to aid users in their selection process. These are authority data specific.
For example, our local cache of Library of Congress Name Authority for persons, the result graph includes:
role | ldpath |
---|---|
birth date | madsrdf:identifiesRWO/madsrdf:birthDate/rdfs:label |
death date | madsrdf:identifiesRWO/madsrdf:birthDate/rdfs:label |
field of activity | madsrdf:identifiesRWO/madsrdf:fieldOfActivity/rdfs:label |
occupation | madsrdf:identifiesRWO/madsrdf:occupation/madsrdf:authoritativeLabel |
NOTE: This example also shows how the data in the results can come from the deeper graph. The notation used to specify the path to the data is Marmotta’s ldpath.
The recommendations in this document support the following combined workflow…
- search for a string using the Search Request API which returns a subset of the data for each entity
- user selects a single entity
- extra the subject URI from the selected entity
- fetch the full data for the selected entity by resolving the subject URI or passing the URI to the Term Fetch Request API
Original article published May 21, 2019 in Cornell University blog. (https://bit.ly/2wddOxO)