-
Notifications
You must be signed in to change notification settings - Fork 2
Mapping Template Language (MTL)
This page provides a description of how the Mapping Template Language (MTL) extends the Velocity Template Language (VTL) to support mappings between different data representations.
A set of example mapping templates is available in the examples folder.
The current documentation assumes a basic knowledge the Apache Velocity Template Engine and VTL. More information on these aspects can be found in the Velocity User Guide. The VTL language is extended by introducing three variables that are bound at runtime to the template Context
and, therefore, are accessible while specifying a mapping template:
-
$reader
to access the input data -
$functions
to execute Java functions statically configured -
$map
to access other information not available in the input data
The initial step for defining a mapping from a file A in a specific format to a file B in a different format involves reading the contents of A.
In the Mapping Template Language (MTL), the access to input data is performed via the Reader
interface. Using a reference formulation for a specific format, a Reader
allows extracting one or more data frames from the input data. A data frames is a flat, non-hierarchical and tabular data structure. A data frames is encoded as a List
of Map
s, where each map corresponds to a row in the data frame and can be accessed using the name of the columns (i.e., the keys).
The currently available implementation supports RDF, CSV, XML, JSON, and SQL inputs via dedicated Reader
s:
- For RDF input files or remote triplestore the
$reader
variable is bound to anRDFReader
accepting SPARQL queries to extract a data frame. - For CSV input files the
$reader
variable is bound to aCSVReader
automatically generating a data frame from the CSV. - For XML input files the
$reader
variable is bound to anXMLReader
accepting XQuery queries to extract a data frame. - For JSON input files the
$reader
variable is bound to aJsonReader
accepting multiple JsonPath queries to extract a data frame. - For SQL databases the
$reader
variable is bound to aSQLReader
accepting SQL queries to extract a data frame. The$reader
variable in template can be used to access theReader
. AdditionalReader
implementations can be added to the library also to support alternative reference formulations for the same format.
The $reader
can be automatically bound to the input data (e.g., if the mapping-template
is run via CLI), or a Reader
can be instantiated at runtime from within the template. The $functions
variable exposes the following methods:
-
getRDFReaderFromFile(String filename)
andgetRDFReaderFromString(String s)
: returns dynamically an RDFReader from a RDF file or string -
getRDFReaderForRepository(String address, String repositoryId, String context)
: returns dynamically an RDFReader for a remote triplestore -
getXMLReaderFromFile(String filename)
andgetXMLReaderFromString(String s)
: returns dynamically an XMLReader from a XML file or string -
getJSONReaderFromFile(String filename)
andgetJSONReaderFromString(String s)
: returns dynamically a JSONReader from a JSON file or string -
getCSVReaderFromFile(String filename)
andgetCSVReaderFromString(String s)
: returns dynamically a CSVReader from a CSV file or string -
getSQLReaderFromDatabase(String driver, String url, String databaseName, String username, String password)
: returns dynamically an SQLReader for a remote SQL Database (MySQL and Postgres currently supported).
This approach can be used to combine data frames extracted from different data sources within the same mapping template.
Let the input A be the following XML file:
<?xml version="1.0" encoding="UTF-8"?>
<transport>
<bus id="25">
<route>
<stop id="645">International Airport</stop>
<stop id="651">Conference center</stop>
</route>
</bus>
</transport>
Then reading data from A to a data frame would be written as:
#set( $query = '
for $stop in /transport/bus/route//stop
return map {
"stopId": $stop/@id,
"stopName": $stop/text(),
"busId": $stop/ancestor::bus/@id
}')
#set( $data = $reader.getDataframe($query))
Where #set
is an Apache VTL directive to store a value in a
variable. Variables are denoted with the prefix $
. In this case a
XQuery query is stored in the $query
variable.
This query is then used to obtain a data frame. The content of the
DataFrame stored in the $data
variable will be:
stopId | stopName | busId |
---|---|---|
"645" | "International Airport" | "25" |
"651" | "Conference center" | "25" |
The data extracted from input A is represented by a data frame whose keys are those specified via XQuery and the values are the results obtained by applying the query to A.
The manipulation of a data frame can be defined using different functions and the VTL directives.
To provide commonly required functionalities a subset of the Apache Velocity Tools can be used inside of template files. These are:
-
$math
, MathTool providing math functions. -
$date
, ComparisonDateTool used to format, parse and compare dates. -
$number
, NumberTool used to format numbers.
A default set of utility functions for data transformation and data frame combination is made available through the $functions
variable:
-
rp(String s)
: if a prefix is set, removes it from the parameter string. If a prefix is not set, or the prefix is not contained in the given string it returns the string as it is. -
setPrefix(String prefix)
: set a prefix for therp
method. -
sp(String s, String substring)
: returns the substring of the parameter string after the first occurrence of the parameter substring. -
p(String s, String substring)
: returns the substring of the parameter string before the first occurrence of the parameter substring. -
replace(String s, String regex, String replacement)
: returns a string replacing all the occurrences of the regex with the replacement provided. -
newline()
: returns a newline string. -
hash(String s)
: returns a string representing the hash of the parameter. -
checkString(String s)
: returnstrue
if the string is not null and not an empty string. -
checkList(List<T> l)
: returnstrue
if the list is not null and not empty. -
checkList(List<T> l, T o)
: returnstrue
if the list is not null, not empty and containso
. -
checkMap(Map<K,V> m)
: returnstrue
if the map is not null and not empty. -
checkMap(Map<K, V> m, K key)
: returnstrue
if the map is not null, not empty and contains the keykey
. -
mergeResults(List<Map<String,String>> results, List<Map<String,String>> otherResults)
: merge two data frames
Custom subclasses of the TemplateFunctions
class may be defined and provided (e.g., using the -fun
option via CLI) to modify the set of functions available in processing the template via the $functions
interface. The provided class is compiled at runtime and made available through the $functions
variable in the template.
The $map
variable contains key-value pairs that can be specified independently from the declarative mapping template and are evaluated at runtime. This is useful if the same template should be run on different input data and the generated output should contain certain constant information that dipend on the considered input but are not available in the input data.
To represent the data according to an expected data format and data model, a set of declarative mapping rules should be defined to specify how the data in the data frame should be combined to obtain the desired output. The flexibility of VTL can be leveraged to generate any textual data representatation.
TAs an example, the previously shown snippet of a mapping can be expanded to generate a set of RDF triples from the data in the extracted dataframe.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix transit: <http://vocab.org/transit/terms/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix ex: <http://trans.example.com/>.
#set( $query = '
for $stop in /transport/bus/route//stop
return map {
"stopId": $stop/@id,
"stopName": $stop/text(),
"busId": $stop/ancestor::bus/@id
}')
#set( $data = $reader.getDataframe($query))
#foreach($stop in $data)
ex:$stop.busId rdf:type transit:stop ;
transit:stop "$stop.stopId"^^xsd:int ;
rdfs:label "$stop.stopName" .
#end
At the beginning of the mapping, the RDF prefixes and corresponding URIs are declared. When this mapping will be executed everything that is not a VTL directive will be kept as a constant in the generated output.
At the end of the mapping, each row in the DataFrame is used to populate the structure of the desired RDF representation in the Turtle format.
The VTL #foreach
directive is used to loop over all the rows in the
data frame. Values are retrieved using the map.key
property acess
syntax.
The specified mapping results to the following RDF in Turtle format.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix transit: <http://vocab.org/transit/terms/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix ex: <http://trans.example.com/>.
ex:25 rdf:type transit:stop ;
transit:stop "645"^^xsd:int ;
rdfs:label "International Airport" .
ex:25 rdf:type transit:stop ;
transit:stop "651"^^xsd:int ;
rdfs:label "Conference center" .
- It is better to avoid nested cycles in the template by using support data structures to access efficiently large data frames. A set of functions is made available through the
$functions
variable to optimise the access to data frames:
-
getMap(List<Map<String, String>> results, String key)
: creates a support data structure to access data frames faster. Builds a map associating a single row with its value w.r.t a specified column (key parameter). The assumption is for each row the value for the given column is unique, otherwise, the result will be incomplete. -
getListMap(List<Map<String, String>> results, String key)
: creates a support data structure to access data frames faster. Builds a map associating a value with all rows having that as value for a specified column (key parameter). -
getMapValue(Map<K, V> map, K key)
: ifcheckMap(map, key)
istrue
returns the value forkey
inmap
, otherwise returnsnull
. -
getListMapValue(Map<K, List<V>> listMap, K key)
: ifcheckMap(listMap, key)
istrue
returns the value forkey
inlistMap
, otherwise returns an empty list.
-
The access to the data heavily affects the performance of the mappings. It is better to combine the extraction of data from the input data source in the minimum number of data frames possible, i.e., not defining several small data frames for each mapping rule.
-
Too large templates may affect performances. If it is feasible for the specific scenario considered, splitting templates into multiple files and then combining the results may improve performances.
When the mapping-template is passed a map containing multiple Readers, where each entry is a key-value pair of "readerName": Reader
, each Reader
will be accessible in the template file as:
$readers.readerName (or $readers['readerName'])
If the map contains only a single Reader, that Reader should be accessible in two ways:
$readers.readerName (or $readers['readerName'])
-
$reader
to facilitate reusing existing mappings
This usage scenario happens when the mapping-template is used as a library for another application, for example Chimera. There is no way to provide a map of Readers as input to the mapping-template when using it as a command line application.