Steps:
- Create RDF ttl files using Table2qb
- (Optional) Load the RDF files into a triple store
- Run CubiQL against the generated RDF data.
cubiql is a command-line tool for generating RDF data cubes from tidy CSV data. See the table2qb
repository for
installation instructions and an example of using it to generate data cubes.
- Java 8
- Unix, Linux, Mac, Windows environments
- On Windows run
table2qb
withjava -jar table2qb.jar exec ...
Create the codelists using the codelist-pipeline
Example:
table2qb exec codelist-pipeline --codelist-csv csv/gender.csv --codelist-name "Gender" --codelist-slug "gender" --base-uri http://statistics.gov.scot/ --output-file ttl/gender.ttl
- Codelists should exist for every dimension. If codelists currently exists there is no need to create new ones.
- A separate CSV is required for each codelist.
- The codelist CSV files should contain all the possibles values for each codelist
- An example CSV is: gender.csv
- At the CSV use the column
Parent Notation
to define hierarchies - The URIs created by the codelist pipeline
- Concept scheme (codelist):
{base-uri}/def/concept-scheme/{codelist-slug}
- Concepts:
{base-uri}/def/concept/{codelist-slug}/{CSV_column_ Notation}
- Concept scheme (codelist):
Create the cube components using the components-pipeline
Example:
table2qb exec components-pipeline --input-csv csv/components.csv --base-uri http://statistics.gov.scot/ --output-file ttl/components.ttl
- The cube components are the
qb:DimensionProperty
,qb:MeasureProperty
andqb:AttributeProperty
. - At the CSV use one row for each of the cube components. If some of the components have already been created (e.g. for another cube) do not include them in the CSV.
- At the CSV column
Component Type
the types are:Dimension
,Measure
andAttribute
- At the CSV column
Codelist
use the URIs created by thecodelist-pipeline
- An example CSV is: components.csv
- The URIs created by the
components-pipeline
:- Dimension:
{base-uri}/def/dimension/{CSV_column_ Label}
- Measure:
{base-uri}/def/measure/{CSV_column_ Label}
- Attribute:
{base-uri}/def/attribute/{CSV_column_ Label}
- Dimension:
Create the cube DSD and observations using the cube-pipeline
Example:
table2qb exec cube-pipeline --input-csv csv/input.csv --dataset-name "Employment" --dataset-slug "employment" --column-config csv/columns.csv --base-uri http://statistics.gov.scot/ --output-file ttl/cube.ttl
- Use a CSV file to define the mappings between a CSV column and the relevant cube component.
- The CSV should contain one row per component i.e.
qb:DimensionProperty
,qb:MeasureProperty
andqb:AttributeProperty
. Including a row for theqb:measureType
. It should also contain a row for the observationvalue
. - At the CSV, the column
component_attachment
can be:qb:dimension
,qb:measure
,qb:attribute
. - At the CSV, the column
property_template
should match with the URIs created for the components by thecomponents-pipeline
. Use the csv columnname
if required at the template (e.g.http://example.gr/def/measure/{measure_type}
) - At the CSV, the column
value_template
should match with the URIs created for the concepts by thecodelist-pipeline
. Use the csv columnname
if required at the template (e.g.http://example.gr/def/concept/stationid/{station_id}
) - If there are numbers at the dimension values (not at the value of the observation) use
datatype
string
. Otherwise ifdatatype
number
is used then the URIs will have the form e.g.http://example.gr/def/condept/year/2000.0
- At the CSV row that has the mapping for the measure (i.e.
component_attachment
qb:measure
), leave thevalue_template
empty. - At the CSV row for the value (i.e. with
name
value
). Leave thecomponent_attachment
andvalue_template
empty. - At each CSV row use a
value_transformation
if required. Possible values are:slugize
,unitize
or blank.slugize
converts column values into URI components, e.g.(slugize "Some column value")
issome-column-value
.unitize
translates literal£
intoGBP
, e.g. (unitize "£ 100") isgpb-100
. Be careful: theslugize
converts all to non-capital letters. The URIs of the dimension values should match with the the concept URIs created through thecodelist-pipeline
Advice:
- Use the same base URI at all pipelines. Although it is not mandatory it will easy the transformation process.
- Be careful to use URIs that match between the pipelines. E.g. the
property_template
URI atcube-pipeline
should match with the URIs created for the components by thecomponents-pipeline
.
A complete example can be found at Github.
CubiQL can be run directly against a local directory containing the generated RDF data, however this loads the data into memory so is unsuitable for large amounts of data. First loading the data into a dedicated triple store is therefore recommended.
Supported triple stores:
- Virtuoso
- Stardog
- Fuseki ?
What to load at the triple store:
- All the RDF ttl created by Table2qb:
- RDF for each of the codelists
- RDF for the cube components
- RDF for the DSD and the observations
- RDF for the QB vocabulary
Configuration of the triple store:
- Set the max limit of the returned results e.g. 100000 (depends on the size and number of cubes).
- At Virtuoso the limit is defined by the parameter
ResultSetMaxRows
- At Virtuoso the limit is defined by the parameter
Run CubiQL using the default configuration:
java -jar cubiql-standalone.jar --port 8085 --endpoint http://myendpoint.com
if running against a local directory containing the data the --endpoint
parameter should specify the path to the directory:
java -jar cubiql-standalone.jar --port 8085 --endpoint ./ttl
The default configuration:
{:geo-dimension-uri nil
:time-dimension-uri nil
:codelist-source "component"
:codelist-predicate "http://publishmydata.com/def/qb/codesUsed"
:codelist-label-uri "http://www.w3.org/2000/01/rdf-schema#label"
:dataset-label-uri "http://www.w3.org/2000/01/rdf-schema#label"
:schema-label-language nil
:max-observations-page-size 2000}
If default configuration does not match your data, then use another configuration file:
java -jar cubiql-standalone.jar --port 8085 --endpoint http://myendpoint.com/ --configuration myconfig.edn
Configuration parameters:
:geo-dimension-uri
defines the geo dimension. The values of the geo dimension should have a labe.:time-dimension-uri
defines the time dimension. The values of the time dimensions should be defined by thereference.data.gov.uk
e.g.http://reference.data.gov.uk/id/year/2016
:codelist-source
defines the source of the codelist that contains only the values used at the cube. The source can be: (i)"component"
or (ii)"dimension"
. By default Table2qb uses the"component"
.:codelist-predicate
defines the predicate that connects the:codelist-source
with the codelist that contains only the values used at the cube. By default Table2qb uses:http://publishmydata.com/def/qb/codesUsed
.:codelist-label-uri
defines the label property that is used at the codelists. By default Table2qb uses:http://www.w3.org/2000/01/rdf-schema#label
:dataset-label-uri
defines the label property that is used at the dataset i.e. cube, DSD, components. By default Table2qb uses:http://www.w3.org/2000/01/rdf-schema#label
- Datasets, dimensions, measures and codelist members should all have a label with a language tag matching the
:schema-label-language
.nil
can be specified to use strings without an associated language tag. :max-observations-page-size
defines the maximum page size e.g. if you need to get all the observations with one query.