HPC pipeline to aggregate knowledge graphs from EMBL-EBI resources, the MONARCH Initiative KG, ROBOKOP, Ubergraph, and other sources into large (up to multi-terabyte) transient Neo4j+Solr databases, perform queries, and materialise result tables for dissemination.
The GrEBI pipeline is being applied to a number of projects including the International Mouse Phenotyping Consortium (IMPC) knowledge graph and the EMBL Human Ecosystems Transversal Theme (HETT) ExposomeKG.
GrEBI has two main outputs: (1) materialised queries and (2) database exports.
We run a collection of queries periodically at EBI and materialise the results as tables, which can be loaded using standard data processing libraries such as Pandas or Polars. The input queries are stored in YAML files in the materialised_queries directory of this repository, and the outputs are uploaded to our FTP server in the query_results
directories. For example, the latest materialised query results for the impc_x_gwas
graph can be found at https://ftp.ebi.ac.uk/pub/databases/spot/kg/impc_x_gwas/latest/query_results/.
The results are stored in JSON Lines format which can be loaded using Pandas:
pd.read_json('impc_x_gwas.results.jsonl', lines=True)
The resulting exports can be downloaded from https://ftp.ebi.ac.uk/pub/databases/spot/kg/
Name | Description | # Nodes | # Edges | Neo4j DB size |
---|---|---|---|---|
ebi_monarch_xspecies |
All datasources with cross-species phenotype matches merged | ~130m | ~850m | ~900 GB |
ebi_monarch |
All datasources with cross-species phenotype matches separated | |||
impc_x_gwas |
Limited to data from IMPC, GWAS Catalog, OpenTargets, and related ontologies and mappings | ~30m | ~184m |
Note that the purpose of this pipeline is not to supply another knowledge graph, but to facilitate querying and analysis across existing ones. Consequently the above exports should be considered temporary and are subject to be removed and/or replaced with new ones without warning.
-
Choose carefully where you would like to run Neo4j. This could be locally, on a server, or on a HPC cluster depending on which export you would like to query. You will need plenty of disk space (see above) and at least 32 GB RAM. More complex queries will require more RAM.
-
Download and extract the Neo4j export. For example to download the latest
impc_x_gwas
export:curl https://ftp.ebi.ac.uk/pub/databases/spot/kg/impc_x_gwas/latest/impc_x_gwas_neo4j.tgz | tar xzf -
-
Start a Neo4j 5.18.0 server from the extracted folder. You can do this easily using Docker:
docker run -p 7474:7474 -p 7687:7687 -v $(pwd)/data:/data -e NEO4J_AUTH=none neo4j:5.18.0
Your graph should now be accessible on port 7474. For example if you are running locally http://localhost:7474. You should now also be able to try out some of the Jupyter notebooks.
The exact instructions will vary depending on your HPC environment. At EBI we use Slurm and Singularity. If your cluster is similar you should be able to modify the following instructions to get started.
-
Start a shell on a Slurm worker with appropriate resources:
srun --pty --time 1-0:0:0 -c 32 --mem 300g bash
-
Download and extract Neo4j as shown above, ideally to local flash-based storage. If you have a very large HPC node you may be able to extract Neo4j to ramdisk e.g.
/dev/shm
for maximum performance. -
Find out the hostname of the worker so we can connect to it later:
hostname
-
Start Neo4j with Singularity:
mkdir -p neo4j_plugins tmp_neo &&
singularity run
--bind "$(pwd)/data:/data"
--bind "neo4j_plugins:/var/lib/neo4j/plugins"
--writable-tmpfs
--tmpdir tmp_neo
--env NEO4J_AUTH=none
--env NEO4J_server_memory_heap_initial__size=120G
--env NEO4J_server_memory_heap_max__size=120G
--env NEO4J_server_memory_pagecache_size=60G
--env NEO4J_dbms_memory_transaction_total_max=60G
--env NEO4J_apoc_export_file_enabled=true
--env NEO4J_apoc_import_file_enabled=true
--env NEO4J_apoc_import_file_use__neo4j__config=true
--env NEO4J_dbms_security_procedures_unrestricted=apoc.*
--env TINI_SUBREAPER=true
--env NEO4J_PLUGINS=["apoc"]
docker://ghcr.io/ebispot/grebi_neo4j_with_extras:5.18.0
Now you should be able to connect to Neo4j at the host shown earlier by hostname
.
The following mapping tables are loaded:
- https://data.monarchinitiative.org/mappings/latest/gene_mappings.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/hp_mesh.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/mesh_chebi_biomappings.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/mondo.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/umls_hp.sssom.tsv
- https://data.monarchinitiative.org/mappings/latest/upheno_custom.sssom.tsv
- https://raw.githubusercontent.com/mapping-commons/mh_mapping_initiative/master/mappings/mp_hp_mgi_all.sssom.tsv
- https://raw.githubusercontent.com/obophenotype/bio-attribute-ontology/master/src/mappings/oba-efo.sssom.tsv
- https://raw.githubusercontent.com/obophenotype/bio-attribute-ontology/master/src/mappings/oba-vt.sssom.tsv
- https://github.com/biopragmatics/biomappings/raw/refs/heads/master/src/biomappings/resources/mappings.tsv
In all of the currently configured outputs, skos:exactMatch
mappings are used for clique merging. In ebi_monarch_xspecies
, semapv:crossSpeciesExactMatch
is used for clique merging (so e.g. corresponding HP and MP terms will share a graph node). As this is not always desirable, a separate graph ebi_monarch
is also provided where semapv:crossSpeciesExactMatch
mappings are represented as edges.
The pipeline is implemented as Rust programs with simple CLIs, orchestrated with Nextflow. Input KGs are represented in a variety of formats including KGX, RDF, and JSONL files. After loading, a simple "bruteforce" integration strategy is applied:
- All strings that begin with any IRI or CURIE prefix from the Bioregistry are canonicalised to the standard CURIE form
- All property values that are the identifier of another node in the graph become edges
- Cliques of equivalent nodes are merged into single nodes
- Cliques of equivalent properties are merged into single properties (and for ontology-defined properties, the qualified safe labels are used)
The primary output of the pipeline is a property graph for Neo4j. The nodes and edges are also loaded into Solr for full-text search and sqlite for id->compressed object resolution.