Blazegraph
Nexus uses Blazegraph as an RDF (triple) store to provide a advanced querying capabilities on the hosted data. This store is treated as a specialized index on the data so as with Elasticsearch in case of failures, the system can be fully restored from the primary store. While the technology is advertised to support High Availability and Scaleout deployment configurations, we have yet to be able to setup a deployment in this fashion.
Blazegraph can be deployed by:
- using the docker image created by BBP: https://hub.docker.com/r/bluebrain/blazegraph-nexus
- deploying Blazegraph using the prepackaged tar.gz distribution available to download from GitHub.
We’re looking at alternative technologies and possible application level (within Nexus) sharding and replicas.
Blazegraph is not actively maintained anymore since Amazon hired the development team for their Neptune product. It makes operating it harder as resources about monitoring are scarce and bugs and issues about Blazegraph will remain unsolved.
Running and monitoring
CPU: It suggests heavy indexing and query operations. It is likely to happen when reindexing after updating search but if it happens regularly:
- Review the SPARQL queries made to Blazegraph
- Review the indexing strategy
- Allocate more resources to Blazegraph
- The Blazegraph and composite views plugin can point to different Blazegraph instances, this can be considered if you have a heavy usage of both plugins
Memory and garbage collection: Blazegraph will use the available RAM in 2 ways, JVM heap and the file system cache so like Elasticsearch, the JVM garbage collection frequency and duration are also important to monitor.
Using the supervision endpoints for Blazegraph and composite views and the RAM guidance from Blazegraph gives some insights on how much memory can be assigned.
Storage:
Blazegraph stores data in an append only journal which means updates will use additional disk space.
So the disk usage will grow with time depending on the rhythm of updates.
Compactions can be applied to the journal using the CompactJournalUtility to reduce the disk usage, but it takes quite a bit a time and requires taking the software offline during the process.
An alternative can be to reindex on a fresh instance of Blazegraph, this approach also allows to reconfigure the underlying namespaces.
Using the supervision endpoints for Blazegraph and composite views and the Data on disk guidance from Blazegraph gives some insights on how much storage can be assigned.
Query and indexing performance:
Query load and latency must be monitored to make sure that information in a timely manner to the client.
On the write side, indexing load and latency must also be watched especially if the data must be available for search as soon as possible.
To help with this, Delta records long queries and stores them in the blazegraph_queries
table and also leverage Kamon tracing with spans to measure these operations and if any of them fail.
Tools and resources
The Hardware Configuration section in the documentation gives a couple of hints about the requirements to operate Blazegraph and there are additional sections for optimizations in terms of Performance, IO and Query.
The Nexus repository gives also:
- A jetty configuration allowing which allow to tune Blazegraph so as to handle better Nexus indexing/querying
- A log4j configuration As the default one provided by Blazegraph is insufficient
- The docker compose file for tests shows how to configure those files via system properties
- A python script allowing to scrape Blazegraph metrics so as to push them to a Prometheus instance
Backups
Pushing data to Blazegraph is time-consuming especially via the API as triples are verbose and Blazegraph does not allow large payloads.
So even if it is possible to repopulate a Blazegraph instance from the primary store, it is better to perform backup using the online backup api endpoint.
Here is an example of a backup script using this endpoint, compressing the resulting file and creating a checksum out of it.
The Nexus repository also provides:
This script is meant to be run in Kubernetes as a cronjob but should be easily adaptable from other contexts by replacing the kubectl
commands.
To restore a backup, the journal file related to the instance needs to be replaced by the one generated by the backup operation.