Table of Contents
This chapter describes how to leverage Apache Hadoop and Apache Spark to configure JanusGraph for distributed graph processing. These steps will provide an overview on how to get started with those projects, but please refer to those project communities to become more deeply familiar with them.
JanusGraph-Hadoop works with TinkerPop’s hadoop-gremlin package for general-purpose OLAP.
For the scope of the example below, Apache Spark is the computing framework and Apache Cassandra is the storage backend. The directions can be followed with other packages with minor changes to the configuration properties.
Note | |
---|---|
The examples in this chapter are based on running Spark in local mode or standalone cluster mode. Additional configuration is required when using Spark on YARN or Mesos. |
For running OLAP queries from the Gremlin Console, a few prerequisites need to be fulfilled. You will need to add the Hadoop configuration directory into the CLASSPATH
, and the configuration directory needs to point to a live Hadoop cluster.
Hadoop provides a distributed access-controlled file system. The Hadoop file system is used by Spark workers running on different machines to have a common source for file based operations. The intermediate computations of various OLAP queries may be persisted on the Hadoop file system.
For configuring a single node Hadoop cluster, please refer to official Apache Hadoop Docs
Once you have a Hadoop cluster up and running, we will need to specify the Hadoop configuration files in the CLASSPATH
. The below document expects that you have those configuration files located under /etc/hadoop/conf
.
Once verified, follow the below steps to add the Hadoop configuration to the CLASSPATH
and start the Gremlin Console, which will play the role of the Spark driver program.
export HADOOP_CONF_DIR=/etc/hadoop/conf export CLASSPATH=$HADOOP_CONF_DIR bin/gremlin.sh
Once the path to Hadoop configuration has been added to the CLASSPATH
, we can verify whether the Gremlin Console can access the Hadoop cluster by following these quick steps:
gremlin> hdfs ==>storage[org.apache.hadoop.fs.LocalFileSystem@65bb9029] // BAD gremlin> hdfs ==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1229457199_1, ugi=user (auth:SIMPLE)]]] // GOOD
JanusGraph-Hadoop works with TinkerPop’s hadoop-gremlin package for general-purpose OLAP to traverse over the graph, and parallelize queries by leveraging Apache Spark.
The backend demonstrated here is Cassandra for the OLAP example below. Additional configuration will be needed that is specific to that storage backend. The configuration is specified by the gremlin.hadoop.graphReader
property which specifies the class to read data from the storage backend.
JanusGraph currently supports following graphReader classes:
-
Cassandra3InputFormat
for use with Cassandra 3 -
CassandraInputFormat
for use with Cassandra 2 -
HBaseInputFormat
andHBaseSnapshotInputFormat
for use with HBase
The following properties file can be used to connect a JanusGraph instance in Cassandra such that it can be used with HadoopGraph to run OLAP queries.
# read-cassandra-3.properties # # Hadoop Graph Configuration # gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat gremlin.hadoop.jarsInDistributedCache=true gremlin.hadoop.inputLocation=none gremlin.hadoop.outputLocation=output gremlin.spark.persistContext=true # # JanusGraph Cassandra InputFormat configuration # # These properties defines the connection properties which were used while write data to JanusGraph. janusgraphmr.ioformat.conf.storage.backend=cassandra # This specifies the hostname & port for Cassandra data store. janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1 janusgraphmr.ioformat.conf.storage.port=9160 # This specifies the keyspace where data is stored. janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph # This defines the indexing backned configuration used while writing data to JanusGraph. janusgraphmr.ioformat.conf.index.search.backend=elasticsearch janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1 # Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr). # # Apache Cassandra InputFormat configuration # cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner # # SparkGraphComputer Configuration # spark.master=local[*] spark.executor.memory=1g spark.serializer=org.apache.spark.serializer.KryoSerializer spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
First create a properties file with above configurations, and load the same on the Gremlin Console to run OLAP queries as follows:
bin/gremlin.sh \,,,/ (o o) -----oOOo-(3)-oOOo----- plugin activated: janusgraph.imports gremlin> :plugin use tinkerpop.hadoop ==>tinkerpop.hadoop activated gremlin> :plugin use tinkerpop.spark ==>tinkerpop.spark activated gremlin> // 1. Open a the graph for OLAP processing reading in from Cassandra 3 gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cassandra-3.properties') ==>hadoopgraph[cassandra3inputformat->gryooutputformat] gremlin> // 2. Configure the traversal to run with Spark gremlin> g = graph.traversal().withComputer(SparkGraphComputer) ==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer] gremlin> // 3. Run some OLAP traversals gremlin> g.V().count() ...... ==>808 gremlin> g.E().count() ...... ==> 8046
The steps followed in the previous section can also be used with a Spark standalone cluster with only minor changes:
-
Update the
spark.master
property to point to the Spark master URL instead of local -
Update the
spark.executor.extraClassPath
to enable the Spark executor to find the JanusGraph dependency jars - Copy the JanusGraph dependency jars into the location specified in the previous step on each Spark executor machine
Note | |
---|---|
We have copied all the jars under janusgraph-distribution/lib into /opt/lib/janusgraph/ and the same directory structure is created across all workers, and jars are manually copied across all workers. |
The final properties file used for OLAP traversal is as follows:
# read-cassandra-3.properties # # Hadoop Graph Configuration # gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat gremlin.hadoop.jarsInDistributedCache=true gremlin.hadoop.inputLocation=none gremlin.hadoop.outputLocation=output gremlin.spark.persistContext=true # # JanusGraph Cassandra InputFormat configuration # # These properties defines the connection properties which were used while write data to JanusGraph. janusgraphmr.ioformat.conf.storage.backend=cassandra # This specifies the hostname & port for Cassandra data store. janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1 janusgraphmr.ioformat.conf.storage.port=9160 # This specifies the keyspace where data is stored. janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph # This defines the indexing backned configuration used while writing data to JanusGraph. janusgraphmr.ioformat.conf.index.search.backend=elasticsearch janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1 # Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr). # # Apache Cassandra InputFormat configuration # cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner # # SparkGraphComputer Configuration # spark.master=spark://127.0.0.1:7077 spark.executor.memory=1g spark.executor.extraClassPath=/opt/lib/janusgraph/* spark.serializer=org.apache.spark.serializer.KryoSerializer spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
Then use the properties file as follows from the Gremlin Console:
bin/gremlin.sh \,,,/ (o o) -----oOOo-(3)-oOOo----- plugin activated: janusgraph.imports gremlin> :plugin use tinkerpop.hadoop ==>tinkerpop.hadoop activated gremlin> :plugin use tinkerpop.spark ==>tinkerpop.spark activated gremlin> // 1. Open a the graph for OLAP processing reading in from Cassandra 3 gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cassandra-3.properties') ==>hadoopgraph[cassandra3inputformat->gryooutputformat] gremlin> // 2. Configure the traversal to run with Spark gremlin> g = graph.traversal().withComputer(SparkGraphComputer) ==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer] gremlin> // 3. Run some OLAP traversals gremlin> g.V().count() ...... ==>808 gremlin> g.E().count() ...... ==> 8046
Apache TinkerPop provides various vertex programs. A vertex program runs on each vertex until either a termination criteria is attained or a fixed number of iterations has been reached. Due to the parallel nature of vertex programs, they can leverage parallel computing frameworks like Spark or Giraph to improve their performance.
Once you are familiar with how to configure JanusGraph to work with Spark, you can run all the other vertex programs provided by Apache TinkerPop, like Page Rank, Bulk Loading and Peer Pressure. See the TinkerPop VertexProgram docs for more details.