Olap using spark cluster taking much more time than expected.

Hi All, We have setup a spark cluster to run olap queries on janusgraph with bigtable as storage backend. Details:
Backend: *Bigtable*
Vertices: *~4 Billion*
Data in backend: *~3.6 TB*
Spark workers: *2 workers each having 6 cpu, 25 gb ram*
Spark executors: *6 executors on each worker having 1 cpu 4gb ram*
Backend: *Bigtable*
Vertices: *~4 Billion*
Data in backend: *~3.6 TB*
Spark workers: *2 workers each having 6 cpu, 25 gb ram*
Spark executors: *6 executors on each worker having 1 cpu 4gb ram*
Now I'm trying to count all the vertices with the label ticket which we know are of the order of ~100k, the query fired to do that is as follows:
graph = GraphFactory.open("conf/hadoop-graph/read-hbase-cluster.properties")
g = graph.traversal().withComputer(org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer)
g.withComputer(Computer.compute().vertices(hasLabel(ticket))).V().count()
graph = GraphFactory.open("conf/hadoop-graph/read-hbase-cluster.properties")
g = graph.traversal().withComputer(org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer)
g.withComputer(Computer.compute().vertices(hasLabel(ticket))).V().count()
The query is running from the past 36 hours and is still not completed, looking at the average throughput (>50 mb/sec) at which data is being read it should have read the ~3.6TB of data by now. Is it possible to use indexes while running the olap query resulting in faster loading of the subgraph into spark rdds (currently it is scanning the full graph) ?
1 Reply
Bo
Bo11mo ago
Is it possible to use indexes while running the olap query resulting in faster loading of the subgraph into spark rdds (currently it is scanning the full graph)
Unfortunately, no. It sounds like an interesting idea, though!
Want results from more Discord servers?
Add your server