Implementing Graph Filter for Sub-Graph Loading in Spark Cluster with JanusGraph

Hello, I'm currently utilizing JanusGraph 0.6.4 with Bigtable as the storage backend and encountering difficulties when attempting to run OLAP queries on my graph via SparkGraphComputer. The graph is quite large, containing billions of vertices, and I'm only able to execute queries on significantly smaller graphs. My queries are being run through the Gremlin console, and the problem appears to be related to loading the graph into Spark RDD. I'm interested in applying a filter to load only vertices and edges with specific labels before running the query. I've noticed that creating a vertex program and using it as described in the Tinkerpop documentation(https://tinkerpop.apache.org/docs/current/reference/#graph-filter) only loads the specified subgraph into the Spark RDDs:
graph.computer().
vertices(hasLabel("person")).
vertexProperties(__.properties("name")).
edges(bothE("knows")).
program(PageRankVertexProgram...)
)
graph.computer().
vertices(hasLabel("person")).
vertexProperties(__.properties("name")).
edges(bothE("knows")).
program(PageRankVertexProgram...)
)
Is it possible to implement this filtering directly through the Gremlin console? I've attempted to use g.V().limit(1), but without success. I suspect this is because the entire graph is being loaded into the RDD for this query as well. Here's the code I used:
graph = GraphFactory.open("conf/hadoop-graph/read-hbase-cluster.properties")
hg = graph.traversal().withComputer(org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer)
hg.V().limit(1)
graph = GraphFactory.open("conf/hadoop-graph/read-hbase-cluster.properties")
hg = graph.traversal().withComputer(org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer)
hg.V().limit(1)
Any insights or suggestions would be greatly appreciated. Thank you.
3 Replies
spmallette
spmallette6mo ago
Is it possible to implement this filtering directly through the Gremlin console?
sure. the Gremlin Console is just a groovy shell so it can run any code you would normally run in java. one way to do it:
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.withComputer(Computer.compute().vertices(hasLabel('person'))).V().label()
==>person
==>person
==>person
==>person
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.withComputer(Computer.compute().vertices(hasLabel('person'))).V().label()
==>person
==>person
==>person
==>person
shivam.choudhary
Thanks @spmallette , I was able to fire the query using this via gremlin console. I'm marking the the question as answered as I was able to apply the filter successfully. The job is now running from the past 36 hours but isn't getting completed.My spark cluster has 12 executors and the graph data is getting read at above 50mb/sec and considering that the spark will scan the full graph having 3.6 TB of data I guess the job should have been completed by now. The olap query which I've fired is:
g.withComputer(Computer.compute().vertices(hasLabel('ticket'))).V().count()
g.withComputer(Computer.compute().vertices(hasLabel('ticket'))).V().count()
is there anything which I might be missing here? Thanks in advance.
spmallette
spmallette6mo ago
that seems really long for just a count() especially when you've limited that to just "ticket" vertices. have you looked at any of the logs to see what sort of activity is happening there? it's been a long time since i've done much with OLAP/Spark and I don't know the implications for BigTable janusgraph. you might want to start a discussion on their Discord server about that. https://discord.gg/vcWNP7T6