Can `CqlInputFormat` do predicate pushdowns/query based prefilters?

Hi! First of all, thank you all for your work on JanusGraph. In my use case, I have a medium-large graph, ~3TB currently, might be 1-2 orders of magnitude bigger later. The data in it is generally clustered in a time-based fashion, e.g. newer vertices are mostly connected to other newer vertices (a timestamp is stored as a vertex property). I am writing an OLAP pipeline with Spark where JanusGraph, backed by Cassandra, is the source, and I use Tinkerpop's hadoop-gremlin to build vertex programs and run OLAP gremlin queries. Per my understanding, in this setup the only point of contact with JanusGraph is through the CqlInputFormat and the server itself is not involved at all. Is that correct? A very common operation that I'm going to have to do, based on the above clustering assumption, is pre-filtering vertices by a timestamp range before running my logic on the subgraph. As an example, I would like to, say, download the last couple days' worth of vertices on my laptop for running some tests. Per my understanding, currently this would entail unconditionally loading the entire dataset in the Spark cluster's memory every time. Is that correct? Is there an alternative? I have looked into CqlInputFormat's code and I noticed that you can add WHERE clauses, but it looks like there are caveats to that and I could not understand how to map a (simple) predicate on a vertex property to a CQL clause. I was considering rolling my own input format class once I grokked how to run CQL queries directly. I'm not super familiar with JanusGraph's codebase, nor I am a Java expert really, but I'm willing to get my hands dirty -- could I please ask for a bird's eye view explanation of how graph data is mapped into the backend, or even just pointers into how to navigate the codebase pertaining to that? Or do you have other suggestions that could point me in the right direction? Thank you! 🙌 cc @criminosis
4 Replies
criminosis
criminosis7d ago
@porunov / @Bo any ideas or any recommendations on who may know? We weren't sure if this was more meaningful to ask here or on the Tinkerpop side.
cdegroc
cdegroc7d ago
👋🏻 Hey! We've used CqlInputFormat to dump entire graphs. I agree with your analysis. I think the WHERE clause you're refering to is https://github.com/JanusGraph/janusgraph/blob/v1.0/cassandra-hadoop-util/src/main/java/org/apache/cassandra/hadoop/cql3/CqlConfigHelper.java#L61. I believe this is a CQL configuration option and not a JanusGraph one. Since JanusGraph encodes rows in its own binary format, I doubt this type of filtering would work well (Happy to be wrong though!).
cdegroc
cdegroc7d ago
Moreover, Hadoop's CqlInputFormat is an old class, which IIRC was deprecated in favor of https://github.com/datastax/spark-cassandra-connector That new project could be a better long-term solution.
GitHub
GitHub - datastax/spark-cassandra-connector: DataStax Connector for...
DataStax Connector for Apache Spark to Apache Cassandra - datastax/spark-cassandra-connector
johndisandonato
johndisandonatoOP7d ago
that looks great, I wasn't aware of that. though I don't see any way to make it play with Tinkerpop? it needs an InputFormat implementation somewhere for the graphReader property and spark-cassandra-connector has no references to that.
Want results from more Discord servers?
Add your server