Verifying the count of ingested vertex and edges after bulk loading in Janusgraph.

I have bulk loaded around 600k Vertices and 800k Edges into my janusgraph cluster backed with bigtable, I want to verify the number of vertex with a given label 'A' using gremlin query but I'm getting evaluation timeout error. The evaluation timeout is set to 5 minutes. Gremlin query used is = g.V().hasLabel('A').count() Can anyone help me on how I can verify the count of vertices and edges loaded into the graph? Thanks.
2 Replies
spmallette
spmallette15mo ago
someone from @janusgraph might wish to comment but you could of course keep increasing the timeout. you have less than a million elements to count so a timeout increase probably isn't a bad idea. i'm not sure if 5 minutes is "slow" or not for BigTable so it's hard to say if the speed is unexpected. typically if you're counting tens of millions of things you'd want to use spark-gremlin to do these sorts of global traversals where you basically have to touch all the data in the graph, but i dont think your data size should warrant that complexity. you might want to ask this question specifically in the JanusGraph discord if you don't get any additional comment here.
Solution
porunov
porunov15mo ago
I think @spmallette already provided a good answer, but I will add several more notes below. 1. Index usage with count operations. JanusGraph doesn't support adding indices based on label only as for now, but allows you to make an index based on label + property / properties. In case you have a common property on all your vertices of a specific type (let's say "id" property) then you could potentially create a mixed index with key "id" and contstrained to your label 'A'. In such case your query to count all vertices of a specific label would look like: g.V().hasLabel('A').has('id').count() That query will use a mixed index for counting instead of the full scan (as you do right now). Notice that I said mixed index and not composite index because composite index supports only equality operations based on property values. Thus you would need to provide a specific value in your query if you used a composite index. I.e.: g.V().hasLabel('A').has('id', 123).count() 2. Distributed count computing. In case you don't want to use indices for counting than you could think about computing the count using Spark computer (in case you have too many data distributed across many machines), but as already noted, your data set is quite small, so I doubt it will benefit from distributed computing. 3. Timeout + tunning. In case neither of the solutions is suitable for you (i.e. index counting or distributed computing) then I would suggest increasing timeout, tune you storage backend as well as JanusGraph nodes for full scan operations. That said, tunning is a whole different topic I think. Increasing read timeout in Gremlin Server should be sufficient in your case I think.