How to run the mapreduce reindexing job
Did anyone succeed in running the map-reduce reindexing job? We went into the usual dependencies nightmare. I would assume we should put together all the dependencies into an uber-jar right? Otherwise we should put in the yarn node classpath the janusgraph dependencies, no?
5 Replies
I used to have a successful set up on yarn cluster, but cannot find it anymore. IIRC, a uber-jar sounds like the way to go.
we should put in the yarn node classpath the janusgraph dependenciesI am not 100% sure but I don't think I did this
Thank you 🙏 the Uber-jar seems the most plausible solution. Then there is the usual mess for putting all the dependencies together
A last point, did you ever think to create reindexing job based on spark instead of MR? It would be more portable, MR is restricted to the hadoop env, yarn etc.
I agree we should gradually move away from MapReduce, or at least, allow people to do reindexing using Spark. I don't foresee it happens in the near future, unless someone is willing to tackle that.
There's an adhoc way to do it by yourself: use Spark job to scan all vertices, update the properties that you want to reindex, and commit. It could be a no-op update that just does an in-place update without changing the value, but it will trigger a reindexing for that vertex/edge (if I recall correctly).
In case you don't know how to "use Spark job to scan all vertices ... and commit", here's an example: https://github.com/Citegraph/citegraph/blob/main/backend/src/main/java/io/citegraph/data/spark/loader/VertexPropertyEnricher.java
GitHub
citegraph/backend/src/main/java/io/citegraph/data/spark/loader/Vert...
CiteGraph: A citation graph web visualizer. Contribute to Citegraph/citegraph development by creating an account on GitHub.
Thank you so much, very helpful thanks