StaleIndexRecordUtil.forceRemoveVertexFromGraphIndex not working

The job is getting successful, but I can still see the vertices in the graph. It is weird that even If I pass a non existent vertex ID, the job does not fail. any idea why such behavior? or am I missing something?
11 Replies
Bo
Bo15mo ago
@porunov would you like to chime in?
porunov
porunov15mo ago
The tool removes vertex record from the graph if it exists. This tool removes records only from the selected index. Notice, for composite index you need to use vertex id + property values used in the composite index. For mixed index vertex id is enough. Thus, removing stale records from composit indexes is usually more complicated if you don’t know values which was used in your index record. Ensure that you removed your non-existing vertex from all the indexes. Perhaps you removed it from some indexes but didn’t remove it from other indexes. Notice, the tool was designed specifically to be used with non-existing vertices. It’s wrong to use the tool on non-stale, existing vertices as you corrupt your index in such way. The tool is designed to clean the mess in non Atomic environments (for example, cleaning up mixed indexes, or cleaning up composite indexes in case your use-case wasn’t following atomicity). I’m not sure what is your use-case and how you use that tool exactly, but please, use it carefully because it may corrupt your index if you used it wrongly. In case you did remove wrong records from your index, you will need to trigger reindexing job to restore your index OR add proper records into your index manually (which might be more complicated depending on your use case).
HailDevil
HailDevil15mo ago
Thanks Oleksandr for explaining it well!! So the issue I am struggling with here is that I can see thousands of vertices in my graph that do not have the property values that I have indexed. These vertices are isolated, lack any incoming or outgoing edges. I am not sure how sometimes these vertices are encountered during traversals of other vertices which are indexed properly. These vertices have become an issue as they are causing serious failures in production. Now to resolve this, I have identified all these empty vertices with g.V().hasNot('property-name') and used StaleIndexRecordUtil to get rid of them (as my understanding was these vertices might be stale indexes). Also in my case, I have multiple composite indexes with single properties, as you said I should be removing them from all the indexes, which I missed, I called the StaleIndexRecordUtil for only one Index. Also, one interesting fact is that, I was able to remove these vertices with g.V(id).drop().iterate() (this I tested in dev environment so far, not sure if i should be using it in prod). Do you thing I can go ahead with this? One more observation which I would like to add is: Sometimes, g.V().has('fileId', '1234') returns a vertex, but surprisingly the if I do a valueMap on it, it returns empty result [] The way my service works, I wont be able to reindex it in this form. So something is probably wrong with indexes only is what I assume!
porunov
porunov15mo ago
If you can remove vertices with g.V(id).drop().iterate() - that’s what should be done. Don’t use StaleIndexRecordUtil if you have vertices in the graph (storage backend). StaleIndexRecordUtil is useful if you removed your vertex from the storage backend but it’s left in the index backend for any reason. However, this is an advanced tool which should not be used to in most of the cases. I would say StaleIndexRecordTool is the last step which should be used only when you can’t resolve your issues using pure Gremlin or other JanusGraph API. If you did remove you vertex from the storage backend but you are in the situation when that vertex is still recorded by some indexes you will need: 1. Remove this vertex from all mixed indices. This step is easy using StaleIndexRecordUtil because it requires vertex id only. 2. Remove this vertex from all composite indexes. This step is more complicated because JanusGraph requires values of all properties of that vertex which are used for each composite index. If you did save those property values somewhere outside of the storage backend, you can reuse them to remove those composite index records. However, if the vertex is removed and the values are lost, then the to figure out the lost values you will need to scan all the records of those composite indexes to find out which records are still pointing to your removed vertex. Scanning full index table might be a slow operation because it’s essentially a full scan job but over indexes instead of vertices. Again, try to use g.V(vertexId).drop().iterate(). If your vertex is still in the storage backend it will check vertex properties and update all the affecting indexes automatically. Thus, you won’t need to use StaleIndexRecordUtil at all in such case.
HailDevil
HailDevil15mo ago
requires values of all properties of that vertex which are used for each composite index
This could be difficult for me to identify! Is there any way to read graphindex and janusgaph_ids table to identiy which vertex id is mapped to which properties? I have a spark integration which I think can do this quickly, do you have any reference on how to do this scan?
porunov
porunov15mo ago
I don’t have any examples on how to properly do that because I haven’t done it myself. However, I would imagine you need to scan graph_index table in your storage backend and find your vertex ids as well as values used in some of the cells. Perhaps someone could chime in who did that already and give more details on the process. Alternatively you could use a system which catches such failures in runtime and automatically trigger cleanup for the relative values of the index. For example, let’s assume your code has something like: g.V().has(“name”, name).drop().iterate() In such case you will receive an exception which will tell you failing vertices. You could catch it and trigger vertex record index removal from relative indexes using your failing vertex ids + name value for your nameIndex for example. This process could be used for in place recovery mechanism, however it could be more complicated to build this recovery mechanism. Alternatively you could simply migrate your data to a new keyspace using full-scan job which will re-create indexes properly (i.e. copy vertices + edges to another keyspace).
HailDevil
HailDevil15mo ago
I can go ahead with .drop().iterate() method, But the issue is, in past I have seen that, even though I drop such vertices with drop(), somehow they come back into existence after some time (when I say 'come back' I literally mean with same vertex ID). Which is why I doubt the correctness of using drop. In case if I can get the mapping of vertex ID and the property value, I may be able to find some workaround 🙂 Just out of curiosity, what mgmt.removeGhostVertices() does?
Bo
Bo15mo ago
It scans your entire database and removes ghost vertices. A ghost vertex is basically stale data after an unsuccessful partial delete.
HailDevil
HailDevil15mo ago
ok, i tried running it with gremlin console, but did not see any affect on the count of such vertices
HailDevil
HailDevil15mo ago
@boxuanli I executed GhostVertexRemover and got this output: How can I see exactly which vertices are getting removed?
No description
Bo
Bo15mo ago
That's not supported as of now. One indirect way to do it is to use Change Data Capture if your primary storage supports it.
Want results from more Discord servers?
Add your server