Mixed Index (ElasticSearch) Backpressure Ignored?

I understand that JG views writes to mixed indices as a "best effort" and documents(https://docs.janusgraph.org/operations/recovery/#transaction-failure) that failures in transactions with regards to mixed indices are left to a separate periodic clean up process after enabling JG's WAL:
If the primary persistence into the storage backend succeeds but secondary persistence into the indexing backends or the logging system fail, the transaction is still considered to be successful because the storage backend is the authoritative source of the graph. ... In addition, a separate process must be setup that reads the log to identify partially failed transaction and repair any inconsistencies caused. ....
But I wanted to inquire if 429 Too Many Requests could be omitted from this.
4 Replies
criminosis
criminosis4mo ago
I've been observing JanusGraph logging a series of these in my deployment:
2024-04-18 19:49:55.259
Caused by: org.elasticsearch.client.ResponseException: method [POST], host [<my-es-cluster-hostname>], URI [/_bulk], status line [HTTP/1.1 429 Too Many Requests] ... 2024-04-18 19:49:54.769
at org.janusgraph.diskstorage.es.rest.RestElasticSearchClient.bulkRequest(RestElasticSearchClient.java:404) ~[janusgraph-es-1.0.0.jar:?] 2024-04-18 19:49:54.769
at org.janusgraph.diskstorage.es.rest.RestElasticSearchClient.performRequest(RestElasticSearchClient.java:546) ~[janusgraph-es-1.0.0.jar:?] 2024-04-18 19:49:54.769
at org.janusgraph.diskstorage.es.rest.RestElasticSearchClient.performRequest(RestElasticSearchClient.java:555) ~[janusgraph-es-1.0.0.jar:?] 2024-04-18 19:49:54.769
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) ~[elasticsearch-rest-client-8.10.4.jar:8.10.4] ...
But my client doesn't doesn't get any kind of error response. I'm assuming because as a mixed backend it's just left to the WAL & separate process to correct per the docs mentioned above. But given the underlying issue isn't something wrong with ES receiving the request, but rather JG submitting too many requests it seems like it'd be more robust for JG to make its ES client code responsive to the 429 and respect the backpressure. It looks like it'd just need a retry loop around this line (https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/rest/RestElasticSearchClient.java#L555) catching if it's ES's ResponseException class and checking for the 429 status.
GitHub
janusgraph/janusgraph-es/src/main/java/org/janusgraph/diskstorage/e...
JanusGraph: an open-source, distributed graph database - JanusGraph/janusgraph
criminosis
criminosis4mo ago
If that's acceptable to the team, I'd be happy to put up a PR for this fwiw this would be congruent with Elasticsearch's guidance https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#multiple-workers-threads
Make sure to watch for TOO_MANY_REQUESTS (429) response codes (EsRejectedExecutionException with the Java client), which is the way that Elasticsearch tells you that it cannot keep up with the current indexing rate. When it happens, you should pause indexing a bit before trying again, ideally with randomized exponential backoff.
Bo
Bo4mo ago
Sounds like a reasonable solution to me. Please go ahead.
criminosis
criminosis4mo ago
Awesome. Thanks @Boxuan Li , PR is up. https://github.com/JanusGraph/janusgraph/pull/4409 Looking forward with working with y'all to refine it to be acceptable for merging. I know at least my deployment of JanusGraph will benefit greatly from this being merged and in a release. I often see 429 errors in my logs.
GitHub
Added retry logic if client request returns an error code that is c...
Closes #4408 In regards to Have you written and/or updated unit tests to verify your changes? below, I mimicked the testing that was done for retry_on_conflict, happy to add more if desired. To wai...
Want results from more Discord servers?
Add your server
More Posts