Very slow regex query (AWS Neptune)

We have a query that searches a data set of about ~400,000 vertices, matching properties using a case insensitive TextP.regex() expression. We are observing very bad query performance; even after several other optimizations, it still takes 20-45 seconds, often timing out.

Simplified query:
g.V()
.has_label("foo").or_(
  __.has("property_1", TextP.regex("(?i)^bar")),
  __.has("property_2", TextP.regex("(?i)^bar")),
  __.has("property_3", TextP.regex("(?i)^bar")),
)
.order()
.by("date", Order.asc)
.limit(1)
.value_map(True)


We are on a db.r6g.xlarge instance, and do NOT observe any meaningful CPU or memory spikes from this query. We have profiled the query and the TextP.regex() portion seems to take 99%+ of the total runtime.

We're looking for any information that might help us optimize this query or at least understand the poor performance a little better. Thanks in advance!
Solution
When Neptune stores data it stores it in 3 different indexed formats (https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html#feature-overview-storage-indexing), each of which are optimized for a specific set of common graph patterns. Each of these indexes is optimized for exact match lookups so when running queries that require partial text matches, such as a regex query, all the matching property data needs to be scanned to see if it matches the provided expression.

To get a performant query for partial text matches the suggestion is to use the Full Text search integration (https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html) , which will integrate with OpenSearch to provide robust full text searching capabilities within a Gremlin query
Learn about the four positions of a Neptune quad element.
Was this page helpful?