Seeking Help: Building a Text-to-Gremlin Corpus Generator - AST Parsing

Hey everyone, I'm working on fine-tuning a large language model for text-to-Gremlin generation. To do this, I need a substantial dataset of natural language queries paired with their corresponding Gremlin queries. I'm currently building a corpus generator for this. I've seen some work on text-to-Cypher where they parsed the Cypher AST (Abstract Syntax Tree). However, the ASTs for Cypher and Gremlin are quite different. Does anyone have suggestions on how to tackle this? Specifically: * Are there any existing tools for parsing Gremlin ASTs? * Alternatively, are there any methods to build such a corpus generator without relying on AST parsing? Any help or ideas would be greatly appreciated! Thanks!
26 Replies
spmallette
spmallette2w ago
interesting. can you say how large a corpus you think you will need?
imbajin
imbajin2w ago
I'm back~
niuzj
niuzjOP2w ago
I think the more data, the better for model performance. So, I need to create a corpus generator.
spmallette
spmallette2w ago
i would think more is better, but was hoping you had an idea of how large a size you intended to generate. i ask because we already have a pretty good corpus of 1600 working gremlin queries in the test suite with expected results. they lack a natural language description though. if there were some effort to remedy that, i think it could make for a good base to help with the task you're talking about, wouldn't it? as a bonus, the tests use a structured format given that they are written in gherkin which make them really easy to work with programmatically
spmallette
spmallette2w ago
GitHub
tinkerpop/gremlin-test/src/main/resources/org/apache/tinkerpop/grem...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
spmallette
spmallette2w ago
as these tests represent the mechanism for enforcing Gremlin semantics for providers, i think this could be a useful piece for tuning up an LLM i've thought we might modify each test definiton to have something like:
Scenario: g_V_count
Given the modern graph
And the traversal description
"""
This traversal counts the total number of vertices in the graph.
"""
And the traversal of
"""
g.V().count()
"""
When iterated to list
Then the result should be ordered
| result |
| d[6].l |
Scenario: g_V_count
Given the modern graph
And the traversal description
"""
This traversal counts the total number of vertices in the graph.
"""
And the traversal of
"""
g.V().count()
"""
When iterated to list
Then the result should be ordered
| result |
| d[6].l |
not sure i wholly like the syntax of "And the traversal description", but that would be the idea.
Captator
Captator2w ago
Last time I needed to look into fine-tuning (~6mo potentially out of date disclaimer), of order 1k high quality samples was the recommended data volume for PEFT approaches. Which fine-tuning approaches/techniques are you exploring?
niuzj
niuzjOP2w ago
Before this, we did not have a large amount of corpus, I think PEFT is ok Thank you very much for providing these.
Andrea
Andrea2w ago
One should note that some of the gerkin tests are for error scenarios (ie. invalid queries) so those should not be included in your corpus. Those should have an expectation that 'the traversal will raise an error'. See https://github.com/apache/tinkerpop/blob/master/gremlin-test/src/main/java/org/apache/tinkerpop/gremlin/features/StepDefinition.java for more information, specifically the methods annotated with @Given
GitHub
tinkerpop/gremlin-test/src/main/java/org/apache/tinkerpop/gremlin/f...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
spmallette
spmallette2w ago
i'll take a moment to go back to the original question here:
I've seen some work on text-to-Cypher where they parsed the Cypher AST (Abstract Syntax Tree). However, ASTs for Cypher and Gremlin are quite different. Does anyone have suggestions on how to tackle this? Specifically: Are there any existing tools for parsing Gremlin ASTs?, Alternatively, are there any methods to build such a corpus generator without relying on AST parsing?
I'm not sure what text-to-cypher was doing in their work. perhaps you could expand a bit on what they did? For Gremlin we do have a grammar which you can find here: https://github.com/apache/tinkerpop/blob/b86ae2663fbd71067a412361e26efaf0a63b5987/gremlin-language/src/main/antlr4/Gremlin.g4 and a parser that will produce a Java Traversal object from that grammar.
GitHub
tinkerpop/gremlin-language/src/main/antlr4/Gremlin.g4 at b86ae2663f...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
spmallette
spmallette2w ago
i'm no expert, but i thought i'd heard it was valuable to include examples of what doesn't work as much as what does. that said, worth calling out that we do have tests for error conditions in there. thanks!
Andrea
Andrea2w ago
I'm no expert either but that's a good call, as long as the error scenarios are separated from the others
spmallette
spmallette2w ago
another thing i've been curious about in doing this is knowing how much of a description each test needs. that example i gave was for an extremely simple query, but for some of the more complicated ones, i dont know what would be an appropriate description in length or language. if anyone paying attention to this thread has information on that i'd be interested to hear thoughts on the matter. @niuzj had asked: "may I ask how you generated these test files?" and i said i'd reply here. they aren't machine generated, if that's what you mean. we author those files manually and they are a part of our test kit. providers use these test to validate that their graph databases work as expected and are compliant with TinkerPop.
niuzj
niuzjOP2w ago
I mean whether these test files have some reusable templates or similar content that can be referenced.
Andrea
Andrea2w ago
These tests use cucumber testing framework and gherkin syntax to define test 'scenarios' - https://cucumber.io/docs/
Introduction | Cucumber
Cucumber is a tool that supports Behaviour-Driven Development (BDD).
Andrea
Andrea2w ago
There is a custom step definition to define the gremlin queries and it is denoted by the phrase the traversal of. If you were looking to extract all queries from these feature tests then you could use that phrase to do so. Example:
And the traversal of
"""
g.V(vid1).out().values("name").inject("daniel").as("a").map(__.length()).path()
"""
And the traversal of
"""
g.V(vid1).out().values("name").inject("daniel").as("a").map(__.length()).path()
"""
spmallette
spmallette2w ago
Details for the specifics of our Gherkin DSL are found here: https://tinkerpop.apache.org/docs/current/dev/developer/#gremlin-language-test-cases
Andrea
Andrea2w ago
Out of curiousity I had AI generate a script to extract all queries from .feature files, separating those that raise errors. it seems to do the job? I only quickly glanced at the results.
spmallette
spmallette2w ago
note that we have a feature built in Java for that sort of thing. it actually even grabs the Gremlin out of the Documentation itself: https://github.com/apache/tinkerpop/tree/master/gremlin-language/src/main/java/org/apache/tinkerpop/gremlin/language/corpus
GitHub
tinkerpop/gremlin-language/src/main/java/org/apache/tinkerpop/greml...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
Andrea
Andrea2w ago
wow so much to discover about what's in tinkerpop project 🙂
niuzj
niuzjOP2w ago
Would it be correct to assume that all these test files adhere to the same schema? @spmallette
spmallette
spmallette2w ago
i assume you mean the graph schema? if so, then no. that is driven by the test itself. this line mostly tells you which schema is in use:
Given the modern graph
Given the modern graph
in that case, it's TinkerPop's "modern" dataset and you'd need to consider the schema for that in writing the test. One thing to note is that we do also have a "empty graph" option for cases where we need to contrive a graph to make a test fit what we want to validate. we've generally tried to make the schema used in those cases match "modern" but not sure that's always the case. it wouldn't be too hard to do that. for an LLMs purpose, i imagine it's advantageous to have multiple schemas within the test (though we don't have a ton of variation - mostly it's "modern" as that little structure covers so many scenarios suprisingly). I also think that we could adjust the tests that use "empty" to include a schema description or something. @niuzj do you have a sense at this point as to how useful the tests will be in your work and what work needs to be done to improve them for fine-tuning LLM?
niuzj
niuzjOP2w ago
More graph schemas are clearly beneficial for model training. Utilizing the test file data you provided as fine-tuning corpus is one of the approaches. This link shows how a student in another community did it, and we'd like to use that as a reference.
niuzj
niuzjOP2w ago
微信公众平台
社区贡献 | Awesome-Text2GQL:开源全自动Text2GQL语料生...
本文介绍了来自中国科学技术大学的TuGraph社区开源贡献者庞同学的工作Awesome-Text2GQL。
niuzj
niuzjOP2w ago
GitHub
GitHub - TuGraph-family/Awesome-Text2GQL: Fine-Tuning Dataset Auto-...
Fine-Tuning Dataset Auto-Generation for Graph Query Languages. - TuGraph-family/Awesome-Text2GQL
Andrea
Andrea2w ago
This is a great project and would be very beneficial to the community 👍

Did you find this page helpful?