Seeking Help: Building a Text-to-Gremlin Corpus Generator - AST Parsing
Hey everyone,
I'm working on fine-tuning a large language model for text-to-Gremlin generation. To do this, I need a substantial dataset of natural language queries paired with their corresponding Gremlin queries. I'm currently building a corpus generator for this.
I've seen some work on text-to-Cypher where they parsed the Cypher AST (Abstract Syntax Tree). However, the ASTs for Cypher and Gremlin are quite different.
Does anyone have suggestions on how to tackle this? Specifically:
* Are there any existing tools for parsing Gremlin ASTs?
* Alternatively, are there any methods to build such a corpus generator without relying on AST parsing?
Any help or ideas would be greatly appreciated! Thanks!
26 Replies
interesting. can you say how large a corpus you think you will need?
I'm back~
I think the more data, the better for model performance. So, I need to create a corpus generator.
i would think more is better, but was hoping you had an idea of how large a size you intended to generate. i ask because we already have a pretty good corpus of 1600 working gremlin queries in the test suite with expected results. they lack a natural language description though. if there were some effort to remedy that, i think it could make for a good base to help with the task you're talking about, wouldn't it?
as a bonus, the tests use a structured format given that they are written in gherkin which make them really easy to work with programmatically
you can find them all in these directories: https://github.com/apache/tinkerpop/tree/master/gremlin-test/src/main/resources/org/apache/tinkerpop/gremlin/test/features
GitHub
tinkerpop/gremlin-test/src/main/resources/org/apache/tinkerpop/grem...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
as these tests represent the mechanism for enforcing Gremlin semantics for providers, i think this could be a useful piece for tuning up an LLM
i've thought we might modify each test definiton to have something like:
not sure i wholly like the syntax of "And the traversal description", but that would be the idea.
Last time I needed to look into fine-tuning (~6mo potentially out of date disclaimer), of order 1k high quality samples was the recommended data volume for PEFT approaches.
Which fine-tuning approaches/techniques are you exploring?
Before this, we did not have a large amount of corpus, I think PEFT is ok
Thank you very much for providing these.
One should note that some of the gerkin tests are for error scenarios (ie. invalid queries) so those should not be included in your corpus. Those should have an expectation that 'the traversal will raise an error'. See https://github.com/apache/tinkerpop/blob/master/gremlin-test/src/main/java/org/apache/tinkerpop/gremlin/features/StepDefinition.java for more information, specifically the methods annotated with
@Given
GitHub
tinkerpop/gremlin-test/src/main/java/org/apache/tinkerpop/gremlin/f...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
i'll take a moment to go back to the original question here:
I've seen some work on text-to-Cypher where they parsed the Cypher AST (Abstract Syntax Tree). However, ASTs for Cypher and Gremlin are quite different. Does anyone have suggestions on how to tackle this? Specifically: Are there any existing tools for parsing Gremlin ASTs?, Alternatively, are there any methods to build such a corpus generator without relying on AST parsing?I'm not sure what text-to-cypher was doing in their work. perhaps you could expand a bit on what they did? For Gremlin we do have a grammar which you can find here: https://github.com/apache/tinkerpop/blob/b86ae2663fbd71067a412361e26efaf0a63b5987/gremlin-language/src/main/antlr4/Gremlin.g4 and a parser that will produce a Java
Traversal
object from that grammar.GitHub
tinkerpop/gremlin-language/src/main/antlr4/Gremlin.g4 at b86ae2663f...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
i'm no expert, but i thought i'd heard it was valuable to include examples of what doesn't work as much as what does. that said, worth calling out that we do have tests for error conditions in there. thanks!
I'm no expert either but that's a good call, as long as the error scenarios are separated from the others
another thing i've been curious about in doing this is knowing how much of a description each test needs. that example i gave was for an extremely simple query, but for some of the more complicated ones, i dont know what would be an appropriate description in length or language. if anyone paying attention to this thread has information on that i'd be interested to hear thoughts on the matter.
@niuzj had asked: "may I ask how you generated these test files?" and i said i'd reply here.
they aren't machine generated, if that's what you mean. we author those files manually and they are a part of our test kit. providers use these test to validate that their graph databases work as expected and are compliant with TinkerPop.
I mean whether these test files have some reusable templates or similar content that can be referenced.
These tests use cucumber testing framework and gherkin syntax to define test 'scenarios' - https://cucumber.io/docs/
Introduction | Cucumber
Cucumber is a tool that supports Behaviour-Driven Development (BDD).
There is a custom step definition to define the gremlin queries and it is denoted by the phrase
the traversal of
. If you were looking to extract all queries from these feature tests then you could use that phrase to do so. Example:
Details for the specifics of our Gherkin DSL are found here: https://tinkerpop.apache.org/docs/current/dev/developer/#gremlin-language-test-cases
Out of curiousity I had AI generate a script to extract all queries from .feature files, separating those that raise errors. it seems to do the job? I only quickly glanced at the results.
note that we have a feature built in Java for that sort of thing. it actually even grabs the Gremlin out of the Documentation itself: https://github.com/apache/tinkerpop/tree/master/gremlin-language/src/main/java/org/apache/tinkerpop/gremlin/language/corpus
GitHub
tinkerpop/gremlin-language/src/main/java/org/apache/tinkerpop/greml...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
wow so much to discover about what's in tinkerpop project 🙂
Would it be correct to assume that all these test files adhere to the same schema? @spmallette
i assume you mean the graph schema? if so, then no. that is driven by the test itself. this line mostly tells you which schema is in use:
in that case, it's TinkerPop's "modern" dataset and you'd need to consider the schema for that in writing the test. One thing to note is that we do also have a "empty graph" option for cases where we need to contrive a graph to make a test fit what we want to validate. we've generally tried to make the schema used in those cases match "modern" but not sure that's always the case. it wouldn't be too hard to do that.
for an LLMs purpose, i imagine it's advantageous to have multiple schemas within the test (though we don't have a ton of variation - mostly it's "modern" as that little structure covers so many scenarios suprisingly). I also think that we could adjust the tests that use "empty" to include a schema description or something.
@niuzj do you have a sense at this point as to how useful the tests will be in your work and what work needs to be done to improve them for fine-tuning LLM?
More graph schemas are clearly beneficial for model training.
Utilizing the test file data you provided as fine-tuning corpus is one of the approaches.
This link shows how a student in another community did it, and we'd like to use that as a reference.
https://mp.weixin.qq.com/s/PCV4Qi9w9K-tRf1vMWHpEQ?poc_token=HPpfMGij1UamxY4g2XxzP0NYrwU5rhGopf4925D8
微信公众平台
社区贡献 | Awesome-Text2GQL:开源全自动Text2GQL语料生...
本文介绍了来自中国科学技术大学的TuGraph社区开源贡献者庞同学的工作Awesome-Text2GQL。
GitHub
GitHub - TuGraph-family/Awesome-Text2GQL: Fine-Tuning Dataset Auto-...
Fine-Tuning Dataset Auto-Generation for Graph Query Languages. - TuGraph-family/Awesome-Text2GQL
This is a great project and would be very beneficial to the community 👍