Apache TinkerPop•2w ago

Seeking Help: Building a Text-to-Gremlin Corpus Generator - AST Parsing

Hey everyone, I'm working on fine-tuning a large language model for text-to-Gremlin generation. To do this, I need a substantial dataset of natural language queries paired with their corresponding Gremlin queries. I'm currently building a corpus generator for this. I've seen some work on text-to-Cypher where they parsed the Cypher AST (Abstract Syntax Tree). However, the ASTs for Cypher and Gremlin are quite different. Does anyone have suggestions on how to tackle this? Specifically: * Are there any existing tools for parsing Gremlin ASTs? * Alternatively, are there any methods to build such a corpus generator without relying on AST parsing? Any help or ideas would be greatly appreciated! Thanks!

26 Replies

spmallette•2w ago

interesting. can you say how large a corpus you think you will need?

imbajin•2w ago

I'm back~

niuzjOP•2w ago

I think the more data, the better for model performance. So, I need to create a corpus generator.

spmallette•2w ago

i would think more is better, but was hoping you had an idea of how large a size you intended to generate. i ask because we already have a pretty good corpus of 1600 working gremlin queries in the test suite with expected results. they lack a natural language description though. if there were some effort to remedy that, i think it could make for a good base to help with the task you're talking about, wouldn't it? as a bonus, the tests use a structured format given that they are written in gherkin which make them really easy to work with programmatically

spmallette•2w ago

you can find them all in these directories: https://github.com/apache/tinkerpop/tree/master/gremlin-test/src/main/resources/org/apache/tinkerpop/gremlin/test/features

GitHub

tinkerpop/gremlin-test/src/main/resources/org/apache/tinkerpop/grem...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

spmallette•2w ago

as these tests represent the mechanism for enforcing Gremlin semantics for providers, i think this could be a useful piece for tuning up an LLM i've thought we might modify each test definiton to have something like:

Scenario: g_V_count
  Given the modern graph
  And the traversal description
    """
    This traversal counts the total number of vertices in the graph.
    """
  And the traversal of
    """
    g.V().count()
    """
  When iterated to list
  Then the result should be ordered
    | result |
    | d[6].l |

Scenario: g_V_count
  Given the modern graph
  And the traversal description
    """
    This traversal counts the total number of vertices in the graph.
    """
  And the traversal of
    """
    g.V().count()
    """
  When iterated to list
  Then the result should be ordered
    | result |
    | d[6].l |

not sure i wholly like the syntax of "And the traversal description", but that would be the idea.

Captator•2w ago

Last time I needed to look into fine-tuning (~6mo potentially out of date disclaimer), of order 1k high quality samples was the recommended data volume for PEFT approaches. Which fine-tuning approaches/techniques are you exploring?

niuzjOP•2w ago

Before this, we did not have a large amount of corpus, I think PEFT is ok Thank you very much for providing these.

Andrea•2w ago

One should note that some of the gerkin tests are for error scenarios (ie. invalid queries) so those should not be included in your corpus. Those should have an expectation that 'the traversal will raise an error'. See https://github.com/apache/tinkerpop/blob/master/gremlin-test/src/main/java/org/apache/tinkerpop/gremlin/features/StepDefinition.java for more information, specifically the methods annotated with @Given

GitHub

tinkerpop/gremlin-test/src/main/java/org/apache/tinkerpop/gremlin/f...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

spmallette•2w ago

i'll take a moment to go back to the original question here:

I've seen some work on text-to-Cypher where they parsed the Cypher AST (Abstract Syntax Tree). However, ASTs for Cypher and Gremlin are quite different. Does anyone have suggestions on how to tackle this? Specifically: Are there any existing tools for parsing Gremlin ASTs?, Alternatively, are there any methods to build such a corpus generator without relying on AST parsing?

I'm not sure what text-to-cypher was doing in their work. perhaps you could expand a bit on what they did? For Gremlin we do have a grammar which you can find here: https://github.com/apache/tinkerpop/blob/b86ae2663fbd71067a412361e26efaf0a63b5987/gremlin-language/src/main/antlr4/Gremlin.g4 and a parser that will produce a Java Traversal object from that grammar.

GitHub

tinkerpop/gremlin-language/src/main/antlr4/Gremlin.g4 at b86ae2663f...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

spmallette•2w ago

i'm no expert, but i thought i'd heard it was valuable to include examples of what doesn't work as much as what does. that said, worth calling out that we do have tests for error conditions in there. thanks!

Andrea•2w ago

I'm no expert either but that's a good call, as long as the error scenarios are separated from the others

spmallette•2w ago

another thing i've been curious about in doing this is knowing how much of a description each test needs. that example i gave was for an extremely simple query, but for some of the more complicated ones, i dont know what would be an appropriate description in length or language. if anyone paying attention to this thread has information on that i'd be interested to hear thoughts on the matter. @niuzj had asked: "may I ask how you generated these test files?" and i said i'd reply here. they aren't machine generated, if that's what you mean. we author those files manually and they are a part of our test kit. providers use these test to validate that their graph databases work as expected and are compliant with TinkerPop.

niuzjOP•2w ago

I mean whether these test files have some reusable templates or similar content that can be referenced.

Andrea•2w ago

These tests use cucumber testing framework and gherkin syntax to define test 'scenarios' - https://cucumber.io/docs/

Introduction | Cucumber

Cucumber is a tool that supports Behaviour-Driven Development (BDD).

Andrea•2w ago

There is a custom step definition to define the gremlin queries and it is denoted by the phrase the traversal of. If you were looking to extract all queries from these feature tests then you could use that phrase to do so. Example:

And the traversal of
      """
      g.V(vid1).out().values("name").inject("daniel").as("a").map(__.length()).path()
      """

And the traversal of
      """
      g.V(vid1).out().values("name").inject("daniel").as("a").map(__.length()).path()
      """

spmallette•2w ago

Details for the specifics of our Gherkin DSL are found here: https://tinkerpop.apache.org/docs/current/dev/developer/#gremlin-language-test-cases

Andrea•2w ago

Out of curiousity I had AI generate a script to extract all queries from .feature files, separating those that raise errors. it seems to do the job? I only quickly glanced at the results.

extract_gremlin_quer...

spmallette•2w ago

note that we have a feature built in Java for that sort of thing. it actually even grabs the Gremlin out of the Documentation itself: https://github.com/apache/tinkerpop/tree/master/gremlin-language/src/main/java/org/apache/tinkerpop/gremlin/language/corpus

GitHub

tinkerpop/gremlin-language/src/main/java/org/apache/tinkerpop/greml...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

Andrea•2w ago

wow so much to discover about what's in tinkerpop project 🙂

niuzjOP•2w ago

Would it be correct to assume that all these test files adhere to the same schema? @spmallette

spmallette•2w ago

i assume you mean the graph schema? if so, then no. that is driven by the test itself. this line mostly tells you which schema is in use:

Given the modern graph

Given the modern graph

in that case, it's TinkerPop's "modern" dataset and you'd need to consider the schema for that in writing the test. One thing to note is that we do also have a "empty graph" option for cases where we need to contrive a graph to make a test fit what we want to validate. we've generally tried to make the schema used in those cases match "modern" but not sure that's always the case. it wouldn't be too hard to do that. for an LLMs purpose, i imagine it's advantageous to have multiple schemas within the test (though we don't have a ton of variation - mostly it's "modern" as that little structure covers so many scenarios suprisingly). I also think that we could adjust the tests that use "empty" to include a schema description or something. @niuzj do you have a sense at this point as to how useful the tests will be in your work and what work needs to be done to improve them for fine-tuning LLM?

niuzjOP•2w ago

More graph schemas are clearly beneficial for model training. Utilizing the test file data you provided as fine-tuning corpus is one of the approaches. This link shows how a student in another community did it, and we'd like to use that as a reference.

niuzjOP•2w ago

https://mp.weixin.qq.com/s/PCV4Qi9w9K-tRf1vMWHpEQ?poc_token=HPpfMGij1UamxY4g2XxzP0NYrwU5rhGopf4925D8

微信公众平台

社区贡献 | Awesome-Text2GQL：开源全自动Text2GQL语料生...

本文介绍了来自中国科学技术大学的TuGraph社区开源贡献者庞同学的工作Awesome-Text2GQL。

niuzjOP•2w ago

https://github.com/TuGraph-family/Awesome-Text2GQL

GitHub

GitHub - TuGraph-family/Awesome-Text2GQL: Fine-Tuning Dataset Auto-...

Fine-Tuning Dataset Auto-Generation for Graph Query Languages. - TuGraph-family/Awesome-Text2GQL

Andrea•2w ago

This is a great project and would be very beneficial to the community 👍

Gaming

Programming

Seeking Help: Building a Text-to-Gremlin Corpus Generator - AST Parsing

Did you find this page helpful?