C#•3y ago

Best way to implement a regex based lexer [Answered]

In a regex lexer, you can loop over every pattern and do something like this:

if (match.Success && (match.Index - currentIndex) == 0)
  //Token Found

if (match.Success && (match.Index - currentIndex) == 0)
  //Token Found

The problem with this is that, unlike other Regex implementations, if c# regex sees that the first char does not match the pattern, instead of returning null/false, it keeps going until a match is found, meaning the if statement is needed and extra match calls are made. Is there any way to get this to work without the extra calls? I could use the index of each match and add them into a list in order based off the match's index position, but that seems needlessly complicated if there is a better way

10 Replies

EzlandingOP•3y ago

an example of how I'd want regex matching to work is in JS. see here where if it doesn't match the first char it returns null https://gist.github.com/pepasflo/4afa5813606b6ee73526a0d21d0d1035#file-lexer-js. So for example pattern " " (checks for space character) would return null with the string "hello world" in JS , while in c# it would recognize the space between o and w (index 5) @nekodjin (sorry for the ping) can you help me with this? You seem to be knowledgeable about c# regex

nekodjin•3y ago

put ^ in the beginning of your regex pattern to match the beginning of the string, then it will only match the pattern if the pattern occurs at the very beginning. however this will require you to take substrings of the input string.

EzlandingOP•3y ago

If that's the best solution I'll do it it's not my favorite though :|

nekodjin•3y ago

yeah it's not ideal however that is what the JS solution does that you posted i looked in the API and unfortunately it doesn't seem as if there is such an option that is unfortunate because taking substrings can be expensive oh well..

EzlandingOP•3y ago

¯\_(ツ)_/¯

nekodjin•3y ago

it'd be nice if it had that option but alas the closest thing is you can specify an index to stop looking but that's ever-so-slightly different and doesn't work with arbitrarily long tokens like identifiers

Anton•3y ago

I've heard antlr is good for this you could try that

nekodjin•3y ago

yeah if you care about really optimizing stuff there are other options but for a hobby project you generally don't and for things more complicated than hobby projects, lexers and parsers are practically never the bottleneck in langdev

EzlandingOP•3y ago

ok thanks

Accord•3y ago

✅ This post has been marked as answered!

Gaming

Programming

Best way to implement a regex based lexer [Answered]

Did you find this page helpful?