C
C#2y ago
Ezlanding

Best way to implement a regex based lexer [Answered]

In a regex lexer, you can loop over every pattern and do something like this:
if (match.Success && (match.Index - currentIndex) == 0)
//Token Found
if (match.Success && (match.Index - currentIndex) == 0)
//Token Found
The problem with this is that, unlike other Regex implementations, if c# regex sees that the first char does not match the pattern, instead of returning null/false, it keeps going until a match is found, meaning the if statement is needed and extra match calls are made. Is there any way to get this to work without the extra calls? I could use the index of each match and add them into a list in order based off the match's index position, but that seems needlessly complicated if there is a better way
10 Replies
Ezlanding
Ezlanding2y ago
an example of how I'd want regex matching to work is in JS. see here where if it doesn't match the first char it returns null https://gist.github.com/pepasflo/4afa5813606b6ee73526a0d21d0d1035#file-lexer-js. So for example pattern " " (checks for space character) would return null with the string "hello world" in JS , while in c# it would recognize the space between o and w (index 5) @nekodjin (sorry for the ping) can you help me with this? You seem to be knowledgeable about c# regex
nekodjin
nekodjin2y ago
put ^ in the beginning of your regex pattern to match the beginning of the string, then it will only match the pattern if the pattern occurs at the very beginning. however this will require you to take substrings of the input string.
Ezlanding
Ezlanding2y ago
If that's the best solution I'll do it it's not my favorite though :|
nekodjin
nekodjin2y ago
yeah it's not ideal however that is what the JS solution does that you posted i looked in the API and unfortunately it doesn't seem as if there is such an option that is unfortunate because taking substrings can be expensive oh well..
Ezlanding
Ezlanding2y ago
¯\_(ツ)_/¯
nekodjin
nekodjin2y ago
it'd be nice if it had that option but alas the closest thing is you can specify an index to stop looking but that's ever-so-slightly different and doesn't work with arbitrarily long tokens like identifiers
Anton
Anton2y ago
I've heard antlr is good for this you could try that
nekodjin
nekodjin2y ago
yeah if you care about really optimizing stuff there are other options but for a hobby project you generally don't and for things more complicated than hobby projects, lexers and parsers are practically never the bottleneck in langdev
Ezlanding
Ezlanding2y ago
ok thanks
Accord
Accord2y ago
✅ This post has been marked as answered!