C
C#9mo ago
Asher

❔ Format parsing problem with character collision

Hi everyone, assume I have the following bencode listl4:spam5:helloe (where the last e signifies the end of the list) Apart from reading each element until there is an e which would create code duplication for me, is there any way to easily identify the difference between an e inside of a string and an e as the end of the list object? (or other objects such as integers for that matter)
27 Replies
Asher
Asher9mo ago
int indexOfNextE = str.IndexOf('e', i);
// TODO: add 0 check
BencodeInteger newInteger = DecodeInteger(str.Substring(i + 1, indexOfNextE - 1));
list.Add(newInteger);
int indexOfNextE = str.IndexOf('e', i);
// TODO: add 0 check
BencodeInteger newInteger = DecodeInteger(str.Substring(i + 1, indexOfNextE - 1));
list.Add(newInteger);
this is an example of what I mean important to note, here I can be sure that the next e will be the end of the object https://wiki.theory.org/BitTorrentSpecification#Bencoding <- the format I'm working with
Pobiega
Pobiega9mo ago
Not really. bencoding relies on the length of each item, as I'm sure you know the only reason we know that e ends the list is that hello was length-prefixed Does your parser not handle this for you?
Asher
Asher9mo ago
I'm building one from scratch
Pobiega
Pobiega9mo ago
Thats fair, but again, why are you not handling this? 😄 Not sure how this would result in code duplication. A simple parser combinator would handle bencoding fairly well I imagine
Asher
Asher9mo ago
well, I have 4 separate functions for parsing each type of object str int dict & list
Pobiega
Pobiega9mo ago
sure, yep
Asher
Asher9mo ago
string and int were easy enough obviously but a list can contain a list in which case, I'd prob need to call the function recursively
Pobiega
Pobiega9mo ago
Makes sense, yep
Asher
Asher9mo ago
the way I built my functions, they get a substring that they parse in their own way and return (my version) some BencodeObject derived class for example, in the list parse function, this is how I parse a string
for (int i = 0; i < str.Length; i++)
{
char c = str[i];

// BencodeString
if(char.IsNumber(c))
{
int j = i;
while(str[j] != ':')
{
j++;
}

int numCharacters = int.Parse(str.Substring(i, j - 1));

BencodeString newString = DecodeString(str.Substring(i, j + numCharacters));
list.Add(newString);
i = j + numCharacters;
}
... // rest of the objects
for (int i = 0; i < str.Length; i++)
{
char c = str[i];

// BencodeString
if(char.IsNumber(c))
{
int j = i;
while(str[j] != ':')
{
j++;
}

int numCharacters = int.Parse(str.Substring(i, j - 1));

BencodeString newString = DecodeString(str.Substring(i, j + numCharacters));
list.Add(newString);
i = j + numCharacters;
}
... // rest of the objects
list is the list that will eventually be returned from this function if that makes sense so assume that i=0 is right where the list starts and i = str.length - 1 is an e that ends the list
Pobiega
Pobiega9mo ago
A fairly common approach to parsers is to return two things first the result of the parse (the list, string, int etc) and secondly.. the rest of the string ie, where our parser stopped
Asher
Asher9mo ago
I assume this is where I do i = j + numCharacters but in way of a return statement
Pobiega
Pobiega9mo ago
sort of, I guess Let me just say that I'm in no way an expert on parsers, and I've only ever written parser combinators before but its a very neat way of doing it
Asher
Asher9mo ago
I will mention I don't know what parser combinators are but you did give me a different idea
Pobiega
Pobiega9mo ago
well, you can google it and I suggest you do, there are plenty of articles and videos on the topic. Its really interesting too, imho
Asher
Asher9mo ago
I was thinking of possibly having a recursive function for every element and it combines it into the root
Pobiega
Pobiega9mo ago
but the idea is that we can write a parser that parses a single character and if we combine several of those, we can parse a word etc so you build progressively larger parsers, by combining smaller ones it eventually leads to some very pretty code, where you declare a list as "zero or more valid bencode tokens" and a "bencode token" is declared as "a bencoded list, dictionary, byte, string or int" etc
Asher
Asher9mo ago
that sounds like what I need I'm gonna look at some articles might try my idea aswell but this seems like it's perfectly suited for me
Pobiega
Pobiega9mo ago
there is a library for C# called Pidgin that helps you write parsers but its also entirely doable to make something from scratch, if you want to
Asher
Asher9mo ago
I'll see that aswell
Pobiega
Pobiega9mo ago
Got it, using Pidgin 🙂
dictionary: d4:spam4:eggs2:hii16ee
spam: "eggs"
hi: 16
dictionary: d4:spam4:eggs2:hii16ee
spam: "eggs"
hi: 16
Asher
Asher9mo ago
mind sharing your code? I won't copy as I want to do this myself, haven't had the time to look at Pidgin yet either just as a general idea
Pobiega
Pobiega9mo ago
public abstract record BToken
{
}

public record BString(string Value) : BToken
{
public override string ToString() => $"\"{Value}\"";
}

public record BInteger(int Value) : BToken
{
public override string ToString() => Value.ToString();
}

public record BList(ImmutableArray<BToken> Elements) : BToken;

public record BDictionary(ImmutableDictionary<string, BToken> Elements) : BToken;
public abstract record BToken
{
}

public record BString(string Value) : BToken
{
public override string ToString() => $"\"{Value}\"";
}

public record BInteger(int Value) : BToken
{
public override string ToString() => Value.ToString();
}

public record BList(ImmutableArray<BToken> Elements) : BToken;

public record BDictionary(ImmutableDictionary<string, BToken> Elements) : BToken;
my models that Im parsing into
private static readonly Parser<char, char> _i = Char('i');
private static readonly Parser<char, char> _l = Char('l');
private static readonly Parser<char, char> _d = Char('d');
private static readonly Parser<char, char> _e = Char('e');
private static readonly Parser<char, char> _colon = Char(':');

private static Parser<char, string> StringWithLength(int length) => Any.RepeatString(length);

private static readonly Parser<char, int> _int = UnsignedInt(10);
private static readonly Parser<char, int> BencodeLength = _int.Before(_colon);

public static readonly Parser<char, BToken> BencodeString =
BencodeLength.Bind(StringWithLength).Select<BToken>(x => new BString(x));

public static readonly Parser<char, BToken> BencodeInt =
_i.Then(_int).Before(_e).Select<BToken>(x => new BInteger(x));

public static readonly Parser<char, BToken> Bencoded =
BencodeString.Or(BencodeInt).Or(Rec(() => BencodeList)).Or(Rec(() => BencodeDictionary));

public static readonly Parser<char, BToken> BencodeList =
Bencoded.Many().Between(_l, _e).Select<BToken>(x => new BList(x.ToImmutableArray()));

private static readonly Parser<char, KeyValuePair<string, BToken>> KeyValuePair =
BencodeString
.Then(Bencoded, (name, val) => new KeyValuePair<string, BToken>((name as BString)!.Value, val));

public static readonly Parser<char, BToken> BencodeDictionary =
KeyValuePair.Many().Between(_d, _e).Select<BToken>(x => new BDictionary(x.ToImmutableDictionary()));
private static readonly Parser<char, char> _i = Char('i');
private static readonly Parser<char, char> _l = Char('l');
private static readonly Parser<char, char> _d = Char('d');
private static readonly Parser<char, char> _e = Char('e');
private static readonly Parser<char, char> _colon = Char(':');

private static Parser<char, string> StringWithLength(int length) => Any.RepeatString(length);

private static readonly Parser<char, int> _int = UnsignedInt(10);
private static readonly Parser<char, int> BencodeLength = _int.Before(_colon);

public static readonly Parser<char, BToken> BencodeString =
BencodeLength.Bind(StringWithLength).Select<BToken>(x => new BString(x));

public static readonly Parser<char, BToken> BencodeInt =
_i.Then(_int).Before(_e).Select<BToken>(x => new BInteger(x));

public static readonly Parser<char, BToken> Bencoded =
BencodeString.Or(BencodeInt).Or(Rec(() => BencodeList)).Or(Rec(() => BencodeDictionary));

public static readonly Parser<char, BToken> BencodeList =
Bencoded.Many().Between(_l, _e).Select<BToken>(x => new BList(x.ToImmutableArray()));

private static readonly Parser<char, KeyValuePair<string, BToken>> KeyValuePair =
BencodeString
.Then(Bencoded, (name, val) => new KeyValuePair<string, BToken>((name as BString)!.Value, val));

public static readonly Parser<char, BToken> BencodeDictionary =
KeyValuePair.Many().Between(_d, _e).Select<BToken>(x => new BDictionary(x.ToImmutableDictionary()));
my parsers fixed and here is some test code
var test = "d4:spam4:eggs2:hii16ee";
Console.WriteLine($"dictionary: {test}");

var result = BencodeDictionary.ParseOrThrow(test) as BDictionary;

foreach (var token in result.Elements)
{
Console.WriteLine($"{token.Key}: {token.Value}");
}
var test = "d4:spam4:eggs2:hii16ee";
Console.WriteLine($"dictionary: {test}");

var result = BencodeDictionary.ParseOrThrow(test) as BDictionary;

foreach (var token in result.Elements)
{
Console.WriteLine($"{token.Key}: {token.Value}");
}
the Between combinator was awesome
WEIRD FLEX
WEIRD FLEX9mo ago
do you really have to use e? you are using utf8, right?
Pobiega
Pobiega9mo ago
the e is set by the encoding, so yes you can't just come up with your own characters and still claim to be fully bencode compliant its like replacing the } in json :p
WEIRD FLEX
WEIRD FLEX9mo ago
aaaaa ok i didn't know it was a standard also because if it was a standard, i thought, what is the problem, it's all already done because there would be already an escaping method or something
Pobiega
Pobiega9mo ago
As Asher said, he is implementing a parser from scratch. There are ofc several existing parsers.
Accord
Accord9mo ago
Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.