The Bencoding Specification is Broken

Recently, I decided to try to implement a Bittorrent client from scratch, for fun and to learn more about network programming. For those who are unaware, Bittorrent is a network protocol that lets computers send each other files across a network, without relying on a specific intermediate computer (i.e it’s a peer-to-peer aka p2p protocol).

Bittorrent includes a custom data serialization format, called bencoding, which is used in Bittorrent any time it needs to send a data structure over the network. There are reasons for this—many of the serialization formats people are used to, like JSON, are not really suitable to be sent over the network directly because they are not length-delimited (Also, Bittorrent is kind of old now, and many serialization formats people are used to today did not exist when it came out which thinking about it, is probably the primary reason. This length-delimiting stuff is probably me projecting my hopes and dreams on Bram Cohen which is unlikely to be healthy).

Having your data serialization format be length-delimited is an important property for network protocols. This is because with a length delimited structure, you can reject wrongly sized inputs as early as possible, which makes the protocol more efficient, and less likely to make errors in parsing (akin to the typical implicit C-style buffer overflow in functions like gets(3). This is important as any service on a public network should be robust to adversarial inputs.

Let’s look at the the spec now. On the topic of serializing byte-vectors (not strings!) it says: “Strings are length-prefixed base ten followed by a colon and the string. For example 4:spam corresponds to ‘spam’.”.

There are at least two problems with this: First of all, these are not strings (what is a string anyway? a sequence of valid unicode codepoints? (in what encoding?))—they are byte-vectors, and it is allowed to put arbitrary binary data in them. There’s another spec which says this explicitly, and it makes me sad. Second: The byte-vector itself is length-prefixed, which is good. But how long is the length? It is unclear. Reading the spec at face value, you could have an arbitrarily large size, so as long as I never emit a colon and keep sending you numbers, you will be stuck parsing just the length (Also, ASCII encoding of numbers is inefficient and a bit silly, but that’s a nitpick in comparison). This is awful! And this is the meat of my complaint.

All the other types bencode specifies have similar issues: Integers are apparently explicitly arbitrarily sized: “Integers have no size limitation.”, and are not even length prefixed but sentinel delimited. Lists and Dictionaries are also not length-prefixed and are sentinel delimited.

Bencoding makes me sad. I don’t think anything new uses it these days, but if you are writing something that requires a data serialization format and are thinking of reaching for bencoding, don’t do it. If you want a recommendation, I think messagepack gets a lot of things correct. Also, I’ve been looking into richer, object-capability-based systems for moving data structures across the network lately, such as Cap’n Proto, and I think the general idea there is correct.

Finally, I should mention why this is not actually an issue in practice: Bencoded files are not very big. .torrent files and pieces of a torrent are the chief use of it, and they end up using the string data type for the vast majority of the space they take up. It turns out your serialization format doesn’t have to be robust to things you never do!