C/C++: 70x faster file embeds using string literals

Date: 2020-08-03
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00013-fast-cpp-embeds.md
Tool: https://github.com/mortie/strliteral

It's really common to want to embed some static data into a binary. Game developers want to embed their shaders. Developers of graphical apps may want to embed sounds or icons. Developers of programming language interpreters may want to embed their language's standard library. I have many times built software whose GUI is in the form of a web app served from a built-in HTTP server, where I want to embed the HTML/JS/CSS into the binary.

Since neither C nor C++ currently has a built-in way to embed files, we use work-arounds. These usually fall into one of two categories: Either we use toolchain-specific features to generate object files with the data exposed as symbols, or we generate C code which we subsequently compile to object files. Since the toolchain-specific features are, well, toolchain-specific, people writing cross-platform software generally prefer code generation.

The most common tool I'm aware of to generate C code for embedding data is xxd, whose -i option will generate C code with an unsigned char array literal.

Given the following input text:

<html>
	<head>
		<title>Hello World</title>
	</head>
	<body>
		Hello World
	</body>
</html>
index.html

The command xxd -i index.html will produce this C code:

unsigned char index_html[] = {
  0x3c, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a, 0x09, 0x3c, 0x68, 0x65, 0x61,
  0x64, 0x3e, 0x0a, 0x09, 0x09, 0x3c, 0x74, 0x69, 0x74, 0x6c, 0x65, 0x3e,
  0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72, 0x6c, 0x64, 0x3c,
  0x2f, 0x74, 0x69, 0x74, 0x6c, 0x65, 0x3e, 0x0a, 0x09, 0x3c, 0x2f, 0x68,
  0x65, 0x61, 0x64, 0x3e, 0x0a, 0x09, 0x3c, 0x62, 0x6f, 0x64, 0x79, 0x3e,
  0x0a, 0x09, 0x09, 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72,
  0x6c, 0x64, 0x0a, 0x09, 0x3c, 0x2f, 0x62, 0x6f, 0x64, 0x79, 0x3e, 0x0a,
  0x3c, 0x2f, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a
};
unsigned int index_html_len = 92;

This works fairly well. Any C or C++ compiler can compile that code and produce an object file with our static data, which we can link against to embed that data into our binary. All in a cross-platform and cross-toolchain way.

There's just one problem: It's slow. Really slow. On my laptop, embedding a megabyte this way takes 2 seconds using g++. Embedding one decent quality MP3 at 8.4MB takes 23 seconds, using 2.5 gigabytes of RAM.

bippety-boppety.mp3, an 8.4MB song

Whether or not we should embed files of that size into our binaries is a question I won't cover in this article, and the answer depends a lot on context. Regardless, processing data at just over 400kB per second is objectively terrible. We can do so much better.

The main reason it's so slow is that parsing arbitrary C++ expressions is actually really complicated. Every single byte is a separate expression, parsed using a complex general expression parser, presumably separately allocated as its own node in the syntax tree. If only we could generate code which combines lots of bytes of data into one token...

I wrote a small tool, called strliteral, which outputs data as a string literal rather than a character array. The command strliteral index.html will produce this C code:

const unsigned char index_html[] =
	"<html>\n\t<head>\n\t\t<title>Hello World</title>\n\t</head>\n\t<body>\n\t\tHello"
	" World\n\t</body>\n</html>\n";
const unsigned int index_html_len = 92;

It should come as no surprise that this is many times faster to parse than the character array approach. Instead of invoking a full expression parser for each and every byte, most of the time will just be spent in a tight loop which reads bytes and appends them to an array. The grammar for a string literal is ridiculously simple compared to the grammar for an array literal.

Compared to xxd's 23 seconds and 2.5GB of RAM usage for my 8.4MB file, my strliteral tool produces code which g++ can compile in 0.6 seconds, using only 138 megs of RAM. That's almost a 40x speed-up, and an 18x reduction in RAM usage. It's processing data at a rate of 15MB per second, compared to xxd's 0.4MB per second. As a bonus, my tool generates 26MB of C code, compared to xxd's 52MB.

Here's how that song looks, encoded with strliteral:

const unsigned char bippety_boppety_mp3[] =
	"\377\373\340D\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\242\240,\2253]5\234\316\020\234\375\246\072D\307\027\203R\030\307\221\314`\243B\370\013\301\220\256"
	"\235\036\243E\246\331\216\026\004\341\362uU&\255\030@,\227\021q]1\231L\304\010E\311\231\005W\231\210"
	"j-\"\374|\210II0\221\026\045\021}qC\206\t9<\320\013\246w\350\263EmH`#\262\037\252\304\272\340\355`7\217"
	"\343*\016\236\320\345oa\217\204\361~k\224\255|\301cy\371\375\034\366K'\236\037\271\204\371\275\rV\267"
	"\252\020\245\322~\233\350\222\343\347\204\332\340~\236-\355S.W\045\365\301=\\+\236\270F\312\246g\266"
	"CX2\376\265V\242T0\337I\031\343\347\320\336\322\020\016\020H\250\007]\031\201\235\025\300h\2628d\000"
	/* 249707 lines snipped */
	"\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252"
	"\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252"
	"\252TAG\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000Created with LMMS\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\377";
unsigned int bippety_boppety_mp3_len = 8779359;

The difference is even bigger when processing mostly-ASCII text rather than binary data. Since xxd produces the same 6 bytes of source code for every byte of input (0x, two hex digits, comma, space), the data itself doesn't matter. However, strliteral produces 4 bytes of source code (\, then three octal digits) for every "weird" character, but just one byte of source code for every "regular" ASCII character.

Graphs

I wrote some benchmarking code to compare various aspects of xxd and strliteral. All times are measured using an Intel Core i7-8705G CPU in a Dell XPS 15 9575. g++ and xxd are from the Ubuntu 20.04 repositories. strliteral is compiled with gcc -O3 -o strliteral strliteral.c using GCC 9.3.0. The benchmarking source code can be found here: https://github.com/mortie/strliteral/tree/master/benchmark

Here's a graph which shows exactly how the two tools compare, across a range of input sizes, given either text or random binary data:

The 70x number in the title comes from this graph. The 60ms spent compiling strliteral-generated code is 72x faster than the 4324ms spent compiling xxd-generated code. Comparing random binary data instead of text would show a lower - though still respectable - speed-up of 25x.

Though most of the time spent when embedding data with xxd comes from the compiler, the xxd tool itself is actually fairly slow too:

Those ~200 milliseconds xxd takes to generate code for a 2MB file isn't very significant compared to the 4.3 second compile time, but if strliteral was equally slow, 75% of the time would've been spent generating code as opposed to compiling code. Luckily, strliteral runs through 2MB of text in 11ms.

Looking at the xxd source code, the reason it's so slow seems to be that it prints every single byte using a call to fprintf:

while ((length < 0 || p < length) && (c = getc(fp)) != EOF)
  {
    if (fprintf(fpo, (hexx == hexxa) ? "%s0x%02x" : "%s0X%02X",
                (p % cols) ? ", " : &",\n  "[2*!p],  c) < 0)
      die(3);
    p++;
  }

Finally, here's a graph over g++'s memory usage:

Caveats

Update: In the reddit discussion, someone pointed out that MSVC, Microsoft's compiler, has a fairly low maximum string length limit (the exact limit is fairly complicated). I had assumed that any modern compiler would just keep strings in a variable sized array. Maybe strliteral will eventually grow an MSVC-specific workaround, but until then, using a better compiler like Clang or GCC on Windows is an option.

Using string literals for arbitrary binary data is a bit more complicated than using an array with integer literals. Both xxd and strliteral might have trouble in certain edge cases, such as when cross-compiling if the host and target disagrees on the number of bits in a byte. Using string literals adds an extra complication due to the distinction between the "source character set" and the "execution character set". The C11 spec (5.2.1p2) states:

In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters.

If you run strliteral on a file which contains the byte 97, it will output the code const unsigned char data[] = "a";. If that C code is compiled with a "source character set" of ASCII and an "execution character set" of EBCDIC, my understanding of the standard text is that the ASCII "a" (byte 97) will be translated to the EBCDIC "a" (byte 129). Whether that's even a bug or not depends on whether the intention is to embed binary data or textual data, but it's probably not what people expect from a tool to embed files.

This should only ever become an issue if you're compiling with different source and execution charsets, where the source charset and execution charset aren't based on ASCII. Compiling with a UTF-8 source charset and an EBCDIC execution charset will cause issues, but since all non-ASCII characters are printed as octal escape sequences, compiling with e.g a UTF-8 source charset and a LATIN-1 execution charset isn't an issue.

It seems extremely unlikely to me that someone will compile with a source charset and an execution charset which are both different and not based on ASCII, but I suppose it's something to keep in mind. If it does become an issue, the --always-escape option will cause strliteral to only generate octal escape sequences. That should work the same as xxd -i in all cases, just faster.

Some implementation notes

C is a weird language. For some reason, probably to better support systems where bytes are bigger than 8 bits, hex string escapes like "\x6c" can be an arbitrary number of characters. "\xfffff" represents a string with one character whose numeric value is 1048575. That obviously won't work on machines with 8-bit bytes, but it could conceivably be useful on a machine with 24-bit bytes, so it's allowed. Luckily, octal escapes are at most 3 numbers, so while "\xf8ash" won't work, "\370ash" will.

C also has a concept of trigraphs and digraphs, and they're expanded even within string literals. The string literals "??(" and "[" are identical (at least in C, and in C++ before C++17). Currently, strliteral just treats ?, : and % as "special" characters which are escaped, which means no digraphs or trigraphs will ever appear in the generated source code. I decided it's not worth the effort to add more "clever" logic which e.g escapes a ( if the two preceeding characters are question marks.

Read More