The tar archive format, its extensions, and why GNU tar extracts in quadratic time

Date: 2022-07-23
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00014-tar.md

(If you're here from Google and just need help with tar being slow: If you trust the tar archive, extract with -P to make tar fast.)

A couple of days ago, I had a 518GiB tar.gz file (1.1 TiB uncompressed) that I had to extract. At first, GNU tar was doing a great job, chewing through the tar.gz at around 100MiB/s. But after a while, it slowed significantly; down to less than a kilobyte per second. pv's time estimate went from a bit over an hour, to multiple hours, to over a day, to almost a week. After giving it some time, and after failing to find anything helpful through Google, I decided that learning the tar file format and making my own tar extractor would probably be faster than waiting for tar. And I was right; before the day was over, I had a working tar extractor, and I had successfully extracted my 1.1TiB tarball.

I will explain why GNU tar is so slow later in this post, but first, let's take a look at:

The original tar file format

Tar is pretty unusual for an archive file format. There's no archive header, no index of files to fascilitate seeking, no magic bytes to help file and its ilk detect whether a file is a tar archive, no footer, no archive-wide metadata. The only kind of thing in a tar file is a file object.

So, how do these file objects look? Well, they start with a 512-byte file object header which looks like this:

struct file_header {
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	char padding[255];
};

Followed by ceil(file_size / 512) 512-byte blocks of payload (i.e file contents).

We have most of the attributes we would expect a file object to have: the file path, the mode, the modification time (mtime), the user/group ID, the file size, and the file type. To support symlinks and hard links, there's also a link path.

The original tar file format defines these possible values for the file_type field:

  • '0' (or sometimes '\0', the NUL character): Normal file
  • '1': Hard link
  • '2': Symbolic link

Future extensions to tar implements additional file types, among them '5', which represents a directory. Some old tar implementations apparently used a trailing slash '/' in a '0'-type file object to represent directories, at least according to Wikipedia.

You may think that the numeric values (file_mode, file_size, file_mtime, ...) would be encoded in base 10, or maybe in hex, or using plain binary numbers ("base 256"). But no, they're actually encoded as octal strings (with a NUL terminator, or sometimes a space terminator). Tar is the only file format I know of which uses base 8 to encode numbers. I don't quite understand why, since octal is neither space-efficient nor human-friendly. When representing numbers in this post, I will write them in decimal (base 10).

To encode a tar archive with one file called "hello.txt" and the content "Hello World", we need two 512-byte blocks:

  1. Bytes 0-511: Header, type='0', file_path="./hello.txt", file_size=11
  2. Bytes 512-1023: "Hello World", followed by 501 zero bytes

In addition, a tar file is supposed to end with 1024 zero-bytes to represent an end-of-file marker.

The two big limitations of the original tar format is that paths can't be longer than 100 characters, and files can't be larger than 8GiB (8^11 bytes). Otherwise though, I quite like the simplicity of the format. We'll discuss how various extensions address the limitations later, but first, let's try to implement an extractor:

(Feel free to skip this source code, but you should at least skim the comments)

// tarex.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>

struct file_header {
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	char padding[255];
};

// We don't bother with great error reporting, just abort on error
#define check(x) if (!(x)) abort()

// Utilities to abort on short read/write
#define xfread(ptr, size, f) check(fread(ptr, 1, size, f) == size)
#define xfwrite(ptr, size, f) check(fwrite(ptr, 1, size, f) == size)

// Tar represents all its numbers as octal
size_t parse_octal(char *str, size_t maxlen) {
	size_t num = 0;
	for (size_t i = 0; i < maxlen && str[i] >= '0' && str[i] <= '7'; ++i) {
		num *= 8;
		num += str[i] - '0';
	}

	return num;
}

// Extract one file from the archive.
// Returns 1 if it extracted something, or 0 if it reached the end.
int extract(FILE *f) {
	unsigned char header_block[512];
	xfread(header_block, sizeof(header_block), f);
	struct file_header *header = (struct file_header *)header_block;

	// The end of the archive is represented with blocks of all-zero content.
	// For simplicity, assume that if the file path is empty, the block is all zero
	// and we reached the end.
	if (header->file_path[0] == '\0') {
		return 0;
	}

	// The file path and link path fields aren't always 0-terminated, so we need to copy them
	// into our own buffers, otherwise we break on files with exactly 100 character paths.
	char file_path[101] = {0};
	memcpy(file_path, header->file_path, 100);
	char link_path[101] = {0};
	memcpy(link_path, header->link_path, 100);

	// We need these for later
	size_t file_size = parse_octal(header->file_size, sizeof(header->file_size));
	FILE *out_file = NULL;

	if (header->file_type == '0' || header->file_type == '\0') {
		// A type of '0' means that this is a plain file.
		// Some early implementations also use a NUL character ('\0') instead of an ASCII zero.

		fprintf(stderr, "Regular file: %s\n", file_path);
		out_file = fopen(file_path, "w");
		check(out_file != NULL);

	} else if (header->file_type == '1') {
		// A type of '1' means that this is a hard link.
		// That means we create a hard link at 'file_path' which links to the file at 'link_path'.

		fprintf(stderr, "Hard link: %s -> %s\n", file_path, link_path);
		check(link(link_path, file_path) >= 0);

	} else if (header->file_type == '2') {
		// A type of '2' means that this is a symbolic link.
		// That means we create a symlink at 'file_path' which links to the file at 'link_path'.

		fprintf(stderr, "Symbolic link: %s -> %s\n", file_path, link_path);
		check(symlink(link_path, file_path) >= 0);

	} else if (header->file_type == '5') {
		// A type of '5' means that this is a directory.

		fprintf(stderr, "Directory: %s\n", file_path);
		check(mkdir(file_path, 0777) >= 0);

		// Directories sometimes use the size field, but they don't contain data blocks.
		// Zero out file_size to avoid skipping entries.
		file_size = 0;

	} else {
		// There are other possible fields added by various tar implementations and standards,
		// but we'll ignore those for this implementation.
		fprintf(stderr, "Unsupported file type %c: %s\n", header->file_type, file_path);
	}

	// We have read the header block, now we need to read the payload.
	// If we're reading a file (i.e if 'outfile' is non-NULL) we will also write the body,
	// but otherwise we'll just skip it.
	char block[512];
	while (file_size > 0) {
		xfread(block, sizeof(block), f);
		size_t n = file_size > 512 ? 512 : file_size;

		file_size -= n;
		if (out_file != NULL) {
			xfwrite(block, n, out_file);
		}
	}

	if (out_file != NULL) {
		check(fclose(out_file) >= 0);
	}

	// Indicate that we have successfully extracted a file object, and are ready to read the next
	return 1;
}

int main() {
	while (extract(stdin));
}

Let's see it in action:

~/tarex $ ls
tarex.c testdir
~/tarex $ gcc -o tarex tarex.c
~/tarex $ tree
.
├── tarex.c
├── tarex
└── testdir
    ├── hello-symlink -> hello.txt
    ├── hello.txt
    └── subdir
        └── file.txt

~/tarex $ tar c testdir >testdir.tar
~/tarex $ mkdir extract && cd extract

~/tarex/extract $ ../tarex <../testdir.tar
Directory: testdir/
Symbolic link: testdir/hello-symlink -> hello.txt
Directory: testdir/subdir/
Regular file: testdir/hello.txt
Regular file: testdir/subdir/file.txt

~/tarex/extract $ tree
.
└── testdir
    ├── hello-symlink -> hello.txt
    ├── hello.txt
    └── subdir
        └── file.txt

The UStar file format

The first major extension to the tar file format we will look at is the UStar format, which increases the file length limit to 256 characters and adds some new file types. The header is expanded to this:

struct file_header {
	// Original tar header fields
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	// New UStar fields
	char magic_bytes[6];
	char version[2];
	char owner_user_name[32];
	char owner_group_name[32];
	char device_major_number[8];
	char device_minor_number[8];
	char prefix[155];

	char padding[12];
};

We now have some magic bytes (defined to be "ustar\0" for the UStar format), as well as the owner user/group names. But most importantly, we have a prefix field, which allows up to 256 character file paths. With UStar, instead of just extracting the bytes from file_path and link_path like before, we must construct a file path like this:

void read_path(char dest[257], char path[100], char prefix[100]) {
	// If there's no prefix, use name directly
	if (prefix[0] == '\0') {
		memcpy(dest, path, 100);
		dest[100] = '\0';
		return;
	}

	// If there is a prefix, the path is: <prefix> '/' <path>
	size_t prefix_len = strnlen(prefix, 155);
	memcpy(dest, prefix, prefix_len);
	dest[prefix_len] = '/';
	memcpy(&dest[prefix_len + 1], path, 100);
	dest[256] = '\0';
}

int extract(FILE *f) {
	/* ... */

	char file_path[257];
	read_path(file_path, header->file_path, header->prefix);
	char link_path[257];
	read_path(link_path, header->link_path, header->prefix);

	/* ... */
}

The original tar format had the file types '0' (or '\0'), '1' and '2', for regular files, hard links and symlinks. UStar defines these additional file types:

  • '3' and '4': Character devices and block devices. These are the reason for the new device_major_number and device_minor_number fields.
  • '5': Directories.
  • '6': FIFO files.
  • '7': Contiguous files. This type isn't really used much these days, and most implementations just treat it as a regular file.

This is definitely an improvement, but we can still only encode up to 256 character long paths. And that 8GiB file size limit still exists. Which leads us to:

The pax file format

The POSIX.1-2001 standard introduced the pax command line tool, and with it, a new set of extensions to the tar file format. This format is identical to UStar, except that it adds two new file object types: 'x' and 'g'. Both of these types let us define "extended header records", as the spec calls it. Records set with 'x' apply to only the next file, while records set with 'g' apply to all following files.

With this new extended header, we can encode the access and modification times with more precision, user/group IDs above 8^7, file sizes over 8^11, file paths of arbitrary length, and a whole lot more. The records are in the payload of the extended headr file object, and use a simple length-prefixed key/value syntax. To represent our "hello.txt" example file with an access time attribute, we need these four 512-byte blocks:

  1. Header, type='x', file_size=30
  2. "30 atime=1658409251.551879906\n", followed by 482 zeroes
  3. Header, type='0', file_path="hello.txt", file_size=11
  4. "Hello World", followed by 501 zero bytes

Interestingly, these extended header records all seem to use decimal (base 10). On the one hand, using base 10 makes sense, but on the other hand, wouldn't it be nice to stick to one way of representing numbers?

Anyways, we can see that the file format has become quite complex now. Just the file path can be provided in any of four different ways:

  • The full path might be in the file_path field.
  • The path might be a combination of the prefix and the file_path fields.
  • The previous file object might've been an 'x' type record with set a path property.
  • There might've been some 'g' type file object earlier in the archive which set a path property.

The GNU tar file format

GNU tar has its own file format, called gnu, which is different from the pax format. Like pax, the gnu format is based on UStar, but it has a different way of encoding arbitrary length paths and large file sizes:

  • It introduces the 'L' type, where the payload of the file object represents the file_path of the next file object.
  • It introduces the 'K' type, where the payload of the file object represents the link_path of the next file object.
  • A link with both a long file_path and a long link_path is preceeded by both an 'L' type file object and a 'K' type file object. The order isn't specified from what I can tell.
  • If a file is over 8GiB, it will set the high bit of the first character in file_size, and the rest of the string is parsed as base 256 (i.e it's treated as a 95-bit integer, big endian).

In some ways, I prefer this approach over the pax approach, since it's much simpler; the pax format requires the extractor to parse the record grammar. On the other hand, the pax format is both more space efficient and vastly more flexible.

In any case, the result is that a tar extractor which wants to support both pax tar files and GNU tar files needs to support 5 different ways of reading the file path, 5 different ways of reading the link path, and 3 different ways of reading the file size.

Whatever happened to the nice and simple format we started out with?

Why GNU tar extracts in quadratic time

Our simple tar extraction implementation has what could be considered a quite serious security bug: It allows people to put files outside the directory we're extracting to. Nothing is stopping an evil arcive from containing a file object with file_path="../hello.txt". You might try to fix that by just disallowing file objects from using ".." as a path component, but it's not that simple. Consider the following sequence of file objects:

  1. Symlink, file_path="./foo", link_path=".."
  2. Normal file, file_path="./foo/hello.txt"

We want to allow symlinks which point to their parent directory, since there are completely legitimate use cases for that. We could try to figure out whether a symlink will end up pointing to somewhere outside of the extraction directory, but that gets complicated real fast when you have to consider symlinks to symlinks and hard links to symlinks. It might be possible to do correctly, but it's not the solution GNU tar goes for.

When GNU tar encounters a hard link or symlink with ".." as a path component in its link_path, tar will create a regular file in its place as a placeholder, and put a note about the delayed link in a linked list datastructure. When it's done extracting the entire archive, it will go through the whole list of delayed links and replace the placeholders with proper links. So far, so good.

The problem comes when trying to extract a hard link which doesn't contain ".." as a path component in its link_path. GNU tar wants to create such hard links immediately if it can. But it can't create a hard link if the target is occupied by a placeholder file. That means, every time GNU tar wants to create a hard link, it first has to walk the entire linked list of delayed links and see if the target is a delayed link. If the target is a delayed link, the new link must also be delayed.

Your time complexity alarm bells should be starting to ring now. For every hard link, we walk the list of all delayed links. But it actually gets worse; for reasons I don't quite understand yet, tar will actually go through the entire list of delayed links again if it found out that it can create the link immediately. So for all "normal" hard links, it has to go through the entire linked list of delayed links twice.

If you're a bit crafty, you can construct a tar archive which GNU tar extracts in precisely O(n^2) time; you just need to alternate between links whose link_path has ".." as a path component and thus get delayed, and "normal" hard links which don't get delayed. If you're a bit unlucky, you might have a totally benign tarball which nevertheless happens to contain a bunch of symlinks which refer to files in a parent directory, followed by a bunch of normal hard links. This is what had happened to me. My tarball happened to contain over 800 000 links with ".." as a path component. It also happened to contain over 5.4 million hard links. Every one of those hard links had to go through the entire list of every hitherto deferred link. No wonder tar got slow.

If you ever find yourself in this situation, pass the --absolute-paths (or -P) parameter to tar. Tar's documentation says this about --absolute-paths:

Preserve pathnames. By default, absolute pathnames (those that begin with a / character) have the leading slash removed both when creating archives and extracting from them. Also, tar will refuse to extract archive entries whose pathnames contain .. or whose target directory would be altered by a symlink. This option suppresses these behaviors.

You would never guess it from reading the documentation, but when you pass --absolute-paths during extraction, tar assumes that the archive is benign and the whole delayed linking mechanism is disabled. Make sure you trust the tar archive though! When extracted with --absolute-paths, a malicious archive will be able to put files anywhere it wants.

I'm absolutely certain that it's possible to make GNU tar extract in O(n) without --absolute-paths by replacing the linked list with a hash map. But that's an adventure for another time.

References

These are the documents I've drawn information from when researching for my tar extractor and this blog post:

If I have represented anything inaccurately in this post, please do correct me.

Read More

C/C++: 70x faster file embeds using string literals

Date: 2020-08-03
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00013-fast-cpp-embeds.md
Tool: https://github.com/mortie/strliteral

It's really common to want to embed some static data into a binary. Game developers want to embed their shaders. Developers of graphical apps may want to embed sounds or icons. Developers of programming language interpreters may want to embed their language's standard library. I have many times built software whose GUI is in the form of a web app served from a built-in HTTP server, where I want to embed the HTML/JS/CSS into the binary.

Since neither C nor C++ currently has a built-in way to embed files, we use work-arounds. These usually fall into one of two categories: Either we use toolchain-specific features to generate object files with the data exposed as symbols, or we generate C code which we subsequently compile to object files. Since the toolchain-specific features are, well, toolchain-specific, people writing cross-platform software generally prefer code generation.

The most common tool I'm aware of to generate C code for embedding data is xxd, whose -i option will generate C code with an unsigned char array literal.

Given the following input text:

<html>
	<head>
		<title>Hello World</title>
	</head>
	<body>
		Hello World
	</body>
</html>
index.html

The command xxd -i index.html will produce this C code:

unsigned char index_html[] = {
  0x3c, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a, 0x09, 0x3c, 0x68, 0x65, 0x61,
  0x64, 0x3e, 0x0a, 0x09, 0x09, 0x3c, 0x74, 0x69, 0x74, 0x6c, 0x65, 0x3e,
  0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72, 0x6c, 0x64, 0x3c,
  0x2f, 0x74, 0x69, 0x74, 0x6c, 0x65, 0x3e, 0x0a, 0x09, 0x3c, 0x2f, 0x68,
  0x65, 0x61, 0x64, 0x3e, 0x0a, 0x09, 0x3c, 0x62, 0x6f, 0x64, 0x79, 0x3e,
  0x0a, 0x09, 0x09, 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72,
  0x6c, 0x64, 0x0a, 0x09, 0x3c, 0x2f, 0x62, 0x6f, 0x64, 0x79, 0x3e, 0x0a,
  0x3c, 0x2f, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a
};
unsigned int index_html_len = 92;

This works fairly well. Any C or C++ compiler can compile that code and produce an object file with our static data, which we can link against to embed that data into our binary. All in a cross-platform and cross-toolchain way.

There's just one problem: It's slow. Really slow. On my laptop, embedding a megabyte this way takes 2 seconds using g++. Embedding one decent quality MP3 at 8.4MB takes 23 seconds, using 2.5 gigabytes of RAM.

bippety-boppety.mp3, an 8.4MB song

Whether or not we should embed files of that size into our binaries is a question I won't cover in this article, and the answer depends a lot on context. Regardless, processing data at just over 400kB per second is objectively terrible. We can do so much better.

The main reason it's so slow is that parsing arbitrary C++ expressions is actually really complicated. Every single byte is a separate expression, parsed using a complex general expression parser, presumably separately allocated as its own node in the syntax tree. If only we could generate code which combines lots of bytes of data into one token...

I wrote a small tool, called strliteral, which outputs data as a string literal rather than a character array. The command strliteral index.html will produce this C code:

const unsigned char index_html[] =
	"<html>\n\t<head>\n\t\t<title>Hello World</title>\n\t</head>\n\t<body>\n\t\tHello"
	" World\n\t</body>\n</html>\n";
const unsigned int index_html_len = 92;

It should come as no surprise that this is many times faster to parse than the character array approach. Instead of invoking a full expression parser for each and every byte, most of the time will just be spent in a tight loop which reads bytes and appends them to an array. The grammar for a string literal is ridiculously simple compared to the grammar for an array literal.

Compared to xxd's 23 seconds and 2.5GB of RAM usage for my 8.4MB file, my strliteral tool produces code which g++ can compile in 0.6 seconds, using only 138 megs of RAM. That's almost a 40x speed-up, and an 18x reduction in RAM usage. It's processing data at a rate of 15MB per second, compared to xxd's 0.4MB per second. As a bonus, my tool generates 26MB of C code, compared to xxd's 52MB.

Here's how that song looks, encoded with strliteral:

const unsigned char bippety_boppety_mp3[] =
	"\377\373\340D\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\242\240,\2253]5\234\316\020\234\375\246\072D\307\027\203R\030\307\221\314`\243B\370\013\301\220\256"
	"\235\036\243E\246\331\216\026\004\341\362uU&\255\030@,\227\021q]1\231L\304\010E\311\231\005W\231\210"
	"j-\"\374|\210II0\221\026\045\021}qC\206\t9<\320\013\246w\350\263EmH`#\262\037\252\304\272\340\355`7\217"
	"\343*\016\236\320\345oa\217\204\361~k\224\255|\301cy\371\375\034\366K'\236\037\271\204\371\275\rV\267"
	"\252\020\245\322~\233\350\222\343\347\204\332\340~\236-\355S.W\045\365\301=\\+\236\270F\312\246g\266"
	"CX2\376\265V\242T0\337I\031\343\347\320\336\322\020\016\020H\250\007]\031\201\235\025\300h\2628d\000"
	/* 249707 lines snipped */
	"\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252"
	"\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252"
	"\252TAG\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000Created with LMMS\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\377";
unsigned int bippety_boppety_mp3_len = 8779359;

The difference is even bigger when processing mostly-ASCII text rather than binary data. Since xxd produces the same 6 bytes of source code for every byte of input (0x, two hex digits, comma, space), the data itself doesn't matter. However, strliteral produces 4 bytes of source code (\, then three octal digits) for every "weird" character, but just one byte of source code for every "regular" ASCII character.

Graphs

I wrote some benchmarking code to compare various aspects of xxd and strliteral. All times are measured using an Intel Core i7-8705G CPU in a Dell XPS 15 9575. g++ and xxd are from the Ubuntu 20.04 repositories. strliteral is compiled with gcc -O3 -o strliteral strliteral.c using GCC 9.3.0. The benchmarking source code can be found here: https://github.com/mortie/strliteral/tree/master/benchmark

Here's a graph which shows exactly how the two tools compare, across a range of input sizes, given either text or random binary data:

The 70x number in the title comes from this graph. The 60ms spent compiling strliteral-generated code is 72x faster than the 4324ms spent compiling xxd-generated code. Comparing random binary data instead of text would show a lower - though still respectable - speed-up of 25x.

Though most of the time spent when embedding data with xxd comes from the compiler, the xxd tool itself is actually fairly slow too:

Those ~200 milliseconds xxd takes to generate code for a 2MB file isn't very significant compared to the 4.3 second compile time, but if strliteral was equally slow, 75% of the time would've been spent generating code as opposed to compiling code. Luckily, strliteral runs through 2MB of text in 11ms.

Looking at the xxd source code, the reason it's so slow seems to be that it prints every single byte using a call to fprintf:

while ((length < 0 || p < length) && (c = getc(fp)) != EOF)
  {
    if (fprintf(fpo, (hexx == hexxa) ? "%s0x%02x" : "%s0X%02X",
                (p % cols) ? ", " : &",\n  "[2*!p],  c) < 0)
      die(3);
    p++;
  }

Finally, here's a graph over g++'s memory usage:

Caveats

Update: In the reddit discussion, someone pointed out that MSVC, Microsoft's compiler, has a fairly low maximum string length limit (the exact limit is fairly complicated). I had assumed that any modern compiler would just keep strings in a variable sized array. Maybe strliteral will eventually grow an MSVC-specific workaround, but until then, using a better compiler like Clang or GCC on Windows is an option.

Using string literals for arbitrary binary data is a bit more complicated than using an array with integer literals. Both xxd and strliteral might have trouble in certain edge cases, such as when cross-compiling if the host and target disagrees on the number of bits in a byte. Using string literals adds an extra complication due to the distinction between the "source character set" and the "execution character set". The C11 spec (5.2.1p2) states:

In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters.

If you run strliteral on a file which contains the byte 97, it will output the code const unsigned char data[] = "a";. If that C code is compiled with a "source character set" of ASCII and an "execution character set" of EBCDIC, my understanding of the standard text is that the ASCII "a" (byte 97) will be translated to the EBCDIC "a" (byte 129). Whether that's even a bug or not depends on whether the intention is to embed binary data or textual data, but it's probably not what people expect from a tool to embed files.

This should only ever become an issue if you're compiling with different source and execution charsets, where the source charset and execution charset aren't based on ASCII. Compiling with a UTF-8 source charset and an EBCDIC execution charset will cause issues, but since all non-ASCII characters are printed as octal escape sequences, compiling with e.g a UTF-8 source charset and a LATIN-1 execution charset isn't an issue.

It seems extremely unlikely to me that someone will compile with a source charset and an execution charset which are both different and not based on ASCII, but I suppose it's something to keep in mind. If it does become an issue, the --always-escape option will cause strliteral to only generate octal escape sequences. That should work the same as xxd -i in all cases, just faster.

Some implementation notes

C is a weird language. For some reason, probably to better support systems where bytes are bigger than 8 bits, hex string escapes like "\x6c" can be an arbitrary number of characters. "\xfffff" represents a string with one character whose numeric value is 1048575. That obviously won't work on machines with 8-bit bytes, but it could conceivably be useful on a machine with 24-bit bytes, so it's allowed. Luckily, octal escapes are at most 3 numbers, so while "\xf8ash" won't work, "\370ash" will.

C also has a concept of trigraphs and digraphs, and they're expanded even within string literals. The string literals "??(" and "[" are identical (at least in C, and in C++ before C++17). Currently, strliteral just treats ?, : and % as "special" characters which are escaped, which means no digraphs or trigraphs will ever appear in the generated source code. I decided it's not worth the effort to add more "clever" logic which e.g escapes a ( if the two preceeding characters are question marks.

Read More

Hacking on Clang is surprisingly easy

Date: 2020-01-27
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00012-clang-compiler-hacking.md

I happen to think that the current lambda syntax for C++ is kind of verbose. I'm not the only one to have thought that, and there has already been a paper discussing a possible abbreviated lambda syntax (though it was rejected).

In this blog post, I will detail my attempt to implement a sort of simplest possible version of an abbreviated lambda syntax. Basically, this:

[](auto &&a, auto &&b) => a.id() < b.id();

should mean precisely:

[](auto &&a, auto &&b) { return a.id() < b.id(); };

I will leave a discussion about whether that change is worth it or not to the end. Most of this article will just assume that we want that new syntax, and discuss how to actually implement it in Clang.

If you want to read more discussion on the topic, I wrote a somewhat controversial post on Reddit discussing why I think it might be a good idea.

Here's the implementation I will go through in this post: https://github.com/mortie/llvm/commit/e4726dc9d9d966978714fc3d85c6e9c335a38ab8 - 28 additions, including comments and whitespace, across 3 files.

Getting the Clang code

This wasn't my first time compiling Clang, but it was my first time downloading the source code with the intent to change it.

LLVM has a nice page which details getting and building Clang, but the tl;dr is:

git clone https://github.com/llvm/llvm-project.git
cd llvm-project && mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/inst -DLLVM_ENABLE_PROJECTS=clang -DCMAKE_BUILD_TYPE=Release ../llvm
make -j 8
make install

A few points to note:

  • The build will take a long time. Clang is big.
  • I prefer -DCMAKE_BUILD_TYPE=Release because it's way faster to build. Linking Clang with debug symbols and everything takes ages and will OOM your machine.
  • This will install your built clang to inst (short for "install"). The clang binary itself will be in inst/bin/clang.

Now that we have a clang setup, we can have a look at how the project is laid out, and play with it.

Changing the Clang code

The feature I want to add is very simple: Basically, I want [] => 10 to mean the exact same thing as [] { return 10; }. In order to understand how one would achieve that, an extremely short introduction to how compilers work is necessary:

Our code is just a sequence of bytes, like [] => 10 + 20. In order for Clang to make sense of that, it will go through many steps. We can basically divide a compiler into two parts: the "front-end", which goes through many steps to build a thorough understanding of the code as a tree structure, and the "back-end" which goes through many steps to remove information, eventually ending up with a simple series of bytes again, but this time as machine code instead of ASCII.

We'll ignore the back-end for now. The front-end basically works like this:

  1. Split the stream of bytes into a stream of tokens. This step turns [] => 10 + 20 into something like (open-bracket) (close-bracket) (fat-arrow) (number: 10) (plus) (number: 20).
  2. Go through those tokens and construct a tree. This step turns the sequence of tokens into a tree: (lambda-expression (body (return-statement (add-expression (number 10) (number 20))))) (Yeah, this looks a lot like Lisp. There's a reason people say Lisp basically has no syntax; you're just writing out the syntax tree by hand.)
  3. Add semantic information, such as types.

The first phase is usually called lexical analysis, or tokenization, or scanning. The second phase is what we call parsing. The third phase is usually called semantic analysis or type checking.

Well, the change I want to make involves adding a new token, the "fat arrow" token =>. That means we'll have to find out how the lexer (or tokenizer) is implemented; where it keeps its list of valid tokens types, and where it turns the input text into tokens. After some grepping, I found the file clang/include/clang/Basic/TokenKinds.def, which includes a bunch of token descriptions, such as PUNCTUATOR(arrow, "->"). This file seems to be a "supermacro"; a file which exists to be included by another file as a form of macro expansion.

I added PUNCTUATOR(fatarrow, "=>") right below the PUNCTUATOR(arrow, "->") line.

Now that we have defined our token, we need to get the lexer to actually generate it.

After some more grepping, I found clang/lib/Lex/Lexer.cpp, where the Lexer::LexTokenInternal function is what's actually looking at characters and deciding what tokens they represent. It has a case statement to deal with tokens which start with an = character:

case '=':
	Char = getCharAndSize(CurPtr, SizeTmp);
	if (Char == '=') {
		// If this is '====' and we're in a conflict marker, ignore it.
		if (CurPtr[1] == '=' && HandleEndOfConflictMarker(CurPtr-1))
			goto LexNextToken;

		Kind = tok::equalequal;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);
	} else {
		Kind = tok::equal;
	}
	break;

Given that, the change to support my fatarrow token is really simple:

case '=':
	Char = getCharAndSize(CurPtr, SizeTmp);
	if (Char == '=') {
		// If this is '====' and we're in a conflict marker, ignore it.
		if (CurPtr[1] == '=' && HandleEndOfConflictMarker(CurPtr-1))
		goto LexNextToken;

		Kind = tok::equalequal;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);

	// If the first character is a '=', and it's followed by a '>', it's a fat arrow
	} else if (Char == '>') {
		Kind = tok::fatarrow;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);

	} else {
		Kind = tok::equal;
	}
	break;

Now that we have a lexer which generates a tok::fatarrow any time it encounters a => in our code, we can start changing the parser to make use of it.

Since I want to change lambda parsing, the code which parses a lamba seems like a good place to start (duh). I found that in a file called clang/lib/Parse/ParseExprCXX.cpp, in the function ParseLambdaExpressionAfterIntroducer. Most of the function deals with things like the template parameter list and trailing return type, which I don't want to change, but the very end of the function contains this gem:

// Parse compound-statement.
if (!Tok.is(tok::l_brace)) {
	Diag(Tok, diag::err_expected_lambda_body);
	Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
	return ExprError();
}

StmtResult Stmt(ParseCompoundStatementBody());
BodyScope.Exit();
TemplateParamScope.Exit();

if (!Stmt.isInvalid() && !TrailingReturnType.isInvalid())
	return Actions.ActOnLambdaExpr(LambdaBeginLoc, Stmt.get(), getCurScope());

Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
return ExprError();
  1. If the next token isn't an opening brace, error.
  2. Parse a compound statement body (i.e consume a {, read statements until the }).
  3. After some housekeeping, act on the now fully parsed lambda expression.

In principle, what we want to do is to check if the next token is a => instead of a {; if it is, we want to parse an expression instead of a compound statement, and then somehow pretend that the expression is a return statement. Through some trial, error and careful copy/pasting, I came up with this block of code which I put right before the if (!Tok.is(tok::l_brace)):

// If this is an arrow lambda, we just need to parse an expression.
// We parse the expression, then put that expression in a return statement,
// and use that return statement as our body.
if (Tok.is(tok::fatarrow)) {
	SourceLocation ReturnLoc(ConsumeToken());

	ExprResult Expr(ParseExpression());
	if (Expr.isInvalid()) {
		Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
		return ExprError();
	}

	StmtResult Stmt = Actions.ActOnReturnStmt(ReturnLoc, Expr.get(), getCurScope());

	BodyScope.Exit();
	TemplateParamScope.Exit();

	if (!Stmt.isInvalid() && !TrailingReturnType.isInvalid())
		return Actions.ActOnLambdaExpr(LambdaBeginLoc, Stmt.get(), getCurScope());

	Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
	return ExprError();
}

// Otherwise, just parse a compound statement as usual.
if (!Tok.is(tok::l_brace)) ...

This is really basic; if the token is a => instead of a {, parse an expression, then put that expression into a return statement, and then use that return statement as our lambda's body.

And it works! Lambda expressions with fat arrows are now successfully parsed as if they were regular lambdas whose body is a single return statement:

Demonstration of our new feature

Was it worth it?

Implementing this feature into Clang was definitely worth it just to get more familiar with how the code base works. However, is the feature itself a good idea at all?

I think the best way to decide if a new syntax is better or not is to look at old code which could've made use of the new syntax, and decide if the new syntax makes a big difference. Therefore, and now that I have a working compiler, I have gone through all the single-expression lambdas and replaced them with my fancy new arrow lambdas in some projects I'm working on.


Before:

std::erase_if(active_chunks_, [](Chunk *chunk) { return !chunk->isActive(); });

After:

std::erase_if(active_chunks_, [](Chunk *chunk) => !chunk->isActive());

This code deletes a chunk from a game world if the chunk isn't currently active. In my opinion, the version with the arrow function is a bit clearer, but a better solution could be C++17's std::invoke. If I understand std::invoke correctly, if C++ was to adopt std::invoke for algorithms, this code could be written like this:

std::erase_if(active_chunks_, &Chunk::isInactive);

This looks nicer, but has the disadvantage that you need to add an extra method to the class. Having both isActive and its negation isInactive as member functions just because someone might want to use it as a predicate in an algorithm sounds unfortunate. I prefer lambdas' fleixibilty.


Before:

return map(begin(worldgens_), end(worldgens_), [](auto &ptr) { return ptr.get(); });

After:

return map(begin(worldgens_), end(worldgens_), [](auto &ptr) => ptr.get());

This code maps a vector of unique pointers to raw pointers. This is yet another case where I think the arrow syntax is slightly nicer than the C++11 alternative, but this time, we could actually use the member function invocation if I changed my map function to use std::invoke:

return map(begin(worldgens_), end(worldgens_), &std::unique_ptr<Swan::WorldGen::Factory>::get);

Well, this illustrates that invoking a member function doesn't really work with overly complex types. Imagine if the type was instead something more elaborate:

return map(begin(worldgens_), end(worldgens_),
	std::unique_ptr<Swan::WorldGen<int>::Factory, Swan::WorldGen<int>::Factory::Deleter>::get);

This also happens to be unspecified behavior, because taking the address of a function in the standard library is generally not legal. From https://en.cppreference.com/w/cpp/language/extending_std:

The behavior of a C++ program is unspecified (possibly ill-formed) if it explicitly or implicitly attempts to form a pointer, reference (for free functions and static member functions) or pointer-to-member (for non-static member functions) to a standard library function or an instantiation of a standard library function template, unless it is designated an addressable function.


Before:

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	[](auto &w) { return w.needRender(); });

After:

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	[](auto &w) => w.needRender());

Again, the short lambda version looks a bit better to me. However, here again, we could replace the lambda with a member reference if algorithms were changed to use std::invoke:

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	&WidgetContainer::needRender);

Overall, I see a short lambda syntax as a modest improvement. The biggest readability win mostly stems from the lack of that awkward ; }); at the end of an expression; foo([] => bar()) instead of foo([] { return bar(); });. It certainly breaks down a bit when the argument list is long; neither of these two lines are particularly short:

foo([](auto const &&a, auto const &&b, auto const &&c) => a + b + c);
foo([](auto const &&a, auto const &&b, auto const &&c) { return a + b + c; });

I think, considering the minimal cost of implementing this short lambda syntax, the modest improvements outweigh the added complexity. However, there's also an opportunity cost associated with my version of the short lambda syntax: it makes a better, future short lambda syntax either impossible or more challenging. For example, accepting my syntax would mean we couldn't really adopt a ratified version of P0573R2's short lambda syntax in the future, even if the issues with it were otherwise fixed.

Therefore, I will argue strongly that my syntax makes code easier to read, but I can't say anything about whether it should be standardized or not.

Aside: Corner cases

If we were to standardize this syntax, we would have to consider all kinds of corner cases, not just accept what clang with my changes happens to to do. However, I'm still curious about what exactly clang happens to do with my changes.

How does this interact with the comma operator?

The comma operator in C++ (and most C-based languages) is kind of strange. For example, what does foo(a, b) do? We know it calls foo with the arguments a and b, but you could technically decide to parse it as foo(a.operator,(b)).

My arrow syntax parses foo([] => 10, 20) as calling foo with one argument; a function with the body 10, 20 (where the comma operator means the 10 does nothing, and 20 is returned). I would probably want that to be changed, so that foo is called with two arguments; a lambda and an int.

This turns out to be fairly easy to fix, because Clang already has ways of dealing with expressions which can't include top-level commas. After all, there's precedence here; since clang parses foo(10, 20) without interpreting the , a top-level comma as a comma operator, we can use the same infrastructure for arrow lambdas.

In clang/lib/Parser/ParseExpr.cpp, Clang defines a function ParseAssignmentExpression, which has this comment:

Parse an expr that doesn't include (top-level) commas.

Calling ParseAssignmentExpression is also the first thing the ParseExpression function does. It seems like it's just the general function for parsing an expression without a top-level comma operator, even though the name is somewhat misleading. This patch changes arrow lambdas to use ParseAssignmentExpression instead of ParseExpression: https://github.com/mortie/llvm/commit/c653318c0056d06a512dfce0799b66032edbed4c

How do immediately invoked lambdas work?

With C++11 lambdas, you can write an immediately invoked lambda in the obvious way; just do [] { return 10; }(), and that expression will return 10. With my arrow lambda syntax, it's not quite as obvious. Would [] => foo() be interpreted as immediately invoking the lambda [] => foo, or would it be interpreted as creating a lambda whose body is foo()?

In my opinion, the only sane way for arrow lambdas to work would be that [] => foo() creates a lambda with the body foo(), and that creating an immediately invoked lambda would require extra parens; ([] => foo())(). That's also how my implementation happens to work.

How does this interact with with explicit return types, specifiers, etc?

Since literally all the code before the arrow/opening brace is shared between arrow lambdas and C++11 lambdas, everything should work exactly the same. That means that all of these statements should work:

auto l1 = []() -> long => 10;
auto l2 = [foo]() mutable => foo++;
auto l3 = [](auto a, auto b) noexcept(noexcept(a + b)) => a + b;

And so would any other combination of captures, template params, params, specifiers, attributes, constraints, etc, except that the body has to be a single expression.

Read More

C compiler quirks I have encountered

Date: 2018-07-26
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00011-c-compiler-quirks.md

In a previous blog post, I wrote about some weird features of C, the C preprocessor, and GNU extensions to C that I used in my testing library, Snow.

This post will be about some of the weird compiler and language quirks, limitations, and annoyances I've come across. I don't mean to bash compilers or the specification; most of these quirks have good technical or practical reasons.

Compilers lie about what version of the standard they support

There's a handy macro, called __STDC_VERSION__, which describes the version of the C standard your C implementation conforms to. We can check #if (__STDC_VERSION__ >= 201112L) to check if our C implementaion confirms to C11 or higher (C11 was published in December 2011, hence 2011 12). That's really useful if, say, you're a library author and have a macro which uses _Generics, but also have alternative ways of doing the same and want to warn people when they use the C11-only macro in an older compiler.

In theory, this should always work; any implementation of C which conforms to all of C11 will define __STDC_VERSION__ as 201112L, while any implementation which doesn't conform to C11, but conforms to some earlier version, will define __STDC_VERSION__ to be less than 201112L. Therefore, unless the _Generic feature gets removed in a future version of the standard, __STDC_VERSION__ >= 201112L means that we can safely use _Generic.

Sadly, the real world is not that clean. You could already in GCC 4.7 enable C11 by passing in -std=c11, which would set __STDC_VERSION__ to 201112L, but the first release to actually implement all non-optional features of C11 was GCC 4.9. That means, if we just check the value of __STDC_VERSION__, users on GCC 4.7 and GCC 4.8 who use -std=c11 will see really confusing error messages instead of our nice error message. Annoyingly, GCC 4.7 and 4.8 happens to still be extremely widespread versions of GCC. (Relevant: GCC Wiki's C11Status page)

The solution still seems relatively simple; just don't use -std=c11. More recent compilers default to C11 anyways, and there's no widely used compiler that I know of which will default to setting __STDC_VERSION__ to C11 without actually supporting all of C11. That works well enough, but there's one problem: GCC 4.9 supports all of C11 just fine, but only if we give it -std=c11. GCC 4.9 also seems to be one of those annoyingly widespread versions of GCC, so we'd prefer to encourage users to set -std=c11 and make the macros which rely on _Generic work in GCC 4.9.

Again, the solution seems obvious enough, if a bit ugly: if the compiler is GCC, we only use _Genric if the GCC version is 4.9 or greater and __STDC_VERSION__ is C11. If the compiler is not GCC, we just trust it if it says it supports C11. This should in theory work perfectly:

#if (__STDC_VERSION__ >= 201112L)
# ifdef __GNUC__
#  if (__GNUC__ >= 5 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 9))
#   define IS_C11
#  endif
# else
#  define IS_C11
# endif
#endif

Our new IS_C11 macro should now always be defined if we can use _Generic and always not be defined when we can't use _Generic, right?

Wrong. It turns out that in their quest to support code written for GCC, Clang also defines the __GNUC__, __GNUC_MINOR__, and __GNUC_PATCHLEVEL__ macros, specifically to fool code which checks for GCC into thinking Clang is GCC. However, it doesn't really go far enough; it defines the __GNUC_* variables to correspond to the the version of clang, not the version of GCC which Clang claims to imitate. Clang gained support for C11 in 3.6, but using our code, we would conclude that it doesn't support C11 because __GNUC__ is 3 and __GNUC_MINOR__ is 6. Update: it turns out that Clang always pretends to be GCC 4.2, but the same issue still applies; __GNUC__ is 4, and __GNUC_MINOR__ is 2, so it fails our version check. We can solve this by adding a special case for when __clang__ is defined:

#if (__STDC_VERSION__ >= 201112L)
# if defined(__GNUC__) && !defined(__clang__)
#  if (__GNUC__ >= 5 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 9))
#   define IS_C11
#  endif
# else
#  define IS_C11
# endif
#endif

Now our code works with both Clang and with GCC, and should work with all other compilers which don't try to immitate GCC - but for every compiler which does immitate GCC, we would have to add a new special case. This is starting to smell a lot like user agent strings.

The Intel compiler is at least nice enough to define __GNUC__ and __GNUC_MINOR__ according to be the version of GCC installed on the system; so even though our version check is completely irrelevant in the Intel compiler, at least it will only prevent an otherwise C11-compliant Intel compiler from using _Generic if the user has an older version of GCC installed.

User: Hi, I'm using the Intel compiler, and your library claims my compiler doesn't support C11, even though it does.

You: Upgrading GCC should solve the issue. What version of GCC do you have installed?

User: ...but I'm using the Intel compiler, not GCC.

You: Still, what version of GCC do you have?

User: 4.8, but I really don't see how that's relevant...

You: Try upgrading GCC to at least version 4.9.

(Relevant: Intel's Additional Predefined Macros page)

_Pragma in macro arguments

C has had pragma directives for a long time. It's a useful way to tell our compiler something implementation-specific; something which there's no way to say using only standard C. For example, using GCC, we could use a pragma directive to tell our compiler to ignore a warning for a couple of lines, without changing warning settings globally:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
// my_float being 0 indicates a horrible failure case.
if (my_float == 0)
	abort();
#pragma GCC diagnostic pop

We might also want to define a macro which outputs the above code, so C99 introduced the _Pragma operator, which works like #pragma, but can be used in macros. Once this code goes through the preprocessor, it will do exactly the same as the above code:

#define abort_if_zero(x) \
	_Pragma("GCC diagnostic push") \
	_Pragma("GCC diagnostic ignored \"-Wfloat-equal\"") \
	if (x == 0) \
		abort(); \
	_Pragma("GCC diagnostic pop")

abort_if_zero(my_float);

Now, imagine that we want a macro to trace certain lines; a macro which takes a line of code, and prints that line of code while executing the line. This code looks completely reasonable, right?

#define trace(x) \
	fprintf(stderr, "TRACE: %s\n", #x); \
	x

trace(abort_if_zero(my_float));

However, if we run that code through GCC's preprocessor, we see this mess:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
#pragma GCC diagnostic pop
fprintf(stderr, "TRACE: %s\n", "abort_if_zero(my_float)"); if (my_float == 0) abort();

The pragmas all got bunched up at the top! From what I've heard, this isn't against the C standard, because the standard not entirely clear on what happens when you send in _Pragma operators as macro arguments, but it sure surprised me when I encountered it nonetheless.

For the Snow library, this means that there are certain warnings which I would have loved to only disable for a few lines, but which I have to disable for all code following the #include <snow/snow.h> line.

Side note: Clang's preprocessor does exactly what one would expect, and produces this output:

fprintf(stderr, "TRACE: %s\n", "abort_if_zero(my_float)");
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
 if (my_float == 0) abort();
#pragma GCC diagnostic pop

Line numbers in macro arguments

Until now, the quirks I've shown have been issues you could potentially encounter in decent, real-world code. If this quirk has caused issues for you however, it might be a sign that you're slightly over-using macros.

All testing code in Snow happens within macro arguments. This allows for what I think is a really nice looking API, and allows all testing code to be disabled just by changing one macro definition. This is a small example of a Snow test suite:

#include <stdio.h>
#include <snow/snow.h>

describe(files, {
	it("writes to files", {
		FILE *f = fopen("testfile", "w");
		assertneq(f, NULL);
		defer(remove("testfile"));
		defer(fclose(f));

		char str[] = "hello there";
		asserteq(fwrite(str, 1, sizeof(str), f), sizeof(str));
	});
});

snow_main();

If that assertneq or asserteq fails, we would like and expect to see a line number. Unfortunately, after the code goes through the preprocessor, the entire nested macro expansion ends up on a single line. All line number information is lost. __LINE__ just returns the number of the last line of the macro expansion, which is 14 in this case. All __LINE__ expressions inside the block we pass to describe will return the same number. I have googled around a bunch for a solution to this issue, but none of the solutions I've looked at actually solve the issue. The only actual solution I can think of is to write my own preprocessor.

Some warnings can't be disabled with pragma

Like the above example, this is probably an issue you shouldn't have come across in production code.

First, some background. In Snow, both the code which is being tested and the test cases can be in the same file. This is to make it possible to test static functions and other functionality which isn't part of the component's public API. The idea is that at the bottom of the file, after all non-testing code, one should include <snow/snow.h> and write the test cases. In a non-testing build, all the testing code will be removed by the preprocessor, because the describe(...) macro expands to nothing unless SNOW_ENABLED is defined.

My personal philosophy is that your regular builds should not have -Werror, and that your testing builds should have as strict warnings as possible and be compiled with -Werror. Your users may be using a different compiler version from you, and that compiler might produce some warnings which you haven't fixed yet. Being a user of a rolling release distro, with a very recent of GCC, I have way too often had to edit someone else's Makefile and remove -Werror just to make their code compile. Compiling the test suite with -Werror and regular builds without -Werror has none of the drawbacks of using -Werror for regular builds, and most or all of the advantages (at least if you don't accept contributions which break your test suite).

This all means that I want to be able to compile all files with at least -Wall -Wextra -Wpedantic -Werror, even if the code includes <snow/snow.h>. However, Snow contains code which produces warnings (and therefore errors) with those settings; among other things, it uses some GNU extensions which aren't actually part of the C standard.

I would like to let users of Snow compile their code with at least -Wall -Wextra -Wpedantic -Werror, but Snow has to disable at least -Wpedantic for all code after the inclusion of the library. In theory, that shouldn't be an issue, right? We just include #pragma GCC diagnostic ignored "-Wpedantic" somewhere.

Well, as it turns out, disabling -Wpedantic with a pragma doesn't disable all the warnings enabled by -Wpedantic; there are some warnings which are impossible to disable once they're enabled. One such warning is about using directives (like #ifdef) inside macro arguments. As I explained earlier, everything in Snow happens inside of macro arguments. That means that when compiling with -Wpedantic, this code produces a warning which it's impossible to disable without removing -Wpedantic from the compiler's arguments:

describe(some_component, {
#ifndef __MINGW32__
	it("does something which can't be tested on mingw", {
		/* ... */
	});
#endif
});

That's annoying, because it's perfectly legal in GNU's dialect of C. The only reason we can't do it is that it just so happens to be impossible to disable that particular warning with a pragma.

To be completely honest, this issue makes complete sense. I imagine the preprocessor stage, which is where macros are expanded, doesn't care much about pragmas. It feels unnecessary to implement pragma parsing for the preprocessor just in order to let people compile files with -Wpedantic but still selectively disable this particular warning. That doesn't make it less annoying though.

Funnily enough, I encountered this issue while writing Snow's test suite. My solution was to just define a macro called NO_MINGW which is empty if __MINGW32__ is defined, and expands to the contents of its arguments otherwise.

Read More

Some obscure C features you might not know about

Date: 2018-01-25
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00010-obscure-c-features.md

I have been working on Snow, a unit testing library for C. I wanted to see how close I could come to making a DSL (domain specific language) with its own syntax and features, using only the C preprocessor and more obscure C features and GNU extensions. I will not go into detail about how Snow works unless it's directly relevant, so I recommend taking a quick look at the readme on the GitHub page.

Sending blocks as arguments to macros

Let's start with the trick that's probably both the most useful in everyday code, and the least technically complicated.

Originally, I defined macros like describe, subdesc, and it similar to this:

#define describe(name, block) \
	void test_##name() { \
		/* some code, omitted for brevity */ \
		block \
		/* more code */ \
	}

The intended use would then be like this:

describe(something, {
	/* code */
});

The C preprocessor doesn't really understand the code; it only copies and pastes strings around. It splits the string between the opening ( and the closing ) by comma; that means, in this case, something would be sent in as the first argument, and { /* code */ } as the second argument (pretend /* code */ is actual code; the preprocessor actually strips out comments). The C preprocessor is smart enough to know that you might want to pass function calls to macros, and function calls contain commas, so parentheses will "guard" the commas they contain. describe(something, foo(10, 20)) would therefore pass something as the first argument, and foo(10, 20) as the second argument.

Now, we're not passing in function calls, but blocks. The preprocessor only considers parentheses; braces { } or brackets [ ] don't guard their contents. That means this call will fail:

describe(something, {
	int a, b;
	/* code */
});

The preprocessor will interpret something as the first argument, { int a as the second argument, and b; /* code */ } as the third argument, but describe only takes two arguments! The preprocessor will halt and show an error message.

So, how do we fix this? Not being able to write commas outside of parentheses in our blocks is quite the limitation. Not only does it prevent us from declaring multiple variables in one statement, it also messes with array declarations like int foo[] = { 10, 20, 30 };.

Well, the preprocessor supports variadic macros; macros which can take an unlimited amount of arguments. The way they are implemented is that any extra arguments (indicated by ... in the macro definition) are made available through the __VA_ARGS__ identifier; __VA_ARGS__ is replaced with all the extra arguments separated by commas. So, what happens if we define the macro like this?

#define describe(name, ...) \
	void test_##name() { \
		/* some code, omitted for brevity */ \
		__VA_ARGS__ \
		/* more code */ \
	}

Let's call describe like we did above:

describe(something, {
	int a, b;
	/* code */
});

Now, the arguments will be interpreted the same way as before; something will be the first argument, { int a will be the second argument, and b; /* code */ } will be the third. However, __VA_ARGS__ will be replaced by the second and third argument with a comma inbetween, and together they produce { int a, b; /* code */ }, just as we intended. The entire describe call will be expanded into this (with added newlines and indentation for clarity; the actual preprocessor would put it all on one line):

void test_something() {
	/* some code, omitted for brevity */
	{
		int a, b;
		/* code */
	}
	/* more code */
}

And just like that, we successfully passed a block of code, with unguarded commas, to a macro.

Credit for this solution goes to this stackoverflow answer.

Generic macros with _Generic

I wanted to be able to use one set of macros, asserteq and assertneq, to be able to do most simple equality checks, instead of having to write asserteq_str for strings, asserteq_int for integers, etc. The C11 standard added the _Generic keyword, which sounds like it's perfect for that; given a list of types and expressions, _Generic will choose the expression whose associated type is compatible with a controlling expression. For example, this code will print "I am an int":

_Generic(10,
	int: printf("I am an int\n"),
	char *: printf("I am a string\n")
);

By itself, _Generic isn't terribly useful, but it can be used to make faux-generic function-like macros. The cppreference.com page uses the example of a generic cbrt (cube root) macro:

#define cbrt(x) _Generic((x), \
	long double: cbrtl, \
	float: cbrtf, \
	default: cbrt)(x)

Calling cbrt on a long double will now call cbrtl, while calling cbrt on a double will call the regular cbrt function, etc. Note that _Generic is not part of the preprocessor; the preprocessor will just spit out the _Generic syntax with x replaced with the macro's argument, and it's the actual compiler's job to figure out what type the controlling expression is and choose the appropriate expression.

I have a bunch of asserteq functions for the various types; asserteq_ptr(void *a, void *b), asserteq_int(intmax_t a, intmax_t b), asserteq_str(const char *a, const char *b), etc. (In reality, the function signatures are a lot uglier, and they're prefixed with _snow_, but for the sake of this article, I'll pretend they look like void asserteq_<suffix>(<type> a, <type> b)).

At first glance, _Generic looks perfect for this use case; just define an asserteq macro like this:

#define asserteq(a, b) _Generic((b), \
	const char *: asserteq_str, \
	char *: asserteq_str, \
	void *: asserteq_ptr, \
	int: asserteq_int)(a, b)

It's sadly not that simple. _Generic will match only specific types; int matches only int, not long. void * matches void pointers, not any other form of pointer. There's no way to say "match every pointer type", for example.

However, there is a default clause, just like in switch statements. My first solution was to just pass anything not otherwise specified to asserteq_int, and use _Pragma (like #pragma, but can be used inside macros) to ignore the warnings:

#define asserteq(a, b) \
	do { \
		_Pragma("GCC diagnostic push") \
		_Pragma("GCC diagnostic ignored \"-Wint-conversion\"") \
		_Generic((b), \
			const char *: asserteq_str, \
			char *: asserteq_str, \
			default: asserteq_int)(a, b) \
		_Pragma("GCC diagnostic pop") \
	} while (0)

That solution worked but it's not exactly nice. I assume it would eventually break, either due to compiler optimizations or due to weird systems where an intmax_t is smaller than a pointer or whatever. Luckily, the good people over in ##C@freenode had an answer: subtracting a pointer from a pointer results in a ptrdiff_t! That means we can nest _Generics, and appropriately choose asserteq_int for any integer types, or asserteq_ptr for any pointer types:

#define asserteq(a, b) _Generic((b), \
	const char *: asserteq_str, \
	char *: asserteq_str, \
	default: _Generic((b) - (b), \
		ptrdiff_t: asserteq_ptr(a, b), \
		default: asserteq_int(a, b)))(a, b)

Defer, label pointers, and goto *(void *)

I once saw a demonstration of Golang's defer statement, and fell in love. It immediately struck me as a much better way to clean up than relying solely on the try/catch stuff we've been used to ever since 1985. Naturally, I wanted to use that for tearing down test cases in Snow, but there's not exactly any obvious way to implement it in C.

For those unfamiliar with it, in Go, defer is basically a way to say, "run this expression once the function returns". It works like a stack; when the function returns, the most recently deferred expression will be executed first, and the first deferred expression will be executed last. The beautiful part is that even if the function returns early, either because some steps can be skipped, or because something failed, all the appropriate deferred expressions, and only the appropriate deferred expressions, will be executed. Replace "function" with "test case", and it sounds perfect for tearing down tests.

So, how would you implement that in C? Well, it turns out that GCC has two useful non-standard extensions (which are also supported by Clang by the way): local labels, and labels as values.

Local labels are basically regular labels which you can jump to with goto, but instead of being global to the entire function, they're only available in the block they're declared in. That's fairly straightforward. You declare that a label should be block scoped by just putting __label__ label_name; at the top of the block, and then you can use label_name: anywhere within the block to actually create the label. A goto label_name from anywhere within the block will then go to the label, as expected.

Labels as values is weirder. GCC adds a new unary && operator, which gets a pointer to a label as a void *. Moreover, if you save that pointer in a variable which is accessible outside the block, you can jump back in to that block from outside of it, even though it's a local label. This will print "hello" in an infinite loop:

{
	void *somelabel;

	{
		__label__ lbl;
		lbl:
		somelabel = &&lbl;
		printf("hello\n");
	}

	goto *somelabel;
}

Yes, the somelabel is a void *. Yes, we dereference somelabel to go to it. I don't know how that works, but the important part is that it does. Other than being dereferencable, the void * we get from the unary && works exactly like any other void *, and can even be in an array. Knowing this, implementing defer isn't too hard; here's a simplified implementation of the it(description, block) macro (using the __VA_ARGS__ trick from before) which describes one test case, and the defer(expr) macro which can be used inside the it block:

#define it(description, ...) \
	do { \
		__label__ done_label; \
		void *defer_labels[32]; \
		int defer_count = 0; \
		int run_defer = 0; \
		__VA_ARGS__ \
		done_label: \
		run_defer = 1; \
		if (defer_count > 0) { \
			defer_count -= 1; \
			goto *defer_labels[defer_count]; \
		} \
	} while (0)

#define defer(expr) \
	do { \
		__label__ lbl; \
		lbl: \
		if (run_defer) { \
			expr; \
			/* Go to the previous defer, or the end of the `it` block */ \
			if (defer_count > 0) { \
				defer_count -= 1; \
				goto *defer_labels[defer_count]; \
			} else { \
				goto done_label; \
			} \
		} else { \
			defer_labels[defer_count] = &&lbl; \
			defer_count += 1; \
		} \
	} while (0)

That might not be the most understandable code you've ever seen, but let's break it down with an example.

it("whatever", {
	printf("Hello World\n");
	defer(printf("world\n"));
	defer(printf("hello "));
});

Running that through the preprocessor, we get this code:

do {
	__label__ done_label;
	void *defer_labels[32];
	int defer_count = 0;
	int run_defer = 0;

	{
		printf("Hello World\n");

		do {
			__label__ lbl;
			lbl:
			if (run_defer) {
				printf("world\n");

				/* Go to the previous defer, or the end of the `it` block */
				if (defer_count > 0) {
					defer_count -= 1;
					goto *defer_labels[defer_count];
				} else {
					goto done_label;
				}
			} else {
				defer_labels[defer_count] = &&lbl;
				defer_count += 1;
			}
		} while (0);

		do {
			__label__ lbl;
			lbl:
			if (run_defer) {
				printf("hello ");

				/* Go to the previous defer, or the end of the `it` block */
				if (defer_count > 0) {
					defer_count -= 1;
					goto *defer_labels[defer_count];
				} else {
					goto done_label;
				}
			} else {
				defer_labels[defer_count] = &&lbl;
				defer_count += 1;
			}
		} while (0);
	}

	done_label:
	run_defer = 1;
	if (defer_count > 0) {
		defer_count -= 1;
		goto *defer_labels[defer_count];
	}
} while (0);

That's still not extremely obvious on first sight, but it's at least more obvious than staring at the macro definitions. The first time through, run_defer is false, so both the defer blocks will just add their labels to the defer_labels array and increment defer_count. Then, just through normal execution (without any goto), we end up at the label called done_label, where we set run_defer to true. Because defer_count is 2, we decrement defer_count and jump to defer_labels[1], which is the last defer.

This time, because run_defer is true, we run the deferred expression printf("hello "), decrement defer_count again, and jump to defer_labels[0], which is the first defer.

The first defer runs its expression, printf("world\n"), but because defer_count is now 0, we jump back to done_label. defer_count is of course still 0, so we just exit the block.

The really nice thing about this system is that a failing assert can at any time just say goto done_label, and only the expressions which were deferred before the goto will be executed.

(Note: in the actual implementation in Snow, defer_labels is of course a dynamically allocated array which is realloc'd when necessary. It's also global to avoid an allocation and free for every single test case. I omitted that part because it's not that relevant, and would've made the example code unnecessarily complicated.)

Update: A bunch of people on Reddit and Hacker News have suggested ways to accomplish this. I ended up using the __attribute__((constructor)) function attribute, which makes a given function execute before the main function. Basically, each describe creates a function called test_##name, and a constructor function called _snow_constructor_##name whose only job is to add test_##name to a global list of functions. Here's the code: https://github.com/mortie/snow/blob/7ee25ebbf0edee519c6eb6d36b82d784b0fdcbfb/snow/snow.h#L393-L421

Automatically call all functions created by describe

The describe macro is meant to be used at the top level, outside of functions, because it creates functions. It's basically just this:

#define describe(name, ...) \
	void test_##name() { \
		__VA_ARGS__ \
	}

Calling describe(something, {}) will create a function called test_something. Currently, that function has to be called manually, because no other part of Snow knows what the function is named. If you have used the describe macro to define the functions test_foo, test_bar, and test_baz, the main function will look like this:

snow_main({
	test_foo();
	test_bar();
	test_baz();
})

I would have loved it if snow_main could just know what functions are declared by describe, and automatically call them. I will go over a couple of ways I tried, which eventually turned out to not be possible, and then one way which would definitely work, but which is a little too crazy, even for me.

Static array of function pointers

What if, instead of just declaring functions with describe, we also appended them to an array of function pointers? What if snow.h contained code like this:

void (*described_functions[512])();

#define describe(name, ...) \
	void test_##name() { \
		__VA_ARGS__ \
	} \
	described_functions[__COUNTER__] = &test_##name

__COUNTER__ is a special macro which starts at 0, and is incremented by one every time it's referenced. That means that assuming nothing else uses __COUNTER__, this solution would have worked, and would have been relatively clean, if only it was valid syntax. Sadly, you can't set the value of an index in an array like that in the top level in C, only inside functions.

Appending to a macro

What if we had a macro which we appended test_##name(); to every time a function is declared by describe? It turns out that this is almost possible using some obscure GCC extensions. I found this solution on StackOverflow:

#define described_functions test_foo();

#pragma push_macro("described_functions")
#undef described_functions
#define described_functions _Pragma("pop_macro(\"described_functions\")") described_functions test_bar();

#pragma push_macro("described_functions")
#undef described_functions
#define described_functions _Pragma("pop_macro(\"described_functions\")") described_functions test_baz();

described_functions // expands to test_foo(); test_bar(); test_baz();

This is actually a way to append a string to a macro which works, at least in GCC. Snow could have used that... except for one problem: you of course can't use #define from within a macro, and we would have needed to do this from within the describe macro. I have searched far and wide for a way, even a weird GCC-specific possibly pragma-related way, to redefine a macro from within another macro, but I haven't found anything. Close, but no cigar.

The way which actually works

I mentioned that there is actually one way to do it. Before I show you, I need to cover dlopen and dlsym.

void *dlopen(const char *filename, int flags) opens a binary (usually a shared object... usually), and returns a handle. Giving dlopen NULL as the file name gives us a handle to the main program.

void *dlsym(void *handle, const char *symbol) returns a pointer to a symbol (for example a function) in the binary which handle refers to.

We can use dlopen and dlsym like this:

#include <stdio.h>
#include <dlfcn.h>

void foo() {
	printf("hello world\n");
}

int main() {
	void *h = dlopen(NULL, RTLD_LAZY);

	void *fptr = dlsym(h, "foo");
	void (*f)() = fptr;
	f();

	dlclose(h);
}

Compile that code with gcc -Wl,--export-dynamic -ldl -o something something.c, and run ./something, and you'll see it print hello world to the terminal. That means we can actually call functions dynamically based on an arbitrary string at runtime. (The -Wl,--export-dynamic is necessary to tell the linker to export the symbols, such that they're available to us through dlsym).

Being able to run functions based on a runtime C string, combined with our friend __COUNTER__, opens up some interesting possibilities. We could write a program like this:

#include <stdio.h>
#include <dlfcn.h>

/* Annoyingly, the concat_ and concat macros are necessary to
 * be able to use __COUNTER__ in an identifier name */
#define concat_(a, b) a ## b
#define concat(a, b) concat_(a, b)

#define describe(...) \
	void concat(test_, __COUNTER__)() { \
		__VA_ARGS__ \
	}

describe({
	printf("Hello from function 0\n");
})

describe({
	printf("Hi from function 1\n");
})

int main() {
	void *h = dlopen(NULL, RTLD_LAZY);
	char symbol[32] = { '\0' };

	for (int i = 0; i < __COUNTER__; ++i) {
		snprintf(symbol, 31, "test_%i", i);
		void *fptr = dlsym(h, symbol);
		void (*f)() = fptr;
		f();
	}

	dlclose(h);
}

Run that through the preprocessor, and we get:

void test_0() {
	{ printf("Hello from function 0\n"); }
}
void test_1() {
	{ printf("Hi from function 1\n"); }
}

int main() {
	void *h = dlopen(NULL, RTLD_LAZY);
	char symbol[32] = { '\0' };

	for (int i = 0; i < 2; ++i) {
		snprintf(symbol, 31, "test_%i", i);
		void *fptr = dlsym(h, symbol);
		void (*f)() = fptr;
		f();
	}

	dlclose(h);
}

That for loop in our main function will first call test_0(), then test_1().

I hope you understand why even though this technically works, it's not exactly something I want to include in Snow ;)

Read More