The tar archive format, its extensions, and why GNU tar extracts in quadratic time

Date: 2022-07-23
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00014-tar.md

(If you're here from Google and just need help with tar being slow: If you trust the tar archive, extract with -P to make tar fast.)

A couple of days ago, I had a 518GiB tar.gz file (1.1 TiB uncompressed) that I had to extract. At first, GNU tar was doing a great job, chewing through the tar.gz at around 100MiB/s. But after a while, it slowed significantly; down to less than a kilobyte per second. pv's time estimate went from a bit over an hour, to multiple hours, to over a day, to almost a week. After giving it some time, and after failing to find anything helpful through Google, I decided that learning the tar file format and making my own tar extractor would probably be faster than waiting for tar. And I was right; before the day was over, I had a working tar extractor, and I had successfully extracted my 1.1TiB tarball.

I will explain why GNU tar is so slow later in this post, but first, let's take a look at:

The original tar file format

Tar is pretty unusual for an archive file format. There's no archive header, no index of files to fascilitate seeking, no magic bytes to help file and its ilk detect whether a file is a tar archive, no footer, no archive-wide metadata. The only kind of thing in a tar file is a file object.

So, how do these file objects look? Well, they start with a 512-byte file object header which looks like this:

struct file_header {
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	char padding[255];
};

Followed by ceil(file_size / 512) 512-byte blocks of payload (i.e file contents).

We have most of the attributes we would expect a file object to have: the file path, the mode, the modification time (mtime), the user/group ID, the file size, and the file type. To support symlinks and hard links, there's also a link path.

The original tar file format defines these possible values for the file_type field:

'0' (or sometimes '\0', the NUL character): Normal file
'1': Hard link
'2': Symbolic link

Future extensions to tar implements additional file types, among them '5', which represents a directory. Some old tar implementations apparently used a trailing slash '/' in a '0'-type file object to represent directories, at least according to Wikipedia.

You may think that the numeric values (file_mode, file_size, file_mtime, ...) would be encoded in base 10, or maybe in hex, or using plain binary numbers ("base 256"). But no, they're actually encoded as octal strings (with a NUL terminator, or sometimes a space terminator). Tar is the only file format I know of which uses base 8 to encode numbers. I don't quite understand why, since octal is neither space-efficient nor human-friendly. When representing numbers in this post, I will write them in decimal (base 10).

To encode a tar archive with one file called "hello.txt" and the content "Hello World", we need two 512-byte blocks:

Bytes 0-511: Header, type='0', file_path="./hello.txt", file_size=11
Bytes 512-1023: "Hello World", followed by 501 zero bytes

In addition, a tar file is supposed to end with 1024 zero-bytes to represent an end-of-file marker.

The two big limitations of the original tar format is that paths can't be longer than 100 characters, and files can't be larger than 8GiB (8^11 bytes). Otherwise though, I quite like the simplicity of the format. We'll discuss how various extensions address the limitations later, but first, let's try to implement an extractor:

(Feel free to skip this source code, but you should at least skim the comments)

// tarex.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>

struct file_header {
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	char padding[255];
};

// We don't bother with great error reporting, just abort on error
#define check(x) if (!(x)) abort()

// Utilities to abort on short read/write
#define xfread(ptr, size, f) check(fread(ptr, 1, size, f) == size)
#define xfwrite(ptr, size, f) check(fwrite(ptr, 1, size, f) == size)

// Tar represents all its numbers as octal
size_t parse_octal(char *str, size_t maxlen) {
	size_t num = 0;
	for (size_t i = 0; i < maxlen && str[i] >= '0' && str[i] <= '7'; ++i) {
		num *= 8;
		num += str[i] - '0';
	}

	return num;
}

// Extract one file from the archive.
// Returns 1 if it extracted something, or 0 if it reached the end.
int extract(FILE *f) {
	unsigned char header_block[512];
	xfread(header_block, sizeof(header_block), f);
	struct file_header *header = (struct file_header *)header_block;

	// The end of the archive is represented with blocks of all-zero content.
	// For simplicity, assume that if the file path is empty, the block is all zero
	// and we reached the end.
	if (header->file_path[0] == '\0') {
		return 0;
	}

	// The file path and link path fields aren't always 0-terminated, so we need to copy them
	// into our own buffers, otherwise we break on files with exactly 100 character paths.
	char file_path[101] = {0};
	memcpy(file_path, header->file_path, 100);
	char link_path[101] = {0};
	memcpy(link_path, header->link_path, 100);

	// We need these for later
	size_t file_size = parse_octal(header->file_size, sizeof(header->file_size));
	FILE *out_file = NULL;

	if (header->file_type == '0' || header->file_type == '\0') {
		// A type of '0' means that this is a plain file.
		// Some early implementations also use a NUL character ('\0') instead of an ASCII zero.

		fprintf(stderr, "Regular file: %s\n", file_path);
		out_file = fopen(file_path, "w");
		check(out_file != NULL);

	} else if (header->file_type == '1') {
		// A type of '1' means that this is a hard link.
		// That means we create a hard link at 'file_path' which links to the file at 'link_path'.

		fprintf(stderr, "Hard link: %s -> %s\n", file_path, link_path);
		check(link(link_path, file_path) >= 0);

	} else if (header->file_type == '2') {
		// A type of '2' means that this is a symbolic link.
		// That means we create a symlink at 'file_path' which links to the file at 'link_path'.

		fprintf(stderr, "Symbolic link: %s -> %s\n", file_path, link_path);
		check(symlink(link_path, file_path) >= 0);

	} else if (header->file_type == '5') {
		// A type of '5' means that this is a directory.

		fprintf(stderr, "Directory: %s\n", file_path);
		check(mkdir(file_path, 0777) >= 0);

		// Directories sometimes use the size field, but they don't contain data blocks.
		// Zero out file_size to avoid skipping entries.
		file_size = 0;

	} else {
		// There are other possible fields added by various tar implementations and standards,
		// but we'll ignore those for this implementation.
		fprintf(stderr, "Unsupported file type %c: %s\n", header->file_type, file_path);
	}

	// We have read the header block, now we need to read the payload.
	// If we're reading a file (i.e if 'outfile' is non-NULL) we will also write the body,
	// but otherwise we'll just skip it.
	char block[512];
	while (file_size > 0) {
		xfread(block, sizeof(block), f);
		size_t n = file_size > 512 ? 512 : file_size;

		file_size -= n;
		if (out_file != NULL) {
			xfwrite(block, n, out_file);
		}
	}

	if (out_file != NULL) {
		check(fclose(out_file) >= 0);
	}

	// Indicate that we have successfully extracted a file object, and are ready to read the next
	return 1;
}

int main() {
	while (extract(stdin));
}

Let's see it in action:

~/tarex $ ls
tarex.c testdir
~/tarex $ gcc -o tarex tarex.c
~/tarex $ tree
.
├── tarex.c
├── tarex
└── testdir
    ├── hello-symlink -> hello.txt
    ├── hello.txt
    └── subdir
        └── file.txt

~/tarex $ tar c testdir >testdir.tar
~/tarex $ mkdir extract && cd extract

~/tarex/extract $ ../tarex <../testdir.tar
Directory: testdir/
Symbolic link: testdir/hello-symlink -> hello.txt
Directory: testdir/subdir/
Regular file: testdir/hello.txt
Regular file: testdir/subdir/file.txt

~/tarex/extract $ tree
.
└── testdir
    ├── hello-symlink -> hello.txt
    ├── hello.txt
    └── subdir
        └── file.txt

The UStar file format

The first major extension to the tar file format we will look at is the UStar format, which increases the file length limit to 256 characters and adds some new file types. The header is expanded to this:

struct file_header {
	// Original tar header fields
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	// New UStar fields
	char magic_bytes[6];
	char version[2];
	char owner_user_name[32];
	char owner_group_name[32];
	char device_major_number[8];
	char device_minor_number[8];
	char prefix[155];

	char padding[12];
};

We now have some magic bytes (defined to be "ustar\0" for the UStar format), as well as the owner user/group names. But most importantly, we have a prefix field, which allows up to 256 character file paths. With UStar, instead of just extracting the bytes from file_path and link_path like before, we must construct a file path like this:

void read_path(char dest[257], char path[100], char prefix[100]) {
	// If there's no prefix, use name directly
	if (prefix[0] == '\0') {
		memcpy(dest, path, 100);
		dest[100] = '\0';
		return;
	}

	// If there is a prefix, the path is: <prefix> '/' <path>
	size_t prefix_len = strnlen(prefix, 155);
	memcpy(dest, prefix, prefix_len);
	dest[prefix_len] = '/';
	memcpy(&dest[prefix_len + 1], path, 100);
	dest[256] = '\0';
}

int extract(FILE *f) {
	/* ... */

	char file_path[257];
	read_path(file_path, header->file_path, header->prefix);
	char link_path[257];
	read_path(link_path, header->link_path, header->prefix);

	/* ... */
}

The original tar format had the file types '0' (or '\0'), '1' and '2', for regular files, hard links and symlinks. UStar defines these additional file types:

'3' and '4': Character devices and block devices. These are the reason for the new device_major_number and device_minor_number fields.
'5': Directories.
'6': FIFO files.
'7': Contiguous files. This type isn't really used much these days, and most implementations just treat it as a regular file.

This is definitely an improvement, but we can still only encode up to 256 character long paths. And that 8GiB file size limit still exists. Which leads us to:

The pax file format

The POSIX.1-2001 standard introduced the pax command line tool, and with it, a new set of extensions to the tar file format. This format is identical to UStar, except that it adds two new file object types: 'x' and 'g'. Both of these types let us define "extended header records", as the spec calls it. Records set with 'x' apply to only the next file, while records set with 'g' apply to all following files.

With this new extended header, we can encode the access and modification times with more precision, user/group IDs above 8^7, file sizes over 8^11, file paths of arbitrary length, and a whole lot more. The records are in the payload of the extended headr file object, and use a simple length-prefixed key/value syntax. To represent our "hello.txt" example file with an access time attribute, we need these four 512-byte blocks:

Header, type='x', file_size=30
"30 atime=1658409251.551879906\n", followed by 482 zeroes
Header, type='0', file_path="hello.txt", file_size=11
"Hello World", followed by 501 zero bytes

Interestingly, these extended header records all seem to use decimal (base 10). On the one hand, using base 10 makes sense, but on the other hand, wouldn't it be nice to stick to one way of representing numbers?

Anyways, we can see that the file format has become quite complex now. Just the file path can be provided in any of four different ways:

The full path might be in the file_path field.
The path might be a combination of the prefix and the file_path fields.
The previous file object might've been an 'x' type record with set a path property.
There might've been some 'g' type file object earlier in the archive which set a path property.

The GNU tar file format

GNU tar has its own file format, called gnu, which is different from the pax format. Like pax, the gnu format is based on UStar, but it has a different way of encoding arbitrary length paths and large file sizes:

It introduces the 'L' type, where the payload of the file object represents the file_path of the next file object.
It introduces the 'K' type, where the payload of the file object represents the link_path of the next file object.
A link with both a long file_path and a long link_path is preceeded by both an 'L' type file object and a 'K' type file object. The order isn't specified from what I can tell.
If a file is over 8GiB, it will set the high bit of the first character in file_size, and the rest of the string is parsed as base 256 (i.e it's treated as a 95-bit integer, big endian).

In some ways, I prefer this approach over the pax approach, since it's much simpler; the pax format requires the extractor to parse the record grammar. On the other hand, the pax format is both more space efficient and vastly more flexible.

In any case, the result is that a tar extractor which wants to support both pax tar files and GNU tar files needs to support 5 different ways of reading the file path, 5 different ways of reading the link path, and 3 different ways of reading the file size.

Whatever happened to the nice and simple format we started out with?

Why GNU tar extracts in quadratic time

Our simple tar extraction implementation has what could be considered a quite serious security bug: It allows people to put files outside the directory we're extracting to. Nothing is stopping an evil arcive from containing a file object with file_path="../hello.txt". You might try to fix that by just disallowing file objects from using ".." as a path component, but it's not that simple. Consider the following sequence of file objects:

Symlink, file_path="./foo", link_path=".."
Normal file, file_path="./foo/hello.txt"

We want to allow symlinks which point to their parent directory, since there are completely legitimate use cases for that. We could try to figure out whether a symlink will end up pointing to somewhere outside of the extraction directory, but that gets complicated real fast when you have to consider symlinks to symlinks and hard links to symlinks. It might be possible to do correctly, but it's not the solution GNU tar goes for.

When GNU tar encounters a hard link or symlink with ".." as a path component in its link_path, tar will create a regular file in its place as a placeholder, and put a note about the delayed link in a linked list datastructure. When it's done extracting the entire archive, it will go through the whole list of delayed links and replace the placeholders with proper links. So far, so good.

The problem comes when trying to extract a hard link which doesn't contain ".." as a path component in its link_path. GNU tar wants to create such hard links immediately if it can. But it can't create a hard link if the target is occupied by a placeholder file. That means, every time GNU tar wants to create a hard link, it first has to walk the entire linked list of delayed links and see if the target is a delayed link. If the target is a delayed link, the new link must also be delayed.

Your time complexity alarm bells should be starting to ring now. For every hard link, we walk the list of all delayed links. But it actually gets worse; for reasons I don't quite understand yet, tar will actually go through the entire list of delayed links again if it found out that it can create the link immediately. So for all "normal" hard links, it has to go through the entire linked list of delayed links twice.

If you're a bit crafty, you can construct a tar archive which GNU tar extracts in precisely O(n^2) time; you just need to alternate between links whose link_path has ".." as a path component and thus get delayed, and "normal" hard links which don't get delayed. If you're a bit unlucky, you might have a totally benign tarball which nevertheless happens to contain a bunch of symlinks which refer to files in a parent directory, followed by a bunch of normal hard links. This is what had happened to me. My tarball happened to contain over 800 000 links with ".." as a path component. It also happened to contain over 5.4 million hard links. Every one of those hard links had to go through the entire list of every hitherto deferred link. No wonder tar got slow.

If you ever find yourself in this situation, pass the --absolute-paths (or -P) parameter to tar. Tar's documentation says this about --absolute-paths:

Preserve pathnames. By default, absolute pathnames (those that begin with a / character) have the leading slash removed both when creating archives and extracting from them. Also, tar will refuse to extract archive entries whose pathnames contain .. or whose target directory would be altered by a symlink. This option suppresses these behaviors.

You would never guess it from reading the documentation, but when you pass --absolute-paths during extraction, tar assumes that the archive is benign and the whole delayed linking mechanism is disabled. Make sure you trust the tar archive though! When extracted with --absolute-paths, a malicious archive will be able to put files anywhere it wants.

I'm absolutely certain that it's possible to make GNU tar extract in O(n) without --absolute-paths by replacing the linked list with a hash map. But that's an adventure for another time.

References

These are the documents I've drawn information from when researching for my tar extractor and this blog post:

The Wikipedia article on tar: https://en.wikipedia.org/wiki/Tar_(computing)#File_format
The POSIX spec for pax: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#tag_20_92_13_01
The GNU tar file format docs: https://www.gnu.org/software/tar/manual/html_node/Standard.html
The GNU tar source code: https://git.savannah.gnu.org/cgit/tar.git/tree/ (in particular, extract.c)

If I have represented anything inaccurately in this post, please do correct me.