Faster virtual machines: Speeding up programming language execution

Date: 2023-01-15
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00015-fast-interpreters.md

In this post, I hope to explore how interpreters are often implemented, what a "virtual machine" means in this context, and how to make them faster.

Note: This post will contain a lot of C source code. Most of it is fairly simple C which should be easy to follow, but some familiarity with the C language is suggested.

What is a (virtual) machine?

For our purposes, a "machine" is anything which can read some sequence of instructions ("code") and act upon them. A Turing machine reads instructions from the cells of a tape and changes its state accordingly. Your CPU is a machine which reads instructions in the form of binary data representing x86 or ARM machine code and modifies its state accordingly. A LISP machine reads instructions in the form of LISP code and modifies its state accordingly.

Your computer's CPU is a physical machine, with all the logic required to read and execute its native machine code implemented as circuitry in hardware. But we can also implement a "machine" to read and execute instructions in software. A software implementation of a machine is what we call a virtual machine. QEMU is an example of a project which implements common CPU instruction sets in software, so we can take native machine code for ARM64 and run it in a virtual ARM64 machine regardless of what architecture our physical CPU implements.

But we don't have to limit ourselves to virtual machines which emulate real CPU architectures. In the world of programming languages, a "virtual machine" is usually used to mean something which takes some language-specific code and executes it.

What is bytecode?

Many programming languages are separated into roughly two parts: the front-end, which parses your textual source code and emits some form of machine-readable code, and the virtual machine, which executes the instructions in this machine-readable code. This machine-readable code that's inteneded to be executed by a virtual machine is usually called "bytecode".

You're probably familiar with this from Java, where the Java compiler produces .class files containing Java bytecode, and the Java Virtual Machine (JVM) executes these .class files. (You may be more familiar with .jar files, which are essentially zip files with a bunch of .class files.)

Python is also an example of a programming language with these two parts. The only difference between Python's approach and Java's approach is that the Python compiler and the Python virtual machine are part of the same executable, and you're not meant to distribute the Python bytecode. But Python also generates bytecode files; the __pycache__ directories and .pyc files Python generates contains Python bytecode. This lets Python avoid compiling your source code to bytecode every time you run a Python script, speeding up startup times.

So how does this "bytecode" look like? Well, it usually has a concept of an "operation" (represented by some numeric "op-code") and "operands" (some fixed numeric argument which somehow modifies the behavior of the instruction). But other than that, it varies wildly between languages.

Note: Sometimes "bytecode" is used interchangeably with any form of code intended to be executed by a virtual machine. Other times, it's used to mean specifically code where an instruction is always encoded using exactly one byte for an "op-code".

Our own bytecode

In this post, we will invent our own bytecode with these characteristics:

This is the set of instructions our bytecode language will have:

00000000: CONSTANT c:
Push 'c' onto the stack.
> push(c);

00000001: ADD:
Pop two values from the stack, push their
sum onto the stack.
> b = pop();
> a = pop();
> push(a + b);

00000010: PRINT:
Pop a value from the stack and print it.
> print(pop());

00000011: INPUT:
Read a value from some external input,
and push it onto the stack.
> push(input())

00000100: DISCARD:
Pop a value from the stack and discard it.
> pop();

00000101: GET offset:
Find the value at the 'offset' from the
top of the stack and push it onto the stack.
> val = stackptr[offset];
> push(val);

0000110: SET offset:
Pop a value from the stack, replace the value
at the 'offset' with the popped value.
> val = pop();
> stackptr[offset] = val;

00000110: CMP:
Compare two values on the stack, push -1 if
the first is smaller than the second, 1 if the
first is bigger than the second, and 0 otherwise.
> b = pop();
> a = pop();
> if (a > b) push(1);
> else if (a < b) push(-1);
> else push(0);

00000111: JGT offset:
Pop the stack, jump relative to the given 'offset'
if the popped value is positive.
> val = pop();
> if (val > 0) instrptr += offset;

00001000: HALT:
Stop execution

I'm sure you can imagine expanding this instruction set with more instructions. Maybe a SUB instruction, maybe more jump instructions, maybe more I/O. If you want, you can play along with this post and expand my code to implement your own custom instructions!

Throughout this blog post, I will be using an example program which multiplies two numbers together. Here's the program in pseudocode:

A = input()
B = input()

Accumulator = 0
do {
	Accumulator = Accumulator + A
	B = B - 1
} while (B > 0)

print(Accumulator)

(This program assumes B is greater than 0 for simplicity.)

Here's that program implemented in our bytecode language:

INPUT // A = input()
INPUT // B = input()

CONSTANT 0 // Accumulator = 0

// Loop body:

// Accumulator + A
GET 0
GET 3
ADD
// Accumulator = <result>
SET 0

// B - 1
GET 1
CONSTANT -1
ADD
// B = <result>
SET 1

// B CMP 0
GET 1
CONSTANT 0
CMP
// Jump to start of loop body if <result> > 0
// We get the value -43 by counting the bytes from
// the first instruction in the loop body.
// Operations are 1 byte, operands are 4 bytes.
JGT -43

// Accumulator
GET 0
// print(<result>)
PRINT

HALT

Note: If you're viewing this in a browser with JavaScript enabled, the above code should be interactive! Press the Step or Run buttons to execute it. The bar on the right represents the stack. The yellow box indicates the current stack pointer, a blinking green box means a value is being read, a blinking red box means a value is being written. The blue rectangle in the code area shows the instruction pointer. You can also edit the code; try your hand at writing your own program!

Here's a link which takes you directly to the interactive virtual machine.

You should take some moments to convince yourself that the bytecode truly reflects the pseudocode. Maybe you can even imagine how you could write a compiler which takes a syntax tree reflecting the source code and produces bytecode? (Hint: Every expression and sub-expression leaves exactly one thing on the stack.)

Implementing a bytecode interpreter

A bytecode interpreter can be basically just a loop with a switch statement. Here's my shot at implementing one in C for the bytecode language we invented:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

enum op {
	OP_CONSTANT, OP_ADD, OP_PRINT, OP_INPUT, OP_DISCARD,
	OP_GET, OP_SET, OP_CMP, OP_JGT, OP_HALT,
};

void interpret(unsigned char *bytecode, int32_t *input) {
	// Create a "stack" of 128 integers,
	// and a "stack pointer" which always points to the first free stack slot.
	// That means the value at the top of the stack is always 'stackptr[-1]'.
	int32_t stack[128];
	int32_t *stackptr = stack;

	// Create an instruction pointer which keeps track of where in the bytecode we are.
	unsigned char *instrptr = bytecode;

	// Some utility macros, to pop a value from the stack, push a value to the stack,
	// peek into the stack at an offset, and interpret the next 4 bytes as a 32-bit
	// signed integer to read an instruction's operand.
	#define POP() (*(--stackptr))
	#define PUSH(val) (*(stackptr++) = (val))
	#define STACK(offset) (*(stackptr - 1 - offset))
	#define OPERAND() ( \
		((int32_t)instrptr[1] << 0) | \
		((int32_t)instrptr[2] << 8) | \
		((int32_t)instrptr[3] << 16) | \
		((int32_t)instrptr[4] << 24))

	int32_t a, b;

	// This is where we just run one instruction at a time, using a switch statement
	// to figure out what to do in response to each op-code.
	while (1) {
		enum op op = (enum op)*instrptr;
		switch (op) {
		case OP_CONSTANT:
			PUSH(OPERAND());
			// We move past 5 bytes, 1 for the op-code, 4 for the 32-bit operand
			instrptr += 5; break;
		case OP_ADD:
			b = POP();
			a = POP();
			PUSH(a + b);
			// This instruction doesn't have an operand, so we move only 1 byte
			instrptr += 1; break;
		case OP_PRINT:
			a = POP();
			printf("%i\n", (int)a);
			instrptr += 1; break;
		case OP_INPUT:
			PUSH(*(input++));
			instrptr += 1; break;
		case OP_DISCARD:
			POP();
			instrptr += 1; break;
		case OP_GET:
			a = STACK(OPERAND());
			PUSH(a);
			instrptr += 5; break;
		case OP_SET:
			a = POP();
			STACK(OPERAND()) = a;
			instrptr += 5; break;
		case OP_CMP:
			b = POP();
			a = POP();
			if (a > b) PUSH(1);
			else if (a < b) PUSH(-1);
			else PUSH(0);
			instrptr += 1; break;
		case OP_JGT:
			a = POP();
			if (a > 0) instrptr += OPERAND();
			else instrptr += 5;
			break;
		case OP_HALT:
			return;
		}
	}
}

That's it. That's a complete virtual machine for our little bytecode language. Let's give it a spin! Here's a main function which exercises it:

int main(int argc, char **argv) {
	unsigned char program[] = {
		OP_INPUT, OP_INPUT,
		OP_CONSTANT, 0, 0, 0, 0,

		OP_GET, 0, 0, 0, 0,
		OP_GET, 3, 0, 0, 0,
		OP_ADD,
		OP_SET, 0, 0, 0, 0,

		OP_GET, 1, 0, 0, 0,
		OP_CONSTANT, 0xff, 0xff, 0xff, 0xff, // -1 32-bit little-endian (two's complement)
		OP_ADD,
		OP_SET, 1, 0, 0, 0,

		OP_GET, 1, 0, 0, 0,
		OP_CONSTANT, 0, 0, 0, 0,
		OP_CMP,
		OP_JGT, 0xd5, 0xff, 0xff, 0xff, // -43 in 32-bit little-endian (two's complement)

		OP_GET, 0, 0, 0, 0,
		OP_PRINT,

		OP_HALT,
	};
	int32_t input[] = {atoi(argv[1]), atoi(argv[2])};
	interpret(program, input);
}

Note: We use two's complement to represent negative numbers, because that's what the CPU does. A 32-bit number can represent the numbers between 0 and 4'294'967'295. Two's complement is a convention where the numbers between 0 and 2'147'483'647 are treated normally, and the numbers between 2'147'483'648 and 4'294'967'295 represent the numbers between -2'147'483'648 and -1.

Little-endian just means that order of the bytes are "swapped" compared to what you'd expect. For example, to express the number 35799 (10001011'11010111 in binary) as 2 bytes in little-endian, we put the last 8 bits first and the first 8 bits last: unsigned char bytes[] = {0b11010111, 0b10001011}. It's a bit counter-intuitive, but it's how most CPU architectures these days represent numbers larger than one byte.

When I compile and run the full C program with the inputs 3 and 5, it prints 15. Success!

If I instead ask it to calculate 1 * 100'000'000, my laptop (Apple M1 Pro, Apple Clang 14.0.0 with -O3) runs the program in 1.4 seconds. My desktop (AMD R9 5950x, GCC 12.2.0 with -O3) runs the same program in 1.1 seconds. The loop contains 12 instructions, and there are 6 instructions outside of the loop, so a complete run executes 100'000'000*12+6=1'200'000'006 instructions. That means my laptop runs 856 million bytecode instructions per second ("IPS") on average, and my desktop runs 1.1 billion instructions per second.

(Link)
Clang + Apple M1 Pro GCC + AMD R9 5950x
Time IPS Time IPS
Basic bytecode interpreter 1'401ms856M1'096ms1'095M

Note: The actual benchmarked code defines the program variable in a separate translation unit from the main function and interpret function, and link-time optimization is disabled. This prevents the compiler from optimizing based on the knowledge of the bytecode program.

Not bad, but can we do better?

Managing our own jump table

Looking at Godbolt, the assembly generated for our loop + switch is roughly like this:

loop:
	jmp jmp_table[*instrptr]

jmp_table:
	.quad case_op_constant
	.quad case_op_add
	.quad case_op_print
	.quad case_op_discard
	.quad case_op_get
	.quad case_op_set
	.quad case_op_cmp
	.quad case_op_jgt
	.quad case_op_halt

case_op_constant:
	; (code...)
	add instrptr, 5
	jmp loop

case_op_add:
	; (code...)
	add instrptr, 1
	jmp loop

; etc

Note: This isn't real x86 or ARM assembly, but it gives an idea of what's going on without getting into the weeds of assembly syntax.

We can see that the compiler generated a jump table; a table of memory addresses to jump to. At the beginning of each iteration of the loop, it looks up the target address in the jump table based on the opcode at the instruction pointer, then jumps to it. And at the end of executing each switch case, it jumps back to the beginning of the loop. This is fine, but it's a bit unnecessary to jump to the start of the loop just to immediately jump again based on the next op-code. We could just replace the jmp loop with jmp jmp_table[*instrptr] like this:

	jmp jmp_table[*instrptr]

jmp_table:
	.quad case_op_constant
	.quad case_op_add
	.quad case_op_print
	.quad case_op_discard
	.quad case_op_get
	.quad case_op_set
	.quad case_op_cmp
	.quad case_op_jgt
	.quad case_op_halt

case_op_constant:
	; code
	add instrptr, 5
	jmp jmp_table[*instrptr]

case_op_add:
	; code
	add instrptr, 1
	jmp jmp_table[*instrptr]

; etc

This has the advantage of using one less instruction per iteration, but that's negligible; completely predictable jumps such as our jmp loop are essentially free. However, there's a much bigger advantage: the CPU can exploit the inherent predictability of our bytecode instruction stream to improve its branch prediction. For example, a CMP instruction is usually going to be followed by the JGE instruction, so the CPU can start speculatively executing the JGE instruction before it's even done executing the CMP instruction. (At least that's what I believe is happeneing; figuring out why something is as fast or slow as it is, at an instruction-by-instruction level, is incredibly difficult on modern CPUs.)

Sadly, standard C doesn't let us express this style of jump table. But GNU C does! With GNU's Labels as Values extension, we can create our own jump table and indirect goto:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

enum op {
	OP_CONSTANT, OP_ADD, OP_PRINT, OP_INPUT, OP_DISCARD,
	OP_GET, OP_SET, OP_CMP, OP_JGT, OP_HALT,
};

void interpret(unsigned char *bytecode, int32_t *input) {
	int32_t stack[128];
	int32_t *stackptr = stack;
	unsigned char *instrptr = bytecode;

	#define POP() (*(--stackptr))
	#define PUSH(val) (*(stackptr++) = (val))
	#define STACK(offset) (*(stackptr - 1 - offset))
	#define OPERAND() ( \
		((int32_t)instrptr[1] << 0) | \
		((int32_t)instrptr[2] << 8) | \
		((int32_t)instrptr[3] << 16) | \
		((int32_t)instrptr[4] << 24))

	// Note: This jump table must be synchronized with the 'enum op',
	// so that `jmptable[op]` represents the label with the code for the instruction 'op'
	void *jmptable[] = {
		&&case_constant, &&case_add, &&case_print, &&case_input, &&case_discard,
		&&case_get, &&case_set, &&case_cmp, &&case_jgt, &&case_halt,
	};

	int32_t a, b;
	goto *jmptable[*instrptr];

case_constant:
	PUSH(OPERAND());
	instrptr += 5; goto *jmptable[*instrptr];
case_add:
	b = POP();
	a = POP();
	PUSH(a + b);
	instrptr += 1; goto *jmptable[*instrptr];
case_print:
	a = POP();
	printf("%i\n", (int)a);
	instrptr += 1; goto *jmptable[*instrptr];
case_input:
	PUSH(*(input++));
	instrptr += 1; goto *jmptable[*instrptr];
case_discard:
	POP();
	instrptr += 1; goto *jmptable[*instrptr];
case_get:
	a = STACK(OPERAND());
	PUSH(a);
	instrptr += 5; goto *jmptable[*instrptr];
case_set:
	a = POP();
	STACK(OPERAND()) = a;
	instrptr += 5; goto *jmptable[*instrptr];
case_cmp:
	b = POP();
	a = POP();
	if (a > b) PUSH(1);
	else if (a < b) PUSH(-1);
	else PUSH(0);
	instrptr += 1; goto *jmptable[*instrptr];
case_jgt:
	a = POP();
	if (a > 0) instrptr += OPERAND();
	else instrptr += 5;
	goto *jmptable[*instrptr];
case_halt:
	return;
}

With this interpreter loop, my laptop calculates 1 * 100'000'000 in 898ms, while my desktop does it in 1 second. It's interesting that Clang + M1 is significantly slower than GCC + AMD with the basic interpreter but significantly faster for this custom jump table approach. At least it's a speed-up in both cases.

(Link)
Clang + Apple M1 Pro GCC + AMD R9 5950x
Time IPS Time IPS
Basic bytecode interpreter 1'401ms856M1'096ms1'095M
Custom jump table 898ms1'336M1'011ms1'187M

Getting rid of the switch entirely with tail calls

Both of the implementations so far have essentially been of the form, "Look at the current instruction, and decide what code to run with some kind of jump table". But we don't actually need that. Instead of doing the jump table look-up every time, we could do the look-up once for every instruction before starting execution. Instead of an array of op codes, we could have an array of pointers to some machine code.

The easiest and most standard way to do this would be to have each instruction as its own function, and let that function tail-call the next function. Here's an implementation of that:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

union instr {
	void (*fn)(union instr *instrs, int32_t *stackptr, int32_t *input);
	int32_t operand;
};

#define POP() (*(--stackptr))
#define PUSH(val) (*(stackptr++) = (val))
#define STACK(offset) (*(stackptr - 1 - offset))
#define OPERAND() (instrs[1].operand)

static void op_constant(union instr *instrs, int32_t *stackptr, int32_t *input) {
	PUSH(OPERAND());
	instrs[2].fn(&instrs[2], stackptr, input);
}

static void op_add(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t b = POP();
	int32_t a = POP();
	PUSH(a + b);
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_print(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t a = POP();
	printf("%i\n", (int)a);
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_input(union instr *instrs, int32_t *stackptr, int32_t *input) {
	PUSH(*(input++));
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_discard(union instr *instrs, int32_t *stackptr, int32_t *input) {
	POP();
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_get(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t a = STACK(OPERAND());
	PUSH(a);
	instrs[2].fn(&instrs[2], stackptr, input);
}

static void op_set(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t a = POP();
	STACK(OPERAND()) = a;
	instrs[2].fn(&instrs[2], stackptr, input);
}

static void op_cmp(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t b = POP();
	int32_t a = POP();
	if (a > b) PUSH(1);
	else if (a < b) PUSH(-1);
	else PUSH(0);
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_jgt(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t a = POP();
	if (a > 0) instrs += instrs[1].operand;
	else instrs += 2;
	instrs[0].fn(&instrs[0], stackptr, input);
}

static void op_halt(union instr *instrs, int32_t *stackptr, int32_t *input) {
	return;
}

This time, we can't just feed our interpreter an array of bytes as the bytecode, since there isn't really "an interpreter", there's just a collection of functions. We can manually create a program containing function pointers like this:

int main(int argc, char **argv) {
	union instr program[] = {
		{.fn = op_input}, {.fn = op_input},

		{.fn = op_constant}, {.operand = 0},

		{.fn = op_get}, {.operand = 0},
		{.fn = op_get}, {.operand = 3},
		{.fn = op_add},
		{.fn = op_set}, {.operand = 0},

		{.fn = op_get}, {.operand = 1},
		{.fn = op_constant}, {.operand = -1},
		{.fn = op_add},
		{.fn = op_set}, {.operand = 1},

		{.fn = op_get}, {.operand = 1},
		{.fn = op_constant}, {.operand = 0},
		{.fn = op_cmp},
		{.fn = op_jgt}, {.operand = -19},

		{.fn = op_get}, {.operand = 0},
		{.fn = op_print},

		{.fn = op_halt},
	};

	int32_t input[] = {atoi(argv[1]), atoi(argv[2])};
	int32_t stack[128];
	program[0].fn(program, stack, input);
}

And that works.

In a real use-case, you would probably want to have some code to automatically generate such an array of union instr based on bytecode, but we'll ignore that for now.

With this approach, my laptop calculates 1 * 100'000'000 in 841ms, while my desktop does it in only 553ms. It's not a huge improvement for the Clang + M1 case, but it's almost twice as fast with GCC + AMD! And compared to the previous approach, it's written in completely standard ISO C99, with the caveat that the compiler must perform tail call elimination. (Most compilers will do this at higher optimization levels, and most compilers let us specify per-function optimization levels with pragmas, so that's not a big issue in practice.)

(Link)
Clang + Apple M1 Pro GCC + AMD R9 5950x
Time IPS Time IPS
Basic bytecode interpreter 1'401ms856M1'096ms1'095M
Custom jump table 898ms1'336M1'011ms1'187M
Tail calls 841ms1'427M553ms2'171M

Note: The timings from the benchmark includes the time it takes to convert the bytecode into this function pointer array form.

Final step: A compiler

All approaches so far have relied on finding ever faster ways to select which source code snippet to run next. As it turns out, the fastest way to do that is to simply put the right source code snippets after each other!

If we have the following bytecode:

CONSTANT 5
INPUT
ADD
PRINT

We can just generate C source code to do what we want:

PUSH(5);

PUSH(INPUT());

b = POP();
a = POP();
PUSH(a + b);

printf("%i\n", (int)POP());

We can then either shell out to GCC/Clang, or link with libclang to compile the generated C code. This also lets us take advantage of those projects's excellent optimizers.

Note: At this point, we don't have a "virtual machine" anymore.

One challenge is how to deal with jumps. The easiest solution from a code generation perspective is probably to wrap all the code in a switch statement in a loop:

int32_t index = 0;
while (1) {
	switch (index) {
	case 0:
		PUSH(5);

	case 5:
		PUSH(INPUT());

	case 6:
		a = POP();
		b = POP();
		PUSH(a + b);

	case 7:
		printf("%i\n", (int)POP());
	}
}

With this approach, a jump to instruction N becomes index = N; break;.

Note: Remember that in C, switch statement cases fall through to the next case unless you explicitly jump to the end with a break. So once the code for instruction 5 is done, we just fall through to instruction 6.

Here's my implementation:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

enum op {
	OP_CONSTANT, OP_ADD, OP_PRINT, OP_INPUT, OP_DISCARD,
	OP_GET, OP_SET, OP_CMP, OP_JGT, OP_HALT,
};

void write_operand(unsigned char *i32le, FILE *out) {
	fprintf(out, "    operand = %i;\n",
		(int)i32le[0] | (int)i32le[1] << 8 | (int)i32le[2] << 16 | (int)i32le[3] << 24);
}

void compile(unsigned char *bytecode, size_t size, FILE *out) {
	fputs(
		"#include <stdio.h>\n"
		"#include <stdint.h>\n"
		"#include <stdlib.h>\n"
		"\n"
		"int main(int argc, char **argv) {\n"
		"  int32_t stack[128];\n"
		"  int32_t *stackptr = stack;\n"
		"  char **inputptr = &argv[1];\n"
		"\n"
		"#define POP() (*(--stackptr))\n"
		"#define PUSH(val) (*(stackptr++) = (val))\n"
		"#define STACK(offset) (*(stackptr - 1 - offset))\n"
		"\n"
		"  int32_t a, b, operand;\n"
		"  int32_t index = 0;\n"
		"  while (1) switch (index) {\n",
		out);

	for (size_t i = 0; i < size;) {
		fprintf(out, "  case %zi:\n", i);

		enum op op = (enum op)bytecode[i];
		switch (op) {
		case OP_CONSTANT:
			write_operand(&bytecode[i + 1], out);
			fputs("    PUSH(operand);\n", out);
			i += 5; break;

		case OP_ADD:
			fputs(
				"    b = POP();\n"
				"    a = POP();\n"
				"    PUSH(a + b);\n",
				out);
			i += 1; break;

		case OP_PRINT:
			fputs(
				"    a = POP();\n"
				"    printf(\"%i\\n\", (int)a);\n",
				out);
			i += 1; break;

		case OP_INPUT:
			fputs("    PUSH(atoi(*(inputptr++)));\n", out);
			i += 1; break;

		case OP_DISCARD:
			fputs("    POP();\n", out);
			i += 1; break;

		case OP_GET:
			write_operand(&bytecode[i + 1], out);
			fputs(
				"    a = STACK(operand);\n"
				"    PUSH(a);\n",
				out);
			i += 5; break;

		case OP_SET:
			write_operand(&bytecode[i + 1], out);
			fputs(
				"    a = POP();\n"
				"    STACK(operand) = a;\n",
				out);
			i += 5; break;

		case OP_CMP:
			fputs(
				"    b = POP();\n"
				"    a = POP();\n"
				"    if (a > b) PUSH(1);\n"
				"    else if (a < b) PUSH(-1);\n"
				"    else PUSH(0);\n",
				out);
			i += 1; break;

		case OP_JGT:
			write_operand(&bytecode[i + 1], out);
			fprintf(out,
				"    a = POP();\n"
				"    if (a > 0) { index = %zi + operand; break; }\n",
				i);
			i += 5; break;

		case OP_HALT:
			fputs("    return 0;\n", out);
			i += 1; break;
		}
	}

	fputs(
		"  }\n"
		"\n"
		"  abort(); // If we get here, there's a missing HALT\n"
		"}",
		out);
}

If we run our compiler on the bytecode for our multiplication program, it outputs this C code:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

int main(int argc, char **argv) {
  int32_t stack[128];
  int32_t *stackptr = stack;
  char **inputptr = &argv[1];

  #define POP() (*(--stackptr))
  #define PUSH(val) (*(stackptr++) = (val))
  #define STACK(offset) (*(stackptr - 1 - offset))

  int32_t a, b, operand;
  int32_t index = 0;
  while (1) switch (index) {
  case 0:
    PUSH(atoi(*(inputptr++)));
  case 1:
    PUSH(atoi(*(inputptr++)));
  case 2:
    operand = 0;
    PUSH(operand);
  case 7:
    operand = 0;
    a = STACK(operand);
    PUSH(a);

  /* ... */

  case 49:
    b = POP();
    a = POP();
    if (a > b) PUSH(1);
    else if (a < b) PUSH(-1);
    else PUSH(0);
  case 50:
    operand = -43;
    a = POP();
    if (a > 0) { index = 50 + operand; break; }
  case 55:
    operand = 0;
    a = STACK(operand);
    PUSH(a);
  case 60:
    a = POP();
    printf("%i\n", (int)a);
  case 61:
    return 0;
  }

  abort(); // If we get here, there's a missing HALT
}

If we compile the generated C code with -O3, my laptop runs the 1 * 100'000'000 calculation in 204ms! That's over 4 times faster than the fastest interpreter we've had so far. That also means we're executing our 1'200'000'006 bytecode instructions at 5'882 million instructions per second! Its CPU only runs at 3'220 million CPU clock cycles per second, meaning it's spending significantly less than a clock cycle per bytecode instruction on average. My desktop with GCC is doing even better, executing all the code in 47ms, which means a whopping 25.7 billion instructions per second!

Note that in this particular case, the compiler is able to see that some instructions always happen after each other, which means it can optimize across bytecode instructions. For example, the bytecode contains a sequence GET 1; CONSTANT -1; ADD;, which the compiler is able to prove you won't ever jump into the middle of, so it optimizes out all the implied stack manipulation code; it's optimized into a single sub instruction which subtracts the constant 1 from one register and writes the result to another.

This is kind of an important point. The compiler can generate amazing code, if it can figure out which instructions (i.e switch cases) are potential jump targets. This is information you probably have access to in the source code, so it's worth thinking about how you can design your bytecode such that GCC or Clang can figure it out when looking at your compiler output. One approach could be to add "label" bytecode instructions, and only permit jumping to such a label. With our bytecode, the only jump instruction we have jumps to a known location, since the jump offset is an immediate operand to the instruction. If we added an instruction which reads the jump target from the stack instead, we might quickly get into situations where GCC/Clang has lost track of which instructions can be jump targets, and must therefore make sure not to optimize across instruction boundaries.

We can preventing the compiler from optimizing across instruction boundaries by inserting this code after the case 61: (the code for the HALT instruction):

if (argc > 100) { PUSH(argc); index = argc % 61; break; }

With this modification, every single instruction might be a branch target, so every instruction must make sense in its own right regardless of which instruction was executed before or how the stack looks.

This time, the 1 * 100'000'000 calculation happens in 550ms on my laptop with Clang, which is still not bad. It means we're executing 2'181 million bytecode instructions per second. My desktop is doing even better, at 168ms.

At this point, I got curious about whether it's the CPU or the compiler making the difference, so the next table contains all the benchmarks for both compilers on both systems.

(Link)
Apple M1 Pro AMD R9 5950x
GCC 12.1.0 Clang 14.0.0 GCC 12.2.0 Clang 15.0.2
Basic bytecode interpreter 1'902ms1'402ms 1'135ms2'347ms
Custom jump table 816ms897ms 1'023ms912ms
Tail calls 1'068ms843ms 557ms645ms
Compiler (pessimized) 342ms548ms 172ms302ms
Compiler 71ms205ms 52ms161ms

I have no intelligent commentary on those numbers. They're all over the place. In the basic interpreter case for example, GCC is much faster than Clang on the AMD CPU, but Clang is much faster than GCC on the Apple CPU. It's the opposite in the custom jump table case, where GCC is much master than Clang on the Apple CPU, but Clang is much faster than GCC on the AMD CPU. The overall pattern we've been looking at holds though, for the most part: for any given CPU + compiler combination, every implementation I've introduced is faster than the one before it. The big exception is the tail call version, where the binary compiled by GCC performs horribly on the Apple CPU (even though it performs excellently on the AMD CPU!).

If anything though, this mess of numbers indicates the value of knowing about all the different possible approaches and choosing the right one for the situation. Which takes us to...

Bringing it all together

We have 4 different implementations of the same bytecode , all with different advantages and drawbacks. And even though every instruction does the same thing in every implementation, we have written 4 separate implementations of every instruction.

That seems unnecessary. After all, we know that ADD, in every implementation, will do some variant of this:

b = POP();
a = POP();
PUSH(a + b);
GO_TO_NEXT_INSTRUCTION();

What exactly it means to POP or to PUSH or to go to the next instruction might depend on the implementation, but the core functionality is the same for all of them. We can utilize that regularity to specify the instructions only once in a way that's re-usable across implementations using so-called X macros.

We create a file instructions.x which contains code to define all our instructions:

X(CONSTANT, 1, {
	PUSH(OPERAND());
	NEXT();
})

X(ADD, 0, {
	b = POP();
	a = POP();
	PUSH(a + b);
	NEXT();
})

// etc...

Let's say we want to create an instructions.h which contains an enum op with all the operation types and a const char *op_names[] which maps enum values to strings. We can implement that by doing something like this:

#ifndef INSTRUCTIONS_H
#define INSTRUCTIONS_H

enum op {
#define X(name, has_operand, code...) OP_ ## name,
#include "instructions.x"
#undef X
};

static const char *op_names[] = {
#define X(name, has_operand, code...) [OP_ ## name] = "OP_" #name,
#include "instructions.x"
#undef X
};

#endif

This code might look a bit confusing at first glance, but it makes sense: we have generic descriptions of instructions in the instructions.x file, and then we define a macro called X to extract information from those descriptions. It's basically a weird preprocessor-based application of the visitor pattern. In the above example, we use the instruction definitions twice: once to define the enum op, and once to define the const char *op_names[]. If we run the code through the preprocessor, we get something rouhly like this:

enum op {
OP_CONSTANT,
OP_ADD,
};

const char *op_names[] = {
[OP_CONSTANT] = "OP_CONSTANT",
[OP_ADD] = "OP_ADD",
};

Now let's say we want to write a function which executes an instruction. We could write that function like this:

void execute(enum op op) {
	switch (op) {
#define X(name, has_operand, code...) case OP_ ## name: code break;
#include "instructions.x"
#undef X
	}
}

Which expands to:

void execute(enum op op) {
	switch (op)
	case OP_CONSTANT:
		{
			PUSH(OPERAND());
			NEXT();
		} break;
	case OP_ADD:
		{
			b = POP();
			a = POP();
			PUSH(a + b);
			NEXT();
		} break;
	}
}

Note: We use a variadic argument for the code block because the C preprocessor has annoying splitting rules. Code such as X(FOO, 1, {int32_t a, b;}) would call the macro X with 4 arguments: FOO, 1, {int32_t a, and b;}. Using a variadic argument "fixes" this, because when we expand code in the macro body, the preprocessor will insert a comma between the arguments. You can read about more stupid preprocessor hacks here: https://mort.coffee/home/obscure-c-features/

This is starting to look reasonable, but it doesn't quite work. We haven't defined those PUSH/OPERAND/NEXT/POP macros, nor the a and b variables. We need to be a bit more rigorous about what exactly is expected by the instruction, and what's expected by the environment which the instruction's code is expanded into. So let's design a sort of "contract" between the instruction and the execution environment.

The environment must:

As for the instruction:

With this, we can re-implement our basic bytecode interpreter:

#include "instructions.h"

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

void interpret(unsigned char *bytecode, int32_t *input) {
	int32_t stack[128];
	int32_t *stackptr = stack;
	unsigned char *instrptr = bytecode;

	int instrsize; // Will be initialized later

	#define POP() (*(--stackptr))
	#define PUSH(val) (*(stackptr++) = (val))
	#define STACK(offset) (*(stackptr - 1 - offset))
	#define OPERAND() ( \
		((int32_t)instrptr[1] << 0) | \
		((int32_t)instrptr[2] << 8) | \
		((int32_t)instrptr[3] << 16) | \
		((int32_t)instrptr[4] << 24))
	#define INPUT() (*(input++))
	#define PRINT(val) (printf("%i\n", (int)(val)))
	#define GOTO_RELATIVE(offset) (instrptr += (offset))
	#define NEXT() (instrptr += instrsize)
	#define HALT() return

	int32_t a, b;
	while (1) {
		switch ((enum op)*instrptr) {
#define X(name, has_operand, code...) \
		case OP_ ## name: \
			instrsize = has_operand ? 5 : 1; \
			code \
			break;
#include "instructions.x"
#undef X
		}
	}
}

And that's it! That's our whole generic basic bytecode interpreter, defined using the instruction definitions in instructions.x. And any time we add more bytecode instructions to instructions.x, the instructions are automatically added to the enum op and const char *op_names[] in instructions.h, and they're automatically supported by this new basic interpreter.

I won't deny that this style of code is a bit harder to follow than straight C code. However, I've seen VM with their own custom domain-specific languages and code generators to define instructions, and I find that much harder to follow than this preprocessor-based approach. Even though the C preprocessor is flawed in many ways, it has the huge advantage that C programmers already understand how it works for the most part, and they're used to following code which uses macros and includes. With decent comments in strategic places, I don't think this sort of "abuse" of the C preprocessor is wholly unreasonable. Your mileage may differ though, and my threshold for "too much preprocessor magic" might be set too high.

For completeness, let's amend instructions.x with all the instructions in the bytecode language I defined at the start of this post:

X(CONSTANT, 1, {
	PUSH(OPERAND());
	NEXT();
})

X(ADD, 0, {
	b = POP();
	a = POP();
	PUSH(a + b);
	NEXT();
})

X(PRINT, 0, {
	PRINT(POP());
	NEXT();
})

X(INPUT, 0, {
	PUSH(INPUT());
	NEXT();
})

X(DISCARD, 0, {
	(void)POP();
	NEXT();
})

X(GET, 1, {
	a = STACK(OPERAND());
	PUSH(a);
	NEXT();
})

X(SET, 1, {
	a = POP();
	STACK(OPERAND()) = a;
	NEXT();
})

X(CMP, 0, {
	b = POP();
	a = POP();
	if (a > b) PUSH(1);
	else if (a < b) PUSH(-1);
	else PUSH(0);
	NEXT();
})

X(JGT, 1, {
	a = POP();
	if (a > 0) { GOTO_RELATIVE(OPERAND()); }
	else { NEXT(); }
})

X(HALT, 0, {
	HALT();
})

Implementing the custom jump table variant and the tail-call variant using this X-macro system is left as an exercise to the reader. However, just to show that it's possible, here's the compiler variant implemented generically:

#include "instructions.h"

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

void compile(unsigned char *bytecode, size_t size, FILE *out) {
	fputs(
		"#include <stdio.h>\n"
		"#include <stdint.h>\n"
		"#include <stdlib.h>\n"
		"\n"
		"int main(int argc, char **argv) {\n"
		"  int32_t stack[128];\n"
		"  int32_t *stackptr = stack;\n"
		"  char **inputptr = &argv[1];\n"
		"\n"
		"#define POP() (*(--stackptr))\n"
		"#define PUSH(val) (*(stackptr++) = (val))\n"
		"#define STACK(offset) (*(stackptr - 1 - offset))\n"
		"#define OPERAND() operand\n"
		"#define INPUT() (atoi(*(inputptr++)))\n"
		"#define PRINT(val) printf(\"%i\\n\", (int)(val))\n"
		"#define GOTO_RELATIVE(offset) index += offset; break\n"
		"#define NEXT()\n"
		"#define HALT() return 0\n"
		"\n"
		"  int32_t a, b, operand;\n"
		"  int32_t index = 0;\n"
		"  while (1) switch (index) {\n",
		out);

	for (size_t i = 0; i < size;) {
		fprintf(out, "  case %zi:\n", i);

		enum op op = (enum op)bytecode[i];
		switch (op) {
#define X(name, has_operand, code...) \
		case OP_ ## name: \
			fprintf(out, "    index = %zi;\n", i); \
			i += 1; \
			if (has_operand) { \
				fprintf(out, "    operand = %i;\n", (int)( \
					((int32_t)bytecode[i + 0] << 0) | ((int32_t)bytecode[i + 1] << 8) | \
					((int32_t)bytecode[i + 2] << 16) | ((int32_t)bytecode[i + 3] << 24))); \
				i += 4; \
			} \
			fputs("    " #code "\n", out); \
			break;
#include "instructions.x"
#undef X
		}
	}

	fputs(
		"  }\n"
		"\n"
		"  abort(); // If we get here, there's a missing HALT\n"
		"}",
		out);
}

A word on real-world performance

I thought I should mention that the techniques described in this post won't magically make any interpreted language much faster. The main source of the performance differences we have explored here is due to the overhead involved in selecting which instruction to execute next; the code which runs between the instructions. By reducing this overhead, we're able to make our simple bytecode execute blazing fast. But that's really only because all our instructions are extremely simple.

In the case of something like Python, each instruction might be much more complex to execute. The BINARY_ADD operation, for example, pops two values from the stack, adds them together, and pushes the result onto the stack, much like how our bytecode's ADD operation does. However, our ADD operation knows that the two popped values are 32-bit signed integers. In Python, the popped values may be strings, they may be arrays, they may be numbers, they may be objects with a custom __add__ method, etc. This means that the time it takes to actually execute instructions in Python will dominate to the point that speeding up instruction dispatch is likely insignificant. Optimizing highly dynamic languages like Python kind of requires some form of tracing JIT to stamp out specialized functions which make assumptions about what types their arguments are, which is outside the scope of this post.

But that doesn't mean the speed-up I have shown here is unrealistic. If you're making a language with static types, you can have dedicated fast instructions for adding i32s, adding doubles, etc. And at that point, the optimizations shown in this post will give drastic speed-ups.

Further reading


So those are my thoughts on speeding up virtual machine execution. If you want, you may check out my programming languages Gilia and osyris. Neither makes use of any of the techniques discussed in this post, but playing with Gilia's VM is what got me started down this path of exploring different techniques. If I ever get around to implementing these ideas into Gilia's VM, I'll add a link to the relevant parts of the source code here.

Read More