DRAFT: Making a simple interpreter from scratch, part 2: Syntax trees

(This is part 2 of my series on making an interpreter from scratch. For part 0, click here.)

We now have a tokenizer, so we don't have to deal with individual bytes; we can think of sequences of tokens. However, while identifier{sum} equals identifier{plus} number-literal{10} number-literal{20} is easier to deal with than an unstructured sum = + 10 20, we can't exactly execute it directly.

What we need is a syntax tree. If we were to parse sum = + 10 20 as a program according to the syntax diagram I mentioned in part 0, we'll get this:

Syntax Tree

Now that looks more like something we could execute directly. This part is all about taking a token stream and constructing what's called an abstract syntax tree (AST).

Reading tokens

In part 1, we made a pretty nice felt_tokenize_next function, which takes an input stream and returns the next token it finds. However, using that is not very ergonomic. We would like an abstraction which lets us look a few tokens ahead without having to consume them. Some other utility functions will also come in handy.

src/token-stream.h:

#define FELT_TOKEN_STREAM_MAX_PEAK 4

struct felt_token_stream {
	struct felt_input_stream *input;
	struct felt_token tokens[FELT_TOKEN_STREAM_MAX_PEAK];
	struct felt_token prev;
	size_t len;
};

void felt_token_stream_init(struct felt_token_stream *stream, struct felt_input_stream *input);
void felt_token_stream_free(struct felt_token_stream *stream);

// Peek 'n' tokens forwards, but don't consume anything.
struct felt_token felt_token_stream_peek(struct felt_token_stream *stream, size_t n);

// Consume the next token and return it.
// The returnd token will be freed the next time felt_token_stream_read is called.
struct felt_token felt_token_stream_read(struct felt_token_stream *stream);

This implementation lets us look up to 4 tokens ahead. It's implemented in src/token-stream.c.

Representing our tree

We need to represent our tree in C. I will have a struct for each node, and each struct will have three associated functions: parse, free, and print. Here's the struct and associated functions for an expression:

src/parser.h:

struct felt_ast_expression {
	enum {
		FELT_EXPR_LVALUE,
		FELT_EXPR_LITERAL,
		FELT_EXPR_FUNC_CALL,
		FELT_EXPR_GROUP,
	} kind;
	union {
		struct felt_ast_lvalue_expression lvalue;
		struct felt_ast_litral_expression literal;
		struct felt_ast_func_call_expression func_call;
		struct felt_ast_group_expression group;
	} content;
};

// Parse an expr ast node.
// Returns 0 and fills 'node' with a valid expression node on success.
// Returns -1 and fills 'errtok' with an error token on failure.
int felt_ast_expression_parse(
		struct felt_token_stream *tokens,
		struct felt_ast_expression *node,
		struct felt_token *errtok);
void felt_ast_expression_free(
		struct felt_ast_expression *ast);
void felt_ast_expression_print(
		struct felt_ast_expression *ast, int depth, FILE *out);

The other node types look roughly similar, just with different data in the struct (and obviously different implementations for parse, free and print). I won't include the other 12 node types here, but you can find them in src/parser.h.

We actually have all we need in order to represent our code as a syntax tree now. For example, we could write that sum = + 10 20 expression like this:

// sum = + 20 30
(struct felt_ast_expression) {
	.kind = FELT_AST_EXPR_ASSIGNMENT,
	.expr.assignment = (struct felt_ast_assignment_expression) {

		// sum
		.left = &(struct felt_ast_lvalue_expression) {
			.kind = FELT_AST_LVALUE_IDENTIFIER,
			.expr.identifier = "sum",
		},

		// + 20 30
		.right = &(struct felt_ast_expression) {
			.kind = FELT_AST_EXPR_FUNCTION_CALL,
			.expr.function_call = (struct felt_ast_function_call_expression) {

				// +
				.function = &(struct felt_ast_expression) {
					.kind = FELT_AST_EXPR_LVALUE,
					.expr.lvalue = (struct felt_ast_lvalue_expression) {
						.kind = FELT_AST_LVALUE_IDENTIFIER,
						.expr.identifier = "+",
					},
				},

				// 20 30
				.args = (struct felt_ast_expression[]) {

					// 20
					(struct felt_ast_expression) {
						.kind = FELT_AST_EXPR_LITERAL,
						.expr.literal = (struct felt_ast_literal_expression) {
							.kind = FELT_AST_LITERAL_NUMBER,
							.expr.number.val = 20,
						},
					},

					// 30
					(struct felt_ast_expression) {
						.kind = FELT_AST_EXPR_LITERAL,
						.expr.literal = (struct felt_ast_literal_expression) {
							.kind = FELT_AST_LITERAL_NUMBER,
							.expr.number.val = 30,
						},
					},
				}
			},
		},
	},
}

Obviously, we don't want to do that by hand, so we better make a parser do it for us.

Parsing

As I alluded to earlier, each AST node has its own parse function, so each node is responsible for parsing itself (including child nodes). This kind of parsing is called a recursive descent parser (wiki). While it's a lot of work to implement the parse function for each and every kind of AST node, it's also strangely easy; you just go through each node one by one, and the pieces kind of just fall in place.

The source for this part is available on gitlab.