DRAFT: Making a simple interpreter from scratch, part 2: Syntax trees
(This is part 2 of my series on making an interpreter from scratch. For part 0, click here.)
We now have a tokenizer, so we don't have to deal with individual bytes; we can
think of sequences of tokens. However, while
identifier{sum} equals identifier{plus} number-literal{10} number-literal{20}
is easier to deal with than an unstructured sum = + 10 20
, we can't exactly
execute it directly.
What we need is a syntax tree. If we were to parse sum = + 10 20
as a program
according to the syntax diagram
I mentioned in part 0, we'll get this:
Now that looks more like something we could execute directly. This part is all about taking a token stream and constructing what's called an abstract syntax tree (AST).
Reading tokens
In part 1, we made a pretty nice
felt_tokenize_next
function, which takes an input stream and returns the next
token it finds. However, using that is not very ergonomic. We would like an
abstraction which lets us look a few tokens ahead without having to consume them.
Some other utility functions will also come in handy.
src/token-stream.h:
#define FELT_TOKEN_STREAM_MAX_PEAK 4
struct felt_token_stream {
struct felt_input_stream *input;
struct felt_token tokens[FELT_TOKEN_STREAM_MAX_PEAK];
struct felt_token prev;
size_t len;
};
void felt_token_stream_init(struct felt_token_stream *stream, struct felt_input_stream *input);
void felt_token_stream_free(struct felt_token_stream *stream);
// Peek 'n' tokens forwards, but don't consume anything.
struct felt_token felt_token_stream_peek(struct felt_token_stream *stream, size_t n);
// Consume the next token and return it.
// The returnd token will be freed the next time felt_token_stream_read is called.
struct felt_token felt_token_stream_read(struct felt_token_stream *stream);
This implementation lets us look up to 4 tokens ahead. It's implemented in src/token-stream.c.
Representing our tree
We need to represent our tree in C. I will have a struct for each node, and
each struct will have three associated functions: parse
, free
, and print
.
Here's the struct and associated functions for an expression:
src/parser.h:
struct felt_ast_expression {
enum {
FELT_EXPR_LVALUE,
FELT_EXPR_LITERAL,
FELT_EXPR_FUNC_CALL,
FELT_EXPR_GROUP,
} kind;
union {
struct felt_ast_lvalue_expression lvalue;
struct felt_ast_litral_expression literal;
struct felt_ast_func_call_expression func_call;
struct felt_ast_group_expression group;
} content;
};
// Parse an expr ast node.
// Returns 0 and fills 'node' with a valid expression node on success.
// Returns -1 and fills 'errtok' with an error token on failure.
int felt_ast_expression_parse(
struct felt_token_stream *tokens,
struct felt_ast_expression *node,
struct felt_token *errtok);
void felt_ast_expression_free(
struct felt_ast_expression *ast);
void felt_ast_expression_print(
struct felt_ast_expression *ast, int depth, FILE *out);
The other node types look roughly similar, just with different data in the struct (and obviously different implementations for parse, free and print). I won't include the other 12 node types here, but you can find them in src/parser.h.
We actually have all we need in order to represent our code as a syntax tree
now. For example, we could write that sum = + 10 20
expression like this:
// sum = + 20 30
(struct felt_ast_expression) {
.kind = FELT_AST_EXPR_ASSIGNMENT,
.expr.assignment = (struct felt_ast_assignment_expression) {
// sum
.left = &(struct felt_ast_lvalue_expression) {
.kind = FELT_AST_LVALUE_IDENTIFIER,
.expr.identifier = "sum",
},
// + 20 30
.right = &(struct felt_ast_expression) {
.kind = FELT_AST_EXPR_FUNCTION_CALL,
.expr.function_call = (struct felt_ast_function_call_expression) {
// +
.function = &(struct felt_ast_expression) {
.kind = FELT_AST_EXPR_LVALUE,
.expr.lvalue = (struct felt_ast_lvalue_expression) {
.kind = FELT_AST_LVALUE_IDENTIFIER,
.expr.identifier = "+",
},
},
// 20 30
.args = (struct felt_ast_expression[]) {
// 20
(struct felt_ast_expression) {
.kind = FELT_AST_EXPR_LITERAL,
.expr.literal = (struct felt_ast_literal_expression) {
.kind = FELT_AST_LITERAL_NUMBER,
.expr.number.val = 20,
},
},
// 30
(struct felt_ast_expression) {
.kind = FELT_AST_EXPR_LITERAL,
.expr.literal = (struct felt_ast_literal_expression) {
.kind = FELT_AST_LITERAL_NUMBER,
.expr.number.val = 30,
},
},
}
},
},
},
}
Obviously, we don't want to do that by hand, so we better make a parser do it for us.
Parsing
As I alluded to earlier, each AST node has its own parse function, so each node is responsible for parsing itself (including child nodes). This kind of parsing is called a recursive descent parser (wiki). While it's a lot of work to implement the parse function for each and every kind of AST node, it's also strangely easy; you just go through each node one by one, and the pieces kind of just fall in place.
The source for this part is available on gitlab.