DRAFT: Making a simple interpreter from scratch, part 1: How do we make sense of words?

(This is part 1 of my series on making an interpreter from scratch. For part 0, click here.)

Before I took my first compilers course, I was curious about making programming languages, but just couldn't imagine how you would make a language with a robust syntax like the serious programming languages. I would try to make languages which work based on splitting the program on whitespace and treating whitespace-separated words as "tokens", with hacks to work around constructs which consist of multiple words. The most complete of those projects is a language I called Shoelips.

Well, it turns out that once you know what to do, it's surprisingly easy to at least imagine how we can start to turn our text into some kind of structured data. All we need to do is to split our string into tokens, but, like, intelligently (also called lexical analysis or tokenization).

We just need to take a string like this:

sum = + 50 (- 33 10)

and turn it into a series of tokens like this:

identifier{sum} equals identifier{+} number{50}
open-paren identifier{-} number{33} number{10} close-paren

Once we have done that, we can start parsing; for example, we can say that when we're looking for an expression, an identifier token followed by an equals token means that we should start parsing the expression as an assignment.

Tokens

The very first thing we're going to have to figure out is what kinds of tokens we want. I have chosen this list of tokens: string literal, number literal, open paren, close paren, open brace, close brace, open bracket, close bracket, newline, semicolon, colon, equals, exclamation mark, dot, end-of-file, and error. The end-of-file token will only appear at the end of the input stream, and the error token will signal that there's some issue with the input which prevents us from tokenizing further.

src/token.h:

enum felt_token_kind {
	FELT_TOKEN_STRING_LITERAL,
	FELT_TOKEN_NUMBER_LITERAL,
	FELT_TOKEN_IDENTIFIER,
	FELT_TOKEN_OPEN_PAREN,
	FELT_TOKEN_CLOSE_PAREN,
	FELT_TOKEN_OPEN_BRACE,
	FELT_TOKEN_CLOSE_BRACE,
	FELT_TOKEN_OPEN_BRACKET,
	FELT_TOKEN_CLOSE_BRACKET,
	FELT_TOKEN_NEWLINE,
	FELT_TOKEN_SEMICOLON,
	FELT_TOKEN_COLON,
	FELT_TOKEN_EQUALS,
	FELT_TOKEN_EXCLAMATION_MARK,
	FELT_TOKEN_DOT,
	FELT_TOKEN_EOF,
	FELT_TOKEN_ERROR,
	FELT_TOKEN_COUNT,
};

extern const char *felt_token_kind_names[];

That felt_token_kind_names array is an array which just maps a token kind to a string representation for nicer error messages. I won't include it here, but it's defined in token.c.

Any particular token will need to know what kind of token it is, its line and column number (for error messages), and in the case of string litrals, number literals and identifiers, some content.

src/token.h:

struct felt_token {
	enum felt_token_kind kind;
	int line, column;
	union {
		double number;
		char *string;
	} content;
};

void felt_token_free(struct felt_token *tok);

Since this is C, we need to think about memory management. For tokens, the convention will be that identifier tokens and string literal tokens will set their content.string property to some dynamically allocated buffer, and felt_token_free will free that buffer. For all other kinds of token, no dynamic memory is necessary.

Reading input

Before we can start tokenizing user input, we have to read user input. I won't bore you with the details, but I made a felt_input_stream which lets us read user input byte by byte from either a C string or a file. Here's the interface:

src/input-stream.h:

struct felt_input_stream {
	/* ... */
	int line, column;
};

void felt_input_stream_init_string(struct felt_input_stream *stream, const char *str);
void felt_input_stream_init_file(struct felt_input_stream *stream, FILE *f);

// Peek the next character, but don't consume it.
// Return EOF on end of file or error.
int felt_input_stream_peek(struct felt_input_stream *stream);

// Consume the next character and return it.
// Returns EOF on end of file or error.
int felt_input_stream_read(struct felt_input_stream *stream);

// Returns an errno code if the stream failed,
// or 0 if there was no error.
int felt_input_stream_error(struct felt_input_stream *stream);

Tokenizing!

Now that we have both a token struct and a way to read input, we can start actually making our tokenizer. The interface will be super simple, it'll just read characters from the input stream and return a token:

src/tokenize.h:

struct felt_token felt_tokenize_next(struct felt_input_stream *input);

That felt_tokenize_nxet function is where where all the tokenization will happen. However, before we start writing that, I want to write a function to skip all whitespace and comments; comments should be ignored, and most whitespace should be ignored.

I don't want felt to require a semicolon on each line, so we need to tell the further stages of the parser about newlines.ine tokens. To make it easier on ourselves later, I want to collapse a sequence of whitespace and comments into a single newline token if it contains any number of newlines. This makes our skip_whitespace_and_comments function a bit hairy.

src/tokenize.c:

// Check if a character is whitespace.
// We don't use isspace() because we don't want the user's current locale
// to affect parsing.
static int is_whitespace(int ch) {
	return ch == ' ' || ch == '\n' || ch == '\t' || ch == '\r';
}

// Check if a character could be a valid identifier character.
// Note that equals signs are allowed in identifiers.
static int is_identifier(int ch) {
	return !is_whitespace(ch) && ch != EOF &&
		ch != '(' && ch != ')' && ch != '{' && ch != '}' &&
		ch != '[' && ch != ']' && ch != ';' && ch != ':' &&
		ch != '!' && ch != '.' && ch != '#';
}

// Skip whitespace or comments.
// Returns true if one or more newlines was skipped, false otherwise.
static bool skip_whitespace_and_comments(struct felt_input_stream *input) {
	bool skipped_newline = false;
	int peek;

	while (true) {
		// Skip whitespace
		while (true) {
			peek = felt_input_stream_peek(input);
			if (!is_whitespace(peek))
				break;
			if (peek == '\n')
				skipped_newline = true;
			felt_input_stream_read(input);
		}

		// We've skipped the whitespace;
		// unless the next character is a comment, we're done
		peek = felt_input_stream_peek(input);
		if (peek != '#')
			break;

		// We need to read until a newline,
		// because comments go from the # to the end of the line.
		felt_input_stream_read(input);
		while (true) {
			peek = felt_input_stream_peek(input);
			if (peek == '\n') {
				skipped_newline = true;
				felt_input_stream_read(input);
				break;
			} else if (peek == EOF) {
				return skipped_newline;
			}

			felt_input_stream_read(input);
		}

		// We've skipped a comment, but we need to go around again
		// in case there's more whitespace to skip.
	}

	return skipped_newline;
}

Now we can start writing the actual felt_tokenize_next function. The first thing we have to do is to skip any comments and newline we may have encountered, and then read the first non-whitespace character:

src/tokenize.c:

struct felt_token felt_tokenize_next(struct felt_input_stream *input) {
	if (skip_whitespace_and_comments(input))
		return (struct felt_token) {
			FELT_TOKEN_NEWLINE, input->line, input->column, { 0 } };

	// Read the first character of our token,
	// and capture the line/column of the first character of the token
	int ch = felt_input_stream_read(input);
	int line = input->line;
	int column = input->column;

	// On error, we just return an error token with an error message
	if (ch == EOF && felt_input_stream_error(input))
		return (struct felt_token) {
			FELT_TOKEN_ERROR, line, column,
			.content.string = strerror(felt_input_stream_error(input)) };

Next, we can get all of the simple one-character tokens and EOF out of the way:

src/tokenize.c:

	// One-character tokens are simple;
	// we already consumed a character from the stream, so we can just
	// return the correct token right away.
	switch (ch) {
	case EOF:
		return (struct felt_token) { FELT_TOKEN_EOF, line, column, { 0 } };
	case '(':
		return (struct felt_token) { FELT_TOKEN_OPEN_PAREN, line, column, { 0 } };
	case ')':
		return (struct felt_token) { FELT_TOKEN_CLOSE_PAREN, line, column, { 0 } };
	case '{':
		return (struct felt_token) { FELT_TOKEN_OPEN_BRACE, line, column, { 0 } };
	case '}':
		return (struct felt_token) { FELT_TOKEN_CLOSE_BRACE, line, column, { 0 } };
	case '[':
		return (struct felt_token) { FELT_TOKEN_OPEN_BRACKET, line, column, { 0 } };
	case ']':
		return (struct felt_token) { FELT_TOKEN_CLOSE_BRACKET, line, column, { 0 } };
	case ';':
		return (struct felt_token) { FELT_TOKEN_SEMICOLON, line, column, { 0 } };
	case ':':
		return (struct felt_token) { FELT_TOKEN_COLON, line, column, { 0 } };
	case '=':
		return (struct felt_token) { FELT_TOKEN_EQUALS, line, column, { 0 } };
	case '!':
		return (struct felt_token) { FELT_TOKEN_EXCLAMATION_MARK, line, column, { 0 } };
	case '.':
		return (struct felt_token) { FELT_TOKEN_DOT, line, column, { 0 } };
	}

Now we only have the three variable length tokens left: number literals, string literals and identifiers. Since they're variable-length, the variable-length buffer we discussed in part 0 will come in handy.

Let's do the string litral first; we can unambiguously determine whether we're looking at a string literal based on whether the current character is a quote.

src/tokenize.c:

	// Strings start with an opening quote.
	if (ch == '"') {
		struct felt_buffer buf;
		felt_buffer_init(&buf);

		// We just read characters until we find the closing quote
		while (felt_input_stream_peek(input) != '"') {
			ch = felt_input_stream_read(input);

			// Error handling is a bit noisy but meh
			if (ch == EOF && felt_input_stream_error(input)) {
				felt_buffer_free(&buf);
				return (struct felt_token) {
					FELT_TOKEN_ERROR, line, column,
					.content.string = strerror(felt_input_stream_error(input)) };
			} else if (ch == EOF) {
				felt_buffer_free(&buf);
				return (struct felt_token) {
					FELT_TOKEN_ERROR, line, column,
					.content.string = "Unexpected end-of-file while parsing string" };
			}

			// Escape sequence!
			if (ch == '\\') {
				ch = felt_input_stream_read(input);

				// The next round around the loop will deal with EOFs
				if (ch == EOF)
					continue;

				// \n, \r and \t produce newline, carriage return and tab;
				// \x for any x just produces 'x' (this handles \" and \\ too)
				char actual = (char)ch;
				if (ch == 'n')
					actual = '\n';
				else if (ch == 'r')
					actual = '\r';
				else if (ch == 't')
					actual = '\t';

				// felt_buffer_push can fail if realloc fails.
				if (felt_buffer_push(&buf, &actual, 1) < 0) {
					felt_buffer_free(&buf);
					return (struct felt_token) {
						FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
				}

			}

			// Not escape sequence
			else {
				char c = (char)ch;
				if (felt_buffer_push(&buf, &c, 1) < 0) {
					felt_buffer_free(&buf);
					return (struct felt_token) {
						FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
				}
			}
		}

		// Either, we have returned with an error, or we've reached this point
		// with a proper string literal in our buffer. Since we just peeked
		// to see that the next character is a quote, we need to consume it.
		felt_input_stream_read(input);

		// We need to push a null byte, such that our buffer is a valid string.
		if (felt_buffer_push(&buf, "", 1) < 0) {
			felt_buffer_free(&buf);
			return (struct felt_token) {
				FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
		}

		// Finally, we can return a string literal and release the buffer from the felt_buffer.
		return (struct felt_token) {
			FELT_TOKEN_STRING_LITERAL, line, column,
			.content.string = (char *)felt_buffer_release(&buf) };
	}

That's kind of long, but it's not too complicated; we just read charactrs into a buffer until we reach a closing quote, with some extra logic for escape sequences.

Now we only need to parse identifiers and numbers. However, in order to do that, we need to be able to differentiate between identifires and numbers. Most languages just say that identifiers can't start with a number, which makes this very easy; if it starts with a number, it's a number, othrewise it's an identifier. However, I want to be able to use numbers at the beginning of my identifiers, because not being able to do that has annoyed me in other languages.

What we'll do is to first read in something which might be either a number literal or an identifier, and then, when we've read it all in, we decide whether it's an identifier or a number.

src/tokenize.c:

	// If we've reached this point, we must be reading either an identifier or a number.
	// Let's read it all in, and then decide which one it is.
	else {
		struct felt_buffer buf;
		felt_buffer_init(&buf);

		// We can push the current character right away,
		// since identifiers don't start with a special symbol like strings do
		char c = (char)ch;
		if (felt_buffer_push(&buf, &c, 1) < 0) {
			felt_buffer_free(&buf);
			return (struct felt_token) {
				FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
		}

		// We just read characters until we find something we don't like
		while (is_identifier(felt_input_stream_peek(input))) {
			ch = felt_input_stream_read(input);

			// We only care about EOF if it's an error here;
			// non-error EOFs are expected.
			if (ch == EOF && felt_input_stream_error(input)) {
				felt_buffer_free(&buf);
				return (struct felt_token) {
					FELT_TOKEN_ERROR, line, column,
					.content.string = strerror(felt_input_stream_error(input)) };
			}

			// We don't need to do any processing like we did with strings.
			c = (char)ch;
			if (felt_buffer_push(&buf, &c, 1) < 0) {
				felt_buffer_free(&buf);
				return (struct felt_token) {
					FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
			}
		}

		// Push the null byte...
		if (felt_buffer_push(&buf, "", 1) < 0) {
			felt_buffer_free(&buf);
			return (struct felt_token) {
				FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
		}

		// Then release it.
		char *str = (char *)felt_buffer_release(&buf);

		// Here's where we need to figure out if our value is a number or an identifier.
		// We'll let parse_number figure it out.
		double num;
		if (parse_number(str, &num)) {
			free(str);
			return (struct felt_token) {
				FELT_TOKEN_NUMBER_LITERAL, line, column,
				.content.number = num };
		} else {
			return (struct felt_token) {
				FELT_TOKEN_IDENTIFIER, line, column,
				.content.string = str };
		}
	}
}

That should basically be it. The only thing we're lacking now is a parse_number implementation. Because of how we've implemented this, we could make that function as fancy as we want; we could parse numbers starting with 0x as hex and 0b as binary, maybe even have some feature for specifying custom radixes within the number literal. However, I'm going to just use strtod for now for simplicity. strtod isn't optimal because it's locale dependent.

This is all that's necessary for a really basic parse_number:

src/tokenize.c:

// Parse a string as a number.
// Returns true if the entire string could be parsed as a number,
// false otherwise.
static int parse_number(const char *str, double *num) {
	char *end = NULL;
	*num = strtod(str, &end);

	// An end pointer pointing to the zero terminator
	// means the entire string was consumed.
	if (*end == '\0')
		return true;
	return false;
}

And we should now have a working tokenizer! Let's change our main.c to test it out:

src/main.c:

int main(int argc, char **argv) {
	if (argc != 2) {
		printf("Usage: %s <code>\n", argv[0]);
		return EXIT_FAILURE;
	}

	struct felt_input_stream input;
	felt_input_stream_init_string(&input, argv[1]);

	printf("%s:\n\n", argv[1]);

	// Just loop through tokens and print them
	// unil we reach the EOF or an error
	int count = 0;
	while (true) {
		count += 1;
		struct felt_token token = felt_tokenize_next(&input);
		enum felt_token_kind kind = token.kind;
		printf("%s", felt_token_kind_names[kind]);

		if (kind == FELT_TOKEN_STRING_LITERAL || kind == FELT_TOKEN_IDENTIFIER) {
			printf("{%s}", token.content.string);
		} else if (kind == FELT_TOKEN_NUMBER_LITERAL) {
			printf("{%.03f}", token.content.number);
		} else if (kind == FELT_TOKEN_ERROR) {
			printf("{%s} :(", token.content.string);
		}

		felt_token_free(&token);

		if (kind == FELT_TOKEN_ERROR || kind == FELT_TOKEN_EOF) {
			printf("\n");
			break;
		}

		// Print a newline every few tokens
		if (count % 5 == 0) {
			printf("\n");
		} else {
			printf(" ");
		}
	}

	return EXIT_SUCCESS;
}
vidarr ~/dev/felt-lang interpreter-from-scratch-part-1 $ make
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic  -o obj/input-stream.o -c src/input-stream.c
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic  -o obj/buffer.o -c src/buffer.c
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic  -o obj/main.o -c src/main.c
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic  -o obj/token.o -c src/token.c
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic  -o obj/tokenize.o -c src/tokenize.c
cc  -o felt obj/input-stream.o obj/buffer.o obj/main.o obj/token.o obj/tokenize.o

vidarr ~/dev/felt-lang interpreter-from-scratch-part-1 $ ./felt 'sum = + 50 (- 33 10)'
sum = + 50 (- 33 10):

identifier{sum} equals identifier{+} number-litral{50.000} open-paren
identifier{-} number-litral{33.000} number-litral{10.000} close-paren end-of-file

Awesome, sum = + 50 (- 33 10) becomes the list of tokens we were looking for. Feel free to download the repository and play around with the tokenizer yourself.


Part 2: Syntax trees

The source for this part is available on gitlab.

Read More