DRAFT: Making a simple interpreter from scratch, part 1: How do we make sense of words?
(This is part 1 of my series on making an interpreter from scratch. For part 0, click here.)
Before I took my first compilers course, I was curious about making programming languages, but just couldn't imagine how you would make a language with a robust syntax like the serious programming languages. I would try to make languages which work based on splitting the program on whitespace and treating whitespace-separated words as "tokens", with hacks to work around constructs which consist of multiple words. The most complete of those projects is a language I called Shoelips.
Well, it turns out that once you know what to do, it's surprisingly easy to at least imagine how we can start to turn our text into some kind of structured data. All we need to do is to split our string into tokens, but, like, intelligently (also called lexical analysis or tokenization).
We just need to take a string like this:
sum = + 50 (- 33 10)
and turn it into a series of tokens like this:
identifier{sum} equals identifier{+} number{50}
open-paren identifier{-} number{33} number{10} close-paren
Once we have done that, we can start parsing; for example, we can say that when
we're looking for an expression, an identifier
token followed by an equals
token
means that we should start parsing the expression as an assignment.
Tokens
The very first thing we're going to have to figure out is what kinds of tokens we want. I have chosen this list of tokens: string literal, number literal, open paren, close paren, open brace, close brace, open bracket, close bracket, newline, semicolon, colon, equals, exclamation mark, dot, end-of-file, and error. The end-of-file token will only appear at the end of the input stream, and the error token will signal that there's some issue with the input which prevents us from tokenizing further.
src/token.h:
enum felt_token_kind {
FELT_TOKEN_STRING_LITERAL,
FELT_TOKEN_NUMBER_LITERAL,
FELT_TOKEN_IDENTIFIER,
FELT_TOKEN_OPEN_PAREN,
FELT_TOKEN_CLOSE_PAREN,
FELT_TOKEN_OPEN_BRACE,
FELT_TOKEN_CLOSE_BRACE,
FELT_TOKEN_OPEN_BRACKET,
FELT_TOKEN_CLOSE_BRACKET,
FELT_TOKEN_NEWLINE,
FELT_TOKEN_SEMICOLON,
FELT_TOKEN_COLON,
FELT_TOKEN_EQUALS,
FELT_TOKEN_EXCLAMATION_MARK,
FELT_TOKEN_DOT,
FELT_TOKEN_EOF,
FELT_TOKEN_ERROR,
FELT_TOKEN_COUNT,
};
extern const char *felt_token_kind_names[];
That felt_token_kind_names
array is an array which just maps a token kind to
a string representation for nicer error messages. I won't include it here, but
it's defined in
token.c.
Any particular token will need to know what kind of token it is, its line and column number (for error messages), and in the case of string litrals, number literals and identifiers, some content.
src/token.h:
struct felt_token {
enum felt_token_kind kind;
int line, column;
union {
double number;
char *string;
} content;
};
void felt_token_free(struct felt_token *tok);
Since this is C, we need to think about memory management. For tokens, the
convention will be that identifier tokens and string literal tokens
will set their content.string
property to some dynamically allocated
buffer, and felt_token_free
will free that buffer. For all other kinds of
token, no dynamic memory is necessary.
Reading input
Before we can start tokenizing user input, we have to read user input. I won't
bore you with the details, but I made a felt_input_stream
which lets us read
user input byte by byte from either a C string or a file. Here's the interface:
src/input-stream.h:
struct felt_input_stream {
/* ... */
int line, column;
};
void felt_input_stream_init_string(struct felt_input_stream *stream, const char *str);
void felt_input_stream_init_file(struct felt_input_stream *stream, FILE *f);
// Peek the next character, but don't consume it.
// Return EOF on end of file or error.
int felt_input_stream_peek(struct felt_input_stream *stream);
// Consume the next character and return it.
// Returns EOF on end of file or error.
int felt_input_stream_read(struct felt_input_stream *stream);
// Returns an errno code if the stream failed,
// or 0 if there was no error.
int felt_input_stream_error(struct felt_input_stream *stream);
Tokenizing!
Now that we have both a token
struct and a way to read input, we can start
actually making our tokenizer. The interface will be super simple, it'll just
read characters from the input stream and return a token:
src/tokenize.h:
struct felt_token felt_tokenize_next(struct felt_input_stream *input);
That felt_tokenize_nxet
function is where where all the tokenization will happen.
However, before we start writing that, I want to write a function to skip
all whitespace and comments; comments should be ignored, and most whitespace
should be ignored.
I don't want felt to require a semicolon on each line, so we need to tell the
further stages of the parser about newlines.ine tokens. To make it easier
on ourselves later, I want to collapse a sequence of whitespace and comments
into a single newline token if it contains any number of newlines.
This makes our skip_whitespace_and_comments
function a bit hairy.
src/tokenize.c:
// Check if a character is whitespace.
// We don't use isspace() because we don't want the user's current locale
// to affect parsing.
static int is_whitespace(int ch) {
return ch == ' ' || ch == '\n' || ch == '\t' || ch == '\r';
}
// Check if a character could be a valid identifier character.
// Note that equals signs are allowed in identifiers.
static int is_identifier(int ch) {
return !is_whitespace(ch) && ch != EOF &&
ch != '(' && ch != ')' && ch != '{' && ch != '}' &&
ch != '[' && ch != ']' && ch != ';' && ch != ':' &&
ch != '!' && ch != '.' && ch != '#';
}
// Skip whitespace or comments.
// Returns true if one or more newlines was skipped, false otherwise.
static bool skip_whitespace_and_comments(struct felt_input_stream *input) {
bool skipped_newline = false;
int peek;
while (true) {
// Skip whitespace
while (true) {
peek = felt_input_stream_peek(input);
if (!is_whitespace(peek))
break;
if (peek == '\n')
skipped_newline = true;
felt_input_stream_read(input);
}
// We've skipped the whitespace;
// unless the next character is a comment, we're done
peek = felt_input_stream_peek(input);
if (peek != '#')
break;
// We need to read until a newline,
// because comments go from the # to the end of the line.
felt_input_stream_read(input);
while (true) {
peek = felt_input_stream_peek(input);
if (peek == '\n') {
skipped_newline = true;
felt_input_stream_read(input);
break;
} else if (peek == EOF) {
return skipped_newline;
}
felt_input_stream_read(input);
}
// We've skipped a comment, but we need to go around again
// in case there's more whitespace to skip.
}
return skipped_newline;
}
Now we can start writing the actual felt_tokenize_next
function. The
first thing we have to do is to skip any comments and newline we may have
encountered, and then read the first non-whitespace character:
src/tokenize.c:
struct felt_token felt_tokenize_next(struct felt_input_stream *input) {
if (skip_whitespace_and_comments(input))
return (struct felt_token) {
FELT_TOKEN_NEWLINE, input->line, input->column, { 0 } };
// Read the first character of our token,
// and capture the line/column of the first character of the token
int ch = felt_input_stream_read(input);
int line = input->line;
int column = input->column;
// On error, we just return an error token with an error message
if (ch == EOF && felt_input_stream_error(input))
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column,
.content.string = strerror(felt_input_stream_error(input)) };
Next, we can get all of the simple one-character tokens and EOF out of the way:
src/tokenize.c:
// One-character tokens are simple;
// we already consumed a character from the stream, so we can just
// return the correct token right away.
switch (ch) {
case EOF:
return (struct felt_token) { FELT_TOKEN_EOF, line, column, { 0 } };
case '(':
return (struct felt_token) { FELT_TOKEN_OPEN_PAREN, line, column, { 0 } };
case ')':
return (struct felt_token) { FELT_TOKEN_CLOSE_PAREN, line, column, { 0 } };
case '{':
return (struct felt_token) { FELT_TOKEN_OPEN_BRACE, line, column, { 0 } };
case '}':
return (struct felt_token) { FELT_TOKEN_CLOSE_BRACE, line, column, { 0 } };
case '[':
return (struct felt_token) { FELT_TOKEN_OPEN_BRACKET, line, column, { 0 } };
case ']':
return (struct felt_token) { FELT_TOKEN_CLOSE_BRACKET, line, column, { 0 } };
case ';':
return (struct felt_token) { FELT_TOKEN_SEMICOLON, line, column, { 0 } };
case ':':
return (struct felt_token) { FELT_TOKEN_COLON, line, column, { 0 } };
case '=':
return (struct felt_token) { FELT_TOKEN_EQUALS, line, column, { 0 } };
case '!':
return (struct felt_token) { FELT_TOKEN_EXCLAMATION_MARK, line, column, { 0 } };
case '.':
return (struct felt_token) { FELT_TOKEN_DOT, line, column, { 0 } };
}
Now we only have the three variable length tokens left: number literals, string literals and identifiers. Since they're variable-length, the variable-length buffer we discussed in part 0 will come in handy.
Let's do the string litral first; we can unambiguously determine whether we're looking at a string literal based on whether the current character is a quote.
src/tokenize.c:
// Strings start with an opening quote.
if (ch == '"') {
struct felt_buffer buf;
felt_buffer_init(&buf);
// We just read characters until we find the closing quote
while (felt_input_stream_peek(input) != '"') {
ch = felt_input_stream_read(input);
// Error handling is a bit noisy but meh
if (ch == EOF && felt_input_stream_error(input)) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column,
.content.string = strerror(felt_input_stream_error(input)) };
} else if (ch == EOF) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column,
.content.string = "Unexpected end-of-file while parsing string" };
}
// Escape sequence!
if (ch == '\\') {
ch = felt_input_stream_read(input);
// The next round around the loop will deal with EOFs
if (ch == EOF)
continue;
// \n, \r and \t produce newline, carriage return and tab;
// \x for any x just produces 'x' (this handles \" and \\ too)
char actual = (char)ch;
if (ch == 'n')
actual = '\n';
else if (ch == 'r')
actual = '\r';
else if (ch == 't')
actual = '\t';
// felt_buffer_push can fail if realloc fails.
if (felt_buffer_push(&buf, &actual, 1) < 0) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
}
}
// Not escape sequence
else {
char c = (char)ch;
if (felt_buffer_push(&buf, &c, 1) < 0) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
}
}
}
// Either, we have returned with an error, or we've reached this point
// with a proper string literal in our buffer. Since we just peeked
// to see that the next character is a quote, we need to consume it.
felt_input_stream_read(input);
// We need to push a null byte, such that our buffer is a valid string.
if (felt_buffer_push(&buf, "", 1) < 0) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
}
// Finally, we can return a string literal and release the buffer from the felt_buffer.
return (struct felt_token) {
FELT_TOKEN_STRING_LITERAL, line, column,
.content.string = (char *)felt_buffer_release(&buf) };
}
That's kind of long, but it's not too complicated; we just read charactrs into a buffer until we reach a closing quote, with some extra logic for escape sequences.
Now we only need to parse identifiers and numbers. However, in order to do that, we need to be able to differentiate between identifires and numbers. Most languages just say that identifiers can't start with a number, which makes this very easy; if it starts with a number, it's a number, othrewise it's an identifier. However, I want to be able to use numbers at the beginning of my identifiers, because not being able to do that has annoyed me in other languages.
What we'll do is to first read in something which might be either a number literal or an identifier, and then, when we've read it all in, we decide whether it's an identifier or a number.
src/tokenize.c:
// If we've reached this point, we must be reading either an identifier or a number.
// Let's read it all in, and then decide which one it is.
else {
struct felt_buffer buf;
felt_buffer_init(&buf);
// We can push the current character right away,
// since identifiers don't start with a special symbol like strings do
char c = (char)ch;
if (felt_buffer_push(&buf, &c, 1) < 0) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
}
// We just read characters until we find something we don't like
while (is_identifier(felt_input_stream_peek(input))) {
ch = felt_input_stream_read(input);
// We only care about EOF if it's an error here;
// non-error EOFs are expected.
if (ch == EOF && felt_input_stream_error(input)) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column,
.content.string = strerror(felt_input_stream_error(input)) };
}
// We don't need to do any processing like we did with strings.
c = (char)ch;
if (felt_buffer_push(&buf, &c, 1) < 0) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
}
}
// Push the null byte...
if (felt_buffer_push(&buf, "", 1) < 0) {
felt_buffer_free(&buf);
return (struct felt_token) {
FELT_TOKEN_ERROR, line, column, .content.string = strerror(errno) };
}
// Then release it.
char *str = (char *)felt_buffer_release(&buf);
// Here's where we need to figure out if our value is a number or an identifier.
// We'll let parse_number figure it out.
double num;
if (parse_number(str, &num)) {
free(str);
return (struct felt_token) {
FELT_TOKEN_NUMBER_LITERAL, line, column,
.content.number = num };
} else {
return (struct felt_token) {
FELT_TOKEN_IDENTIFIER, line, column,
.content.string = str };
}
}
}
That should basically be it. The only thing we're lacking now is a
parse_number
implementation. Because of how we've implemented this, we could
make that function as fancy as we want; we could parse numbers starting with
0x
as hex and 0b
as binary, maybe even have some feature for specifying
custom radixes within the number literal. However, I'm going to just use
strtod
for now for simplicity. strtod
isn't optimal because it's locale
dependent.
This is all that's necessary for a really basic parse_number:
src/tokenize.c:
// Parse a string as a number.
// Returns true if the entire string could be parsed as a number,
// false otherwise.
static int parse_number(const char *str, double *num) {
char *end = NULL;
*num = strtod(str, &end);
// An end pointer pointing to the zero terminator
// means the entire string was consumed.
if (*end == '\0')
return true;
return false;
}
And we should now have a working tokenizer! Let's change our main.c
to
test it out:
src/main.c:
int main(int argc, char **argv) {
if (argc != 2) {
printf("Usage: %s <code>\n", argv[0]);
return EXIT_FAILURE;
}
struct felt_input_stream input;
felt_input_stream_init_string(&input, argv[1]);
printf("%s:\n\n", argv[1]);
// Just loop through tokens and print them
// unil we reach the EOF or an error
int count = 0;
while (true) {
count += 1;
struct felt_token token = felt_tokenize_next(&input);
enum felt_token_kind kind = token.kind;
printf("%s", felt_token_kind_names[kind]);
if (kind == FELT_TOKEN_STRING_LITERAL || kind == FELT_TOKEN_IDENTIFIER) {
printf("{%s}", token.content.string);
} else if (kind == FELT_TOKEN_NUMBER_LITERAL) {
printf("{%.03f}", token.content.number);
} else if (kind == FELT_TOKEN_ERROR) {
printf("{%s} :(", token.content.string);
}
felt_token_free(&token);
if (kind == FELT_TOKEN_ERROR || kind == FELT_TOKEN_EOF) {
printf("\n");
break;
}
// Print a newline every few tokens
if (count % 5 == 0) {
printf("\n");
} else {
printf(" ");
}
}
return EXIT_SUCCESS;
}
vidarr ~/dev/felt-lang interpreter-from-scratch-part-1 $ make
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic -o obj/input-stream.o -c src/input-stream.c
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic -o obj/buffer.o -c src/buffer.c
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic -o obj/main.o -c src/main.c
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic -o obj/token.o -c src/token.c
cc -Wall -Wextra -Wno-unused-parameter -Wpedantic -o obj/tokenize.o -c src/tokenize.c
cc -o felt obj/input-stream.o obj/buffer.o obj/main.o obj/token.o obj/tokenize.o
vidarr ~/dev/felt-lang interpreter-from-scratch-part-1 $ ./felt 'sum = + 50 (- 33 10)'
sum = + 50 (- 33 10):
identifier{sum} equals identifier{+} number-litral{50.000} open-paren
identifier{-} number-litral{33.000} number-litral{10.000} close-paren end-of-file
Awesome, sum = + 50 (- 33 10)
becomes the list of tokens we were looking for.
Feel free to download the repository and play around with the tokenizer
yourself.
The source for this part is available on gitlab.