Crafting Interpreters
below are some of the notes i am taking while reading through robert nystrom’s crafting interpreters book.
chapter 2: a map of the territory
the long route.
front end
-
scanning (lexing) / lexical analysis - takes a stream of characters and chunks and converts them into tokens
-
parsing - the syntax acquires grammar through parsing to produce an abstract syntax tree (AST)
-
static analysis
- binding / resolution - for every identifier, look up the value of the variable
- type checking
-
after all this, new insights/information is available, but we have to store it somewhere:
- as attributes on the AST
- store it in a lookup table (a symbol table)
- create an intermediate representation (IR)
intermediate representations
- IRs make it easier to target multiple architectures
- styles of IRs
- continuation-passing style (CPS)
- used in the first version of Scheme- every function takes an extra argument that is a function (a “continuation”) and the function does the equivalent of “return” by calling the continuation function with the result value (what you would normally return). this is more common in functional and logic style languages.
- static single-assignment (SSA) - every variable is assigned exactly once, which makes many optimizations easier. used by LLVM for example.
- three-address code (TAC / 3AC) - reduces a number of expressions to a number of three operand expressions - which looks sort of like assembly (since it can have labeled jumps/GOTOs)
- continuation-passing style (CPS)
- well known IRs in GCC
middle end
optimizations
- list of LLVM optimizations (passes) here
- examples
- constant folding / constant propagation - calculate or replace expressions with the equivalent (e.g.
a = 1 + 2
becomesa = 3
) - common subexpression elimination - re-using common calculations instead of repeating the calculation
- loop-invariant code motion - also called hoisting or scalar promotion - moving statements/expression outside of a loop (so that they are not repeated) in a way that does not affect the semantics of the program
- global value numbering - finding duplicate values and then eliminating the duplicates (e.g. multiple version of the same variable)
- strength reduction - replacing expensive (“strong”) operations with less expensive ones (“weaker”) - e.g. replacing a multiplication with an addition (in a way where the resulting values are equivalent)
- scalar replacement of aggregates - breaks up allocations of an aggregate type (like a struct) into individual allocations k
- dead-code elimination
- loop unrolling - replacing a loop with the “unrolled form”, which allows you to remove end of loop tests as well as loop indices
- constant folding / constant propagation - calculate or replace expressions with the equivalent (e.g.
back end
code generation
- after all optimizations, we generate actual machine code, assembly like stuff. however, we have to decide if we’re targeting a real machine (e.g. x86 machine code), versus a virtual machine like Uxn (which generates “p-code” or more commonly referred to as “bytecode”)
- if you create bytecode, you need to then create a translator or mini-compiler that translates from the bytecode to the actual machine code, in which bytecode is really just another IR.
- alternatively, you can write a virtual machine, which emulates a hypothetical chip which supports the bytecode
runtime
- execution time - able to actually run it - either start up the VM (like a python interpreter) - as well language services like “instance of” runtime checks, or type checks - which is bundled as part of the runtime (or interpreter).
shortcuts
- single pass compilers - interleave parsing, analysis, code generation, right into the parser - Pascal and C were designed around this limitation (due to memory costs) - e.g. note how both have types of forward declaration - also see syntax-directed translation
- tree-walk interpreters - begin executing right after you parse it into an AST - evaluating each node as it walks the AST tree (thus tree-walking) - tends to be slow (big exception with MRI for Ruby)
- transpilers - target a different IR, or, entire source language instead
- just-in-time compilation - compile the native machine code when loading the program, and then profile parts of it to figure out what we can recompile and make better once we see which areas are slow (HotSpot JVM, LuaJIT)
chapter 3: lox, the language
-
C style syntax
-
dynamic typing
-
automatic memory management
- reference counting
- tracing garbage collection
-
data types: booleans, numbers, strings, nil
-
expressions: operands, infix/prefix/postfix operators, as well as mixifx operators (like foo ? one : two)
-
comparison, equality, logical operators
-
execution blocks (scoping), closures, classes
-
dynamic dispatch - runtime lookup of method
- not static dispatch - compile time method lookup
-
javascript style “this”
-
field assignment creates if missing
-
inheritance with <
-
zero standard library
-
expressions versus statements - “everything is an expression” is like Lisp - but then you need to decide what to do with every “statement like” expression - like implicit returns
chapter 4: scanning
- source code -> tokens
- also called “lexing” or “lexical analysis”
- terms are basically interchangeable
- creating a REPL (read-eval-print-loop)
- scanning creates lexemes, then bundle that with info to create a token
- token type
- literal values - like a number
- location information (error handling), e.g. offset or line
- the rest of the chapter goes on to just focus on actually writing a simple scanner.
the scanner in the book is written in java, and, it uses mutable array that it
pushes tokens to as they’re parsed. it also keeps track of where it is in the
string with a simple pointer. however, this doesn’t work as easily with rust,
where strings are utf-8 (which may be multiple bytes per grapheme/character).
in order to index into the String
- we need to either use a Chars
iterator
or use chars_indices
on the String to figure out where one char maps to a
particular byte. I could probably have gotten away with just using the exact
same approach - because all the identifiers and keywords in the language are
single byte utf-8 characters (e.g. ascii). However - this part of the book gave
me a chance to dig more into the distinction between
str/string/bytes/chars/graphemes in rust. i ended up implementing a version of
the scanner that uses
Peekable - and that
was enough to keep track of the state of the iterator and advance when needed.
chapter 5: representing code
- when represneting code, we need to create a context free grammar
- contrasting lexical grammars to syntactic grammars
- alphabet -> characters (lexical) -> tokens (syntactic)
- string -> lexeme/token (lexical) -> expression (syntactic)
- implemented by -> scanner (lexical) -> parser (syntactic)
- formal grammars specify which are valid strings and which aren’t
- strings created from the grammar: derivations
- rules == “productions” since they produce strings in the grammar
- each grammar has a head -> its name, and a body, which describes what it generates
- terminal nodes versus non terminal nodes
- example of backus-naur form (BNF)
lox grammar:
- literals - numbers, strings, booleans, nil
- unary expressions: a prefix like ! to be logical not, and - to negate a number
- binary expressions: (infixes: +*-/) and logic (==, !=, etc)
- parentheses: wrapped around expressions
- eventually creating an abstract syntax tree
- parser trees & ASTs are ways for the parser/interpreter to communicate
- expression problem
- object oriented code makes adding new types easy (new class), but adding new functionalities annoying (add new method to every class)
- functional code makes adding new behavior easier (new function), but adding new types is annoying (add new type to all pattern matching)
- visitor pattern - approximating functional style in an OOP language
- chapter ends by writing a class to generate data structs/classes for expression types
- each class is a java class, with just a few fields defined, like a python dataclass
- however in rust, we wouldn’t do this - we’re going to use boxes (to store data on the heap) to enable recursive types
chapter 6: parsing expressions
example lox grammar:
expression → literal
| unary
| binary
| grouping ;
literal → NUMBER | STRING | "true" | "false" | "nil" ;
grouping → "(" expression ")" ;
unary → ( "-" | "!" ) expression ;
binary → expression operator expression ;
operator → "==" | "!=" | "<" | "<=" | ">" | ">="
| "+" | "-" | "*" | "/" ;
- ambiguity in parsing - we have to choose associativity
- consider order of operations
- precedence: what is evaluated first when you have different operators
- associativity: what is evaluated first in a series of identical operators
- e.g. left associated:
5 - 3 - 1
=(5 - 3) - 1
- some operators are non-associative
- consider assignment:
a = 5 + 1
=a = (5 + 1)
recursive decent parsing:
grammar <-------------> precedence
------------------------------------
top equality lower
comparison
addition
multiplication
bottom unary higher
- descent because you’re going “down”
- top down parsers: lowest precedence encountered first
the parser in the book is written almost identically to the the scanner,
however it needs a lot of conditional matching. this sort of works with
Peekable
again in rust, however, it ends up causing some problems when you
need to check more than one character forward at a time. i think there was
probably a way to do it while still satisfying the borrow cchecker, but the
code started to look ugly and copying the code from the book was simple nough.
one other thing that came up while i was doing this was noticing the lack of
variadic arguments in rust. it seems simple enough to just pass a vector of the
argument type in question, but, still found it surprising. another thing i
learned about here was slightly more complex destructuring in rust:
match c {
Token { token_type: Foo | Bar} => println!("match foo or bar field values")
Token { token_type: Foo } => println!("match struct with field value"}
Token { .. } => println!("match struct"}
}
in the book’s implementation, java’s lack of builtin pattern matching becomes really obvious, and having a ton of if branches that just try to make a match don’t really feel like very idiomatic rust. however, at the end of the day, its ok to do the simpler thing in this case, since i think the complexity can still be completely hidden from export, by putting the entire parser and its state into a static function.
chapter 7: evaluating expressions
- values are created by literals
- stored in variables - which are objects in lox
- in java, this is implemented (dynamic typing) via Object
- in rust, we’ll use an enum that has various members
- java will handle memory etc via a gc
- in rust we have our borrow checker
- literals are values - responsibility of the parser
- other values (computations) are the responsibility of runtime
- interpreter does post-order traversal - each node evaluates its children first
- the handle runtime errors
chapter 8: statements & sate
-
adding statements - an expression statment can go where a statement is expected
-
then we add print statements, mostly for didactic reasons
-
then we add variable decalartions, which are statements
-
we need to handle global state - a big hashmap on the interpreter for variable values
-
l-value
versusr-value
(classical terms)var a = a + 1
l-value
is the originala
valuer-value
is the right side
-
we now have creation, reading, and mutation of global variables
-
lox does not have dynamic scope for variables, but methods/fields on objecst are
-
lexical scope shows where a scope begins and ends - just from reading the code
chapter 9: control flow
- this chapter needs fewer notes, as implementing the new constructs mostly builds on existin concepts
- now we have
or
and
while
andfor
andif
working