electrical
software


administration
characters
combinator-calculus
computer-science-history
crafting-interpreters
devterm
game-development
generative-art
machine-learning
programming-languages
software-design
spellbook
steam-deck
vim

Crafting Interpreters

below are some of the notes i am taking while reading through robert nystrom’s crafting interpreters book.

chapter 2: a map of the territory

the long route.

front end

scanning (lexing) / lexical analysis - takes a stream of characters and chunks and converts them into tokens
parsing - the syntax acquires grammar through parsing to produce an abstract syntax tree (AST)
static analysis
- binding / resolution - for every identifier, look up the value of the variable
- type checking
after all this, new insights/information is available, but we have to store it somewhere:
- as attributes on the AST
- store it in a lookup table (a symbol table)
- create an intermediate representation (IR)

intermediate representations

IRs make it easier to target multiple architectures
styles of IRs
- continuation-passing style (CPS)
  - used in the first version of Scheme- every function takes an extra argument that is a function (a “continuation”) and the function does the equivalent of “return” by calling the continuation function with the result value (what you would normally return). this is more common in functional and logic style languages.
- static single-assignment (SSA) - every variable is assigned exactly once, which makes many optimizations easier. used by LLVM for example.
- three-address code (TAC / 3AC) - reduces a number of expressions to a number of three operand expressions - which looks sort of like assembly (since it can have labeled jumps/GOTOs)
well known IRs in GCC
- GIMPLE - a TAC
- RTL - Register Transfer Language

middle end

optimizations

list of LLVM optimizations (passes) here
examples
- constant folding / constant propagation - calculate or replace expressions with the equivalent (e.g. a = 1 + 2 becomes a = 3)
- common subexpression elimination - re-using common calculations instead of repeating the calculation
- loop-invariant code motion - also called hoisting or scalar promotion - moving statements/expression outside of a loop (so that they are not repeated) in a way that does not affect the semantics of the program
- global value numbering - finding duplicate values and then eliminating the duplicates (e.g. multiple version of the same variable)
- strength reduction - replacing expensive (“strong”) operations with less expensive ones (“weaker”) - e.g. replacing a multiplication with an addition (in a way where the resulting values are equivalent)
- scalar replacement of aggregates - breaks up allocations of an aggregate type (like a struct) into individual allocations k
- dead-code elimination
- loop unrolling - replacing a loop with the “unrolled form”, which allows you to remove end of loop tests as well as loop indices

back end

code generation

after all optimizations, we generate actual machine code, assembly like stuff. however, we have to decide if we’re targeting a real machine (e.g. x86 machine code), versus a virtual machine like Uxn (which generates “p-code” or more commonly referred to as “bytecode”)
if you create bytecode, you need to then create a translator or mini-compiler that translates from the bytecode to the actual machine code, in which bytecode is really just another IR.
alternatively, you can write a virtual machine, which emulates a hypothetical chip which supports the bytecode

runtime

execution time - able to actually run it - either start up the VM (like a python interpreter) - as well language services like “instance of” runtime checks, or type checks - which is bundled as part of the runtime (or interpreter).

shortcuts

single pass compilers - interleave parsing, analysis, code generation, right into the parser - Pascal and C were designed around this limitation (due to memory costs) - e.g. note how both have types of forward declaration - also see syntax-directed translation
tree-walk interpreters - begin executing right after you parse it into an AST - evaluating each node as it walks the AST tree (thus tree-walking) - tends to be slow (big exception with MRI for Ruby)
transpilers - target a different IR, or, entire source language instead
just-in-time compilation - compile the native machine code when loading the program, and then profile parts of it to figure out what we can recompile and make better once we see which areas are slow (HotSpot JVM, LuaJIT)

chapter 3: lox, the language

C style syntax
dynamic typing
automatic memory management
- reference counting
- tracing garbage collection
data types: booleans, numbers, strings, nil
expressions: operands, infix/prefix/postfix operators, as well as mixifx operators (like foo ? one : two)
comparison, equality, logical operators
execution blocks (scoping), closures, classes
dynamic dispatch - runtime lookup of method
- not static dispatch - compile time method lookup
javascript style “this”
field assignment creates if missing
inheritance with <
zero standard library
expressions versus statements - “everything is an expression” is like Lisp - but then you need to decide what to do with every “statement like” expression - like implicit returns

chapter 4: scanning

source code -> tokens
also called “lexing” or “lexical analysis”
terms are basically interchangeable
creating a REPL (read-eval-print-loop)
scanning creates lexemes, then bundle that with info to create a token
- token type
- literal values - like a number
- location information (error handling), e.g. offset or line
the rest of the chapter goes on to just focus on actually writing a simple scanner.

the scanner in the book is written in java, and, it uses mutable array that it pushes tokens to as they’re parsed. it also keeps track of where it is in the string with a simple pointer. however, this doesn’t work as easily with rust, where strings are utf-8 (which may be multiple bytes per grapheme/character). in order to index into the String - we need to either use a Chars iterator or use chars_indices on the String to figure out where one char maps to a particular byte. I could probably have gotten away with just using the exact same approach - because all the identifiers and keywords in the language are single byte utf-8 characters (e.g. ascii). However - this part of the book gave me a chance to dig more into the distinction between str/string/bytes/chars/graphemes in rust. i ended up implementing a version of the scanner that uses Peekable - and that was enough to keep track of the state of the iterator and advance when needed.

chapter 5: representing code

when represneting code, we need to create a context free grammar
contrasting lexical grammars to syntactic grammars
- alphabet -> characters (lexical) -> tokens (syntactic)
- string -> lexeme/token (lexical) -> expression (syntactic)
- implemented by -> scanner (lexical) -> parser (syntactic)
formal grammars specify which are valid strings and which aren’t
strings created from the grammar: derivations
rules == “productions” since they produce strings in the grammar
each grammar has a head -> its name, and a body, which describes what it generates
terminal nodes versus non terminal nodes
example of backus-naur form (BNF)

lox grammar:

literals - numbers, strings, booleans, nil
unary expressions: a prefix like ! to be logical not, and - to negate a number
binary expressions: (infixes: +*-/) and logic (==, !=, etc)
parentheses: wrapped around expressions
eventually creating an abstract syntax tree
parser trees & ASTs are ways for the parser/interpreter to communicate
expression problem
- object oriented code makes adding new types easy (new class), but adding new functionalities annoying (add new method to every class)
- functional code makes adding new behavior easier (new function), but adding new types is annoying (add new type to all pattern matching)
visitor pattern - approximating functional style in an OOP language
chapter ends by writing a class to generate data structs/classes for expression types
each class is a java class, with just a few fields defined, like a python dataclass
however in rust, we wouldn’t do this - we’re going to use boxes (to store data on the heap) to enable recursive types

chapter 6: parsing expressions

example lox grammar:

expression     → literal
               | unary
               | binary
               | grouping ;

literal        → NUMBER | STRING | "true" | "false" | "nil" ;
grouping       → "(" expression ")" ;
unary          → ( "-" | "!" ) expression ;
binary         → expression operator expression ;
operator       → "==" | "!=" | "<" | "<=" | ">" | ">="
               | "+"  | "-"  | "*" | "/" ;

ambiguity in parsing - we have to choose associativity
- consider order of operations
- precedence: what is evaluated first when you have different operators
- associativity: what is evaluated first in a series of identical operators
- e.g. left associated: 5 - 3 - 1 = (5 - 3) - 1
- some operators are non-associative
- consider assignment: a = 5 + 1 = a = (5 + 1)

recursive decent parsing:

grammar <------------->   precedence
------------------------------------
top       equality        lower
          comparison
          addition  
          multiplication
bottom    unary           higher

descent because you’re going “down”
top down parsers: lowest precedence encountered first

the parser in the book is written almost identically to the the scanner, however it needs a lot of conditional matching. this sort of works with Peekable again in rust, however, it ends up causing some problems when you need to check more than one character forward at a time. i think there was probably a way to do it while still satisfying the borrow cchecker, but the code started to look ugly and copying the code from the book was simple nough. one other thing that came up while i was doing this was noticing the lack of variadic arguments in rust. it seems simple enough to just pass a vector of the argument type in question, but, still found it surprising. another thing i learned about here was slightly more complex destructuring in rust:

match c {
  Token { token_type: Foo | Bar} => println!("match foo or bar field values")
  Token { token_type: Foo } => println!("match struct with field value"}
  Token { .. } => println!("match struct"}
}

in the book’s implementation, java’s lack of builtin pattern matching becomes really obvious, and having a ton of if branches that just try to make a match don’t really feel like very idiomatic rust. however, at the end of the day, its ok to do the simpler thing in this case, since i think the complexity can still be completely hidden from export, by putting the entire parser and its state into a static function.

chapter 7: evaluating expressions

values are created by literals
stored in variables - which are objects in lox
in java, this is implemented (dynamic typing) via Object
in rust, we’ll use an enum that has various members
java will handle memory etc via a gc
in rust we have our borrow checker
literals are values - responsibility of the parser
other values (computations) are the responsibility of runtime
interpreter does post-order traversal - each node evaluates its children first
the handle runtime errors

chapter 8: statements & sate

adding statements - an expression statment can go where a statement is expected
then we add print statements, mostly for didactic reasons
then we add variable decalartions, which are statements
we need to handle global state - a big hashmap on the interpreter for variable values
l-value versus r-value (classical terms)
- var a = a + 1
- l-value is the original a value
- r-value is the right side
we now have creation, reading, and mutation of global variables
lox does not have dynamic scope for variables, but methods/fields on objecst are
lexical scope shows where a scope begins and ends - just from reading the code

chapter 9: control flow

this chapter needs fewer notes, as implementing the new constructs mostly builds on existin concepts
now we have or and while and for and if working

chapter 10: functions

we implement normal functions and native functions
it goes on to introduce closures, as well as globals
return statements as well

chapter 11: resolving and binding

hoisting:

{
  console.log(a);
  var a = "value";
}

becomes

{
  var a; // Hoist.
  console.log(a);
  a = "value";
}

the primary armount of work in this chapter is ensuring that the scope of a given function is only the enclosing scope at the time of function declaration
the book creates an additional pass over the AST after all variables are declared to resolve them, however this didn’t seem necessary in my rust implementation

chapter 12: classes, methods

OOP flavors
- classes
- prototypes
- mutlimethods

last updated:

2025.02.18