Fall 2025 notes
Fall 2025 notes
TODO:
- at end of quarter, send out AI survey form (create gradescope assignment for them to submit it)
lecture timing
- week 0.2 (quarter starts on thursday): up to (not including) ‘compiler overview::things we’re leaving out’
- week 1.1: through end of ‘lexing’
- week 1.2: up to (not including) ‘parsing::recursive descent and LL(1)::LL(1) recursive descent’
- week 2.1: up to (not including) ‘parsing::building the AST’ [ASSIGN-1 OUT]
- week 2.2: through end of parsing
- week 3.1:
- week 3.2:
- week 4.1:
- week 4.2:
- week 5.1:
- week 5.2:
- week 6.1:
- week 6.2:
- week 7.1: (holiday)
- week 7.2:
- week 8.1:
- week 8.2:
- week 9.1:
- week 9.2: (holiday)
- week 10.1:
- week 10.2:
TODO: tentative assignment timing
TODO: maybe decrease type checking, increase some of the others by making assignment due dates other than on class days
- week 2.1 assign-1 (lexing and parsing) – 14 days
- week 4.1 assign-2 (type checking) – 14 days
- week 6.1 assign-3 (lowering) – 9 days
- week 7.2 assign-4 (codegen) – 13 days
- week 9.* assign-5 (gc) – 9 days
grading scale
percentage | letter grade |
---|---|
98–100 | A+ |
94–97 | A |
90–94 | A- |
87–89 | B+ |
84–86 | B |
80–83 | B- |
77–79 | C+ |
74–76 | C |
70–73 | C- |
67–69 | D+ |
64–66 | D |
60–63 | D- |
00–59 | F |
misc notes
this quarter i’m replacing the lir optimization assignment with a garbage collection assignment, and so moved the lecure material around accordingly (still covering the material, but later and less in-depth from last time)
i’m still not covering register allocation, mostly due to lack of time; if i do decide to cover it then i need to decide what to remove (and also create the lecture notes for it—see what i can crib from mehmet)
be sure to tell the students that when they access JSON files for their assignments (i.e., reading the input for the verification, lowering, and codegen assignments) they should be sure to use references and not deep copying (e.g.,
json& funcs = value["functions"]
notjson funcs = value["functions"]
). they also probably want to create a new data structure (AST or LIR) from the JSON rather than only using the JSON itself.gradescope
sometimes student submissions give corrupted output (e.g., not utf-8), which causes the
DiffOutputFiles
to fail and thus the entire grading script; figure out a way to tolerate these kinds of failures and treat them as incorrect outputs (maybe add a specific test for this problem inDiffOutputFiles
)students get confused by the output on a failure: we give the total number of tests passed out of the total number of tests, and they think it gives the actual test number (even though we give them the name of the test in the same message)—this is only a problem because that’s how they identify the test in their communications to us which is confusing; maybe we can tweak the failure message to make it more clear?
assign-1
copy-pasting the failed program from the grader output doesn’t copy the special characters (e.g.,
\r
,\t
, etc); the students need to be aware of this possibilitywe give the students the LL(1) version of the grammar so they never actually have to use the grammar refactoring rules; i think this is fair given all the other stuff they have to do, but maybe think about giving quizzes (graded or not) where they have to refactor some simple grammars
since test cases are generated as ASTs and then printed out as source code, the test cases have more regular syntax than the grammar allows—for example, toplevel constructs will always be in the same order, every function will have at most one let, and arithmetic ops are all parenthesized to make precedence explicit; to allow for testing the student parsers more thoroughly it would be good to create a printer that randomizes these things
- for now i’ve decided that this isn’t worth fixing; maybe later if i have time
course logistics
introduce myself and TAs
- hitomi
- sarah
course slack:
invite will be sent to all registered students
all communication goes through slack: announcements, assignments, questions, etc
be sure to stay on top of slack notifications; set up your preferences accordingly
office hours: TBA, will be mix of in-person and remote and at different days and times
discussion sections: no new material; basically organized office hours
- NOTE: need to close first section (9am), the 3 students enrolled please see me after lecture
assessment:
no exams
5 projects (going into finals week)
- 1–3 person teams
- weighted evenly, must do all 5
- late days allowed, but with grade penalty (-11% for one week late, -31% if before end of quarter)
- one assignment can be turned in late with no penalty (automatically whichever helps you most)
- this late penalty scheme is intended to give all the flexibility you need, please don’t ask for more late days
projects in C++, done from scratch—no skeleton code given
if you’re using an online repo (e.g., github) make sure it’s private! sharing your code publically is considered a violation of academic integrity (as is using code that others may have shared)—projects should be your own work
what about AI?
it’s ok to use AI for your projects
i advise against it:
the point of the projects is to really learn how compilers work; using AI prevents learning
i don’t think it will work very well
the only thing i require is that you disclose AI use and how you used it
- there is no penalty
- this is for my information so i can learn how AI might impact things
important: this is a work-intensive course that requires a lot of coding—be prepared! don’t take other heavy-workload courses at the same time (like 170)
please ask questions and correct my mistakes!
questions are good: they let me know when i need to do a better job of explaining things and slow the pace of the lecture if it’s going too fast
i’m only human, and sometimes when i’m writing things down my wires get crossed and i make mistakes…i’d much prefer that they’re caught right away instead of going into people’s notes and causing confusion later
intro to course
this course is about compilers, which you’ve been using for years now
you put C++ (or whatever) source code in, and get an executable out (or error messages)
up to now it’s just been a black box that works by magic
but how does it work? and why do we do it that way?
motivation
computer chips are designed with an instruction set architecture (ISA); an ISA is a set of binary values that correspond to certain actions by the computer (e.g., take two registers, add their values, and put the result in another register)
different architectures use different ISAs (e.g., x86, ARM, RISC-V, MIPS) but they are all the same basic idea
ISA instructions are also called machine code, in contrast to source code
humans can’t program directly in machine code except for very small, limited cases; even when we use a somewhat higher level of abstraction (assembly language) it’s still really painful
so we use modern, high-level programming languages like C, C++, Java, Python, JavaScript, etc (there are dozens of widely-used languages, thousands of existing languages)
- this is important: programming languages are for humans, not computers
but to the computer, source code is just text—it might as well be shakespeare, it has no meaning that the computer can understand or execute
- so we still need to be able to translate source code into machine code
source code: [see OneNote]
"x = x + 2;"
- what the computer sees (in ASCII code):
['x', ' ', '=', ' ', 'x', ' ', '+', ' ', '2', ';']
- translated to assembly:
mov [esp+4], eax
add eax, 2
mov eax, [esp+4]
- translated to machine code:
[0x89, 0x44, 0x24, 0x04, 0x83, 0xC0, 0x02, 0x8B, 0x44, 0x24, 0x04]
besides the fact that source code is just text, is also uses abstractions that are meaningless to the computer
classes, objects, structs, functions, if statements, while and for loops, variables…none of these are things that the computer knows about or understands
programming languages are specifically designed to give the human programmers useful abstractions to describe data and computation
- different languages use different kinds of abstractions (e.g., object-oriented languages vs functional languages vs logic programming languages)
so somehow we need to take a program in source code (with whatever abstractions that particular language uses) and turn it into a stream of ISA instructions (without any of those abstractions) that the computer can actually execute
approach 1: ahead-of-time compilers
take source code and translate it into machine code (e.g., gcc or clang)
we call this ‘ahead of time’ because it happens before the program is ever executed
approach 2: interpreters
take source code and have an interpreter take each statement and expression and implicity translate it into machine code on the fly (e.g., cpython for python)
it’s implicit because the mapping from source code to machine code isn’t present explicitly in the interpreter—an interpreter for language A is itself written in some source language B, and it looks at the source code for A and implements it in terms of B
but how was the interpreter itself made executable? a compiler!
interpreting source code is much slower than executing compiled source code because both the interpreter and the source code are executing at the same time; on the other hand, interpreters don’t need to wait for compilation to start executing source code so they are faster to use in a “write-execute” cycle
approach 3: hybrid
use an interpreter, but during execution also in parallel compile code to machine instructions (e.g., v8 engine for javascript)
this is called ‘just in time’ compilation, or JIT compilation for short
get the fast startup time of interpreters but still a good part of the performance of compiled code (though still slower than ahead-of-time compilation)
some language implementations do all three, like Java and any other language that targets the Java Virtual Machine (JVM), or .NET languages that target the .NET runtime
ahead-of-time compilation to java bytecode
bytecode interpreted by JVM
JVM also JIT compiles the bytecode for better performance
so no matter what language you’re using, compilers are involved in some way; there is no modern programming without compilers
it’s important to note that languages are not compiled or interpreted—a language is implemented with a compiler or interpreter
- there are interpreters for C and C++
- there are compilers for python
programming languages are how humans express computation: we can’t compute without them, and we can’t make them work without compilers—as computer scientists and/or developers, there shouldn’t be any magic here
goals for this course:
dispel the “magic” and get a basic understanding of how source code turns into executable code
not just relevant to compilers for mainstream programming languages like C++, etc: parsing log or configuration files, creating domain-specific languages, etc
show how some of the computational theory from 138 has practical applications
give experience writing (relatively) large and complex software from scratch
still much smaller and simpler than mainstream compilers that have had hundreds of people working on them for many years
you won’t be a compiler expert after this course, but you’ll know enough to have an idea of how compilers work and a foundation you can use to learn more if you want
compiler overview
compilation
- to put compilation in context:
[source + macros] --PREPROCESSOR--> [source code] --COMPILER--> assembly --ASSEMBLER-->
[relocatable machine code] --LINKER--> executable machine code
we often call this whole process “compilation”, but technically the compiler part is from source code to assembly
- compilers like gcc, clang, etc automatically invoke the preprocessor, assembler, and linker as third-party executables; you can configure which ones they use
traditionally compilers are broken into three pieces:
- front end: responsible for syntax and validation
- “middle” end: responsible for optimization
- back end: responsible for generating machine code
[source code] -[front end]→ [internal representation] -[back end]→ [assembly]
⇵
[middle end]
front end
responsible for three things:
- lexer (aka scanner, tokenizer)
- parser
- validation (type checking, etc)
lexer
the lexer takes source code (as a single string) and turns it into a sequence of tokens
lexer stands for “lexicographic analyzer”
a token is a “word” in our programming language
the lexer categorizes substrings of the input into classes of recognized things in our language
lexing example [see OneNote]
// as an example
let sum:int = 0, total:int = 5, x:int = 2;
while sum < total {
sum = sum + x * 10;
}
- as a string:
['/', '/', ' ', 'a', 's', ' ', 'a', 'n', ' ', 'e', 'x', 'a', 'm', 'p', 'l', 'e',
'l', 'e', 't', ' ', 's', 'u', 'm', ':', 'i', 'n', 't', ' ', '=', ' ', '0', ',',
' ', 't', 'o', 't', 'a', 'l', ':', 'i', 'n', 't', ' ', '=', ' ', '5', ',', ' ',
'x', ':', 'i', 'n', 't', ' ', '=', ' ', '2', ';', '\n', 'w', 'h', 'i', 'l', 'e',
' ', 's', 'u', 'm', ' ', '<', ' ', 't', 'o', 't', 'a', 'l', ' ', '{', '\n', ' ',
' ', 's', 'u', 'm', ' ', '=', ' ', 's', 'u', 'm', ' ', '+', ' ', 'x', ' ', '*',
' ', '1', '0', ';', '\n', '}']
- after lexing:
[LET, VAR(sum), COLON, TYPE(int), EQ, NUM(0), COMMA, VAR(total), COLON, TYPE(int),
EQ, NUM(5), COMMA, VAR(x), COLON, TYPE(int), EQ, NUM(2), SEMICOLON, WHILE, VAR(sum),
LT, VAR(total), OPEN_BRACE, VAR(sum), EQ, VAR(sum), PLUS, VAR(x), MUL, NUM(10),
SEMICOLON, CLOSE_BRACE]
notice that lexing eliminated the comments and whitespace, what’s left are the building blocks of the source code
the tokenizer recognizes that, e.g.,
let
is a keyword instead of a variable, thatsum
is a variable, that10
is a number, etcsome tokens carry extra information (e.g., for VAR what the name is); these are ignored by the parser but necessary for later stages
how does the lexer know what the tokens should be?
the tokens are specified by the language designer as regular expressions
these are converted into NFA (or DFA) and used to process the string into tokens
the lexer classifies sequences of characters into elements of our language (i.e., tokens), but still leaves those tokens as a flat sequence without any structure
- we know at this point that the source code only contains valid tokens, but we still don’t know if it’s syntactically correct
parser
the parser has two jobs:
verify that the token stream represents a syntactically correct program in our language
build an internal representation of that program for the compiler to operate on
verifying syntactic correctness:
the syntax of the language is specified by the language designer as a CFG, with the tokens as the alphabet
to show a program is syntactically correct we have to parse the tokens and show that a derivation exists from the start symbol to the token stream
- recall from 138 ‘derivation trees’, aka ‘parse trees’, aka ‘syntax tree’
the internal representation usually isn’t the actual derivation tree because it contains a lot of things we don’t care about when compiling (e.g., punctuation); instead we usually use an abstract syntax tree (AST) that keeps the structure of the derivation tree but only keeps the important information
parsing example [see OneNote; draw tree]
LET
- VAR(sum), TYPE(int), NUM(0)
- VAR(total), TYPE(int), NUM(5)
- VAR(x), TYPE(int), NUM(2)
WHILE
- Guard
- LT
- Left: VAR(sum)
- Right: VAR(total)
- Body
- Assignment
- Left: VAR(sum)
- Right:
- PLUS
- Left: VAR(sum)
- Right:
- MUL
- Left: VAR(x)
- Right: NUM(10)
notice that the punctuation like colons, semicolons, braces, etc are gone
they were necessary in the concrete grammar to figure out the correct AST structure (where does a statement end; what statements belong to the while body; etc)
they aren’t needed now because the AST structure makes them superfluous
notice that the AST structure also makes precedence explicit for the arithmetic
we know at this point that the program is syntactically correct, but we still don’t know if it’s valid
validation
just because a program is syntactically correct doesn’t mean it’s a valid program
- analogy with English: “colorless green ideals sleep furiously” (from chomsky)
example of a syntactically correct but invalid program:
let x:int = 0, y:int = 10;
x = new int;
y = x * y;
here
x
is an integer but we’re storing a pointer value (i.e., address); multiplying by an address doesn’t make sensewe need to make several checks, but perhaps the most important is type checking
- everything is declared with a proper type
- anything declared as a given type is only used in ways that make sense for that type
at this point we know that the program is a valid program and we can go to the middle end
middle end
the primary job of the so-called middle-end is to optimize the code in order to make it more efficient
- we often lower the AST into a simpler, lower-level intermediate representation (IR) that is easier to reason about and exposes more optimization opportunities
let’s look at some examples (to keep things simple we’ll just modify the source code itself instead of showing the IR):
original [see OneNote]
let sum:int = 0, total:int = 5, x:int = 2;
while sum < total {
sum = sum + x * 10;
}
- loop invariant code motion
let sum:int = 0, total:int = 5, x:int = 2, tmp:int;
tmp = x * 10;
while sum < total {
sum = sum + tmp;
}
- constant folding
let sum:int = 0, total:int = 5, x:int = 2, tmp:int;
tmp = 20;
while sum < 5 {
sum = sum + 20;
}
- dead code elimination
let sum:int = 0;
while sum < 5 {
sum = sum + 20;
}
we call this process “optimization”, but there isn’t really a sense of “optimal” code meaning the best possible code—just better than the original
- the more optimizations we apply the better the result (with diminishing returns) but the longer compilation takes
once the code has been optimized as much as desired we’re done with the middle-end
back end
the primary job of the back end is to translate the IR into whatever ISA we’re targeting (we’ll be targeting x86-64 in this class)
a secondary job is to apply ISA-specific optimizations
the particular ISA-specific optimization we’ll look at is register allocation
step 1: translate the IR into assembly assuming an infinite number of registers are available
- i’ll use a pseudo assembly to keep things simple
EXAMPLE using original source code and assuming each variable corresponds to its own register [see OneNote]
MOV #0, sum // move constant into register
MOV #5, total
MOV #2, x
L1: CMP sum, total // set condition flag
JLT L2 // jump if less-than
EXIT
L2: MOV #10, tmp
MUL x, tmp // tmp = x * tmp
ADD sum, tmp // tmp = sum + tmp
MOV tmp, sum // sum = tmp
JMP L1
step 2: register allocation
in reality no ISA has an infinite number of registers; we need to map an arbitrary number of variables into a fixed number of registers
no instruction will use more than two variables; the easy fix is to always use the following process so that we only ever need two registers:
- load the first variable from memory to a register
- load the second variable from memory to a register
- perform the operation
- store the result to memory
memory is really slow, this has a huge performance penalty; we want to keep things in registers as much as possible and go to memory as little as possible
register allocation gives us a smarter way to assign variables to registers so that we have to go to memory as few times as possible
runtime
a compiled program usually still needs runtime support to actually execute
C/C++: the runtime standard library provides things like
malloc/free
andnew/delete
, dealing with I/O, etcJava, Javascript, etc: the runtime provides garbage collection, etc
this matters because the compiler needs to know about the runtime in order to emit the correct code
our language will be garbage collected, and so the compiler will need to emit code that gives the garbage collector the information it needs to do its job
- you’ll also be implementing the garbage collector itself
putting it all together
[source code] --LEXER--> [token stream] --PARSER--> [AST] --LOWERING--> [LIR] --CODEGEN--> [x86-64]
⇵ ⇵ ⇵
VALIDATION OPTIMIZATION REGISTER ALLOC.
executable machine code + RUNTIME == execution
what we’ll be doing
we will be covering most of these pieces, and you will be implementing them for your projects
- lexer
- parser
- validation
- lowering
- codegen
- garbage collection
we’ll leave out the following for time (and because we can get a working compiler and runtime without them)
- optimization
- register allocation
the language we’ll be implementing is a C-like language
- ints and arithmetic
- pointers, arrays, and dynamic memory allocation
- structs
- functions
- assignment, conditionals, and loops
note that these language features are sufficient to implement things like object-oriented and functional languages
- an object is a struct + function pointers
- so is a closure
things we’re leaving out
compiler implementations often have other goals in addition to just emitting executable code, like:
- fast compilation
- optimal instruction selection
- good error messages
- IDE support
- debugger support
- testing and/or verified compilation
we won’t discuss these, but they are important for real-world compilers
final thoughts
adding language features can make every part of the compiler more complicated; language designers have to be careful to balance useful language features with how complicated they make the language implementation
building a compiler is easy as long as (1) you control the language definition and can make it (and its syntax) simple and easy to handle; (2) you don’t care how efficient the compiler itself is; (3) you don’t care how efficient the resulting executable is; and (4) you don’t care about having good error messages
i have carefully designed the language we’re compiling to be mostly straightforward, with only a few small complications to give you a taste of complexity
of course, for real compilers we want all of these things, and building a real compiler for a language you don’t control is very difficult
introducing cflat
[show example programs] [see OneNote]
- see
cflat-docs/examples/linked-list.cb
- see
cflat-docs/examples/matrices.cb
- see
[show grammar] [see OneNote]
- see
cflat-docs/syntax.md
- see
lexing
- [remind students what the lexer does; show example in OneNote]
how do we define the language tokens?
our first step is to define the tokens, i.e., the elements of our language
these come from our grammar: keywords, symbols, punctuation, constants, identifiers, etc
- [see grammar to point out tokens]
- [see OneNote for token list]
we specify what each kind of token looks like using regular expressions
- we’re using somewhat extended regular expression notation, but it maps easily to the standard expressions you learned in 138
r+ ⟶ rr*
r? ⟶ r | ϵ
[abc] ⟶ a | b | c
[a-z] ⟶ a | b | ... | z
[^ab] ⟶ Σ - {a, b}
- we also need to specify what whitespace and comments look like (we cheated a bit in the grammar by using
-
in the c-comment regex, this is technically regular but can be very expensive to compute)
whitespace = (' ' | '\t' | '\n' | '\r')+
c++-comment = `//`[^'\n']*
-- notice we don't require '\n' at the end, which handles comments that go to
EOF without terminating
c-comment = `/*`([^`*`] | `*`⋆[^`*/`])⋆`*`+/
-- using ⋆ for kleene star and * for the literal asterisk character
skip = (whitespace | c++-comment | c-comment)*
notice that some tokens should carry extra information (the lexer and parser don’t actually care about this extra info, but we’ll need it for the rest of the compiler)
if we have a Num, what were the digits? if we have an Id, what were the characters?
we can represent this information as indices in the input string (example coming shortly)
how do we extract tokens (aka ‘tokenize’)?
now we know what our tokens are and what they look like; the next step is to map the actual characters in the input stream to a sequence of tokens
- the sequence of characters that make up a token is called a lexeme
example: “foo = bar + 01”
lexeme token
------ -----
foo Id("foo") // Id(0,3)
= Gets
bar Id("bar") // Id(6,3)
+ Plus
01 Num("01") // Num(12,2)
we know how to recognize regular languages: convert a regular expression to an NFA, then simulate the NFA on some input
there are two ways we could go when implementing this idea:
have explicit Regex and NFA data structures and implement Regex–>NFA conversion and NFA simulation in code
manually figure out the NFA and hardcode its simulation using control flow (basically, a switch statement inside a for loop)
i’m going to explain things using (1) because it is the most general method and makes the underlying theory clear
- it’s also what you would use if you were implementing a lexer generator (where there is no fixed set of tokens) or if you were designing the language and the set of tokens may change over time
i recommend that you use (2) for your acual project when it is assigned because it is simpler and easier, assuming you have a fixed, predetermined set of tokens as we do for cflat
- i have asked the TAs to explain how that works in more detail during the next discussion section
EXAMPLE [see OneNote]
RE
| EmptyLang
| EmptyString
| CharRange { lower: char, upper: char } // e.g., [a-z]
| Star { r: RE }
| Concat { first: RE, second: RE }
| Union { left: RE, right: RE }
// for regex: `/*`([^`*`] | `*`⋆[^`*/`])⋆`*`+/
re = RE::char_seq("/*").then(
RE::anything_but('*').or(
RE::char('*').star().then(RE::anything_but(['*','/'])
).star()
)
.then(RE::char('*').plus())
.then(RE::char('/')))
NFA
- states: set<State>
- transitions: map<State, vector<(option<RE::CharRange>, set<State>)>>
- init: State
- accepting: set<State>
nfa = re.convert_to_nfa()
nfa.accepts("/* this is a comment */");
but we don’t have a single regular expression and a single lexeme, we have a set of regular expressions corresponding to different lexemes and an input that contains a sequence of lexemes
tag each accepting state of each nfa with the token it maps to (or Skip if it maps to a comment or whitespace), then union all the nfas together into a single nfa
this handles the fact that we’re looking for multiple lexemes: when we get a match, look at the accepting state to see what token should be produced
keep walking though the input emitting tokens as we find them, starting at the position after the previous match, until we’ve processed the entire input
EXAMPLE/EXERCISE
suppose we have:
- TkAB(chars) = a+b
- TkBA(chars) = b+a
- skip = c+
[draw NFA]
process input "aabbaccabcbba"
token stream: [TkAB(0..2), TkBA(3..4), TkAB(7..8), TkBA(10..12)]
still a problem: ambiguity between the token regular expressions
consider the following examples:
- “ab” ⟶
[Id(ab)]
vs[Id(a), Id(b)]
- “whilex” ⟶
[While, Id(x)]
vs[Id(whilex)]
- “while” ⟶
[While]
vs[Id(while)]
- “ab” ⟶
how do we figure out which token to emit if there are multiple possibilities?
solution 1: maximal munch
take the longest possible match
when lexing, record the last accepting state reached but keep on going until matching fails—then backtrack to the last accepted character
this handles the first two examples:
- “ab” ⟶
[Id(ab)]
- “whilex” ⟶
[Id(whilex)]
- “ab” ⟶
solution 2: priorities
maximal munch doesn’t help if both possible lexemes are the same size (like our third example, “while”)
we give priorities to the different accepting states, and if there are multiple accepting states then we pick the one with highest priority
for our language we’ll say that keywords have priority over identifiers; this handles the third example:
- “while” ⟶
[While]
- “while” ⟶
EXAMPLE/EXERCISE
suppose we have:
- TkAB(chars) = a*b*
- TkBA(chars) = b*a*
- with priority(TkAB) > priority(TkBA)
[draw NFA]
process input "bbaabbbb"
token stream: [TkBA(0..3), TkAB(4..7)]
final thoughts
lexing can be easy if we define our syntax carefully, but full of pitfalls if we’re careless
in general it’s good to reduce ambiguity as much as possible; which token a lexeme corresponds to shouldn’t depend on what context it’s used in
famous C++ example:
vector<set<int>>
is a syntax error because>>
is read as a bitshift operator, you need to writevector<set<int> >
this may have been fixed in more recent versions of the language, i’m not sure
why regular expressions, why not contex-free grammars?
we can use CFGs (they can do anything REs can do), and there are examples where people have done so (see “scannerless parsers”)
in general lexing using regular expressions makes the parser a lot simpler and more efficient
it does also explain why you can’t nest c-style comments: proper nesting is context-free, you can’t describe it using regular expressions
parsing
overview
we describe the syntax of our language as a context-free grammar, with tokens as the terminal symbols of our alphabet
- [show
cflat-docs/syntax.md
again]
- [show
our task is to prove that the given source code is really a member of our language. given a CFG and a string, we can prove that the string is a member of the CFG’s language using a derivation tree (aka parse tree, concrete syntax tree)
grammar:
E ::= [1] Id | [2] E Op E | [3] ( E )
Op ::= [4] + | [5] *
string: "(x + y) * z"
tokens: [OpenParen, Id(x), Plus, Id(y), CloseParen, Mul, Id(z)]
-- for clarity we'll abbreviate the tokens using the corresponding characters
derivation tree [arrange as tree]:
E -[2]→ E Op E
-[3]→ ( E ) Op E
-[2]→ ( E Op E ) Op E
-[1]→ ( x Op E ) Op E
-[4]→ ( x + E ) Op E
-[1]→ ( x + y ) Op E
-[5]→ ( x + y ) * E
-[1]→ ( x + y ) * z
the internal nodes of the tree are nonterminals, with the root being the grammar start symbol; the leaves of the tree are the tokens (in order)
if we can create such a derivation tree, then the token stream represents a syntactically correct program
the derivation i went through in the example is called a leftmost derivation because i always expanded the leftmost leaf in the derivation tree
we could expand the leaves in any order and get the same tree, but leftmost is a natural order to use when we’re processing the input from the beginning to the end—we’re matching the beginning of the input before matching the rest of the input
this is the essence of parsing: proving that an input belongs to the language by constructing a derivation tree
as discussed previously we rarely actually construct the entire derivation tree explicitly, instead we construct an abstract syntax tree that is more useful for compilation
however, IDEs do construct the derivation tree so that they can do things like syntax highlighting, etc
ambiguity
using the same grammar, we can give a derivation tree for the input
x + y + x * z + z
; in fact, we can give severalthe biggest problem we face in parsing we also had to deal with in lexing: ambiguity
- a grammar is ambiguous if there is at least one input in the language that has more than one parse tree
this is a problem because the parse tree determines how a program will be executed
consider
1 + 2 * 3
and the difference between(1 + 2) * 3
and1 + (2 * 3)
(draw as trees)the syntactic structure of the program influences its meaning, the same as in english
consider “she saw the man with a telescope”; what does it mean?
arithmetic expressions are an easy example, but it can come up in lots of places—a famous example is “dangling elses”
stmt ::= `if` exp `then` stmt `else` stmt
| `if` exp `then` stmt
| x `=` exp
// code
a = false;
c = 1;
if a then
if b then c = 2;
else c = 3;
what is the value of
c
?remember that whitespace is not significant, we skip it during lexing
[show two parse trees, one where the
else
is attached to the firstif
and one where it is attached to the second]
so we want to avoid grammars that are ambiguous
unfortunately, determining whether a grammar is ambiguous is undecidable
but we can determine whether a grammar is deterministic
deterministic implies unambiguous, though unambiguous does not imply deterministic
recall that a PDA is an NFA + a stack, a DPDA is a DFA + a stack, and DPDA are strictly less expressive than PDA (unlike regular languages where DFA and NFA are equally expressive)
a PDA is deterministic if, from each state, the current input and the current top of the stack allows for at most one transition
PDAs and CFGs are equivalent (we can transform each into the other)
since DPDA are less expressive than PDA, there must be CFG that cannot be expressed as DPDA; these include all ambiguous grammars (and also some non-ambiguous grammars)
a grammar that can be expressed with a DPDA is a deterministic grammar, and there is an algorithm that decides whether this is true
it’s important to note that ambiguity is a property of a grammar, not a formal language
the same formal language can have multiple grammars, some of which are ambiguous and some of which are not
there exist formal languages that only have ambiguous grammars, but they don’t arise for realistic programming languages
this gives us a solution: if the original grammar for our programming language syntax is not deterministic, transform it into one that is (while still describing the same syntax)
this isn’t always possible, but again for realistic programming languages it tends to be
if it isn’t, then we have to change our programming language syntax until it is
example (dangling elses again)
stmt ::= `if` exp `then` withelse `else` stmt
| `if` exp `then` stmt
| x `=` exp
withelse ::= `if` exp `then` withelse `else` withelse
| x `=` exp
this version of the grammar recognizes exactly the same set of inputs, but only allows for a single parse tree
- it requires any
else
to be associated with the “nearest”if
- it requires any
of course, we can always just change our programming language to make it obviously deterministic
- require that expressions use parentheses everywhere
- require that braces be used everywhere
- etc
this makes implementing the compiler easy, but can make programmers frustrated and annoyed
it’s often true that we can either make things convenient/nice for the programmer but complicated for the compiler implementor or vice-versa
usually we should default to making things nice for the programmer even if it makes our lives difficult as compiler implementors
parsing strategies
we know what the goal of parsing is, but how do we do it? there are actually a large number of strategies with different tradeoffs in complexity, expressiveness, performance, ease of use, etc
let’s take the same example we used earlier
grammar:
E ::= [1] Id | [2] E Op E | [3] ( E )
Op ::= [4] + | [5] *
string: "(x + y) * z"
we can divide all of the strategies into two basic categories: top-down and bottom-up
TOP-DOWN: start from the root of the derivation tree, i.e., the grammar start symbol (
E
in this case) and work our way down to the leaves, selecting productions for each node that will result in the leaves matching the input stringthis seems like it involves a lot of guess-work, but clever versions of the approach can be very efficient
[show top-down approach for example]
BOTTOM-UP: start from the leaves of the derivation tree, i.e., the input, and work our way up the tree selecting productions in reverse
- [show bottom-up approach for example]
besides top-down vs bottom-up, parsing strategies can also be divided by what class of grammars they work for
there are strategies that work for any grammar, including ambiguous ones (e.g., CYK, Earley, GLL, GLR, etc); they are O(n³) and, for ambiguous inputs, result in a parse forest instead of a parse tree
there are strategies that work for any deterministic grammar (e.g., LR, a bottom-up strategy); these are linear but complex to implement (usually we resort to automatically generating an LR parser instead of coding it by hand)
there are strategies that work for a subset of deterministic grammars (e.g., LL(k), a top-down strategy); these are also linear and, even though they are less expressive, they have other nice properties
LL(k) parsers can be handwritten, giving very fine control over parsing and allowing for easier error recovery and error messages
a popular choice for mainstream compilers, including gcc and clang
this is what we’ll be implementing for our language
recursive descent and LL(1)
intro
LL(1) is an example of a top-down parser, and a common strategy for these types of parsers (which we’ll be using) is recursive descent
- there are other strategies, but recursive descent is the most popular
we’ll go over a naive recursive descent strategy first, then show how an LL(1) grammar can make it much more efficient
to illustrate the ideas we’ll use the following simple grammar
S ::= aSa | bSb | c
naive recursive descent
from 138 we know that we can transform any CFG into a PDA
- there are several possible transformations, but we’ll use the one that yields a top-down strategy
q0 -[ϵ, ϵ⟶S$]→ q1 -[ϵ, $⟶ϵ]→ q2 (accepting)
⇵
[ϵ, S⟶aSa]
[ϵ, S⟶bSb]
[ϵ, S⟶c]
[a, a⟶ϵ]
[b, b⟶ϵ]
[c, c⟶ϵ]
this is not a DPDA because there are 3 possible transitions if
S
is on top of the stack- for now we’ll assume we have some oracle that lets us always pick the right transition to use
let’s see what happens with input
abcba
; pay particular attention to the stack
ϵ ⟶ S$ ⟶ aSa$ ⟶ ̸aSa$ ⟶ ̸abSba$ ⟶ ̸a̸bSba$ ⟶ ̸a̸bcba$ ⟶ ̸a̸b̸cba$ ⟶ ̸a̸b̸c̸ba$
⟶ ̸a̸b̸c̸b̸a$ ⟶ ̸a̸b̸c̸b̸a̸$
the stack is going through a top-down derivation for the input
when the top of the stack is
S
we expand it by pushing a productionwhen the top of the stack is a terminal we match it against the input
starting from
S
we work our way to the entire input string, and because of how the stack works we’re getting a leftmost derivation
this approach gives us what we want, but relies on having an oracle; what if we don’t have one?
then we need to guess, and if we guess wrong then we need to backtrack, i.e., if the PDA rejects we need to rollback to the last place we guessed and guess something different
this can be exponential if we always guess wrong at first; we’ll see soon how an LL(1) grammar can prevent this problem from happening
how would we implement this idea in code?
the PDA stack is what’s keeping track of the derivation
we don’t need to actually translate the grammar to a PDA, we can use the computer’s implicit function stack [remind them how a function stack works]
in other words, we use recursion
here’s how it works for our example in pseudocode [see OneNote]
S ::= aSa | bSb | c
fn S() {
saved_input = curr_input;
try { // aSa
curr_input.consume('a');
S();
curr_input.consume('a');
}
else try { // bSb
curr_input = saved_input;
curr_input.consume('b');
S();
curr_input.consume('b');
}
else try { // c
curr_input = saved_input;
curr_input.consume('c')
}
else {
curr_input = saved_input;
fail
}
}
[show what happens for input
abcba
]notice that the recursive function stack is mimicking what happened with the PDA stack—it’s tracking the derivation
- when we make a recursive call we save our current place in the code, process the call, then resume from where we left off—e.g., the stack lets us remember that after we match an
a
and then try to matchS()
, we still need to match anothera
- when we make a recursive call we save our current place in the code, process the call, then resume from where we left off—e.g., the stack lets us remember that after we match an
here’s the general idea for any grammar
create a set of mutually recursive functions, one per nonterminal
each function
A()
will have a case for each ruleA ::= α
in the grammarwhen
A()
is called it tries each case in turn until it succeeds or runs out of cases and failssuppose we have rules
A ::= α₁α₂...αₙ | β₁β₂...βₘ
, where eachαᵢ
andβᵢ
may be either a terminal (i.e., a character) or a nonterminalthen we create a function
A()
that tries the first rule, then if it doesn’t work it tries the second rule, then if that doesn’t work it failsfor each rule we look at
αᵢ
(orβᵢ
depending on which rule we’re trying) and:if
αᵢ
is a terminal, tries to consume matching characters from the input; if successful goes toαᵢ₊₁
else fails this ruleif
αᵢ
is a nonterminal, calls the corresponding function; if the function returns successfully then we go toαᵢ₊₁
else we fail this rule
LL(1) recursive descent
backtracking kills performance, and for some grammars recursion can result in non-termination—how can we make this strategy efficient?
- that’s where an LL(1) grammar comes in
consider the PDA we created for our example gramar and note that we could make it a DPDA by adding lookahead
q0 -[ϵ, ϵ⟶S$]→ q1 -[ϵ, $⟶ϵ]→ q2 (accepting)
|↑
[a, ϵ⟶ϵ] || [ϵ, a⟶ϵ]
↓|
qA
⇵
[ϵ, S⟶aSa]
[repeat construction for qB, qC]
let’s see what happens with input
abcba
[same as before, except not relying on oracle]
it’s completely deterministic, no guessing involved
what does that mean for our recursive descent algorithm? it means no backtracking is necessary [see OneNote]
fn S() {
if next token is 'a' { // aSa
input.consume('a');
S();
input.consume('a');
}
else if next token is 'b' { // bSb
input.consume('b');
S();
input.consume('b');
}
else if next token is 'c' { // c
input.consume('c');
}
else fail
}
there is a special case we need to worry about: what if a nonterminal has an ϵ production?
- just treat it as the default case, i.e., the thing to do if none of the other cases are true
new example grammar
A ::= xy | yBz
B ::= wy | ϵ
- code
fn A() {
if next token is 'x' {
input.consume('x');
input.consume('y');
}
else if next token is 'y' {
input.consume('y');
B();
input.consume('z');
}
else fail
}
fn B() {
if next token is 'w' {
input.consume('w');
input.consume('y');
} else {
// don't consume anything, but still return success
}
}
the key is that by looking at the next token we always know exactly what rule we should be trying to match (and if the next token isn’t one we expect then we return an error)
a grammar s.t. we can use
k
tokens of lookahead to make the recursive descent algorithm completely deterministic is an LL(k) grammarthis isn’t true of all grammars, or even all deterministic grammars, but we can often make it work for real programming languages
for us, k = 1 (and for most real programming languages)
why
LL(k)
?- we’re processing the input from
L
eft to right - we’re tracking a
L
eftmost derivation - we’re using
k
tokens of lookahead
- we’re processing the input from
exercise
- given the following grammar
S ::= aPb | Qc | cRd | TcP
P ::= QR | TR | ε
Q ::= fR | b
R ::= d | gbc
T ::= ea | Ra
write pseudocode for a recursive descent LL(1) parser; remember:
for each nonterminal there is a function
the function contains a case for each production rule for that nonterminal
we use one token of look-ahead to determine which case to execute
trace the parser for inputs:
afgbcdb
[PASS]daceagbc
[PASS]
SOLUTION
fn S() {
if next token is a {
input.consume('a');
P();
input.consume('b');
}
else if next token is in {f, b} {
Q();
input.consume('c');
}
else if next token is c {
input.consume('c');
R();
input.consume('d');
}
else if next token is in {e, d, g} {
T();
input.consume('c');
P();
}
else fail
}
fn P() {
if next token is in {f, b} {
Q();
R();
}
else if next token is in {e, d, g} {
T();
R();
}
}
fn Q() {
if next token is f {
input.consume('f');
R();
}
else if next token is b {
input.consume('b');
}
else fail
}
fn R() {
if next token is d {
input.consume('d');
}
else if next token is g {
input.consume('gbc');
}
else fail
}
fn T() {
if next token is e {
input.consume('ea');
}
else if next token is in {d, g} {
R();
input.consume('a');
}
else fail
}
formalizing LL(1)
intro
how do we know if a grammar is LL(1) and thus suitable for implementing in this way? there are two criteria:
no left recursion
deterministic look-ahead (using 1 token)
if the grammar meets these criteria, it is LL(1)
left recursion
a grammar is left recursive if there exists a derivation
A -->* Aα
for some nonterminalA
this means that when matching
A
, we recursively need to matchA
again without consuming any inputif we’re using recursion, this means that we’re caught in an infinite recursive cycle
example
S ::= Sa | b
fn S() {
if next token is `b` {
input.consume('b');
}
else if next token is `a` {
S();
input.consume('a');
}
}
input: "baa"
- in order to allow a recursive descent parsing strategy, we have to forbid left-recursive grammars
deterministic look-ahead
our intuition is that for any given nonterminal
A
that has multiple rules, looking at the next token in the input is sufficient to determine which rule we need to pick- we call a grammar for which this is true a predictive grammar
our simple example from earlier:
S ::= aSa | bSb | c
- if we have an
S
symbol and need to expand it with one of these rules, by looking at the next token in the input we can tell which rule to use
- if we have an
in general it may be a bit more complicated to figure out the symbols to use; take the grammar from the previous exercise, where some rules (e.g., for
P
) themselves start with nonterminals
S ::= aPb | Qc | cRd | TcP
P ::= QR | TR | ε
Q ::= fR | b
R ::= d | gbc
T ::= ea | Ra
- we can formalize this intuition; first we’ll do so assuming there are no ϵ rules, then we’ll add in ϵ rules (which make things a bit more complicated)
look-ahead without ϵ
- let α, β be strings of terminals (T) and nonterminals (N), then:
FIRST(α) = { t ∈ T | α -->* tβ }
in other words, FIRST(α) is the set of terminals that can be derived from α with 0 or more applications of the grammar production rules
- there is an algorithm to compute FIRST sets (given in the textbook), but for many grammars it’s easy enough to compute manually by inspection
example
S ::= AB
A ::= xBw | yBz | Bwz
B ::= 0 | 1
FIRST(S) = { x, y, 0, 1 }
FIRST(A) = { x, y, 0, 1 }
FIRST(B) = { 0, 1 }
FIRST(xBw) = { x }
FIRST(yBz) = { y }
FIRST(Bwz) = { 0, 1 }
if a grammar is LL(1), then for any nonterminal
A ::= α | β
it must be true that FIRST(α) is disjoint from FIRST(β)- thus, we just need to see if the next token belongs to FIRST(α) or FIRST(β) to determine which rule to use
look-ahead with ϵ
- example
A ::= xBA | f
B ::= xwB | ϵ
FIRST(A) = { x, f }
FIRST(B) = { x }
FIRST(xBA) = { x }
FIRST(f) = { f }
FIRST(xwB) = { x }
FIRST(ϵ) = {}
at first it seems things are OK: for both
A
andB
the FIRST sets of the productions rules are disjoint—but consider the inputxxf
A
is the start symbol, so we begin matching there- look-ahead =
x
, so we use the first rulexBA
and consumex
- now we’re matching
BA
- look-ahead =
x
, so we use the first rulexwB
and consumex
- now we’re matching
wBA
- ERROR: next input symbol is
f
which doesn’t matchw
what happened? the correct way to parse input
xxf
would be:- expand
A
toxBA
- consume
x
- expand
B
to ϵ - expand
A
toxBA
- consume
x
- expand
B
to ϵ - expand
A
tof
- consume
f
- expand
but we can’t figure this out just looking at the next token because the ϵ allows us to “throw away” a nonterminal at any time (by picking the ϵ rule when expanding it)
- therefore looking at the next token doesn’t always tell us what the correct thing to do is
enter FOLLOW sets: for any nonterminal
A
,FOLLOW(A)
is the set of terminals that can immediately follow any expansion ofA
that is, for all production rules
αAβ
FIRST(β) is included in FOLLOW(A)note that FOLLOW is only for nonterminals, whereas FIRST was for any string of terminals or nonterminals
there is another algorithm for computing FOLLOW sets (in the textbook), but again it’s often easy enough to manually compute it by inspection
A ::= xBA | f
B ::= xwB | ϵ
FOLLOW(A) = {}
FOLLOW(B) = FIRST(A) = { x, f }
if a grammar is LL(1), then for any nonterminal
A ::= α | β
it must be true that if α –>* ϵ, then FIRST(β) and FOLLOW(A) are disjointin other words, by looking at the next token we can decide whether we should expand
A
or throw it away using ϵwhen matching
A
, if the next token is in FIRST(α) expand to α, else if the first token is in FIRST(β) expand to β, else assume we’re using the ϵ and match without consuming any inputthis is exactly the strategy we used when making our recursive descent parser—this property of LL(1) grammars is why it’s guaranteed to work correctly
based on this requirement, the example grammar is not LL(1) because FIRST(xwb) ∩ FOLLOW(B) = { x }, so they are not disjoint
to sum up:
FIRST sets tell us which production rule to use based on the look-ahead, and for an LL(1) grammar this is unambiguous
FOLLOW sets tell us whether using the FIRST sets this way will actually work in the presence of ϵ rules, and for an LL(1) grammar it must be true that they will
exercise
- compute the FIRST and FOLLOW sets of the following grammar and explain all the reasons why it is not predictive
A ::= xBy | Bx | zCw
B ::= wB | ε
C ::= wC | DB
D ::= yD | ε
- solution
FIRST FOLLOW
--- ------- --------
A w,x,z ∅
B w x,y
C w,y w
D y w
for A: FIRST(xBy) ∩ FIRST(Bx) = {x}
for C: FIRST(wC) ∩ FIRST(DB) = {w}
for C: FIRST(wC) ∩ FOLLOW(C) = {w}
what about regular expressions in the production rules?
remember that the regular expressions are just convenient short-hand; expanding them back to “standard” CFGs helps understand how to deal with them
if the grammar is LL(1), then we can always determine how to handle them based on the look-ahead
dealing with
?
: remember thatr?
is justr | ϵ
- in the following example, if the grammar is LL(1) then α₂ and α₃ must begin with different terminals and so we can immediately tell whether to consume α₂ or α₃
// using ?
A ::= α₁ α₂? α₃
// we don't have to actually translate to this version, but it makes
// clear how to process the ?
A ::= α₁ B α₃
B ::= α₂ | ϵ
dealing with
*
: this means that the expression can happen any number of times (including 0)in the example below, if the grammar is LL(1) then α₂ and α₃ must begin with different terminals and so we can immediately tell whether to consume α₂ or α₃—then we keep asking that question in a loop until the answer is to consume α₃
note that we had to translate the
*
using right recursion since LL(1) grammars can’t have left recursion, which automatically makes the parse right-associative; this is a problem if we actually want it to be left-associativewe have to parse and then modify the AST afterwards
it’s easiest to not translate the grammar and just parse using the
*
operator directly per the below bullet
if we’re directly using the
*
in our parser instead of translating then we just use a loop to get a vector of α₂, then make it left- or right-associative when we put it into the AST depending on what we have specified for the language
// using *
A ::= α₁ α₂* α₃
// we don't have to actually translate to this version, but it makes
// clear how to process the *
A ::= α₁ B α₃
B ::= α₂B | ϵ
- dealing with
+
: remember thatr+
is justrr*
, we can deal with it just liker*
except we require that there’s at least one α₂
building the AST
recall that the derivation tree (aka parse tree, concrete syntax tree) contains more information than we really need after parsing
e.g., punctuation, parentheses, braces, etc, as well as the exact sequence of expanded nonterminals used to build the tree
it shows how the program was parsed, but all we need after parsing is the underlying structure of the program
this underlying structure is called the abstract syntax tree (AST)
example grammar (arithmetic expressions with precedence to enforce LL(1))
E ::= FX
X ::= + FX | ε
F ::= GY
Y ::= * GY | ε
G ::= (E) | id
- concrete syntax tree for
x + y * z
(draw as tree)
E ⟶ [FX]
⟶ [GY]X
⟶ [id(x)]YX
⟶ id(x)[ϵ]X
⟶ id(x)[+FX]
⟶ id(x)+[GY]X
⟶ id(x)+[id(y)]YX
⟶ id(x)+id(y)[*GY]X
⟶ id(x)+id(y)*[id(z)]YX
⟶ id(x)+id(y)*id(z)[ϵ]X
⟶ id(x)+id(y)*id(z)[ϵ]
- abstract syntax tree (draw as tree)
(Add
Id(x)
(Mul
Id(y)
Id(z)
)
)
it’s called the abstract syntax tree because it abstracts out the unimportant (now that parsing is over) information from the concrete parse tree
we often directly compute the AST during parsing rather than creating a concrete parse tree and then abstracting it
we do this by inserting the AST construction logic directly into the functions doing the parsing
instead of functions just consuming the input and returning nothing, they’ll consume the input and return AST sub-trees
when a function calls another function, it takes the returned AST sub-tree and adds it to the AST tree it’s creating
so we need to define the AST data structure, then insert the appropriate logic into our parsing functions
example: for the grammar above, we could use the following data structure as the AST
AST
| Id { name: string }
| Add { left: AST, right: AST }
| Mul { left: AST, right: AST }
- the recursive descent parser for the grammar [see OneNote]
fn E() {
F();
X();
}
fn X() {
if next token is `+` {
input.consume('+');
F();
X();
}
}
fn F() {
G();
Y();
}
fn Y() {
if next token is '*' {
input.consume('*');
G();
Y();
}
}
fn G() {
if next token is `(` {
input.consume('(');
E();
input.consume(')');
}
else if next token is id(name) {
input.consume(id);
}
else fail
}
- modified to produce an AST [see OneNote]
fn E() -> AST {
a = F();
b = X();
if b is nil { return a; }
else { return AST::Add(a, b); }
}
fn X() -> AST {
if next token is `+` {
input.consume('+');
a = F();
b = X();
if b is nil { return a; }
else { return AST::Add(a, b); }
}
return nil;
}
fn F() -> AST {
a = G();
b = Y();
if b is nil { return a; }
else { return AST::Mul(a, b); }
}
fn Y() -> AST {
if next token is '*' {
input.consume('*');
a = G();
b = Y();
if b is nil { return a; }
else { return AST::Mul(a, b); }
}
return nil;
}
fn G() -> AST {
if next token is `(` {
input.consume('(');
ast = E();
input.consume(')');
return ast;
}
else if next token is id(name) {
input.consume(id);
return AST::Id(name);
}
else fail
}
- example (or exercise): given the grammar and parser below, modify the parser to return an AST as also defined below and run it on
payawazaxa
[see OneNote]
ambiguous grammar (think of w
, x
, y
, z
as binary operators, p
as a unary operator, and a
as a constant):
A ::= AyB | AzB | B
B ::= BwC | BxC | C
C ::= pC | a
LL(1) grammar:
A ::= BX
X ::= yBX | zBX | ε
B ::= CY
Y ::= wCY | xCY | ε
C ::= pC | a
AST
| a
| w { op1: AST, op2: AST }
| x { op1: AST, op2: AST }
| y { op1: AST, op2: AST }
| z { op1: AST, op2: AST }
| p { op: AST }
recursive descent parser:
fn A() {
B();
X();
}
fn X() {
if next token is 'y' or 'z' {
if next token is 'y' { input.consume('y'); }
else { input.consume('z'); }
B();
X();
}
}
fn B() {
C();
Y();
}
fn Y() {
if next token is 'w' or 'x' {
if next token is 'w' { input.consume('w'); }
else { input.consume('x'); }
C();
Y();
}
}
fn C() {
if next token is 'p' {
input.consume('p');
C();
}
else if next tokan is 'a' {
input.consume('a');
}
else fail
}
modified parser (SOLUTION)
fn A() -> AST {
a = B();
b = X();
if X matched 'y' { return AST::y(a, b); }
else if X matched 'z' { return AST::z(a, b); }
else { return a; }
}
fn X() -> AST {
if next token is 'y' or 'z' {
if next token is 'y' { input.consume('y'); }
else { input.consume('z'); }
a = B();
b = X();
if X matched 'y' { return AST::y(a, b); }
else if X matched 'z' { return AST::z(a, b); }
else { return a; }
}
return nil;
}
fn B() -> AST {
a = C();
b = Y();
if Y matched 'w' { return AST::w(a, b); }
else if Y matched 'x' { return AST::y(a, b); }
else { return a; }
}
fn Y() -> AST {
if next token is 'w' or 'x' {
if next token is 'w' { input.consume('w'); }
else { input.consume('x'); }
a = C();
b = Y();
if Y matched 'w' { return AST::w(a, b); }
else if Y matched 'x' { return AST::y(a, b); }
else { return a; }
}
return nil;
}
fn C() -> AST {
if next token is 'p' {
input.consume('p');
return AST::p(C());
}
else if next tokan is 'a' {
input.consume('a');
return AST::a;
}
else fail
}
AST for payawazaxa
y(
p(a),
z(
w(a, a)
x(a, a)
)
)
what is the AST for cflat? [see OneNote]
- see
cflat-docs/ast.md
- see
transforming a grammar to LL(1)
intro
how do we take a grammar that isn’t LL(1) and turn it into a grammar that is LL(1)?
- for example, our cflat grammar is not LL(1) and so we cannot use it for a recursive descent LL(1) parser as-is
there are three main things to we need to worry about
precedence: when there are multiple ways to parse a given string, we need to refactor the grammar to enforce a single possible parse
left-recursion: a grammar that has nonterminals in the “wrong” places can prevent recursive descent parsers from ever terminating, so we need to refactor the grammar to make sure this doesn’t happen
look-ahead: we need to make sure that the grammar is deterministic using a given amount of look-ahead
establishing precedence
if there are multiple ways to parse a given string, we need to force the grammar to allow only a single parse
we do this by establishing precedence: we enforce that one parse is preferable and should always be used if possible
example ambiguous grammar
E ::= id | E + E | E - E | E * E | E / E | (E)
conventionally multiplication should have precedence over addition
- given
a + b * c
, we should treat it asa + (b * c)
- given
there are several strategies that have been developed to enforce precedence in a grammar, but we’ll use the classical solution that modifies the grammar itself to enforce the desired precedence
method:
decide on the levels of precedence, e.g., {
()
} > {*
,/
} > {+
,-
}create a nonterminal for each level of precedence (we can reuse the original nonterminal for the lowest predecedence level)
factor out the operations into the appropriate nonterminal for their level of precedence
example:
G = {`()`}
F = {`*`,`/`}
E = {`+`,`-`}
E ::= E + F | E - F | F
F ::= F * G | F / G | G
G ::= (E) | id
notice the following properties:
each nonterminal rule that applies a binary operator has one operand that is the same nonterminal again and the other is the next highest precedence level nonterminal
this allows the expression to have multiple of the same precedence level operators in a row
choosing which side is which controls the associativity of the operator: having the same nonterminal on the left makes the operator left associative; having it on the right makes the operator right associative
each nonterminal (except the highest precedence,
G
) allows for applying an operator at the level of predecence or directly falling through to the next level- this allows the expression to not have any operators at the lower precedence level, e.g., an expression that doesn’t have
+
in it
- this allows the expression to not have any operators at the lower precedence level, e.g., an expression that doesn’t have
the base cases for expressions (e.e.,
id
) are always at the highest level of precedence- this allows the expression to just be an identifier without any operators at all
examples
x + y * z ==> x + (y * z)
x * y + z ==> (x * y) + z
x * y * z ==> (x * y) * z
(x + y) * z ==> (x + y) * z
we’ve looked at this using arithmetic operators as our examples, but it applies in many other situations, e.g.:
relational and logical operators:
!(x < y) && z < y
subscript operators:
a + b[i]
type casts:
(double)a / b
exercise: rewrite the following grammar assuming that all operator are left-associative and using the following precedence levels:
- {
p
} > {w
,x
} > {y
,z
}
- {
A ::= A R A | p A | a
R ::= w | x | y | z
- solution
A ::= A y B | A z B | B
B ::= B w C | B x C | C
C ::= p C | a
removing left-recursion
intro
we saw before that left recursion causes non-termination in a recursive descent parser; how can we remove it?
we can define two kinds of left recursion
direct: there is a production of the form
A ::= Aα
indirect: there is a set of mutually recursive productions that allow a left-recursive derivation (possibly involving an ϵ rule)
example 1: indirect left recursion
A ::= B | alice
B ::= C | bob
C ::= A charlie
consider A --> B --> C --> A charlie
- example 2: indirect left recursion with ϵ
A ::= B | alice
B ::= C | bob
C ::= DA charlie
D ::= dave | ε
consider A --> B --> C --> DA charlie --> A charlie
removing direct left recursion
this is easy to fix: any left-recursive production can be changed to an equivalent right-recursive production as follows
given:
A ::= Aα | β | γ
s.t. α,β,γ don’t start withA
- this grammar says that there can be β or γ then a sequence of αs
transformed:
A ::= βB | γB
,B ::= αB | ϵ
- this grammar says the same thing, but using right-recursion
any left-recursive rule must have some non-recursive base case (β and γ in the above example), otherwise the recursive would never terminate
- we just rearrange those base cases to use right-recursion instead of left-recursion
what if there are multiple left-recursive rules for
A
?there must still be at least one base case that we can use
example:
A ::= Aα | Aβ | γ
transformed:
A ::= γB
,B ::= αB | βB | ϵ
example
E ::= E+F | E-F | F
F ::= (E) | id
- transformed
E ::= FG
F ::= (E) | id
G ::= +FG | -FG | ϵ
removing indirect left recursion
there are several strategies for removing indirect left recursion, but the conceptually simplest is to just inline productions to turn indirect recusion into direct recursion and then applying the transformation from earlier
example
A ::= B | alice
B ::= C | bob
C ::= A charlie
A ::= C | bob | alice
B ::= C | bob
C ::= A charlie
A ::= A charlie | bob | alice
B ::= C | bob
C ::= A charlie
A ::= bob D | alice D
D ::= charlie D | ϵ
B ::= C | bob
C ::= A charlie
note that now
B
andC
are not reachable from the start symbolA
and can be removed; in general that may or may not happenthis strategy also works for dealing with ϵ
A ::= B | alice
B ::= C | bob
C ::= DA charlie
D ::= dave | ϵ
A ::= C | bob | alice
B ::= C | bob
C ::= DA charlie
D ::= dave | ϵ
A ::= DA charlie | bob | alice
B ::= C | bob
C ::= A charlie
D ::= dave | ϵ
A ::= dave A charlie | A charlie | bob | alice
B ::= C | bob
C ::= A charlie
D ::= dave | ϵ
left-factoring for predictability
after establishing precedence and removing left recursion, the grammar still may not be predictive
left factoring is a transformation that can possibly make it predictive (but doesn’t work for all grammars)
example: this grammar is not predictive because all three rules start with
id
E ::= id | id[E] | id(E)
- but we can make it predictive by factoring out the
id
E ::= id F
F ::= [E] | (E) | ϵ
this is the left-factoring transformation; in general terms:
A ::= αβ₁ | αβ₂ | ... | αβₙ
becomes
A ::= αB
,B ::= β₁ | β₂ | ... | βₙ
a final note [only need to cover if someone asks about it]
you might have learned in 138 that we can transform any grammar to remove ϵ rules, which might seem to make FOLLOW sets unnecessary
however, this transformation can make the grammar non-predictive, which we would need to fix using left-factoring
but left-factoring may re-introduce the ϵ rules, so we have to deal with them
validation
what is validation
after parsing we know that the program is syntactically correct, but that doesn’t necessarily mean that it’s valid
what does “valid” mean? whatever the language designer decides it means
there are some common, trivial rules that are easy to check, like:
- there is a
main
function (or whatever entry point the language has) - no two functions have the same name
- basically any purely syntactic check…
- there is a
non-trivial validation can be done in several different ways
cflat has a static type system, like C, C++, Java, etc; this means non-trivial validation is done by the compiler
other language have dynamic type systems, where non-trivial validation happens at runtime
we’ll talk more about the pros and cons of this choice later
what is a static type system
- recall the following example
let x:int = 0, y:int = 10;
x = new int;
y = x * y;
this program doesn’t make sense because it multiplies a pointer by an integer; other examples of things that don’t make sense are: trying to access an integer as if it were a struct, trying to index into a function as if it were an array, etc
- we can’t compile these nonsensical programs because there isn’t any sensible assembly instructions that we can translate them to
a type system is a framework that allows us to precisely specify what things “make sense” for our language and what things do not
- a type checker uses the type system in order to filter programs to only allow those that make sense
a type describes a set of values along with what operations are valid for those values
int
describes the set of integers; allowed to compare and do arithmetic&int
describes the set of pointers to integers; allowed to dereference[int]
describes the set of arrays with integer elements; allowed to index into- etc
saying that expression
e
has typeT
means that whene
is evaluated at runtime, the resulting value will belong to the set described byT
1 + 2
has typeint
because its value is3
, which is an integernew int
has type&int
because its value is a pointer to a location in the heap that holds an integer
the type system provides a set of rules that allow us to assign a type to every element of the AST as long as that type is compatible with the operations being done by/on that element
for example, the rules would not allow us to assign a type to
x * y
in the example above because*
is not a valid operation on pointersnote that type systems are language-specific; we could invent a new language where it does make sense to multiply a pointer by an integer, in which its type system would allow it
type systems are covered in detail in CS 162, so i’m only going over the very basics here; i recommend 162 to learn more
how do type systems work
a type system is defined by:
the set of types that the language recognizes
the typing rules that describe how to assign types to elements of the AST
the set of types is usually described via a CFG (or something equivalent)
- note that there are usually an infinite number of possible types
EXAMPLE
τ ∈ Type ::= int | ref τ | τ... → τ
[sequences of types may use vector notation or overlines instead of ellipses]
the typing rules consist of:
typing judgements that state that a particular typing fact holds (e.g., that this element of the AST has this type)
inference rules that state how we can conclude whether a particular judgement is true
there is a very strong connection between type systems and formal logic
judgements and inference rules come from the natural deduction style of formal logic
the curry-howard correspondance states that types are propositions and programs are proofs; from this perspective a type checker is a proof verifier
a typing judgement states that a typing fact holds, and looks something like
Γ ⊢ e : τ
where:Γ
is a list of typing facts about the variables in scope, called the contexte
is the expression we’re typingτ
is the type ofe
that the judgement is asserting
EXAMPLE:
x:int, y:ref int ⊢ x + *y : int
- we needed the context because
x + *y
by itself says nothing about the types ofx
ory
, so it’s impossible to determine what the type ofx + *y
is without that additional information
- we needed the context because
an inference rule explains how to prove that a judgement is correct; it is essentially an “if…then…” rule and looks like this:
EXAMPLE
Γ ⊢ e : ref τ
-------------
Γ ⊢ *e : τ
- from this example rule we are allowed to conclude:
x : ref int ⊢ *x : int
x : ref ref int ⊢ *x : ref int
x : ref ref int ⊢ **x : int
a judgement may have 0 or more premises; all premises must be true in order to prove that the conclusion is true
- 0 premises means it’s always true, i.e., an axiom
a type system consists of a set of inference rules that can refer to each other recursively, along with at least one non-recursive rule as a base case
EXAMPLE
Γ(x) = τ
---------- [VAR]
Γ ⊢ x : τ
Γ ⊢ e : ref τ
------------- [REF]
Γ ⊢ *e : τ
Γ ⊢ e1 : int Γ ⊢ e2 : int
--------------------------- [ADD]
Γ ⊢ e1 + e2 : int
from these example rules we can now prove that
x:int, y:ref int ⊢ x + *y : int
- [draw expression as AST, type each node using the above rules]
inference rules allow us to conclude typing judgements, and are the only way to conclude typing judgements…if there is an expression s.t. no inference rule applies, then we cannot assign a type to that expression and it is a type error
EXAMPLE:
x : int ⊢ *x
is ill-typed, as isx:int, y:ref int ⊢ x + y
turning a type system into a type checker implementation can range from straightfoward to really complicated depending on the type system itself
in our case it will be straightforward
turn each inference rule into a separate function
each function takes an expression of the kind given in the conclusion of the rule and returns the type of that expression
if the rule has premises, the function makes calls to other functions passing the context and appropriate sub-expression as arguments
if there is no applicable rule then we return a type error
EXAMPLE
fn type(Γ : Context, e : Exp) -> Type {
if e is a variable then var(Γ, e)
else if e is a dereference then ref(Γ, e)
else if e is an addition then add(Γ, e)
}
fn var(Γ : Context, x : Variable) -> Type {
if Γ contains x then Γ(x)
else fail
}
fn ref(Γ : Context, e : Exp) -> Type {
if type(Γ, e) = ref τ then τ
else fail
}
fn add(Γ : Context, e1 + e2 : Exp) -> Type {
if type(Γ, e1) = int and type(Γ, e2) = int then int
else fail
}
how do we come up with a type system for a language?
it is language-dependent (the examples above are for a made-up language)
the language designer must decide what types they want in the language based on the values and operators present in the language
they must use the semantics of the language (i.e., how programs behave when they are executed) in order to create the appropriate inference rules
ideally they would prove that their type system is sound, i.e., a well-typed program should always behave meaningfully wrt its semantics
this doesn’t mean that a well-typed program is always correct wrt what the programmer intended, just that whatever it does is always well-defined according to the language semantics
if
x
andy
are int and i writex < y
when i should have usedx <= y
, that is not a type error even if it is a logic error, because<
is well-defined for integersif i write
x(42)
whenx
is an int instead of a function then it should be a type errorthat is, a sound type system should catch all bugs that count as errors across all programs, no matter what they’re trying to do
categorizing type systems
type systems can be classified along a number of dimensions; we’ll list some of the important ones and discuss their pros and cons
i have my own biases, but the point isn’t to convince you which choices are the right ones, it’s to give you a foundation so that you can decide for yourselves
static vs dynamic
a language’s type system can be static or dynamic
- static: types are checked by the compiler
- dynamic: types are checked at runtime
benefits of static
static type systems can guarantee the absence of entire classes of runtime errors
potential errors are detected at compile time instead of lurking in production code waiting to be triggered by a user
type annotations can serve as useful code documentation (that is automatically checked by the compiler so can’t go stale)
static type information allows the compiler to do a much better job optimizing the code and results in much faster executables
drawbacks of static:
determining exact program behavior is undecidable; in order to guarantee that programs don’t behave badly, we must necessarily flag some programs that wouldn’t behave badly (example:
let x:int; if false { x = *x}
)- we can make type systems more powerful and expressive to recognize more programs as correct, but type checking becomes more expensive and complicated; we can easily move into the realm of undecidable type checking
advanced type systems can also be complicated for programmers to use and require lots of expertise
benefits of dynamic:
faster, more flexible development because the programmer doesn’t need to satisfy the type checker
easier for the language designer because they don’t need to create a type system or implement a type checker
drawbacks of dynamic:
big performance penalty; programs in dynamic languages are generally much slower than comparable programs in static languages
errors don’t show up until runtime (e.g., after the program has been put in production and is being used by customers)
we can mitigate this somewhat with extensive testing, but static type systems catch all typing errors and testing can only catch a finite set of typing errors (those covered by the tests)
as an anecdote, my grading script for assign-1, written in python, had a bug in it for exactly this reason: i forgot to qualify the name of a function appropriately and it wasn’t caught until the autograder ran the script when grading a student’s submission
explicit vs implicit
a static type system can be explicit or implicit
- explicit: type annotations are part of the language syntax
- implicit: type annotations are optional
EXAMPLE
fn mul(x: int) -> int { return x * x; }
VS
fn mul(x) { return x * x; }
some people conflate static typing with explicit typing because that’s all they’re familiar with, but it isn’t true
languages with implicit types allow the programmer to omit type annotations, so the program looks just like a dynamically-typed one; the compiler must then infer what the types should be in order to perform type checking
obviously the more expressive the type system the more difficult and expensive this process is
however, it also really helps with the complexity of advanced type systems
the benefit and drawback of explicit type systems is that the programmer writes the types so the types are immediately evident, adding documentation and making it easier to understand
the benefit and drawback of implicit type systems is that the programmer does not write the types so the types are not evident, lacking documentation and are harder to understand
- however, this is mitigated by modern IDEs which can do type inference and automatically add the types to the code’s display
a number of modern languages take a mid-point, where the programmer must annotate some type information (e.g., the types of function parameters and function return values) but the rest can be inferred by the compiler
- c++ added the
auto
keyword that instructs the compiler to infer an appropriate type, which is really useful for complicated types with lots of templating (though it’s rather rudimentary compared to other implicitly-typed languages)
- c++ added the
strong vs weak
a type system can be strong or weak
- strong: provides guarantees (a well-typed program “can’t go wrong”, i.e., try to execute nonsensical operations)
- weak: no guarantees (a well-typed program can “go wrong”)
in other words, a weak type system can “lie to you”, telling you that the program is well-typed (and so anything it does should make sense) but in reality the program execution could try to do nonsensical things
C and C++ are the most well-known and widely-used languages with weak type systems
not surprisingly, they are also the languages that cause the most problems with bugs and security vulnerabilities
the terms “strong” and “weak” are a bit controversial, and some people use them to mean different things…when discussing them with others, it’s good to make sure everyone is using the terms the same way or it will get really confusing
strong guarantees seem obviously preferable to weak guarantees; why would we design a language with a weak type system?
the essential reason is performance, which comes up in a number of ways
an important example is dealing with memory
garbage collection requires lots of runtime resources, causes unpredictable performance, and requires high memory usage (all of this was especially true back when C and C++ were being developed; it’s less true today but still true)
- until recently, all strongly-typed languages had to use garbage collection
if we don’t want that overhead, then (at least until recently) the only alternative was to allow the programmer to manage memory manually (e.g.,
free
/delete
)but programmers often get manual memory management wrong, and that leads to memory unsafety (e.g., use-after-free errors) and that memory unsafety violates the guarantees provided by the type system
these errors also lead to lots of security vulnerabilities
allowing memory safety errors automatically leads to weak type systems, because those errors can be used to execute any operation we want regardless of what the type system says
the motivation behind the Rust programming language was to allow low-level systems programming without garbage collection
its designers were basically fed up with c++ and wanted a safe alternative
the key thing that makes this possible is a linear type system, which is based on linear logic; both ideas have been in academia for decades
Rust is the first mainstream language to adopt the ideas and put them into practice in a production-quality language
linear types allow the compiler to track memory ownership and, essentially, automatically insert
free
/delete
where necessary in a way that is guaranteed to be correct
it isn’t just memory, for example another performance-based example is signed integer overflow, which is undefined in C/C++
- constantly checking every arithmetic operation would be expensive, so we don’t bother and leave it up to the programmer to prevent overflow
C and C++ have literally hundreds of different kinds of undefined behaviors given in the language specs…do you know them all?
why use a static type system at all if it is lying to you? again because of performance: having static types lets the compiler do a much better job of optimizing code
technically a dynamically-typed programming language could use a weak type system, but there isn’t much point—dynamically there is enough information that being weak would be a weird choice
- mostly we talk about static type systems being strong or weak
the cflat type system
cflat’s type system has a simple, static, explicit, strong type system; what does this mean?
- simple: no polymorphism, dependent types, substructural types, etc
- static: types are checked at compile-time
- explicit: the programmer explicitly annotates type information in the code
- strong: a well-typed program cannot perform invalid operations
[go over
cflat-docs/validation.pdf
]rule ARRAY: we don’t check that the index is within bounds; this will be a dynamic check that we add during codegen
rule FIELD: notice that this implies that accessing a struct pointer automatically dereferences it, like java and unlike c++
EXAMPLE [see OneNote]
struct foo { f1: [int] }
fn main() -> int {
let a:&foo;
a = new foo;
a.f1 = [int; 1];
return *bar(a.f1[0]);
}
fn bar(p: int) -> &int {
let q: ∫
q = new int;
if p < 10 { *q = p; return q; }
else { return nil; }
}
Program
- structs = { Struct("foo", { Decl("f1", Array(Int)) }) }
- externs = {}
- functions = {
Function(
name = "main"
params = []
rettyp = Int
locals = { Decl("a", Ptr(Struct("foo"))) }
stmts = [
Assign(
Id("a"),
NewSingle(Struct("foo"))
),
Assign(
FieldAccess(Id("a"), "f1"),
NewArray(Int, Num(1))
),
Return(
Deref(
FunCall(
Id("bar"),
ArrayAccess(
FieldAccess(Id("a"), "f1"),
Num(0)
)
)
)
)
]
),
Function(
name = "bar"
params = [ Decl("p", Int) ]
rettyp = Ptr(Int)
locals = { Decl("q", Ptr(Int)) }
stmts = [
Assign(
Id("q"),
NewSingle(Int)
),
If(
BinOp(Lt, Id("p"), Num(10)),
[
Assign(
Deref(Id("q")),
Id("p")
),
Return(Id("q"))
],
[Return(Nil)]
)
]
)
}
front-end recap
let’s review the steps we needed to go through to define our compiler front-end
start with a grammar for the concrete syntax of the language
determine tokens and their regex descriptions
define token data structure
translate token descriptions to NFA and use maximal munch and prioritization to remove any ambiguity
implement lexer to translate from source code to token stream
check if grammar is LL(1), if not:
decide on precedence levels and factor grammar to enforce them
eliminate any left recursion
check if grammar is predictive, apply left-factoring if necessary
if grammar is still not LL(1), we need to change the concrete syntax
define the AST data structure
translate grammar to recursive descent parser
instrument parser functions to build AST
validate AST
i want to emphasize that we’ve simplified some issues that happen in real-world languages like C, C++, Java, etc
- these often require some creative thinking to get them to fit into the frontend framework we’ve covered here, mostly because of ambiguity in the lexemes and/or grammar or because lexemes aren’t strictly regular