Compiling Techniques Notes
Table of Contents
High level view of a compiler
Takes source code and produces machine code, often creates errors instead.
- Must recognise legal (and illegal) code
- Must generate correct code
- Must produce understandable errors
Two pass compiler
Uses an intermediate representation (IR). Maps front end to IR and IR to machine code. Admits multiple front ends and multiple passes. Front is \(O(n)\) or \(O(n\log n)\) while back-end is NP-complete.
Common fallacy is front ends and back-ends can be swapped for every system. This is the Holy Grail of compiler research, and is currently still theory.
Front end chain
Lexer
Consists of a Scanner and Tokeniser. Scanner reads in text and tokeniser recognises words from.
Example: \(x=y+2\) becomes IDENTIFIER(x) EQUAL IDENTIFIER(y) PLUS NUMBER(2)
Parser
Recognises context free syntax and report errors. Easy to build hand coded parsers.
Semantic Analyser
Guides context sensitive (semantic) analysis
IR generator
Generates the IR for the rest of the compiler
Abstract Syntax Tree (AST)
Summarises grammatical structure
Back end
Translate IR to target machine code. Choose instructions for IR chain. Ensure conformance with system architecture.
Instruction selection
Pattern matching. Produce fast, compact code.
Register allocation
Limited registers, need to manage a limited set of resources. Can change instruction choices and insert LOAD and STORE. Optimal is NP-Complete as is usually a graph colouring problem.
Instruction scheduling
Avoid hardware stalls and interlocks. Use all functional units productively. Optimal is NP-Complete.
Middle end
Code improvement and optimisation. Analyse and improve IR. Aim to reduce the running time (or space or power) of compiled code, while preserving meaning.
Optimiser
Multiple passes through the optimiser to try and improve performance.
Restructurer
Translate high level IR to low level IR. Can be useful for "high level" optimisation.
The Lexer
Maps character stream into words - basic syntax units. Typical tokens: number, identifier, +, -, new, while, for. Scanner removes white-space and comments.
Context-free language
Specified in a grammar, e.g. SheepNoise \(\to\) SheepNoise baa | baa.
Regular Expressions
Represented using an augmented BNF notation Expression for all integers: ['-'] (('1'|…|'9') ('0' | … | '9')+ | '0')