Date post: | 15-Jan-2016 |
Category: |
Documents |
View: | 218 times |
Download: | 0 times |
1
Foundations of Software Design
Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti HearstFall 2002
2
How Do Computers Work (Revisited)?
Bits & Bytes Binary Numbers
Number Systems
Orders of MagnitudeGates
Boolean Logic
Circuits
CPU Machine Instructions
Assembly Language
Programming Languages
Address Space
Code vs. Data
Compiler
3Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Compiler
• What is a compiler? – A recognizer (of some source language L). – A translator (of programs written in L into programs
written in some object or target language L').
• A compiler is itself a program, written in some host language
• Operates in phases
Machine Instructions
Assembly Language
Programming Languages
Compiler
4
Converting Java to Byte Code
• When you compile a java program, javac produces byte codes (stored in the class file).
• The byte codes are not converted to machine code.
• Instead, they are interpreted in the VM when you run the program called java.
5
Machine Code
Assembly Language
C codeTranslatedby the Ccompiler(gcc or cc)
Byte code (class file)
Java codeTranslatedby the javacompiler (javac or jit)
Java Virtual Machine
Creates theJVM once
Individual program isloaded & run in JVM
6
Compiler Compilers
• Which came first: the compiler or the program?– The very first one has to be written in assembly
language!– This is why most programming languages today start
with the C code generator
• After you have created the first compiler for a given language, say java, then you …
• Use that compiler to compile itself!!
7
Compiling Your Compiler
Write the first java compiler using C
Javac in C
Compile using gcc
Write the second java compiler using java
Javac in java
Compile using javac
Write other java programs
Compile using javac
8Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Lexical analyzer (scanner)
Syntax analyzer (parser)
Semantic analyzer
Intermediate Code Generator
Optimizer
Code Generator
Compiler in more detail.
9Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Scanner
• Task: – Translate the sequence of characters into a
corresponding sequence of tokens (by grouping characters into lexemes).
• How it’s done– Specify lexemes using Regular Expressions– Convert these Regular Expressions into Finite Automata
10Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Lexemes and TokensHere are some Java lexemes and the corresponding tokens:
; = index tmp 37 102 SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT Note that multiple lexemes can correspond to the same token (e.g.,
there are many identifiers).
Given the source code: position = initial + rate * 60 ;
a Java scanner would return the following sequence of tokens:
IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT SEMI-COLON
11Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Scanner
• Also called the Lexer• How it works:
– Reads characters from the source program. – Groups the characters into lexemes (sequences of
characters that "go together"). – Each lexeme corresponds to a token;
• the scanner returns the next token (plus maybe some additional information) to the parser.
– The scanner may also discover lexical errors (e.g., erroneous characters).
• The definitions of what is a lexeme, token, or bad character all depend on the source language.
12Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Two kinds of Automata
Deterministic (DFA): – No state has more than one outgoing edge with the
same label.
Non-Deterministic (NFA):– States may have more than one outgoing edge with
same label.– Edges may be labeled with (epsilon), the empty
string. – The automaton can take an epsilon transition
without looking at the current input character.
13Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Regular Expressions to Finite Automata
• Generating a scanner
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
14
BNF
• Backus-Naur form, Backus-Normal form– A set of rules (or productions)– Each of which expresses the ways symbols of the
language can be grouped together• Non-terminals are written upper-case• Terminals are written lower-case• The start symbol is the left-hand side of the first
production
• The rules for a CFG are often referred to as its BNF
15Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Java Identifier Definition
Described in the Java specification:– http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.ht
ml#44591
– “An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.
– An identifier cannot have the same spelling (Unicode character sequence) as a keyword (§3.9), Boolean literal (§3.10.3), or the null literal (§3.10.7).”
16Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Java Identifier Definition
17Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Java Integer Literals
• An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), or octal (base 8)
• Examples:0 2 0372 0xDadaCafe 1996 0x00FF00FF
(opt means optional)
18Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Defining Java Decimal Numerals
A decimal numeral is either the single ASCII character 0, representing the integer zero, or consists of an ASCII digit from 1 to 9, optionally followed by one or more ASCII digits from 0 to 9, representing a positive integer:
19Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Defining Floating-Point LiteralsA floating-point literal has the following parts: a whole-number part, a decimal point (represented by an ASCII period character), a fractional part, an exponent, and a type suffix. The exponent, if present, is indicated by the ASCII letter e or E followed by an optionally signed integer.
20
From the Lucene HTML Scanner
21Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Functionality of the Parser
• Input: sequence of tokens from lexical analysis
• Output: parse tree of the program – parse tree is generated if the input is a legal program– if input is an illegal program, syntax errors are issued
• Note: – Instead of parse tree, some parsers produce directly:
• abstract syntax tree (AST) + symbol table, or• intermediate code, or• object code
22Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Parser vs. Scanner
Phase Input Output
Scanner String of characters
String of tokens
Parser String of tokens Parse tree
23Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Parser• Groups tokens into "grammatical phrases", discovering
the underlying structure of the source program. • Finds syntax errors.
– Example • position = * 5 ;
– corresponds to the sequence of tokens: IDENT ASSIGN TIMES INT-LIT SEMI-COLON
– All are legal tokens, but that sequence of tokens is erroneous. • Might find some "static semantic" errors, e.g., a use of an
undeclared variable, or variables that are multiply declared.
• Might generate code, or build some intermediate representation of the program such as an abstract-syntax tree.
24Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
What must the parser do?1. Recognizer: not all strings of tokens are programs
– must distinguish between valid and invalid strings of tokens
2. Translator: must expose program structure• e.g., associativity and precedence• must return the parse tree
We need:– A language for describing valid strings of tokens
• context-free grammars• (analogous to regular expressions in the scanner)
– A method for distinguishing valid from invalid strings of tokens (and for building the parse tree)• the parser• (analogous to the state machine in the scanner)
25Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Parser Example
position = initial + rate * 60 ;
=
+
*
position
initial
rate 60
26Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Semantic Analyzer• The semantic analyzer checks for (more) "static
semantic" errors, e.g., type errors. • Annotates and/or changes the abstract syntax tree
– (e.g., it might annotate each node that represents an expression with its type).
– Example with before and after:
=
+
*position
initial
rate 60
=
+
*position
initial
rate
60
(float)
(float)
(float)(float)
(float)
(float) int-to-float()
(float)
(int)
27Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Intermediate Code Generator
The intermediate code generator translates from abstract-syntax tree to intermediate code.
– One possibility is 3-address code. – Here's an example of 3-address code for the abstract-
syntax tree shown above:
temp1 = int-to-float(60)temp2 = rate * temp1 temp3 = initial + temp2 position = temp3
28Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Optimizer
• Examine the program and rewrite it in ways the preserve the meaning but are more efficient.
• Incredibly complex programs and algorithms• Example
– Move the declaration of temp outside the loop so it isn’t re-declared every time the loop is executed
– Change 2*5 to 10 since it is a constant (no need to do an expensive multiply at run time)
– If we removed the line with temp, the program might even skip the loop altogether
• You can see in advance that count ends up = 30
int count = 0;for (int j=0; j < 2*5; j++) { int temp = j + 1; count += 3;}
29Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Code Generator
• The code generator generates object code from (optimized) intermediate code.
LOADF rate,R1 MULF #60.0,R1 LOADF initial,R2 ADDF R2,R1 STOREF R1,position
30Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Tools
• Scanner Generator– Used to create a scanner automatically– Input:
• a regular expression for each token to be recognized
– Output:• a finite state machine
– Examples:• lex or flex (produce C code), or jlex (produce java)
• Compiler Compilers• yacc (produces C) or JavaCC (produces Java, also has a
scanner generator).
31
From the Lucene HTML Parser
32
From the Lucene HTML Parser
33
Graphs / Networks
34Slide adapted from Goodrich & Tamassia
What is a Graph?
35Slide adapted from Goodrich & Tamassia
36Slide adapted from Goodrich & Tamassia
37Slide adapted from Goodrich & Tamassia
38Slide adapted from Goodrich & Tamassia
39Slide adapted from Goodrich & Tamassia
40Slide adapted from Goodrich & Tamassia
41Slide adapted from Goodrich & Tamassia
42Slide adapted from Goodrich & Tamassia
43Slide adapted from Goodrich & Tamassia
44Slide adapted from Goodrich & Tamassia
45Slide adapted from Goodrich & Tamassia
46Slide adapted from Goodrich & Tamassia
47Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Next Time
• Graph Traversal• Directed Graphs (digraphs)• DAGS• Weighted Graphs