Intro

作者: 官子寒 | 来源:发表于2020-04-15 15:45 被阅读0次

Intro
Intro
Intro
Intro
《The intro》
Intro
Intro
Intro
Intro
Intro

1. Intro

1.1 Compilers and Interpreters

二者的结构不同，一个是直接将程序和数据进行解释执行，而编译器是将程序转化为机器课执行的字节码，再输入数据产生输出

FORTRAN - 第一款编译语言

1.2 Phases of Compilers

Lexicl Analysis: divide the programs text into "words" or "tokens"
Parsing: divide the programs text into "words" or "tokens"
Sementic Analysis: perform limited Sementic Analysis; peform semantic checks
Optimization: to reduce memory, run faster and ...
Code generation:

2. Lexical Analysis

2.1 Lexical Analysis

transform the program text into strings
Classify the strings with Token Classes：Identifiers，keywords，numbers...
- Identifier: Strings of letters, starting with a letter
- Integer: A non-empty string of digits
- Keyword
- Whitespace: blanks

Token<Token Name, Attribute>

Token Name: 标记名
Attribute: 属性，通常用一个id表示，指向Symbol Table中的信息

2.1.1 Input Buffering

Lookahead: requires to decide where one token starts and ends
有时需要向前多看一个或多个输入，因此必须要采用Input Buffering

2.2 Regular Languages

Regular Expressions: 正则表达式

$L(e) \rightarrow m$ : meaning function maps to its syntax to semantics

多对一的映射关系

eg：

Integer: a non-empty string of digits $digit digit^* = digit^+$
Identifier: Strings of letters or digits, starting with a letter

useful for describing many languages
regular language is a language specification

2.3 匹配

2.3.1 匹配规则

Maximal Much: choose the longer one when either the shorter one and the longer one is a feasible string
Priority First
Match none: put error last in the priority

2.4 Finite Automata

accept: 1) end of state 2) in accpet state

2.4.1 RegularExpression to NFA

2.4.2 NFA to DFA

$\epsilon- closure$

An NFA can be many different states at a time
DFA的每个状态都是一个由NFA中的状态构成的集合，即NFA状态集合的一个子集。

NFA转DFA的子集构造(subset construction)算法
 什么是NFA(不确定的有穷自动机)和DFA(确定的有穷自动机)
从NFA到DFA

2.4.3 Transition Table

可以用数组或者链表的方式存储

3. Parsing

3.1 Limitations of Regular Language

????

3.2 Parsing VS. Lexical Analysis

Phase	Input	Output
$Lexer$	String of Characters	String of tokens
$Parser$	String of tokens	Parse Tree

3.3 Context-Free Gramars

A set of Termials $T$
A set of Non-Termials $N$
A start Symbol $S \in N$
A set of Productions $X \rightarrow Y1 Y2...$
$X \in N$ $Y \in T or N or \epsilon$

Termials are permanent
non-termials are in lower case while termials are all caps

3.4 Derivation

3.4.1 Parse Tree

解析树的叶子节点都是Termial节点，内部节点都是non-terminal节点
分类：left-most，right-most 每次解析最左/右边的non-terminal节点，二者解析出来的解析树是一样的

3.5 Ambiguity

当出现多个left-most或者right-most解析树的时候，就出现了Ambiguity
Ambiguity is BAD

3.6 Error Handling

Purpose of the Compilers:

To detect non-valid prorams
To transform the valid ones

3.6.1 Ways of Error Handling

1. 出现错误的情况

栈顶是终结符时，与输入不匹配
栈顶是非终结符时，在预测分析表中没有与输入相匹配的

Panic Mode：忽略输入中的一些符号，直到输入中出现由设计者选择的同步词法单元集合中的某个词法单元
Error Productions: specify the errors in the grammar
不足：让语法很复杂
Automatic Local or Global Corretion: find a correct nearby program and try token insertions and deletions
不足：
- 很难实现
- 降低解析速度
- 附近这个概念不是很精确

3.7 Abstract Syntax Tree

compress all the junks in the parse tree

3.8 Parsing Algorithms

3.8.1 Recursive Descent Parsing

Problem:

left recursive

right recursive
规范规约：最左规约
规范推导：最右推导

3.8.2 自顶向下方法的问题

问题1：当同一非终结符的多个候选式存在多个共同前缀的时候，会产生回溯问题
问题2：无限循环，原因是左递归文法会形成循环

将含有 $A \rightarrow A\alpha$ 格式的产生的文法定义为直接左递归的文法
两步及两步以上推导的文法叫做间接左递归文法

1.消除直接左递归

实际上就是将左递归转化为右递归
代价：引入了非终结符和空表达式

2.消除间接左递归

代入法，转化为直接左递归

3.提取公共前缀

3.9 Predictive Parsing

减少回溯
递归下降分析的一个特例，通过在输入中向前看固定个数符号来选择正确的A-产生式
也称为 $LL(k)$ 生成器

3.9.1 定义

accept LL(K) grammars

传统的production式的语法不能满足Predictive Parsing的需求，提出了left factor
在每一步推导过程中根据当前的最左非终结符A和当前的输入符号a，选择正确的A-产生式
S_文法：每个产生式的右部都以终结符开始；同一非终结符的各个候选式的首终结符都不同；不包含 $\epsilon$ 空产生式

$\epsilon$ 空产生式:

空产生式的使用取决于其后面能紧跟哪些终结符
q_文法
- 相较于S_文法，功能更加强大，因为允许产生式的右部为空串

3.9.2 First Sets & Follow Sets

First Sets和Follow Sets教程

3.9.3 LL1语法

If any entry has multiple moves, then it's not LL1
- not left factored
- left recursive
- ambiguous
- other

3.10 Bottom-Up Parsing

the productions trace a rightmost derivation
Top-down: 从上至下，直到找到所有的terminal node
可以看成是将输入串 $w$ 规约为文法开始符号 $S$ 的过程

3.10.1 Shift-reduce 移入-规约分析

Reduce: unexamined

Shift

3.10.2 Handles 句柄

一旦句柄在栈中形成，就立即将其规约
规范句型 = 栈内符号串 + 剩余输入
自底向上的关键问题是如何识别句柄
When to shift and When to reduce?

Recognizing Handles
- 最左直接短语：高度为2的树的边缘，句柄就是最左直接短语
LR分析法
- 基本原理：根据这些状态来构造自动机进行句柄的识别
- LR(0)项目：右部某位置标有原点的产生式称为相应文法的一个项目
- 增广文法(Augmented Grammar)：使得文法开始符仅出现在一个产生式的左部
- 后继项目
- 移入/规约冲突规约/规约冲突：不是所有的CFG都叫做LR(0)文法
- 如何消解冲突：SLR，LR(1)
- 其实这一部分感觉就是用LR(0)文法来构造Action和Goto表，这样就可以使用LR自动机来