浅析Python解释器的设计（一）

作者: 51reboot | 来源:发表于2018-02-09 10:32 被阅读130次

浅析Python解释器的设计（一）
浅析Python解释器的设计（一）
浅析Python解释器的设计（二）
02-Python解释器
Python基础
2. 使用 Python 解释器
运行Python程序的三种方式
Python基础手册3——Python解释器
[python_doc]python_3.7_chapter.2
python 基础 - 简介

一些铺垫（扯淡）

历史上，在Python 2.4以及之前的版本，py代码的执行，也就是从源码到bytecode分为两步：

解析py源码成为分析树 (Parser/pgen.c)
基于分析树优化缩减bytecode (Python/compile.c)

也是由于Guido van Rossum大叔的历史局限性，Python当年就被实现成这么一个样子了，运行的也挺好。后来随着LLVM党的崛起，大家逐渐认识到这种模式的局限性，后面还会说。

正经说起来，一个标准的“现代”解释器（编译器）应该是“介事儿”的：

把py源码解析成分析树 (Parser/pgen.c)
把分析树转化成抽象语法树AST（Abstract Syntax Tree） (Python/ast.c)
把AST转化成CFG（Control Flow Graph） (Python/compile.c)
基于Control Flow Graph优化生成bytecode (Python/compile.c)

从Python 2.5开始，CPython“改邪归正”走上了现代编译器的康庄大道。简单来说就是把原先的第二步拆解成3个步骤。后面我们会着重简要讲一下后面三步的过程：

这里不会阐述太多关于语法解析如何工作的内容，除了必须的有助于我们了解编译过程的内容。如果想要了解Python语法解析的细节，还是建议你去直接阅读CPython的源码。

语法分析树

2.5版以后的Python的语法解释器是基于“龙书”的LL(1) parser的一个标准的实现。Python的语法定义可以在CPython源码的Grammar/Grammar （https://github.com/python/cpython/blob/master/Grammar/Grammar）看到：

# Grammar for Python# NOTE WELL: You should also follow all the steps listed at# https://docs.python.org/devguide/grammar.html# Start symbols for the grammar:#       single_input is a single interactive statement;#       file_input is a module or sequence of commands read from an input file;#       eval_input is the input for the eval() functions.# NB: compound_stmt in single_input is followed by extra NEWLINE!single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
file_input: (NEWLINE | stmt)* ENDMARKER
eval_input: testlist NEWLINE* ENDMARKER

decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
decorators: decorator+
decorated: decorators (classdef | funcdef | async_funcdef)async_funcdef: ASYNC funcdef
funcdef: 'def' NAME parameters ['->' test] ':' suite

parameters: '(' [typedargslist] ')'typedargslist: (tfpdef ['=' test] (',' tfpdef ['=' test])* [',' [
        '*' [tfpdef] (',' tfpdef ['=' test])* [',' ['**' tfpdef [',']]]
      | '**' tfpdef [',']]]
  | '*' [tfpdef] (',' tfpdef ['=' test])* [',' ['**' tfpdef [',']]]
  | '**' tfpdef [','])tfpdef: NAME [':' test]varargslist: (vfpdef ['=' test] (',' vfpdef ['=' test])* [',' [
        '*' [vfpdef] (',' vfpdef ['=' test])* [',' ['**' vfpdef [',']]]
      | '**' vfpdef [',']]]
  | '*' [vfpdef] (',' vfpdef ['=' test])* [',' ['**' vfpdef [',']]]
  | '**' vfpdef [','])vfpdef: NAME

stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
             import_stmt | global_stmt | nonlocal_stmt | assert_stmt)expr_stmt: testlist_star_expr (annassign | augassign (yield_expr|testlist) |
                     ('=' (yield_expr|testlist_star_expr))*)annassign: ':' test ['=' test]testlist_star_expr: (test|star_expr) (',' (test|star_expr))* [',']augassign: ('+=' | '-=' | '*=' | '@=' | '/=' | '%=' | '&=' | '|=' | '^=' |
            '<<=' | '>>=' | '**=' | '//=')# For normal and annotated assignments, additional restrictions enforced by the interpreterdel_stmt: 'del' exprlist
pass_stmt: 'pass'flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt
break_stmt: 'break'continue_stmt: 'continue'return_stmt: 'return' [testlist]yield_stmt: yield_expr
raise_stmt: 'raise' [test ['from' test]]import_stmt: import_name | import_from
import_name: 'import' dotted_as_names# note below: the ('.' | '...') is necessary because '...' is tokenized as ELLIPSISimport_from: ('from' (('.' | '...')* dotted_name | ('.' | '...')+)
              'import' ('*' | '(' import_as_names ')' | import_as_names))import_as_name: NAME ['as' NAME]dotted_as_name: dotted_name ['as' NAME]import_as_names: import_as_name (',' import_as_name)* [',']dotted_as_names: dotted_as_name (',' dotted_as_name)*
dotted_name: NAME ('.' NAME)*
global_stmt: 'global' NAME (',' NAME)*
nonlocal_stmt: 'nonlocal' NAME (',' NAME)*
assert_stmt: 'assert' test [',' test]compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
async_stmt: ASYNC (funcdef | with_stmt | for_stmt)if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]while_stmt: 'while' test ':' suite ['else' ':' suite]for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]try_stmt: ('try' ':' suite           ((except_clause ':' suite)+                       ['else' ':' suite]
           ['finally' ':' suite] |
           'finally' ':' suite))with_stmt: 'with' with_item (',' with_item)*  ':' suite
with_item: test ['as' expr]# NB compile.c makes sure that the default except clause is lastexcept_clause: 'except' [test ['as' NAME]]suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

test: or_test ['if' or_test 'else' test] | lambdef
test_nocond: or_test | lambdef_nocond
lambdef: 'lambda' [varargslist] ':' testlambdef_nocond: 'lambda' [varargslist] ':' test_nocond
or_test: and_test ('or' and_test)*
and_test: not_test ('and' not_test)*
not_test: 'not' not_test | comparison
comparison: expr (comp_op expr)*# <> isn't actually a valid comparison operator in Python. It's here for the# sake of a __future__ import described in PEP 401 (which really works :-)comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'star_expr: '*' expr
expr: xor_expr ('|' xor_expr)*
xor_expr: and_expr ('^' and_expr)*
and_expr: shift_expr ('&' shift_expr)*
shift_expr: arith_expr (('<<'|'>>') arith_expr)*
arith_expr: term (('+'|'-') term)*
term: factor (('*'|'@'|'/'|'%'|'//') factor)*
factor: ('+'|'-'|'~') factor | power
power: atom_expr ['**' factor]atom_expr: [AWAIT] atom trailer*
atom: ('(' [yield_expr|testlist_comp] ')' |
       '[' [testlist_comp] ']' |
       '{' [dictorsetmaker] '}' |
       NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* [','] )trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME
subscriptlist: subscript (',' subscript)* [',']subscript: test | [test] ':' [test] [sliceop]sliceop: ':' [test]exprlist: (expr|star_expr) (',' (expr|star_expr))* [',']testlist: test (',' test)* [',']dictorsetmaker: ( ((test ':' test | '**' expr)
                   (comp_for | (',' (test ':' test | '**' expr))* [','])) |
                  ((test | star_expr)
                   (comp_for | (',' (test | star_expr))* [','])) )classdef: 'class' NAME ['(' [arglist] ')'] ':' suite

arglist: argument (',' argument)*  [',']# The reason that keywords are test nodes instead of NAME is that using NAME# results in an ambiguity. ast.c makes sure it's a NAME.# "test '=' test" is really "keyword '=' test", but we have no such token.# These need to be in a single rule to avoid grammar that is ambiguous# to our LL(1) parser. Even though 'test' includes '*expr' in star_expr,# we explicitly match '*' here, too, to give it proper precedence.# Illegal combinations and orderings are blocked in ast.c:# multiple (test comp_for) arguments are blocked; keyword unpackings# that precede iterable unpackings are blocked; etc.argument: ( test [comp_for] |
            test '=' test |
            '**' test |
            '*' test )comp_iter: comp_for | comp_if
comp_for: [ASYNC] 'for' exprlist 'in' or_test [comp_iter]comp_if: 'if' test_nocond [comp_iter]# not used in grammar, but may appear in "node" passed from Parser to Compilerencoding_decl: NAME

yield_expr: 'yield' [yield_arg]yield_arg: 'from' test | testlist

各种token值的定义都在Include/graminit.h（https://github.com/python/cpython/blob/master/Include/graminit.h）。组成语法解析树的node * structs定义在Include/node.h（https://github.com/python/cpython/blob/master/Include/node.h）。

从node结构体里查询是由一组定义在 Include/token.h 的一组宏实现的：

CHILD(node *, int)

Returns the nth child of the node using zero-offset indexing

RCHILD(node *, int)

Returns the nth child of the node from the right side; use negative numbers!

NCH(node *)

Number of children the node has

STR(node *)

String representation of the node; e.g., will return : for a COLON token

TYPE(node *)

The type of node as specified in Include/graminit.h

REQ(node *, TYPE)

Assert that the node is the type that is expected

LINENO(node *)

retrieve the line number of the source code that led to the creation of the parse rule; defined in Python/ast.c

我们可以通过while语法的定义了解到上面所有的示例：

while_stmt: 'while' test ':' suite ['else' ':' suite]

这段定义表示这个节点将会有如下等式成立：

TYPE(node) == while_stmt；

子节点的编号将会是4或者7，取决于'while'后面是否有一个'else'声明。通过下面的代码我们可以确定后面的第一个 ':' 是否存在以及代表什么。

(REQ(CHILD(node, 2), COLON)

抽象语法树 (AST)

抽象语法树（AST）是程序结构的一种高抽象层次的表达，有了它我们并再不需要源代码的存在了。它可以认为是源代码等价的一种抽象表达。CPython采用Zephyr抽象语法定义语言（ASDL）（http://www.cs.princeton.edu/research/techreps/TR-554-97）来描述AST。

Python的AST定义可以在源代码的Parser/Python.asdl（https://github.com/python/cpython/blob/master/Parser/Python.asdl）找到:

-- ASDL's 7 builtin types are:-- identifier, int, string, bytes, object, singleton, constant---- singleton: None, True or False-- constant can be None, whereas None means "no value" for object.module Python{
    mod = Module(stmt* body, string? docstring)
        | Interactive(stmt* body)
        | Expression(expr body)

        -- not really an actual node but useful in Jython's typesystem.
        | Suite(stmt* body)

    stmt = FunctionDef(identifier name, arguments args,
                       stmt* body, expr* decorator_list, expr? returns,
                       string? docstring)
          | AsyncFunctionDef(identifier name, arguments args,
                             stmt* body, expr* decorator_list, expr? returns,
                             string? docstring)

          | ClassDef(identifier name,
             expr* bases,
             keyword* keywords,
             stmt* body,
             expr* decorator_list,
             string? docstring)
          | Return(expr? value)

          | Delete(expr* targets)
          | Assign(expr* targets, expr value)
          | AugAssign(expr target, operator op, expr value)
          -- 'simple' indicates that we annotate simple name without parens
          | AnnAssign(expr target, expr annotation, expr? value, int simple)

          -- use 'orelse' because else is a keyword in target languages
          | For(expr target, expr iter, stmt* body, stmt* orelse)
          | AsyncFor(expr target, expr iter, stmt* body, stmt* orelse)
          | While(expr test, stmt* body, stmt* orelse)
          | If(expr test, stmt* body, stmt* orelse)
          | With(withitem* items, stmt* body)
          | AsyncWith(withitem* items, stmt* body)

          | Raise(expr? exc, expr? cause)
          | Try(stmt* body, excepthandler* handlers, stmt* orelse, stmt* finalbody)
          | Assert(expr test, expr? msg)

          | Import(alias* names)
          | ImportFrom(identifier? module, alias* names, int? level)

          | Global(identifier* names)
          | Nonlocal(identifier* names)
          | Expr(expr value)
          | Pass | Break | Continue

          -- XXX Jython will be different
          -- col_offset is the byte offset in the utf8 string the parser uses
          attributes (int lineno, int col_offset)

          -- BoolOp() can use left & right?
    expr = BoolOp(boolop op, expr* values)
         | BinOp(expr left, operator op, expr right)
         | UnaryOp(unaryop op, expr operand)
         | Lambda(arguments args, expr body)
         | IfExp(expr test, expr body, expr orelse)
         | Dict(expr* keys, expr* values)
         | Set(expr* elts)
         | ListComp(expr elt, comprehension* generators)
         | SetComp(expr elt, comprehension* generators)
         | DictComp(expr key, expr value, comprehension* generators)
         | GeneratorExp(expr elt, comprehension* generators)
         -- the grammar constrains where yield expressions can occur
         | Await(expr value)
         | Yield(expr? value)
         | YieldFrom(expr value)
         -- need sequences for compare to distinguish between
         -- x < 4 < 3 and (x < 4) < 3
         | Compare(expr left, cmpop* ops, expr* comparators)
         | Call(expr func, expr* args, keyword* keywords)
         | Num(object n) -- a number as a PyObject.
         | Str(string s) -- need to specify raw, unicode, etc?
         | FormattedValue(expr value, int? conversion, expr? format_spec)
         | JoinedStr(expr* values)
         | Bytes(bytes s)
         | NameConstant(singleton value)
         | Ellipsis
         | Constant(constant value)

         -- the following expression can appear in assignment context
         | Attribute(expr value, identifier attr, expr_context ctx)
         | Subscript(expr value, slice slice, expr_context ctx)
         | Starred(expr value, expr_context ctx)
         | Name(identifier id, expr_context ctx)
         | List(expr* elts, expr_context ctx)
         | Tuple(expr* elts, expr_context ctx)

          -- col_offset is the byte offset in the utf8 string the parser uses
          attributes (int lineno, int col_offset)

    expr_context = Load | Store | Del | AugLoad | AugStore | Param

    slice = Slice(expr? lower, expr? upper, expr? step)
          | ExtSlice(slice* dims)
          | Index(expr value)

    boolop = And | Or

    operator = Add | Sub | Mult | MatMult | Div | Mod | Pow | LShift
                 | RShift | BitOr | BitXor | BitAnd | FloorDiv

    unaryop = Invert | Not | UAdd | USub

    cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn

    comprehension = (expr target, expr iter, expr* ifs, int is_async)

    excepthandler = ExceptHandler(expr? type, identifier? name, stmt* body)
                    attributes (int lineno, int col_offset)

    arguments = (arg* args, arg? vararg, arg* kwonlyargs, expr* kw_defaults,
                 arg? kwarg, expr* defaults)

    arg = (identifier arg, expr? annotation)
           attributes (int lineno, int col_offset)

    -- keyword arguments supplied to call (NULL identifier for **kwargs)
    keyword = (identifier? arg, expr value)

    -- import name with optional 'as' alias.
    alias = (identifier name, identifier? asname)

    withitem = (expr context_expr, expr? optional_vars)}

每种AST节点（声明，表达式，一些特殊类型例如：列表推导式、异常处理handler）都被ASDL描述定义。大多数的AST定义都对应一种特定的源码结构，例如一个'if'定义或者一个属性查找。这些定义都是和具体实现Python的语言无关的，也就是说无论CPython、PyPy、Jyphon甚至IronPython，都是用的同样的ASDL。

下面的Python ASDL结构展示了Python的函数定义相关语法结构和实现：

module Python{
      stmt = FunctionDef(identifier name, arguments args, stmt* body,
                          expr* decorators)
            | Return(expr? value) | Yield(expr value)
            attributes (int lineno)}

上面的示例描述了3种不同的语法结构：def函数定义、return声明、yield声明。这三种语法结构的语法声明被用 '|' 分割。它们接受的参数类型和个数各不相同。

每个参数定义前面还带了不同的修饰符（这点和正则很像）来说明需要多少个参数：

'?'说明这个参数是可有可无的（0~1个）；
'*'说明这个参数的数量是0或者更多（≥ 0）；
没有修饰符表示这个参数是必须且只能有一个的。

例如，函数定义（def）接受一个 identifier 作为函数名，'arguments'作为参数，0或者多个stmt作为函数体，0或者多个expr作为装饰器。

请注意这里的'arguments'，不像大家可能会认为的和stmt一样，它是作为一个node类型会由一个单独的AST定义所表示。

所有的这三种语法结构类型都有一个 'attributes' 参数，所以 'attributes' 前面没有 '|' 。

上面的ASDL会被翻译生成如下的C语言结构体类型：

typedef struct _stmt *stmt_ty;struct _stmt {
      enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
      union {
              struct {
                      identifier name;
                      arguments_ty args;
                      asdl_seq *body;
              } FunctionDef;

              struct {
                      expr_ty value;
              } Return;

              struct {
                      expr_ty value;
              } Yield;
      } v;
      int lineno;
 }

一同被生成的还有一些构造函数，它们会给这些语法结构分配一个stmt_ty结构体，并附带一些必要的初始化。'kind' enum字段表明下面声明的是哪种union。FunctionDef() 的构造函数将会吧'kind'设置成 FunctionDef_kind 并且初始化 'name', 'args', 'body', 和 'attributes' 字段。

未完待续

浅析Python解释器的设计（一）
一些铺垫（扯淡）历史上，在Python 2.4以及之前的版本，py代码的执行，也就是从源码到bytecode分为...
浅析Python解释器的设计（一）
一些铺垫（扯淡）历史上，在Python 2.4以及之前的版本，py代码的执行，也就是从源码到bytecode分为...
浅析Python解释器的设计（二）
从现代编译器的角度看，解释器和编译器的边界已经相当的模糊。我们后面的讨论说到的编译器就是Python的解释器，没有...
02-Python解释器
目标解释器的作用下载Python解释器安装Python解释器一. 解释器的作用 Python解释器作用：运...
Python基础
Python基础 Python是一门多范式编程语言。 Python的执行解释器有C语言解释器，JAVA解释器，等...
2. 使用 Python 解释器
2. 使用 Python 解释器 2.1. 调用 Python 解释器 Python 解释器通常被安装在目标机器的...
运行Python程序的三种方式
1、python和python3解释器 1) python解释器 2) python3解释器 2、交互式运行 1)...
Python基础手册3——Python解释器
一、Python解释器我们编写的Python代码都要放在Python解释器上运行，解释器是代码与计算机硬件之间的...
[python_doc]python_3.7_chapter.2
第二章使用python解释器 2.1调用python解释器 python解释器一般被存放在 /usr/local...
python 基础 - 简介
python 简介前言一种面向对象的解释型计算机程序设计语言解释类：解释器通过解释器来翻译，类似于同声传译人...

浅析Python解释器的设计（一）

一些铺垫（扯淡）

语法分析树

抽象语法树 (AST)

相关文章