美文网首页自动化程序员
浅析Python解释器的设计(一)

浅析Python解释器的设计(一)

作者: 51reboot | 来源:发表于2018-02-09 10:32 被阅读130次

    一些铺垫(扯淡)

    历史上,在Python 2.4以及之前的版本,py代码的执行,也就是从源码到bytecode分为两步:

    1. 解析py源码成为分析树 (Parser/pgen.c)

    2. 基于分析树优化缩减bytecode (Python/compile.c)

    也是由于Guido van Rossum大叔的历史局限性,Python当年就被实现成这么一个样子了,运行的也挺好。后来随着LLVM党的崛起,大家逐渐认识到这种模式的局限性,后面还会说。

    正经说起来,一个标准的“现代”解释器(编译器)应该是“介事儿”的:

    1. 把py源码解析成分析树 (Parser/pgen.c)

    2. 把分析树转化成抽象语法树AST(Abstract Syntax Tree) (Python/ast.c)

    3. 把AST转化成CFG(Control Flow Graph) (Python/compile.c)

    4. 基于Control Flow Graph优化生成bytecode (Python/compile.c)

    从Python 2.5开始,CPython“改邪归正”走上了现代编译器的康庄大道。简单来说就是把原先的第二步拆解成3个步骤。后面我们会着重简要讲一下后面三步的过程:

    这里不会阐述太多关于语法解析如何工作的内容,除了必须的有助于我们了解编译过程的内容。如果想要了解Python语法解析的细节,还是建议你去直接阅读CPython的源码。

    语法分析树

    2.5版以后的Python的语法解释器是基于“龙书”的LL(1) parser的一个标准的实现。Python的语法定义可以在CPython源码的Grammar/Grammar (https://github.com/python/cpython/blob/master/Grammar/Grammar)看到:

    # Grammar for Python# NOTE WELL: You should also follow all the steps listed at# https://docs.python.org/devguide/grammar.html# Start symbols for the grammar:#       single_input is a single interactive statement;#       file_input is a module or sequence of commands read from an input file;#       eval_input is the input for the eval() functions.# NB: compound_stmt in single_input is followed by extra NEWLINE!single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
    file_input: (NEWLINE | stmt)* ENDMARKER
    eval_input: testlist NEWLINE* ENDMARKER
    
    decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
    decorators: decorator+
    decorated: decorators (classdef | funcdef | async_funcdef)async_funcdef: ASYNC funcdef
    funcdef: 'def' NAME parameters ['->' test] ':' suite
    
    parameters: '(' [typedargslist] ')'typedargslist: (tfpdef ['=' test] (',' tfpdef ['=' test])* [',' [
            '*' [tfpdef] (',' tfpdef ['=' test])* [',' ['**' tfpdef [',']]]
          | '**' tfpdef [',']]]
      | '*' [tfpdef] (',' tfpdef ['=' test])* [',' ['**' tfpdef [',']]]
      | '**' tfpdef [','])tfpdef: NAME [':' test]varargslist: (vfpdef ['=' test] (',' vfpdef ['=' test])* [',' [
            '*' [vfpdef] (',' vfpdef ['=' test])* [',' ['**' vfpdef [',']]]
          | '**' vfpdef [',']]]
      | '*' [vfpdef] (',' vfpdef ['=' test])* [',' ['**' vfpdef [',']]]
      | '**' vfpdef [','])vfpdef: NAME
    
    stmt: simple_stmt | compound_stmt
    simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
    small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
                 import_stmt | global_stmt | nonlocal_stmt | assert_stmt)expr_stmt: testlist_star_expr (annassign | augassign (yield_expr|testlist) |
                         ('=' (yield_expr|testlist_star_expr))*)annassign: ':' test ['=' test]testlist_star_expr: (test|star_expr) (',' (test|star_expr))* [',']augassign: ('+=' | '-=' | '*=' | '@=' | '/=' | '%=' | '&=' | '|=' | '^=' |
                '<<=' | '>>=' | '**=' | '//=')# For normal and annotated assignments, additional restrictions enforced by the interpreterdel_stmt: 'del' exprlist
    pass_stmt: 'pass'flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt
    break_stmt: 'break'continue_stmt: 'continue'return_stmt: 'return' [testlist]yield_stmt: yield_expr
    raise_stmt: 'raise' [test ['from' test]]import_stmt: import_name | import_from
    import_name: 'import' dotted_as_names# note below: the ('.' | '...') is necessary because '...' is tokenized as ELLIPSISimport_from: ('from' (('.' | '...')* dotted_name | ('.' | '...')+)
                  'import' ('*' | '(' import_as_names ')' | import_as_names))import_as_name: NAME ['as' NAME]dotted_as_name: dotted_name ['as' NAME]import_as_names: import_as_name (',' import_as_name)* [',']dotted_as_names: dotted_as_name (',' dotted_as_name)*
    dotted_name: NAME ('.' NAME)*
    global_stmt: 'global' NAME (',' NAME)*
    nonlocal_stmt: 'nonlocal' NAME (',' NAME)*
    assert_stmt: 'assert' test [',' test]compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
    async_stmt: ASYNC (funcdef | with_stmt | for_stmt)if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]while_stmt: 'while' test ':' suite ['else' ':' suite]for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]try_stmt: ('try' ':' suite           ((except_clause ':' suite)+                       ['else' ':' suite]
               ['finally' ':' suite] |
               'finally' ':' suite))with_stmt: 'with' with_item (',' with_item)*  ':' suite
    with_item: test ['as' expr]# NB compile.c makes sure that the default except clause is lastexcept_clause: 'except' [test ['as' NAME]]suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
    
    test: or_test ['if' or_test 'else' test] | lambdef
    test_nocond: or_test | lambdef_nocond
    lambdef: 'lambda' [varargslist] ':' testlambdef_nocond: 'lambda' [varargslist] ':' test_nocond
    or_test: and_test ('or' and_test)*
    and_test: not_test ('and' not_test)*
    not_test: 'not' not_test | comparison
    comparison: expr (comp_op expr)*# <> isn't actually a valid comparison operator in Python. It's here for the# sake of a __future__ import described in PEP 401 (which really works :-)comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'star_expr: '*' expr
    expr: xor_expr ('|' xor_expr)*
    xor_expr: and_expr ('^' and_expr)*
    and_expr: shift_expr ('&' shift_expr)*
    shift_expr: arith_expr (('<<'|'>>') arith_expr)*
    arith_expr: term (('+'|'-') term)*
    term: factor (('*'|'@'|'/'|'%'|'//') factor)*
    factor: ('+'|'-'|'~') factor | power
    power: atom_expr ['**' factor]atom_expr: [AWAIT] atom trailer*
    atom: ('(' [yield_expr|testlist_comp] ')' |
           '[' [testlist_comp] ']' |
           '{' [dictorsetmaker] '}' |
           NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* [','] )trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME
    subscriptlist: subscript (',' subscript)* [',']subscript: test | [test] ':' [test] [sliceop]sliceop: ':' [test]exprlist: (expr|star_expr) (',' (expr|star_expr))* [',']testlist: test (',' test)* [',']dictorsetmaker: ( ((test ':' test | '**' expr)
                       (comp_for | (',' (test ':' test | '**' expr))* [','])) |
                      ((test | star_expr)
                       (comp_for | (',' (test | star_expr))* [','])) )classdef: 'class' NAME ['(' [arglist] ')'] ':' suite
    
    arglist: argument (',' argument)*  [',']# The reason that keywords are test nodes instead of NAME is that using NAME# results in an ambiguity. ast.c makes sure it's a NAME.# "test '=' test" is really "keyword '=' test", but we have no such token.# These need to be in a single rule to avoid grammar that is ambiguous# to our LL(1) parser. Even though 'test' includes '*expr' in star_expr,# we explicitly match '*' here, too, to give it proper precedence.# Illegal combinations and orderings are blocked in ast.c:# multiple (test comp_for) arguments are blocked; keyword unpackings# that precede iterable unpackings are blocked; etc.argument: ( test [comp_for] |
                test '=' test |
                '**' test |
                '*' test )comp_iter: comp_for | comp_if
    comp_for: [ASYNC] 'for' exprlist 'in' or_test [comp_iter]comp_if: 'if' test_nocond [comp_iter]# not used in grammar, but may appear in "node" passed from Parser to Compilerencoding_decl: NAME
    
    yield_expr: 'yield' [yield_arg]yield_arg: 'from' test | testlist
    

    各种token值的定义都在Include/graminit.h(https://github.com/python/cpython/blob/master/Include/graminit.h) 。组成语法解析树的node * structs定义在Include/node.h(https://github.com/python/cpython/blob/master/Include/node.h)。

    从node结构体里查询是由一组定义在 Include/token.h 的一组宏实现的:

    • CHILD(node *, int)

    Returns the nth child of the node using zero-offset indexing

    • RCHILD(node *, int)

    Returns the nth child of the node from the right side; use negative numbers!

    • NCH(node *)

    Number of children the node has

    • STR(node *)

    String representation of the node; e.g., will return : for a COLON token

    • TYPE(node *)

    The type of node as specified in Include/graminit.h

    • REQ(node *, TYPE)

    Assert that the node is the type that is expected

    • LINENO(node *)

    retrieve the line number of the source code that led to the creation of the parse rule; defined in Python/ast.c

    我们可以通过while语法的定义了解到上面所有的示例:

    while_stmt: 'while' test ':' suite ['else' ':' suite]

    这段定义表示这个节点将会有如下等式成立:

    TYPE(node) == while_stmt;

    子节点的编号将会是4或者7,取决于'while'后面是否有一个'else'声明。通过下面的代码我们可以确定后面的第一个 ':' 是否存在以及代表什么。

    (REQ(CHILD(node, 2), COLON)

    抽象语法树**** (AST)

    抽象语法树(AST)是程序结构的一种高抽象层次的表达,有了它我们并再不需要源代码的存在了。它可以认为是源代码等价的一种抽象表达。CPython采用Zephyr抽象语法定义语言(ASDL)(http://www.cs.princeton.edu/research/techreps/TR-554-97)来描述AST。

    Python的AST定义可以在源代码的Parser/Python.asdl(https://github.com/python/cpython/blob/master/Parser/Python.asdl)找到:

    -- ASDL's 7 builtin types are:-- identifier, int, string, bytes, object, singleton, constant---- singleton: None, True or False-- constant can be None, whereas None means "no value" for object.module Python{
        mod = Module(stmt* body, string? docstring)
            | Interactive(stmt* body)
            | Expression(expr body)
    
            -- not really an actual node but useful in Jython's typesystem.
            | Suite(stmt* body)
    
        stmt = FunctionDef(identifier name, arguments args,
                           stmt* body, expr* decorator_list, expr? returns,
                           string? docstring)
              | AsyncFunctionDef(identifier name, arguments args,
                                 stmt* body, expr* decorator_list, expr? returns,
                                 string? docstring)
    
              | ClassDef(identifier name,
                 expr* bases,
                 keyword* keywords,
                 stmt* body,
                 expr* decorator_list,
                 string? docstring)
              | Return(expr? value)
    
              | Delete(expr* targets)
              | Assign(expr* targets, expr value)
              | AugAssign(expr target, operator op, expr value)
              -- 'simple' indicates that we annotate simple name without parens
              | AnnAssign(expr target, expr annotation, expr? value, int simple)
    
              -- use 'orelse' because else is a keyword in target languages
              | For(expr target, expr iter, stmt* body, stmt* orelse)
              | AsyncFor(expr target, expr iter, stmt* body, stmt* orelse)
              | While(expr test, stmt* body, stmt* orelse)
              | If(expr test, stmt* body, stmt* orelse)
              | With(withitem* items, stmt* body)
              | AsyncWith(withitem* items, stmt* body)
    
              | Raise(expr? exc, expr? cause)
              | Try(stmt* body, excepthandler* handlers, stmt* orelse, stmt* finalbody)
              | Assert(expr test, expr? msg)
    
              | Import(alias* names)
              | ImportFrom(identifier? module, alias* names, int? level)
    
              | Global(identifier* names)
              | Nonlocal(identifier* names)
              | Expr(expr value)
              | Pass | Break | Continue
    
              -- XXX Jython will be different
              -- col_offset is the byte offset in the utf8 string the parser uses
              attributes (int lineno, int col_offset)
    
              -- BoolOp() can use left & right?
        expr = BoolOp(boolop op, expr* values)
             | BinOp(expr left, operator op, expr right)
             | UnaryOp(unaryop op, expr operand)
             | Lambda(arguments args, expr body)
             | IfExp(expr test, expr body, expr orelse)
             | Dict(expr* keys, expr* values)
             | Set(expr* elts)
             | ListComp(expr elt, comprehension* generators)
             | SetComp(expr elt, comprehension* generators)
             | DictComp(expr key, expr value, comprehension* generators)
             | GeneratorExp(expr elt, comprehension* generators)
             -- the grammar constrains where yield expressions can occur
             | Await(expr value)
             | Yield(expr? value)
             | YieldFrom(expr value)
             -- need sequences for compare to distinguish between
             -- x < 4 < 3 and (x < 4) < 3
             | Compare(expr left, cmpop* ops, expr* comparators)
             | Call(expr func, expr* args, keyword* keywords)
             | Num(object n) -- a number as a PyObject.
             | Str(string s) -- need to specify raw, unicode, etc?
             | FormattedValue(expr value, int? conversion, expr? format_spec)
             | JoinedStr(expr* values)
             | Bytes(bytes s)
             | NameConstant(singleton value)
             | Ellipsis
             | Constant(constant value)
    
             -- the following expression can appear in assignment context
             | Attribute(expr value, identifier attr, expr_context ctx)
             | Subscript(expr value, slice slice, expr_context ctx)
             | Starred(expr value, expr_context ctx)
             | Name(identifier id, expr_context ctx)
             | List(expr* elts, expr_context ctx)
             | Tuple(expr* elts, expr_context ctx)
    
              -- col_offset is the byte offset in the utf8 string the parser uses
              attributes (int lineno, int col_offset)
    
        expr_context = Load | Store | Del | AugLoad | AugStore | Param
    
        slice = Slice(expr? lower, expr? upper, expr? step)
              | ExtSlice(slice* dims)
              | Index(expr value)
    
        boolop = And | Or
    
        operator = Add | Sub | Mult | MatMult | Div | Mod | Pow | LShift
                     | RShift | BitOr | BitXor | BitAnd | FloorDiv
    
        unaryop = Invert | Not | UAdd | USub
    
        cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn
    
        comprehension = (expr target, expr iter, expr* ifs, int is_async)
    
        excepthandler = ExceptHandler(expr? type, identifier? name, stmt* body)
                        attributes (int lineno, int col_offset)
    
        arguments = (arg* args, arg? vararg, arg* kwonlyargs, expr* kw_defaults,
                     arg? kwarg, expr* defaults)
    
        arg = (identifier arg, expr? annotation)
               attributes (int lineno, int col_offset)
    
        -- keyword arguments supplied to call (NULL identifier for **kwargs)
        keyword = (identifier? arg, expr value)
    
        -- import name with optional 'as' alias.
        alias = (identifier name, identifier? asname)
    
        withitem = (expr context_expr, expr? optional_vars)}
    

    每种AST节点(声明,表达式,一些特殊类型例如:列表推导式、异常处理handler)都被ASDL描述定义。大多数的AST定义都对应一种特定的源码结构,例如一个'if'定义或者一个属性查找。这些定义都是和具体实现Python的语言无关的,也就是说无论CPython、PyPy、Jyphon甚至IronPython,都是用的同样的ASDL。

    下面的Python ASDL结构展示了Python的函数定义相关语法结构和实现:

    module Python{
          stmt = FunctionDef(identifier name, arguments args, stmt* body,
                              expr* decorators)
                | Return(expr? value) | Yield(expr value)
                attributes (int lineno)}
    

    上面的示例描述了3种不同的语法结构:def函数定义、return声明、yield声明。这三种语法结构的语法声明被用 '|' 分割。它们接受的参数类型和个数各不相同。

    每个参数定义前面还带了不同的修饰符(这点和正则很像)来说明需要多少个参数:

    • '?'说明这个参数是可有可无的(0~1个);

    • '*'说明这个参数的数量是0或者更多(≥ 0);

    • 没有修饰符表示这个参数是必须且只能有一个的。

    例如,函数定义(def)接受一个 identifier 作为函数名,'arguments'作为参数,0或者多个stmt作为函数体,0或者多个expr作为装饰器。

    请注意这里的'arguments',不像大家可能会认为的和stmt一样,它是作为一个node类型会由一个单独的AST定义所表示。

    所有的这三种语法结构类型都有一个 'attributes' 参数,所以 'attributes' 前面没有 '|' 。

    上面的ASDL会被翻译生成如下的C语言结构体类型:

    typedef struct _stmt *stmt_ty;struct _stmt {
          enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
          union {
                  struct {
                          identifier name;
                          arguments_ty args;
                          asdl_seq *body;
                  } FunctionDef;
    
                  struct {
                          expr_ty value;
                  } Return;
    
                  struct {
                          expr_ty value;
                  } Yield;
          } v;
          int lineno;
     }
    

    一同被生成的还有一些构造函数,它们会给这些语法结构分配一个stmt_ty结构体,并附带一些必要的初始化。'kind' enum字段表明下面声明的是哪种union。FunctionDef() 的构造函数将会吧'kind'设置成 FunctionDef_kind 并且初始化 'name', 'args', 'body', 和 'attributes' 字段。


    未完待续

    相关文章

      网友评论

        本文标题:浅析Python解释器的设计(一)

        本文链接:https://www.haomeiwen.com/subject/zulftftx.html