Python源码剖析-PyStringObject对象和STR(

作者: 敬贤icode121 | 来源:发表于2019-02-02 17:31 被阅读0次

引言

我们知道Python str 对象的神奇魅力，它使得Pythoner 初学者更加容易上手，你可以快速的对字符串切片，相加，相乘，它本身还带大量的函数支持快速的变化字符串.
[图片上传失败...(image-21c01c-1549099935130)]
例如：
[图片上传失败...(image-20143f-1549099935130)]

Str 对象方法

static PyMethodDef
string_methods[] = {
    /* Counterparts of the obsolete stropmodule functions; except
       string.maketrans(). */
    {"join", (PyCFunction)string_join, METH_O, join__doc__},
    {"split", (PyCFunction)string_split, METH_VARARGS, split__doc__},
    {"rsplit", (PyCFunction)string_rsplit, METH_VARARGS, rsplit__doc__},
    {"lower", (PyCFunction)string_lower, METH_NOARGS, lower__doc__},
    {"upper", (PyCFunction)string_upper, METH_NOARGS, upper__doc__},
    {"islower", (PyCFunction)string_islower, METH_NOARGS, islower__doc__},
    {"isupper", (PyCFunction)string_isupper, METH_NOARGS, isupper__doc__},
    {"isspace", (PyCFunction)string_isspace, METH_NOARGS, isspace__doc__},
    {"isdigit", (PyCFunction)string_isdigit, METH_NOARGS, isdigit__doc__},
    {"istitle", (PyCFunction)string_istitle, METH_NOARGS, istitle__doc__},
    {"isalpha", (PyCFunction)string_isalpha, METH_NOARGS, isalpha__doc__},
    {"isalnum", (PyCFunction)string_isalnum, METH_NOARGS, isalnum__doc__},
    {"capitalize", (PyCFunction)string_capitalize, METH_NOARGS,
     capitalize__doc__},
    {"count", (PyCFunction)string_count, METH_VARARGS, count__doc__},
    {"endswith", (PyCFunction)string_endswith, METH_VARARGS,
     endswith__doc__},
    {"partition", (PyCFunction)string_partition, METH_O, partition__doc__},
    {"find", (PyCFunction)string_find, METH_VARARGS, find__doc__},
    {"index", (PyCFunction)string_index, METH_VARARGS, index__doc__},
    {"lstrip", (PyCFunction)string_lstrip, METH_VARARGS, lstrip__doc__},
    {"replace", (PyCFunction)string_replace, METH_VARARGS, replace__doc__},
    {"rfind", (PyCFunction)string_rfind, METH_VARARGS, rfind__doc__},
    {"rindex", (PyCFunction)string_rindex, METH_VARARGS, rindex__doc__},
    {"rstrip", (PyCFunction)string_rstrip, METH_VARARGS, rstrip__doc__},
    {"rpartition", (PyCFunction)string_rpartition, METH_O,
    ....
    {NULL,     NULL}                         /* sentinel */
};

大差不差，上述中的函数名称都在 dir(str)中，这里简单说明一下这个数据结构的格式：
{"lower", (PyCFunction)string_lower, METH_NOARGS, lower__doc__｝

lower 是对外的函数名称，可以理解为对string_lower的包装
string_lower 是 lower 名称相对源码的映射
METH_NOARGS 表示该函数没有参数，不会进行PyargparseTuple的参数解析，具体参数可查阅官网doc,后期我们将使用原生C来编写Python的插件
lower.__doc__ 函数声明（这是Python独特的部分，通过访问对象的 _doc_可以看到申明）
>>> 'xx'.lower().__doc__
"str(object='') -> string\n\nReturn a nice string representation of the object.
nIf the argument is a string, the return value is the same object."

我们选择两个函数来进行源码的分析：

1. swapcase
  {"swapcase", (PyCFunction)string_swapcase, METH_NOARGS, swapcase__doc__},

static PyObject *
string_swapcase(PyStringObject *self)
{
    char *s = PyString_AS_STRING(self), *s_new; //字符对象
    Py_ssize_t i, n = PyString_GET_SIZE(self);
    PyObject *newobj;

    newobj = PyString_FromStringAndSize(NULL, n);
    if (newobj == NULL)
        return NULL;
    s_new = PyString_AsString(newobj);//字符对象转化为 PyStringObject 对象，并获取该对象的ob_sval
    for (i = 0; i < n; i++) {//遍历字符串每一位（转化为大写）
        int c = Py_CHARMASK(*s++);
        if (islower(c)) {//如果小写转化为大写
            *s_new = toupper(c);   
            //#define _toupper(_Char)    ( (_Char)-'a'+'A' )
        }
        else if (isupper(c)) { //如果大写转化为小写
            *s_new = tolower(c);
        }
        else
            *s_new = c;
        s_new++;
    }
    return newobj;
}

可知上述返回的是对象的大小写对换的副本。
'xXXx'.swapcase() -> XxxX

1. replace
  {"replace", (PyCFunction)string_replace, METH_VARARGS, replace__doc__}

//__doc__声明，通过 str.replace.__doc__ 访问
PyDoc_STRVAR(replace__doc__,
"S.replace(old, new[, count]) -> string\n\
\n\
Return a copy of string S with all occurrences of substring\n\
old replaced by new.  If the optional argument count is\n\
given, only the first count occurrences are replaced.");

//函数主体
static PyObject *
string_replace(PyStringObject *self, PyObject *args)
{
    Py_ssize_t count = -1;
    PyObject *from, *to;
    const char *from_s, *to_s;
    Py_ssize_t from_len, to_len;
    //一系列检查和转化
    //如果为unicode，则 return PyUnicode_Replace，这里不做深入探究。
    // 是否为str 或者 unicode
    PyObject_AsCharBuffer(from, &from_s, &from_len);
    PyObject_AsCharBuffer(to, &to_s, &to_len);
    return (PyObject *)replace((PyStringObject *) self,from_s, from_len,to_s, to_len, count);
    //最后的replace通过判断 from_len, to_len的长度分别进行不同的操作，比如 from_len =0 :
/* insert the 'to' string everywhere.   */
/*    >>> "Python".replace("", ".")     */
/*    '.P.y.t.h.o.n.'                   */
}

>>> 'AAAAX'.replace('A','x',2) -> 'xxAAX'
>>> "".replace("", "A") == "A"`

源码部分比较多，涉及大量的条件判断，我们可以知道，一个简单的字符串是由大量的原生函数帮助实现的，正所谓前人栽树，后人乘凉，你说Python简单，那是因为你们有看到简单背后的强大，正如你看到中国国家如此和平，没看到边疆保卫家园的战士默默的付出！
介绍完str 对象的函数，我们再来看看为何str对象支持序列操作？如：
assert 'x' in 'xXXX'

string_as_sequence

会到 PyString_Type
我们可以发现：
&string_as_sequence, /* tp_as_sequence */
这就是 str 对象支持诸如 sequence 那样灵活能力的原因，其本质是 PyType_Type 结构体的该位置会用来实例化对象 tp_new的时候进行检查操作，由于内部大量的宏 (define) ，在实例化 PyStringObject 的时候会将该对象的“属性” 信息取址到 &PyString_Type ，因此 PyStringObject 与 string_as_sequence 函数产生了关联。

我们来看看string_as_sequence 结构：

static PySequenceMethods string_as_sequence = {
    (lenfunc)string_length, /*sq_length*/
    (binaryfunc)string_concat, /*sq_concat*/
    (ssizeargfunc)string_repeat, /*sq_repeat*/
    (ssizeargfunc)string_item, /*sq_item*/
    (ssizessizeargfunc)string_slice, /*sq_slice*/
    0,                  /*sq_ass_item*/
    0,                  /*sq_ass_slice*/
    (objobjproc)string_contains /*sq_contains*/
};

从上我们可以看到，或者更通俗的解释：

在Python 虚拟机遇到 in 操作字节码的时候，当判断到 in 的右边为 PyStringObject 对象的时候，会调用 PyString_Type -> tp_as_sequence 也就是&string_as_sequence -> string_contains 方法。

为了一探究竟，这里就插入一个题外话:

in 字节码探索

[图片上传失败...(image-c334bb-1549099935130)]
关查可以 in 字节码对应的就是 COMPARE_OP （#define COMPARE_OP 107 /* Comparison operator */ 也就是十进制107，0x6b）

在 ceval.c 中：

        TARGET(COMPARE_OP)
        {
            w = POP(); //栈弹出操作 in 左边
            v = TOP(); // in 右边的值
            if (PyInt_CheckExact(w) && PyInt_CheckExact(v)) { //当为正数的时候
                /* INLINE: cmp(int, int) */
                register long a, b;
                register int res;
                a = PyInt_AS_LONG(v);
                b = PyInt_AS_LONG(w);
                switch (oparg) {
                case PyCmp_LT: res = a <  b; break;
                case PyCmp_LE: res = a <= b; break;
                ....
                default: goto slow_compare; //显然是走进default分支
                }
                x = res ? Py_True : Py_False;
                Py_INCREF(x);
            }
            else {
              slow_compare: 
                x = cmp_outcome(oparg, v, w);
            }

很显然，从模拟栈中弹出的是 PyStringObject 对象，所以会进入 cmp_outcome(oparg,v,w)

static PyObject *
cmp_outcome(int op, register PyObject *v, register PyObject *w)
{
    int res = 0;
    switch (op) {
    case PyCmp_IS:    // is 操作符
        res = (v == w);
        break;
    case PyCmp_IS_NOT:   // is not 操作符
        res = (v != w);
        break;
    case PyCmp_IN:       // in 操作符
        res = PySequence_Contains(w, v);
        if (res < 0)
            return NULL;
        break;
    case PyCmp_NOT_IN:    // not in 操作符
        res = PySequence_Contains(w, v);
        if (res < 0)
            return NULL;
        res = !res;
        break;
    case PyCmp_EXC_MATCH:
            ......

继续跟进 PySequence_Contains(w, v)
【abstract.c】

/* Return -1 if error; 1 if ob in seq; 0 if ob not in seq.
 * Use sq_contains if possible, else defer to _PySequence_IterSearch().
 */
int
PySequence_Contains(PyObject *seq, PyObject *ob)
{
    Py_ssize_t result;
    if (PyType_HasFeature(seq->ob_type, Py_TPFLAGS_HAVE_SEQUENCE_IN)) {
        PySequenceMethods *sqm = seq->ob_type->tp_as_sequence;
        if (sqm != NULL && sqm->sq_contains != NULL)
            return (*sqm->sq_contains)(seq, ob); //猜测完全一致 ， 调用的就是
    }
}

seq->ob_type->tp_as_sequence->sq_contains
至此，验证确实是这么一套机制。

继续回到 PyString_Type 序列操作时调用的 string_contains 方法：
其是对stringlib_contains_obj 方法的包装，它又调用了stringlib_find

Py_LOCAL_INLINE(Py_ssize_t)
stringlib_find(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
               const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
               Py_ssize_t offset)
{
    Py_ssize_t pos;

    if (str_len < 0)
        return -1;
    if (sub_len == 0)
        return offset;

    pos = fastsearch(str, str_len, sub, sub_len, -1, FAST_SEARCH);  // contains 操作的最终调用

    if (pos >= 0)
        pos += offset;

    return pos;
}

最终判断 'xxx' in 'yyy' 的重任就交给了 fastsearch，其实就是C中的字符串查找的动作，但是80%的代码都是在进行类型检查操作。
如果找到则 pos >0。否则 pos <0

总结

高级语言层面的 Python 字符串你是否已经完全掌握了呢？经过本章的细化， Python 字符串的机制和原理相信你有更高更深层的认识！

@ 敬贤。源码就是砖头，高级语言只能算是成品。

欢迎投稿： 787518771@qq.com
[图片上传失败...(image-b76e8c-1549099935130)]

2018-06-23 14:05:05 星期六

Python源码剖析-PyStringObject对象和STR(

引言

Str 对象方法

string_as_sequence

in 字节码探索

总结

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读