BYR Achieve · 镜像论坛

# 关于python字符串`in`的时间复杂度室友说他一直很困惑，搞不清楚如下代码的时间复杂度 ```python >>> 'aa' in 'qweaaqwe' True ``` 这种用法还真的没有使用过，只能查看源代码了。现在python放在了github上，不用自己下载源代码了，大大方便我们查看。我们知道，python的`str`对象是本质是`PyUnicode`，首先在[unicodeobject.c](https://github.com/python/cpython/blob/master/Objects/unicodeobject.c)中，查看`PyUnicode_Type`的实现。 ```c PyTypeObject PyUnicode_Type = { PyVarObject_HEAD_INIT(&PyType_Type, 0) "str", /* tp_name */ sizeof(PyUnicodeObject), /* tp_size */ 0, /* tp_itemsize */ /* Slots */ (destructor)unicode_dealloc, /* tp_dealloc */ 0, /* tp_print */ 0, /* tp_getattr */ 0, /* tp_setattr */ 0, /* tp_reserved */ unicode_repr, /* tp_repr */ &unicode_as_number, /* tp_as_number */ &unicode_as_sequence, /* tp_as_sequence */ &unicode_as_mapping, /* tp_as_mapping */ (hashfunc) unicode_hash, /* tp_hash*/ ...... }; ``` 而`in`最终则是tp_as_sequence，序列协议中的一部分。 ```c static PySequenceMethods unicode_as_sequence = { (lenfunc) unicode_length, /* sq_length */ PyUnicode_Concat, /* sq_concat */ (ssizeargfunc) unicode_repeat, /* sq_repeat */ (ssizeargfunc) unicode_getitem, /* sq_item */ 0, /* sq_slice */ 0, /* sq_ass_item */ 0, /* sq_ass_slice */ PyUnicode_Contains, /* sq_contains */ }; ``` 正如普通函数的magic方法__contains__是in的实现一样，`unicode_as_sequence`中的PyUnicode_Contains函数正是str中实现的本尊。看到了 ```c int PyUnicode_Contains(PyObject *str, PyObject *substr) { int kind1, kind2; void *buf1, *buf2; Py_ssize_t len1, len2; int result; if (!PyUnicode_Check(substr)) { PyErr_Format(PyExc_TypeError, "'in <string>' requires string as left operand, not %.100s", Py_TYPE(substr)->tp_name); return -1; } ...... ``` 尝试一下 ``` >>> 1 in 'qwe' Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> 1 in 'qwe' TypeError: 'in <string>' requires string as left operand, not int ``` 果然报错。到这里： ``` ...... len2 = PyUnicode_GET_LENGTH(substr); ...... if (len2 == 1) { Py_UCS4 ch = PyUnicode_READ(kind2, buf2, 0); result = findchar((const char *)buf1, kind1, len1, ch, 1) != -1; return result; } ``` 如果是in前面的substr长度为一，就调用findchar。之后 ```c kind1 = PyUnicode_KIND(str); switch (kind1) { case PyUnicode_1BYTE_KIND: result = ucs1lib_find(buf1, len1, buf2, len2, 0) != -1; break; case PyUnicode_2BYTE_KIND: result = ucs2lib_find(buf1, len1, buf2, len2, 0) != -1; break; case PyUnicode_4BYTE_KIND: result = ucs4lib_find(buf1, len1, buf2, len2, 0) != -1; break; default: result = -1; assert(0); } ``` 归根结底是调用result = ucs?lib_find(buf1, len1, buf2, len2, 0) != -1; 而看这个文件的开头有 ```c /* Compilation of templated routines */ #include "stringlib/asciilib.h" #include "stringlib/fastsearch.h" #include "stringlib/partition.h" #include "stringlib/split.h" #include "stringlib/count.h" #include "stringlib/find.h" #include "stringlib/find_max_char.h" #include "stringlib/localeutil.h" #include "stringlib/undef.h" #include "stringlib/ucs1lib.h" #include "stringlib/fastsearch.h" #include "stringlib/partition.h" #include "stringlib/split.h" #include "stringlib/count.h" #include "stringlib/find.h" #include "stringlib/replace.h" #include "stringlib/find_max_char.h" #include "stringlib/localeutil.h" #include "stringlib/undef.h" #include "stringlib/ucs2lib.h" #include "stringlib/fastsearch.h" #include "stringlib/partition.h" #include "stringlib/split.h" #include "stringlib/count.h" #include "stringlib/find.h" #include "stringlib/replace.h" #include "stringlib/find_max_char.h" #include "stringlib/localeutil.h" #include "stringlib/undef.h" #include "stringlib/ucs4lib.h" #include "stringlib/fastsearch.h" #include "stringlib/partition.h" #include "stringlib/split.h" #include "stringlib/count.h" #include "stringlib/find.h" #include "stringlib/replace.h" #include "stringlib/find_max_char.h" #include "stringlib/localeutil.h" #include "stringlib/undef.h" #include "stringlib/unicodedefs.h" #include "stringlib/fastsearch.h" #include "stringlib/count.h" #include "stringlib/find.h" #include "stringlib/undef.h" ``` 通过一堆宏定义，实现c++中的模版，这里直接跳过最终调用[`FASTSEARCH`](https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h) ```c } else { /* FAST_RSEARCH */ /* create compressed boyer-moore delta 1 table */ /* process pattern[0] outside the loop */ STRINGLIB_BLOOM_ADD(mask, p[0]); /* process pattern[:0:-1] */ for (i = mlast; i > 0; i--) { STRINGLIB_BLOOM_ADD(mask, p[i]); if (p[i] == p[0]) skip = i - 1; } for (i = w; i >= 0; i--) { if (s[i] == p[0]) { /* candidate match */ for (j = mlast; j > 0; j--) if (s[i+j] != p[j]) break; if (j == 0) /* got a match! */ return i; /* miss: check if previous character is part of pattern */ if (i > 0 && !STRINGLIB_BLOOM(mask, s[i-1])) i = i - m; else i = i - skip; } else { /* skip: check if previous character is part of pattern */ if (i > 0 && !STRINGLIB_BLOOM(mask, s[i-1])) i = i - m; } } } if (mode != FAST_COUNT) return -1; return count; } ``` `FASTSEARCH`中的实现，看注释是使用`boyer-moore`算法。阮一峰有一篇很好的[文章](http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html) ### 总结 CPython的字符串是使用c实现，而写python的那帮人肯定要把算法实现优化。像这种项目，弄一个O(n^2)的实现上来还能不被嘲讽？ __by 长恨安歌 2017年6月27日__

【心得】关于python字符串`in`的时间复杂度