您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

On the implementation of Python list

編輯：Python

To know what it is, we must know what it is ,python There are really not many container objects in , Usually, we will use the corresponding containers according to our needs with great peace of mind , For variable length data list, Want to reuse set, Want to match quickly dict, For character processing str, But why can this effect be achieved ？ For example, we use list When , Know that this thing can store various formats at will , Save integer type 、 floating-point 、 character string 、 You can even nest list Other containers , The underlying principle is implemented by arrays , Or a chain list ？ Like our dictionary , Is the bottom layer an array or something else ？ If it's something else like a hash table , Then how to realize the sequence arrangement of input data ？ This time, we might as well analyze it layer by layer , Make a deduction . You can't chew too much , This time, let's start with list Analyze

sketch

The name is easily associated with other languages （C++、Java etc. ） The linked list in the standard library is confused , But in fact CPython The list of is not a list at all （ That's a little windy , It may be easier to understand in English ：python Medium list Not what we learn list）, stay CPython in , The list is implemented as A variable length array .

Look at the details ,Python The list in is made up of A contiguous array of references to other objects , The pointer to this array and its length are stored in a list header structure . It means , Every time you add or delete an element , An array made up of references requires this size （ Redistribution ）. In the process of implementation ,Python When creating these arrays, the exponential allocation method is adopted , As a result, it is not necessary to change the size of the array every time , But because of this, the average complexity of adding or removing elements is low .

The result of this method is on the common linked list “ The price is very small ” Some of the other operations in Python The computational complexity is relatively high .

utilize list.insert(i,item) Method to insert an element anywhere —— Complexity O(N)
utilize list.pop(i) or list.remove(value) Delete an element —— Complexity O(N)

The source code parsing

Let's take a look at list Source code of implementation , Source juice source flavor , Make a detailed evaluation . Let's find out first list Multiple inheritance from MutableSequence and Generic. Then we can read ,list Implementation of related embedded functions of , Such as append、pop、extend、insert And so on are actually realized through inheritance , Then we have to look for it MutableSequence and Generic The underlying implementation of these two classes , Only after solving these two classes , We can answer why list You can dynamically add data , And the complexity of deletion and insertion is not so good .

class list(MutableSequence[_T], Generic[_T]):
@overload
def __init__(self) -> None: ...
@overload
def __init__(self, iterable: Iterable[_T]) -> None: ...
if sys.version_info >= (3,):
def clear(self) -> None: ...
def copy(self) -> List[_T]: ...
def append(self, object: _T) -> None: ...
def extend(self, iterable: Iterable[_T]) -> None: ...
def pop(self, index: int = ...) -> _T: ...
def index(self, object: _T, start: int = ..., stop: int = ...) -> int: ...
def count(self, object: _T) -> int: ...
def insert(self, index: int, object: _T) -> None: ...
def remove(self, object: _T) -> None: ...
def reverse(self) -> None: ...
if sys.version_info >= (3,):
def sort(self, *, key: Optional[Callable[[_T], Any]] = ..., reverse: bool = ...) -> None: ...
else:
def sort(self, cmp: Callable[[_T, _T], Any] = ..., key: Callable[[_T], Any] = ..., reverse: bool = ...) -> None: ...
def __len__(self) -> int: ...
def __iter__(self) -> Iterator[_T]: ...
def __str__(self) -> str: ...
__hash__: None # type: ignore
@overload
def __getitem__(self, i: int) -> _T: ...
@overload
def __getitem__(self, s: slice) -> List[_T]: ...
@overload
def __setitem__(self, i: int, o: _T) -> None: ...
@overload
def __setitem__(self, s: slice, o: Iterable[_T]) -> None: ...
def __delitem__(self, i: Union[int, slice]) -> None: ...
if sys.version_info < (3,):
def __getslice__(self, start: int, stop: int) -> List[_T]: ...
def __setslice__(self, start: int, stop: int, o: Sequence[_T]) -> None: ...
def __delslice__(self, start: int, stop: int) -> None: ...
def __add__(self, x: List[_T]) -> List[_T]: ...
def __iadd__(self: _S, x: Iterable[_T]) -> _S: ...
def __mul__(self, n: int) -> List[_T]: ...
def __rmul__(self, n: int) -> List[_T]: ...
if sys.version_info >= (3,):
def __imul__(self: _S, n: int) -> _S: ...
def __contains__(self, o: object) -> bool: ...
def __reversed__(self) -> Iterator[_T]: ...
def __gt__(self, x: List[_T]) -> bool: ...
def __ge__(self, x: List[_T]) -> bool: ...
def __lt__(self, x: List[_T]) -> bool: ...
def __le__(self, x: List[_T]) -> bool: ...

MutableSequence

This class actually comes from collections.abc.MutableSequence, In fact, it is the method of variable sequence in the so-called Abstract basic class .

Python There are two kinds of sequences of , Variable sequence and immutable sequence and provide two base classes for them Sequence and MutableSequence, These two base classes exist in built-in modules collections.abc in , And other common classes such as int、list And so on , These two base classes are Abstract base class . This involves a new concept of abstract base classes , What is an abstract base class ？

For abstract base classes , There is no need to pay too much attention at present , Just know that the abstract base class refers to Cannot instantiate a class that produces an instance object , We will discuss abstract base classes later .

Sequence and MutableSequence Are two abstract base classes , Therefore, these two classes cannot instantiate to generate instance objects , Then Sequence and MutableSequence What else do the two abstract base classes do ？

In fact, the role of an abstract base class is not to instantiate an instance object , Its function is more like defining a rule , Or the official saying is agreement , So when we want to create this type of object in the future , Require compliance with such rules or agreements . Now we need to understand the protocols of sequence types , It takes learning abc Module Sequence and MutableSequence Two classes .

Sequence and MutableSequence The inheritance relationship between the two classes is as follows ：

Sequence inheritance relationship .png

Bold in the figure indicates the abstract base class , Italics indicate abstract methods , It can be understood as There is no specific implementation method , The rest are the methods implemented in the abstract base class .

You can see , The inheritance relationship is not complicated , But there's a lot of information , This diagram should be kept in mind , Because this is very important for understanding sequence types . We see , Variable sequence MutableSequence Class inherits from immutable sequence Sequence class ,Sequence Class inherits two more classes Reversible and Collection,Collection And inherit from Container、 Iterable、Sized Three abstract base classes . Through this inheritance diagram , We should at least know , For standard immutable sequence types Sequence, At least the following methods should be implemented （ Follow these agreements ）：

__contains__,__iter__,__len__,__reversed__,__getitem__,index,count

What exactly do these methods mean ？ In front of list We can peek into the implementation source code of ：

Realized __contains__ Method , Means list Member operations can be performed , That is to use in and not in The effect of
Realized __iter__ Method , signify list Is an iterative object , Can be done for loop 、 unpacking 、 Generator expression and other operations
Realized __len__ Method , This means that you can use built-in functions len(). meanwhile , When judging a list When the Boolean value of , If list It didn't come true __bool__ Method , Will also try to call __len__ Method
Realized __reversed__ Method , It means that the reverse operation can be realized
Realized __getitem__ Method , This means that you can index and slice
Realized index and count Method , Then it means that the index and statistical frequency can be obtained according to conditions .

The standard Sequence Type declares the above method , This means inheriting from Sequence Subclasses of , The object generated by its instantiation will be an iteratable object 、 have access to for loop 、 unpacking 、 Generator Expressions 、in、not in、 Indexes 、 section 、 Flip and many other operations . It also shows that , If we say that an object is an immutable sequence , Implies that the object is an iteratable object 、 have access to for loop 、.......

For standard variable sequences MutableSequence, We found that , In addition to implementing several methods in immutable sequences , At least the following methods need to be implemented （ Follow these agreements ）：

__setitem__,__delitem__,insert,append,extend,pop,remove,__iadd__

What do these methods mean ？ The same to Python Built in type list Take an example to illustrate ：

Realized __setitem__ Method , You can modify the elements in the list , Such as a = [1,2], Code a[0]=2 Is calling this method
Realized __delitem__,pop,remove Method , You can delete the elements in the list , Such as a = [1,2], Code del a[0] That's calling __delitem__ Method
Realized insert,append,extend Method , You can insert elements into the sequence
Realized __iadd__ Method , List can be assigned incrementally

That is to say , For standard variable sequence types , In addition to performing immutable type query operations , The instance objects of its subclasses can execute Additions and deletions The operation of .

Abstract base class Sequence and MutableSequence It declares which methods should be implemented for a sequence type , Obviously , If a class inherits directly from Sequence class , The interior is also overloaded Sequence Seven methods in , So obviously this class must be a sequence type ,MutableSequence The same is true for subclasses of . Such is the case , But when we look at the list list、 Character sequence str、 Tuples tuple When the inheritance chain of , Found in its mro There is no Sequence and MutableSequence class , in other words , These built-in types do not directly inherit from these two abstract base classes , So why should we say at the beginning of the article that they are all sequence types ？

>>> list.__mro__
(<class 'list'>, <class 'object'>)
>>> tuple.__mro__
(<class 'tuple'>, <class 'object'>)
>>> str.__mro__
(<class 'str'>, <class 'object'>)
>>> dict.__mro__
(<class 'dict'>, <class 'object'>)

Actually ,Python One of them is called The duck type Programming style . In this style , We don't pay much attention to the type of an object , It inherits from that type , It's about what it can do , Those methods are defined . If a thing looks like a duck , Walk like a duck , It sounds like a duck , So he's a duck .

Under this thought , If a class does not inherit directly from Sequence, But the internal implementation __contains__、__iter__、__len__、__reversed__、__getitem__、index,count Several ways , We can call it an immutable sequence . You don't even have to be so strict , Maybe you just need to implement __len__,__getitem__ Two methods can be called immutable sequence types . The same is true for variable sequences .

The idea of duck type runs through Python Object oriented programming is always .

Generic

This class is actually the implementation of generics , From the comments you can see , This is also an abstract base class , It is essentially used to implement multi type parameter input . For example list We can both deposit int, It can be str, It can also be list, It can also be dict Many different types of elements , This is essentially dependent on the inheritance of this class .

class Generic:
"""Abstract base class for generic types.
A generic type is typically declared by inheriting from
this class parameterized with one or more type variables.
For example, a generic mapping type might be defined as::
class Mapping(Generic[KT, VT]):
def __getitem__(self, key: KT) -> VT:
...
# Etc.
This class can then be used as follows::
def lookup_name(mapping: Mapping[KT, VT], key: KT, default: VT) -> VT:
try:
return mapping[key]
except KeyError:
return default
"""
__slots__ = ()
_is_protocol = False
@_tp_cache
def __class_getitem__(cls, params):
if not isinstance(params, tuple):
params = (params,)
if not params and cls is not Tuple:
raise TypeError(
f"Parameter list to {cls.__qualname__}[...] cannot be empty")
msg = "Parameters to generic types must be types."
params = tuple(_type_check(p, msg) for p in params)
if cls in (Generic, Protocol):
# Generic and Protocol can only be subscripted with unique type variables.
if not all(isinstance(p, TypeVar) for p in params):
raise TypeError(
f"Parameters to {cls.__name__}[...] must all be type variables")
if len(set(params)) != len(params):
raise TypeError(
f"Parameters to {cls.__name__}[...] must all be unique")
else:
# Subscripting a regular Generic subclass.
_check_generic(cls, params, len(cls.__parameters__))
return _GenericAlias(cls, params)
def __init_subclass__(cls, *args, **kwargs):
super().__init_subclass__(*args, **kwargs)
tvars = []
if '__orig_bases__' in cls.__dict__:
error = Generic in cls.__orig_bases__
else:
error = Generic in cls.__bases__ and cls.__name__ != 'Protocol'
if error:
raise TypeError("Cannot inherit from plain Generic")
if '__orig_bases__' in cls.__dict__:
tvars = _collect_type_vars(cls.__orig_bases__)
# Look for Generic[T1, ..., Tn].
# If found, tvars must be a subset of it.
# If not found, tvars is it.
# Also check for and reject plain Generic,
# and reject multiple Generic[...].
gvars = None
for base in cls.__orig_bases__:
if (isinstance(base, _GenericAlias) and
base.__origin__ is Generic):
if gvars is not None:
raise TypeError(
"Cannot inherit from Generic[...] multiple types.")
gvars = base.__parameters__
if gvars is not None:
tvarset = set(tvars)
gvarset = set(gvars)
if not tvarset <= gvarset:
s_vars = ', '.join(str(t) for t in tvars if t not in gvarset)
s_args = ', '.join(str(g) for g in gvars)
raise TypeError(f"Some type variables ({s_vars}) are"
f" not listed in Generic[{s_args}]")
tvars = gvars
cls.__parameters__ = tuple(tvars)

Suddenly look back

As a common data structure , It is used as an array in many scenarios , I may feel that list It is nothing more than a dynamic array , It's like C++ Medium vector perhaps Go Medium slice equally . From the implementation of the source code , We can also see that list Inherit MutableSequence And it has the effect of generics , But it can be asserted that list Is it a dynamic array ？

Let's consider a simple question ,Python Medium list Allows us to store different types of data , Since the types are different , The memory footprint is different , How are data objects of different sizes " Deposit in " In the array ？

We can store a string in the array respectively , A plastic surgery , And a dictionary object , If it is an array implementation , You need to store the data in the adjacent memory space , And index access becomes a very difficult thing , After all, we can't guess the size of each element , Thus, the desired element position cannot be located .

List elements .png

Is it realized through the linked list structure ? After all, linked lists support dynamic adjustment , With the help of pointers, different types of data can be referenced , For example, the linked list structure in the following figure . But in this case, when using subscript index data , You need to rely on traversal to find ,O(n) The time complexity of access efficiency is too low .

But for the use of linked lists , The system overhead is also large , After all, in addition to maintaining local data pointers for each data item , And maintain a next The pointer , Therefore, additional allocation is required 8 Bytes of data , At the same time, the dispersity of the linked list makes it impossible to use it like an array CPU Cache to efficiently execute data reading and writing .

Linked list storage logic .png

Let's discuss it again at this time list The internal implementation of this structure , My next deduction is based on CPython To the , The implementation syntax of different languages should be different , But the ideas are similar .

Python Medium list The implementation of data structure is simpler and purer than expected , The array memory continuity access mode is reserved , Only each node does not store actual data , But the corresponding data The pointer , Store and access data items in the form of an array of pointers , The corresponding structure is shown in the following figure ：

The details of the implementation can be seen from its Python Source code found in , The definition is as follows ：

typedef struct {
PyObject_VAR_HEAD
PyObject **ob_item;
Py_ssize_t allocated;
} PyListObject;

Inside list The implementation of is a C Structure , In this structure ob_item It's an array of Pointers , Stored pointer data of all objects ,allocated Is the amount of memory allocated , PyObject_VAR_HEAD Is a macro extension that contains more extended properties for managing arrays , Such as reference count and array size .

The dynamic array

Since it is a dynamic array , There is bound to be a problem , That is, how to manage capacity , Most programming languages use dynamic adjustment strategies for such structures , That is, when the storage capacity reaches a certain threshold , Expand capacity , When the storage capacity is below a certain threshold , Reduce capacity . It's simple , But it's not that easy to implement , When is the expansion , How much , When to perform recycling , How much free capacity must be reclaimed each time , These are all issues that need to be clarified in the implementation process .

about Python in list The dynamic adjustment rule procedure is defined as follows ： When the additional data capacity is full , Calculate the reallocated space size in the following way , Create a new array , And copy all the data into the new array . This is a strategy with relatively slow data growth , When reclaiming, the policy is executed when the capacity is half idle , Get the new reduced capacity size . In fact, this method is very similar to TCP Sliding window mechanism

new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
new_allocated += newsize
// 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, …

Suppose we use one of the simplest strategies ： Double the excess capacity , Less than half the capacity is reduced by times . What's wrong with this strategy ？ Imagine an insert when the capacity is full , The element is deleted , Alternate multiple times , The array data will be copied and recycled as a whole , There is no performance to speak of .

append

Next , Let's see list Several common operations of data structure . The first is list On the implementation append The operation of , This function adds elements to list Tail of . Notice that this is Pointer data is appended to the tail , Not the actual element .

test = list()
test.append("hello yerik")

Add a string to the list ：test.append("hello yerik") What happened ？ In fact, the underlying C function app1().

arguments: list object, new element
returns: 0 if OK, -1 if not
app1:
n = size of list
call list_resize() to resize the list to size n+1 = 0 + 1 = 1
list[n] = list[0] = new element
return 0

For an empty list, The size of the array is 0, To be able to insert elements , We need to expand the capacity of the array , Adjust the size according to the above calculation formula . For example, there is only one element , that newsize = 1, Calculated new_allocated = 3 + 1 = 4 , After successfully inserting the element , We don't need to reallocate new space until we insert the fifth element , To avoid frequent calls list_resize() function , Improve program performance .

python list append.png

We try to continue adding more elements to the list , When we insert elements "abc" When , Its internal array size is not large enough to hold the element , Execute a new round of dynamic capacity expansion , here newsize = 5 , new_allocated = 3 + 5 = 8

>>> test.append(520)
>>> test.append(dict())
>>> test.append(list())
>>> test.append("abc")
>>> test
['hello yerik', 520, {}, [], 'abc']

The data storage space distribution after inserting is shown in the following figure ：

python list append2.png

insert

In the list offset 2 Insert the new element in the position of , Integers 5：test.insert(1,2.33333333), Internal calls ins1() function .

arguments: list object, where, new element
returns: 0 if OK, -1 if not
ins1:
resize list to size n+1 = 5 -> 4 more slots will be allocated
starting at the last element up to the offset where, right shift each element
set new element at offset where
return 0

python Realized insert Function takes two parameters , The first is to specify the insertion location , The second is the element object . The middle insertion will cause the element after the position to be shifted , Since it is a stored pointer, the actual element does not need to be shifted , Just move the pointer .

>>> test.insert(2,2.33333333)
>>> test
['hello yerik', 520, 2.33333333, {}, [], 'abc']

python list insert.png

Insert the element as a string object , Create the string and get its pointer （ptr5）, Store it in the index as 2 In the array position of , And move the other subsequent elements to a position respectively ,insert Function call completed . It is precisely because of the need to “ Check the expansion ” Why , As a result, the complexity of the operation reaches O(n), Instead of the linked list O(1)

pop

Take out the last element of the list namely l.pop(), Called listpop() function . stay listpop() The function is called list_resize function , If the element usage is less than half , The idle capacity is recycled .

arguments: list object
returns: element popped
listpop:
if list empty:
return null
resize list with size 5 - 1 = 4. 4 is not less than 8/2 so no shrinkage
set list object size to 4
return last element

In the list pop The average complexity of the operation is O(1). However, the storage space size may need to be modified , This leads to increased complexity

>>> test.pop()
'abc'
>>> test.pop()
[]
>>> test.pop()
{}
>>> test
['hello yerik', 520, 2.33333333]

The element at the end is recycled , Pointer clear , At this time, the length is 5, Capacity of 8, Therefore, there is no need to implement any recycling strategy . When we continue to execute three times pop Make its length become 3 after , At this time, the usage is less than half of the capacity , A recycling policy needs to be implemented . The recycling method is also to use the above formula for processing , For example, the new size here is 3, Then the returned capacity is 3+3 = 6 , Not all free space is reclaimed .

python list pop.png

pop The operation of also needs to be checked and reduced , Therefore, the complexity is O(n)

Remove

remove The function specifies the element to be deleted , This element can be anywhere in the list . So every time remove Must be First, traverse the data items in turn , Match , Until the corresponding element position is found . Deleting may result in the migration of some elements .Remove The overall time complexity of the operation is O(n).

python list remove.png

>>> test
['hello yerik', 520, 2.33333333]
>>> test.remove(520)
>>> test
['hello yerik', 2.33333333]

In fact, for Python Dynamic adjustment of data structure like list , It also exists in other languages , But you may not realize it in your daily use , Understand the dynamic adjustment rules , We can allocate enough space by, for example, manually , To reduce the migration cost caused by its dynamic allocation , Make the program run more efficiently .

In addition, if you know in advance that the data types stored in the list are the same , For example, they are all integer or character types , Consider using arrays library , perhaps numpy library , Both provide a more direct array memory storage model , Instead of the above pointer reference model , Therefore, it will be more efficient in terms of access and storage efficiency .