To know what it is, we must know what it is ,python There are really not many container objects in , Usually, we will use the corresponding containers according to our needs with great peace of mind , For variable length data list
, Want to reuse set
, Want to match quickly dict
, For character processing str
, But why can this effect be achieved ? For example, we use list
When , Know that this thing can store various formats at will , Save integer type 、 floating-point 、 character string 、 You can even nest list
Other containers , The underlying principle is implemented by arrays , Or a chain list ? Like our dictionary , Is the bottom layer an array or something else ? If it's something else like a hash table , Then how to realize the sequence arrangement of input data ? This time, we might as well analyze it layer by layer , Make a deduction . You can't chew too much , This time, let's start with list
Analyze
The name is easily associated with other languages (C++、Java etc. ) The linked list in the standard library is confused , But in fact CPython
The list of is not a list at all ( That's a little windy , It may be easier to understand in English :python Medium list Not what we learn list), stay CPython in , The list is implemented as A variable length array .
Look at the details ,Python The list in is made up of A contiguous array of references to other objects , The pointer to this array and its length are stored in a list header structure . It means , Every time you add or delete an element , An array made up of references requires this size ( Redistribution ). In the process of implementation ,Python When creating these arrays, the exponential allocation method is adopted , As a result, it is not necessary to change the size of the array every time , But because of this, the average complexity of adding or removing elements is low .
The result of this method is on the common linked list “ The price is very small ” Some of the other operations in Python The computational complexity is relatively high .
list.insert(i,item)
Method to insert an element anywhere —— Complexity O(N)list.pop(i)
or list.remove(value)
Delete an element —— Complexity O(N) Let's take a look at list Source code of implementation , Source juice source flavor , Make a detailed evaluation . Let's find out first list Multiple inheritance from MutableSequence
and Generic
. Then we can read ,list Implementation of related embedded functions of , Such as append、pop、extend、insert And so on are actually realized through inheritance , Then we have to look for it MutableSequence
and Generic
The underlying implementation of these two classes , Only after solving these two classes , We can answer why list You can dynamically add data , And the complexity of deletion and insertion is not so good .
class list(MutableSequence[_T], Generic[_T]): @overload def __init__(self) -> None: ... @overload def __init__(self, iterable: Iterable[_T]) -> None: ... if sys.version_info >= (3,): def clear(self) -> None: ... def copy(self) -> List[_T]: ... def append(self, object: _T) -> None: ... def extend(self, iterable: Iterable[_T]) -> None: ... def pop(self, index: int = ...) -> _T: ... def index(self, object: _T, start: int = ..., stop: int = ...) -> int: ... def count(self, object: _T) -> int: ... def insert(self, index: int, object: _T) -> None: ... def remove(self, object: _T) -> None: ... def reverse(self) -> None: ... if sys.version_info >= (3,): def sort(self, *, key: Optional[Callable[[_T], Any]] = ..., reverse: bool = ...) -> None: ... else: def sort(self, cmp: Callable[[_T, _T], Any] = ..., key: Callable[[_T], Any] = ..., reverse: bool = ...) -> None: ... def __len__(self) -> int: ... def __iter__(self) -> Iterator[_T]: ... def __str__(self) -> str: ... __hash__: None # type: ignore @overload def __getitem__(self, i: int) -> _T: ... @overload def __getitem__(self, s: slice) -> List[_T]: ... @overload def __setitem__(self, i: int, o: _T) -> None: ... @overload def __setitem__(self, s: slice, o: Iterable[_T]) -> None: ... def __delitem__(self, i: Union[int, slice]) -> None: ... if sys.version_info < (3,): def __getslice__(self, start: int, stop: int) -> List[_T]: ... def __setslice__(self, start: int, stop: int, o: Sequence[_T]) -> None: ... def __delslice__(self, start: int, stop: int) -> None: ... def __add__(self, x: List[_T]) -> List[_T]: ... def __iadd__(self: _S, x: Iterable[_T]) -> _S: ... def __mul__(self, n: int) -> List[_T]: ... def __rmul__(self, n: int) -> List[_T]: ... if sys.version_info >= (3,): def __imul__(self: _S, n: int) -> _S: ... def __contains__(self, o: object) -> bool: ... def __reversed__(self) -> Iterator[_T]: ... def __gt__(self, x: List[_T]) -> bool: ... def __ge__(self, x: List[_T]) -> bool: ... def __lt__(self, x: List[_T]) -> bool: ... def __le__(self, x: List[_T]) -> bool: ...
This class actually comes from collections.abc.MutableSequence
, In fact, it is the method of variable sequence in the so-called Abstract basic class .
Python There are two kinds of sequences of , Variable sequence and immutable sequence and provide two base classes for them Sequence
and MutableSequence
, These two base classes exist in built-in modules collections.abc
in , And other common classes such as int
、list
And so on , These two base classes are Abstract base class . This involves a new concept of abstract base classes , What is an abstract base class ?
For abstract base classes , There is no need to pay too much attention at present , Just know that the abstract base class refers to Cannot instantiate a class that produces an instance object , We will discuss abstract base classes later .
Sequence
and MutableSequence
Are two abstract base classes , Therefore, these two classes cannot instantiate to generate instance objects , Then Sequence
and MutableSequence
What else do the two abstract base classes do ?
In fact, the role of an abstract base class is not to instantiate an instance object , Its function is more like defining a rule , Or the official saying is agreement , So when we want to create this type of object in the future , Require compliance with such rules or agreements . Now we need to understand the protocols of sequence types , It takes learning abc Module Sequence
and MutableSequence
Two classes .
Sequence and MutableSequence The inheritance relationship between the two classes is as follows :
Bold in the figure indicates the abstract base class , Italics indicate abstract methods , It can be understood as There is no specific implementation method , The rest are the methods implemented in the abstract base class .
You can see , The inheritance relationship is not complicated , But there's a lot of information , This diagram should be kept in mind , Because this is very important for understanding sequence types . We see , Variable sequence MutableSequence
Class inherits from immutable sequence Sequence
class ,Sequence
Class inherits two more classes Reversible
and Collection
,Collection
And inherit from Container
、 Iterable
、Sized
Three abstract base classes . Through this inheritance diagram , We should at least know , For standard immutable sequence types Sequence
, At least the following methods should be implemented ( Follow these agreements ):
__contains__,__iter__,__len__,__reversed__,__getitem__,index,count
What exactly do these methods mean ? In front of list
We can peek into the implementation source code of :
__contains__
Method , Means list Member operations can be performed , That is to use in
and not in
The effect of __iter__
Method , signify list Is an iterative object , Can be done for
loop 、 unpacking 、 Generator expression and other operations __len__
Method , This means that you can use built-in functions len()
. meanwhile , When judging a list When the Boolean value of , If list It didn't come true __bool__
Method , Will also try to call __len__
Method __reversed__
Method , It means that the reverse operation can be realized __getitem__
Method , This means that you can index and slice index
and count
Method , Then it means that the index and statistical frequency can be obtained according to conditions . The standard Sequence
Type declares the above method , This means inheriting from Sequence
Subclasses of , The object generated by its instantiation will be an iteratable object 、 have access to for loop 、 unpacking 、 Generator Expressions 、in、not in、 Indexes 、 section 、 Flip and many other operations . It also shows that , If we say that an object is an immutable sequence , Implies that the object is an iteratable object 、 have access to for loop 、.......
For standard variable sequences MutableSequence
, We found that , In addition to implementing several methods in immutable sequences , At least the following methods need to be implemented ( Follow these agreements ):
__setitem__,__delitem__,insert,append,extend,pop,remove,__iadd__
What do these methods mean ? The same to Python Built in type list Take an example to illustrate :
__setitem__
Method , You can modify the elements in the list , Such as a = [1,2]
, Code a[0]=2
Is calling this method __delitem__
,pop
,remove
Method , You can delete the elements in the list , Such as a = [1,2]
, Code del a[0]
That's calling __delitem__
Method insert
,append
,extend
Method , You can insert elements into the sequence __iadd__
Method , List can be assigned incrementally That is to say , For standard variable sequence types , In addition to performing immutable type query operations , The instance objects of its subclasses can execute Additions and deletions The operation of .
Abstract base class Sequence
and MutableSequence
It declares which methods should be implemented for a sequence type , Obviously , If a class inherits directly from Sequence
class , The interior is also overloaded Sequence
Seven methods in , So obviously this class must be a sequence type ,MutableSequence
The same is true for subclasses of . Such is the case , But when we look at the list list、 Character sequence str、 Tuples tuple When the inheritance chain of , Found in its mro There is no Sequence and MutableSequence class , in other words , These built-in types do not directly inherit from these two abstract base classes , So why should we say at the beginning of the article that they are all sequence types ?
>>> list.__mro__ (<class 'list'>, <class 'object'>) >>> tuple.__mro__ (<class 'tuple'>, <class 'object'>) >>> str.__mro__ (<class 'str'>, <class 'object'>) >>> dict.__mro__ (<class 'dict'>, <class 'object'>)
Actually ,Python One of them is called The duck type Programming style . In this style , We don't pay much attention to the type of an object , It inherits from that type , It's about what it can do , Those methods are defined . If a thing looks like a duck , Walk like a duck , It sounds like a duck , So he's a duck .
Under this thought , If a class does not inherit directly from Sequence
, But the internal implementation __contains__
、__iter__
、__len__
、__reversed__
、__getitem__
、index
,count
Several ways , We can call it an immutable sequence . You don't even have to be so strict , Maybe you just need to implement __len__
,__getitem__
Two methods can be called immutable sequence types . The same is true for variable sequences .
The idea of duck type runs through Python Object oriented programming is always .
This class is actually the implementation of generics , From the comments you can see , This is also an abstract base class , It is essentially used to implement multi type parameter input . For example list We can both deposit int, It can be str, It can also be list, It can also be dict Many different types of elements , This is essentially dependent on the inheritance of this class .
class Generic: """Abstract base class for generic types. A generic type is typically declared by inheriting from this class parameterized with one or more type variables. For example, a generic mapping type might be defined as:: class Mapping(Generic[KT, VT]): def __getitem__(self, key: KT) -> VT: ... # Etc. This class can then be used as follows:: def lookup_name(mapping: Mapping[KT, VT], key: KT, default: VT) -> VT: try: return mapping[key] except KeyError: return default """ __slots__ = () _is_protocol = False @_tp_cache def __class_getitem__(cls, params): if not isinstance(params, tuple): params = (params,) if not params and cls is not Tuple: raise TypeError( f"Parameter list to {cls.__qualname__}[...] cannot be empty") msg = "Parameters to generic types must be types." params = tuple(_type_check(p, msg) for p in params) if cls in (Generic, Protocol): # Generic and Protocol can only be subscripted with unique type variables. if not all(isinstance(p, TypeVar) for p in params): raise TypeError( f"Parameters to {cls.__name__}[...] must all be type variables") if len(set(params)) != len(params): raise TypeError( f"Parameters to {cls.__name__}[...] must all be unique") else: # Subscripting a regular Generic subclass. _check_generic(cls, params, len(cls.__parameters__)) return _GenericAlias(cls, params) def __init_subclass__(cls, *args, **kwargs): super().__init_subclass__(*args, **kwargs) tvars = [] if '__orig_bases__' in cls.__dict__: error = Generic in cls.__orig_bases__ else: error = Generic in cls.__bases__ and cls.__name__ != 'Protocol' if error: raise TypeError("Cannot inherit from plain Generic") if '__orig_bases__' in cls.__dict__: tvars = _collect_type_vars(cls.__orig_bases__) # Look for Generic[T1, ..., Tn]. # If found, tvars must be a subset of it. # If not found, tvars is it. # Also check for and reject plain Generic, # and reject multiple Generic[...]. gvars = None for base in cls.__orig_bases__: if (isinstance(base, _GenericAlias) and base.__origin__ is Generic): if gvars is not None: raise TypeError( "Cannot inherit from Generic[...] multiple types.") gvars = base.__parameters__ if gvars is not None: tvarset = set(tvars) gvarset = set(gvars) if not tvarset <= gvarset: s_vars = ', '.join(str(t) for t in tvars if t not in gvarset) s_args = ', '.join(str(g) for g in gvars) raise TypeError(f"Some type variables ({s_vars}) are" f" not listed in Generic[{s_args}]") tvars = gvars cls.__parameters__ = tuple(tvars)
As a common data structure , It is used as an array in many scenarios , I may feel that list It is nothing more than a dynamic array , It's like C++ Medium vector perhaps Go Medium slice equally . From the implementation of the source code , We can also see that list Inherit MutableSequence
And it has the effect of generics , But it can be asserted that list Is it a dynamic array ?
Let's consider a simple question ,Python Medium list Allows us to store different types of data , Since the types are different , The memory footprint is different , How are data objects of different sizes " Deposit in " In the array ?
We can store a string in the array respectively , A plastic surgery , And a dictionary object , If it is an array implementation , You need to store the data in the adjacent memory space , And index access becomes a very difficult thing , After all, we can't guess the size of each element , Thus, the desired element position cannot be located .
Is it realized through the linked list structure ? After all, linked lists support dynamic adjustment , With the help of pointers, different types of data can be referenced , For example, the linked list structure in the following figure . But in this case, when using subscript index data , You need to rely on traversal to find ,O(n) The time complexity of access efficiency is too low .
But for the use of linked lists , The system overhead is also large , After all, in addition to maintaining local data pointers for each data item , And maintain a next The pointer
, Therefore, additional allocation is required 8 Bytes of data , At the same time, the dispersity of the linked list makes it impossible to use it like an array CPU Cache to efficiently execute data reading and writing .
Let's discuss it again at this time list
The internal implementation of this structure , My next deduction is based on CPython
To the , The implementation syntax of different languages should be different , But the ideas are similar .
Python Medium list The implementation of data structure is simpler and purer than expected , The array memory continuity access mode is reserved , Only each node does not store actual data , But the corresponding data The pointer , Store and access data items in the form of an array of pointers , The corresponding structure is shown in the following figure :
The details of the implementation can be seen from its Python Source code found in , The definition is as follows :
typedef struct { PyObject_VAR_HEAD PyObject **ob_item; Py_ssize_t allocated; } PyListObject;
Inside list The implementation of is a C Structure , In this structure ob_item
It's an array of Pointers , Stored pointer data of all objects ,allocated
Is the amount of memory allocated , PyObject_VAR_HEAD
Is a macro extension that contains more extended properties for managing arrays , Such as reference count and array size .
Since it is a dynamic array , There is bound to be a problem , That is, how to manage capacity , Most programming languages use dynamic adjustment strategies for such structures , That is, when the storage capacity reaches a certain threshold , Expand capacity , When the storage capacity is below a certain threshold , Reduce capacity . It's simple , But it's not that easy to implement , When is the expansion , How much , When to perform recycling , How much free capacity must be reclaimed each time , These are all issues that need to be clarified in the implementation process .
about Python in list The dynamic adjustment rule procedure is defined as follows : When the additional data capacity is full , Calculate the reallocated space size in the following way , Create a new array , And copy all the data into the new array . This is a strategy with relatively slow data growth , When reclaiming, the policy is executed when the capacity is half idle , Get the new reduced capacity size . In fact, this method is very similar to TCP Sliding window mechanism
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6); new_allocated += newsize // 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, …
Suppose we use one of the simplest strategies : Double the excess capacity , Less than half the capacity is reduced by times . What's wrong with this strategy ? Imagine an insert when the capacity is full , The element is deleted , Alternate multiple times , The array data will be copied and recycled as a whole , There is no performance to speak of .
Next , Let's see list Several common operations of data structure . The first is list On the implementation append The operation of , This function adds elements to list Tail of . Notice that this is Pointer data is appended to the tail , Not the actual element .
test = list() test.append("hello yerik")
Add a string to the list :test.append("hello yerik") What happened ? In fact, the underlying C function app1().
arguments: list object, new element returns: 0 if OK, -1 if not app1: n = size of list call list_resize() to resize the list to size n+1 = 0 + 1 = 1 list[n] = list[0] = new element return 0
For an empty list, The size of the array is 0, To be able to insert elements , We need to expand the capacity of the array , Adjust the size according to the above calculation formula . For example, there is only one element , that newsize = 1, Calculated new_allocated = 3 + 1 = 4 , After successfully inserting the element , We don't need to reallocate new space until we insert the fifth element , To avoid frequent calls list_resize() function , Improve program performance .
We try to continue adding more elements to the list , When we insert elements "abc" When , Its internal array size is not large enough to hold the element , Execute a new round of dynamic capacity expansion , here newsize = 5 , new_allocated = 3 + 5 = 8
>>> test.append(520) >>> test.append(dict()) >>> test.append(list()) >>> test.append("abc") >>> test ['hello yerik', 520, {}, [], 'abc']
The data storage space distribution after inserting is shown in the following figure :
In the list offset 2 Insert the new element in the position of , Integers 5:test.insert(1,2.33333333), Internal calls ins1() function .
arguments: list object, where, new element returns: 0 if OK, -1 if not ins1: resize list to size n+1 = 5 -> 4 more slots will be allocated starting at the last element up to the offset where, right shift each element set new element at offset where return 0
python Realized insert Function takes two parameters , The first is to specify the insertion location , The second is the element object . The middle insertion will cause the element after the position to be shifted , Since it is a stored pointer, the actual element does not need to be shifted , Just move the pointer .
>>> test.insert(2,2.33333333) >>> test ['hello yerik', 520, 2.33333333, {}, [], 'abc']
Insert the element as a string object , Create the string and get its pointer (ptr5), Store it in the index as 2 In the array position of , And move the other subsequent elements to a position respectively ,insert Function call completed . It is precisely because of the need to “ Check the expansion ” Why , As a result, the complexity of the operation reaches O(n), Instead of the linked list O(1)
Take out the last element of the list namely l.pop(), Called listpop() function . stay listpop() The function is called list_resize function , If the element usage is less than half , The idle capacity is recycled .
arguments: list object returns: element popped listpop: if list empty: return null resize list with size 5 - 1 = 4. 4 is not less than 8/2 so no shrinkage set list object size to 4 return last element
In the list pop The average complexity of the operation is O(1). However, the storage space size may need to be modified , This leads to increased complexity
>>> test.pop() 'abc' >>> test.pop() [] >>> test.pop() {} >>> test ['hello yerik', 520, 2.33333333]
The element at the end is recycled , Pointer clear , At this time, the length is 5, Capacity of 8, Therefore, there is no need to implement any recycling strategy . When we continue to execute three times pop Make its length become 3 after , At this time, the usage is less than half of the capacity , A recycling policy needs to be implemented . The recycling method is also to use the above formula for processing , For example, the new size here is 3, Then the returned capacity is 3+3 = 6 , Not all free space is reclaimed .
pop The operation of also needs to be checked and reduced , Therefore, the complexity is O(n)
remove The function specifies the element to be deleted , This element can be anywhere in the list . So every time remove Must be First, traverse the data items in turn , Match , Until the corresponding element position is found . Deleting may result in the migration of some elements .Remove The overall time complexity of the operation is O(n).
>>> test ['hello yerik', 520, 2.33333333] >>> test.remove(520) >>> test ['hello yerik', 2.33333333]
In fact, for Python Dynamic adjustment of data structure like list , It also exists in other languages , But you may not realize it in your daily use , Understand the dynamic adjustment rules , We can allocate enough space by, for example, manually , To reduce the migration cost caused by its dynamic allocation , Make the program run more efficiently .
In addition, if you know in advance that the data types stored in the list are the same , For example, they are all integer or character types , Consider using arrays library , perhaps numpy library , Both provide a more direct array memory storage model , Instead of the above pointer reference model , Therefore, it will be more efficient in terms of access and storage efficiency .