您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

[Python] illustration numpy: internal mechanism of common functions

編輯：Python

selected from Medium, author ：Lev Maximov

Heart of machine compilation

Support a large number of multidimensional array and matrix operations NumPy Software library is a necessary tool for many machine learning developers and researchers , In this paper, we will analyze the commonly used NumPy Functions and functions , To help you understand NumPy The intrinsic mechanism of manipulating arrays .

NumPy It's a basic software library , A lot of common Python It's inspired by the software that processes the data , Include pandas、PyTorch、TensorFlow、Keras etc. . understand NumPy The working mechanism of can help you improve your skills in these software libraries . And in GPU Upper use NumPy when , There is no need to modify or just a small amount of code modification .

NumPy The core concept of n Dimension group .n The beauty of dimension sets is that most operations look the same , No matter how many dimensions an array has . But one and two dimensions are a little special . This article is divided into three parts ：

1. vector ： One dimensional array

2. matrix ： Two dimensional array

3. Three dimensions and higher

This paper refers to Jay Alammar The article 《A Visual Intro to NumPy》 And use it as a starting point , And then it expanded , And made some minor changes .

NumPy Array and Python list

At first glance ,NumPy An array with the Python The list is similar to . They all serve as containers , Can quickly get and set elements , But inserting and removing elements is a little slower .

NumPy The simplest example of an array winning list is arithmetic ：

besides ,NumPy The advantages and characteristics of arrays also include ：

More compact , Especially when the dimension is larger than one dimension ;

When the operation can be vectorized , Faster than lists ;

When you attach elements to the back , Slower than the list ;

Usually homogeneous ： It's fast when all elements are of one type .

here O(N) The time required to complete the operation is proportional to the size of the array , and O*(1)（ That is to say 「 capitation O(1)」） The time to complete an operation is usually independent of the size of the array .

vector ： One dimensional array

Vector initialization

In order to create NumPy Array , One way is to transform Python list .NumPy Array types can be derived directly from list element types .

Make sure that the list you enter is of the same type , Or you'll end up with dtype=’object’, It affects speed , In the end, only NumPy The grammar sugar contained in .

NumPy Arrays can't be like Python It's growing like a list . There is no space left at the end of the array to quickly attach elements . therefore , The common practice is to either use Python list , When you're ready, convert it to NumPy Array , Or use np.zeros or np.empty Leave the necessary space in advance ：

Usually it is necessary to create an empty array that matches an existing array in shape and element type .

in fact , All functions used to create arrays filled with constant values have _like In the form of ：

NumPy Two functions in can perform array initialization with a monotone sequence ：

If you need something like that [0., 1., 2.] Such an array of floating-point numbers , You can modify arange Type of output ：arange(3).astype(float), But there's a better way .arange Functions are type sensitive ： If you enter an integer as a parameter , It generates integers ; If you enter a floating-point number （ such as arange(3.)）, It generates floating-point numbers .

but arange Not very good at dealing with floating-point numbers ：

In our eyes , This 0.1 It looks like a finite decimal number , But computers don't see it that way . In binary representation ,0.1 It's an infinite fraction , So we have to make a reduction , It will inevitably lead to errors . And for this reason , If to arange Function input with fractional part step, It usually doesn't get good results ： You may come across a mistake (off-by-one error). You can make the end of the interval fall on a non integer step In number （solution1）, But it reduces the readability and maintainability of the code . Now ,linspace I can use it . It's not affected by rounding , Always generate the element values you want . however , Use linspace There's a common pitfall that you'll encounter when you're on the road ： It counts the number of data points , Not intervals , So the last parameter num Usually more than you think 1. therefore , The number in the last example above is 11, instead of 10.

When testing , We usually need to generate random arrays ：

Vector index

Once you have data in your array ,NumPy They can be easily provided in a very clever way ：

except 「 Fancy index （fancy indexing）」 Outside , All the indexing methods given above are called 「view」： They don't store data , It will not reflect the changes of the original array when the data is indexed .

All methods that include fancy indexes are variable ： They allow the contents of the original array to be modified by assignment , As shown above . This feature avoids the habit of always copying arrays by splitting them into different parts .

Python List and NumPy Comparison of arrays

In order to obtain NumPy Data in array , Another super useful method is Boolean index （boolean indexing）, It supports the use of all kinds of logical operators ：

any and all The role of and in Python It's similar to , But no short circuit .

But be careful , Not supported here Python Of 「 Ternary comparison 」, such as 3<=a<=5.

As shown above , Boolean indexes are also writable . Its two common functions have their own special functions ： Overloaded np.where Functions and np.clip function . Their meanings are as follows ：

Vector operations

NumPy One of the great applications of speed is arithmetic . Vector operators are converted to C++ At the level of execution , So as to avoid slow Python The cost of recycling .NumPy Support the operation of the entire array just like the operation of ordinary numbers .

And Python Syntax is the same ,a//b Express a except b（ The quotient of division ）,x**n Express xⁿ.

Just like adding and subtracting floating-point numbers, integer numbers are converted to floating-point numbers , Scalars are also converted to arrays , This process takes place in NumPy It's called broadcasting （broadcast）.

Most mathematical functions have functions for dealing with vectors NumPy The corresponding function ：

Scalar products have their own operators ：

You don't have to loop to perform trigonometric functions ：

We can round the array as a whole ：

floor To give up 、ceil In order to enter ,around Is rounded to the nearest integer （ among .5 Will be abandoned ）

NumPy Can also perform basic statistical operations ：

NumPy The sort function of does not have Python The sorting function is so powerful ：

Python List and NumPy Array sorting function comparison

In the case of one dimension , If there is a lack of reversed keyword , Then simply reverse the result , The end result is the same . The two-dimensional case is more difficult （ People are asking for this function ）.

Search for elements in a vector

And Python The list is the opposite ,NumPy Array has no index method . People have been asking for this function for a long time , But it hasn't come true yet .

Python List and NumPy Comparison of arrays ,index() The square brackets in can be omitted j Or omit at the same time i and j.

One way to find elements is np.where(a==x)[0][0], But this approach is neither elegant , Not fast either , Because it needs to check all the elements in the array , Even if the target is at the beginning of the array .

Another way to use it is faster Numba To speed up next((i[0] for i, v in np.ndenumerate(a) if v==x), -1).

Once the array is sorted , It's much easier to search ：v = np.searchsorted(a, x); return v if a[v]==x else -1 It's very fast , The time complexity is O(log N), But it needs O(N log N) Time first .

in fact , use C It's not a problem to implement it and speed up the search . The problem is floating point comparisons . This is not a simple and directly available task for any data .

Compare floating-point numbers

function np.allclose(a, b) Can compare floating-point number array under certain tolerance .

function np.allclose(a, b) Example of working process of . There is no universal way ！

np.allclose Suppose that all the numbers being compared are in the typical 1 Within the scope of . for instance , If you want to complete the calculation in nanoseconds , You need to use the default atol The parameter value is divided by 1e9：np.allclose(1e-9, 2e-9, atol=1e-17) == False.

math.isclose No assumptions are made about the number to be compared , It depends on the user to give a reasonable abs_tol value （ For typical 1 Values in the range of , Take the default np.allclose atol value 1e-8 That's good enough ）：math.isclose(0.1+0.2–0.3, abs_tol=1e-8)==True.

besides ,np.allclose There are still some small problems in the formula of absolute value and relative tolerance , for instance , For a given a and b, There is allclose(a, b) != allclose(b, a). These questions are already in （ Scalar ） function math.isclose It has been solved , We'll talk about it later . For more on this , see also GitHub Floating point guide on and the corresponding NumPy problem （https://floating-point-gui.de/errors/comparison/）.

matrix ： Two dimensional array

NumPy There was a special matrix class , But now it's abandoned , So this article will alternate 「 matrix 」 and 「 Two dimensional array 」 These two terms .

The initialization syntax of a matrix is similar to that of a vector ：

You have to use double brackets here , Because the second positional parameter is dtype（ Optional , Also accept integers ）.

The syntax of random matrix generation is similar to that of vector generation ：

The syntax of two-dimensional index is more convenient than nested list ：

view The symbol means that when you slice an array, you don't actually copy it . When the array is modified , These changes will also be reflected in the segmentation results .

axis Parameters

In many operations （ such as sum）, You need to tell NumPy Whether the operation is performed on a column or a row . In order to obtain a general symbol suitable for any dimension ,NumPy Introduced axis The concept of ： in fact ,axis The value of the parameter is the number of indexes in the related problem ： The first index is axis=0, The second index is axis=1, And so on . So in two dimensions ,axis=0 It's in columns ,axis=1 It's by line .

Matrix arithmetic operation

In addition to the regular operators that are executed on an element by element basis （ such as +、-、、/、//、*）, There's also a way to compute the product of matrices @ Operator ：

We've covered scalar to array broadcasting in the first part , After generalization based on it ,NumPy Mixed operation of support vector and matrix , Even the operation between two vectors ：

Broadcast in a two-dimensional array

Row vector and column vector

As the example above shows , In two dimensions , Row vectors and column vectors are treated differently . This is the same as having some kind of one-dimensional array NumPy Practice is different （ Like two-dimensional arrays a— Of the j Column a[:,j] It's a one-dimensional array ）. By default , One dimensional arrays are treated as row vectors in two-dimensional operations , So when you multiply a matrix by a row vector , You can use shapes (n,) or (1, n)—— The result is the same . If you need a column vector , There are many ways to get it based on one-dimensional arrays , But here's the surprise 「 Transposition 」 Not one of them .

There are two operations to get two-dimensional array based on one-dimensional array ： Use reshape Shape and use newaxis Index ：

among -1 This parameter tells reshape Automatically calculate the size of one of the dimensions , In square brackets None Is used as np.newaxis Shortcut to , This will add an empty axis.

therefore ,NumPy There are three types of vectors ： One dimensional vector 、 Two dimensional row vectors and two dimensional column vectors . The figure below shows how these three vectors are transformed ：

One dimensional vector 、 The transformation between two-dimensional row vector and two-dimensional column vector . According to the principles of broadcasting , One dimensional arrays can be implicitly treated as two-dimensional row vectors , So there's usually no need to perform a transition between the two —— So the corresponding area is shaded .

Matrix operation

There are two main arrays of merge functions ：

These two functions work for stacking only matrices or stacking only vectors , But when you need to stack one-dimensional arrays and matrices , Only vstack It can work ：hstack There will be a dimension mismatch error , The reasons are described above , One dimensional arrays are treated as row vectors , Instead of column vectors . In response to this question , The solution is either to convert it to a row vector , Or you can do it automatically column_stack function ：

The reverse operation of stacking is splitting ：

There are two ways to copy a matrix ： Copy - Pasted tile And page printing repeat：

delete You can delete specific rows and columns ：

The reverse operation of deletion is used as insertion , namely insert：

append Functions are like hstack equally , Cannot transpose one-dimensional arrays automatically , So again , Or you need to change the shape of the vector , Or you need to add a dimension , Or use column_stack：

in fact , If you just need to add constant values to the edge of the array , that （ Slightly more complicated ）pad Functions should be enough ：

grid

Broadcast rules make it easier to manipulate the grid . Suppose you have the following matrix （ But it's very big ）：

Use C And use Python Create a matrix comparison

These two methods are slow , Because they use Python loop . To solve such problems ,MATLAB The way to do this is to create a grid ：

Use MATLAB Create a sketch of the grid

Use the parameters provided above I and J,meshgrid Function takes any set of indexes as input ,mgrid It's just segmentation ,indices Only full index ranges can be generated ,fromfunction The function provided is called only once .

But actually ,NumPy There's a better way to do it . We don't have to spend all our memory on I and J On the matrix . It's enough to store vectors of the right shape , Broadcasting rules can do the rest of the work .

Use NumPy Create a sketch of the grid

No, indexing=’ij’ Parameters ,meshgrid Will change the order of these parameters ：J, I= np.meshgrid(j, i)—— This is a kind of xy Pattern , To visualize 3D Charts are useful .

Get matrix Statistics

and sum equally ,min、max、argmin、argmax、mean、std、var And all other statistical functions support axis Parameters and can be used to complete statistical calculation ：

Three examples of statistical functions , In order to avoid and Python Of min Conflict ,NumPy The corresponding function in is called np.amin.

For two-dimensional and higher dimensional argmin and argmax The function returns the first instance of the minimum and maximum values , There's a bit of trouble returning the expanded index . In order to convert it into two coordinates , Need to use unravel_index function ：

Use unravel_index Example of a function

all and any Functions also support axis Parameters ：

Use all and any Example of a function

Matrix ordering

axis Parameters are useful for the functions listed above , But it's useless for sorting ：

Use Python List and NumPy Arrays perform sort comparisons

This is usually not the result you want to see when sorting a matrix or spreadsheet ：axis There is no substitute for key Parameters . But fortunately ,NumPy Provides some auxiliary functions to support sorting by column —— Or sort by multiple columns if necessary ：

1. a[a[:,0].argsort()] You can sort the array by the first column ：

here argsort Will return the sorted index of the original array .

This technique can be repeated , But you have to be careful , Don't let the next sort disturb the result of the previous sort ：

a = a[a[:,2].argsort()]

a = a[a[:,1].argsort(kind='stable')]

a = a[a[:,0].argsort(kind='stable')]

2. lexsort Function can sort all columns in the same way , But it's always on line , And the order of the rows to be sorted is reversed （ From the bottom up ）, So it's a bit unnatural to use it , such as

- a[np.lexsort(np.flipud(a[2,5].T))] First of all, according to article 2 Column sorting , then （ When the first 2 When the values of the columns are equal ） And then according to 5 Column sorting .

– a[np.lexsort(np.flipud(a.T))] Will sort from left to right according to all columns .

here ,flipud The matrix is flipped up and down （ To be exact axis=0 Direction , And a[::-1,...] equally , Three of the dots represent 「 All the other dimensions 」, So flipping this one-dimensional array is all of a sudden flipud, instead of fliplr.

3. sort One more order Parameters , But if it's normal at first （ Unstructured ） Array , It's not fast to execute , It's not easy to use .

4. stay pandas It's probably a better choice to execute it in , Because in pandas in , This particular operation is much more readable , It's not so easy to make mistakes ：

– pd.DataFrame(a).sort_values(by=[2,5]).to_numpy() I'll start with 2 Column sorting , And then according to section 5 Column sorting .

– pd.DataFrame(a).sort_values().to_numpy() Will sort from left to right according to all columns .

Three dimensions and higher

When you adjust the shape of a one-dimensional vector or transform nested vectors Python List to create 3D Array time , The meaning of index is (z,y,x). The first index is the number of planes , And then there are the coordinates on that plane ：

Exhibition (z,y,x) A schematic diagram of the sequence

This index order is very convenient , for instance , It can be used to save some grayscale images ：a[i] It's index number i A shortcut to an image .

But this index order is not universal . When operating RGB In the picture , You usually use (y,x,z) The order ： First, two pixel coordinates , The last one is the color coordinates （Matplotlib Medium is RGB,OpenCV Medium is BGR）：

Exhibition (y,x,z) A schematic diagram of the sequence

such , We can easily index specific pixels ：a[i,j] Can provide (i,j) Positional RGB Tuples .

therefore , The actual command to create the geometry depends on the conventions in your field ：

Create general 3D arrays and RGB Images

Obviously ,hstack、vstack、dstack These functions don't support these conventions . They're hard coded (y,x,z) Index order of , namely RGB The order of the images ：

NumPy Use (y,x,z) A schematic diagram of the sequence , The stack RGB Images （ There are only two colors here ）

If your data layout is different , Use concatenate It's more convenient to stack images with the command , To one axis Parameter enter a clear index value ：

Stack general 3D arrays

If you're not used to thinking axis Count , You can convert the array to hstack And so on ：

Convert an array to hstack In the form of hard coded schematic diagram

The cost of this conversion is very low ： No actual replication is performed , It's just the order of the mixed indexes during execution .

Another operation that can mix index order is array transpose . Understanding it may make you more familiar with 3D arrays . According to what you decide to use axis The order is different , The actual command to transpose all planes of the array will be different ： For general arrays , It exchanges indexes 1 and 2, Yes RGB The image is 0 and 1：

The command to transpose all planes of a 3D data

But here's the interesting thing ,transpose Default axes Parameters （ And the only a.T Operation mode ） Will rotate the direction of the index order , This is not consistent with the above two index order conventions .

Last , There is also a function that prevents you from using too much training when dealing with multidimensional arrays , It also makes your code simpler ——einsum（ Einstein's summation ）：

It sums the array along the repeated index . In this particular case ,np.tensordot(a, b, axis=1) Enough to deal with both , But in more complex cases ,einsum It could be faster , And it's usually easier to read and write —— As long as you understand the logic behind it .

If you want to test your NumPy Skill ,GitHub Yes 100 A rather difficult exercise ：https://github.com/rougier/numpy-100.

Your favorite NumPy What is the function ？ Please share with us ！

Link to the original text ：https://medium.com/better-programming/numpy-illustrated-the-visual-guide-to-numpy-3b1d4976de1d

 Past highlights
It is suitable for beginners to download the route and materials of artificial intelligence ( Image & Text + video ) Introduction to machine learning series download Chinese University Courses 《 machine learning 》（ Huang haiguang keynote speaker ） Print materials such as machine learning and in-depth learning notes 《 Statistical learning method 》 Code reproduction album machine learning communication qq Group 955171419, Please scan the code to join wechat group