您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

1.28 numpy and pandas learning

編輯：Python

1.28 numpy And pandas Study

numpy

numpy What is it?

numerical python

An open source scientific computing library

numpy advantage

The code is simpler （ In array , The matrix is granular ）
More efficient performance （ Better storage efficiency and input / output performance ）
numpy yes python The basic library of scientific data class library

Test the execution time of a function

%timeit Executed function

Numpy The core of array

array All elements in are of the same type

array Its own attributes

shape： Returns a tuple , Express array Dimensions
ndim： A number , Express array The number of dimensions of
size： A number , Express array The number of all data elements in the
dtype： data type

establish array Methods

from python A list of list And nested list creation array
Use arange,ones,zeros,empty,full,eye Equal function creation
Generating random numbers np.random Module creation

array Supported operations and functions

Support element by element addition, subtraction, multiplication and division
Multi dimensional array oriented index
Yes sum,mean Wait for the aggregate function
There are linear algebraic functions

python Of list establish array

import numpy as np
x = np.array([1,2,3,4,5,6,7,8])
X = np.array(
[
[1,2,3,4],
[5,6,7,8]
]
)

use arange Create a sequence of numbers

arange([start,] stop[,step,], dtype = None)
np.ones / [zeros] / [empty( Random value )] / [full( Specify the value )] (shape, dtype=None, order='C')
# for example 
np.ones(10)
np.ones((2,3))
# np Use ones_like Create an array with the same shape 
np.ones_like(x)
# full Create the specified value 
np.full(10, 666)
np.full((2,4), 666)
np.random(d0,d1,d2……,dn)
# Random number generates the random number of the specified dimension tree array
B = np.random.randn(2,5)
# Random numbers change shape 
A = np.arange(10).reshape(2,5)
# reshape Directly transform the shape into 2*5
# Binary operations operate directly element by element

Numpy Query the array by index

Basic index
Magic index
Boolean index

Basic index

0 Start from left to right ,1 Start right to left

You can index by slicing

Two dimensional array index

x[0][0]
x[0,0] # The same thing as above 
x[2] # Filter the second row 
x[:-1] # Filter multiple rows 
x[:,2] # Two dimensional filtering

numpy The modification of the slice will modify the original array

Magic index

An index using an array of integers , It's called the magic index

x = np.arange(10)
x[[3,4,7]] # Returns an array array([3,4,7])
indexs = np.array([[0, 2], [1,3 ]])
x[indexs]
# Return to one array, Index by subscript

Two dimensional array

X[[0, 2], :] # Filter multiple columns , Lines cannot be omitted 
X[[0, 2, 3], [1, 3, 4]] # Also specify rows and columns - list , The return is [(0,1),(2,3),(3,4)] The number of positions

Boolean index

You can filter

One dimensional array

x = np.arange(10)
x > 5
# Return to one 10 A list of elements （True or False）
x[x > 5] # Return ratio 5 A list of elements of 
x[x < 5] += 20 # Element self increment operation

Two dimensional array

# If the two-dimensional array is filtered 
X > 5
# What is returned is a one bit array , Play the role of dimensionality reduction

Numpy random number random

Function name explain seed([seed]) Set random seeds , In this way, the random number generated each time will be the same rand(d0,d1,……dn) Return data in [0,1) Between randn(d0,d1……dn) The returned data has a standard normal distribution （ mean value 0, variance 1）randint（low[,high,size,dtype]） Generate random integer , contain low, It doesn't contain highrandom([size]) Generate [0.0,1.0) The random number choice（a）a It's a one-dimensional array , Generate random results from it shuffle(x) An array x Arrange randomly permutation(x) Arrange randomly , Or the full arrangement of numbers normal([loc,scale,size]) According to the average loc And variance scale Generate a number of Gaussian distributions uniform([low,high,size]) stay [low,high) Generate evenly distributed numbers between

Numpy Data computing from introduction to practice

Function name explain np.sum Sum of all elements np.prod The product of all elements np.cumsum The cumulative sum of elements np.sumprod Cumulative product of elements np.min minimum value np.max Maximum np.percentile0-100 Percentiles np.quantile0-1 quantile np.median Median np.average Average （ Weighted ）np.mean Average np.std Standard deviation np.var variance

Numpy in axis Parameters

axis=0 On behalf of the line , axis=1 Representative column

about sum,mean And so on

axis=0 Means to eliminate the line ,axis=1 Stands for column disaggregation
axis=0 Represents inter-bank calculation ,axis=1 Represents cross column calculation

Standardization ： A = ( A − m e a n ( a , a x i s = 0 ) ) / s t d ( A , a x i s = 0 ) A=(A-mean(a,axis=0))/std(A,axis=0) A=(A−mean(a,axis=0))/std(A,axis=0)

Numpy Calculation of the elements satisfying the condition in

import numpy as np
arr = np.random.randint(1, 10000, size = int(1e8))
arr[arr > 5000]

Numpy How to add a dimension to an array

np.newaxis ： keyword , Use the syntax of the index to add dimensions to the array
np.expand_dims(arr, axis) ： Method , and np.newaxis Realize the same function , to arr stay axis Position add dimension
np.reshape(a, newshape) ： Method , Set a dimension to 1 Complete upgrading

One dimensional vector adds dimension （newaxis）

arr[np.newaxis, :] # Add a row dimension 
arr[:, np.newaxis] # Add a column dimension

Adding dimensions to a one bit array （expand_dims）

np.expand_dims(arr, axis = 0)
np.reshape(arr, (1,5))

Data merge operation

Add multiple lines
Add multiple columns

np.concatenate(array_list, axis = 0/1) # According to the specified axis A merger 
np.vstack
np.row_stack(array_list) # Data consolidation by row 
np.hstack
np.column_stack(array_list) # Data consolidation by column

Pandas

Open source python Class library

pandas Reading data

Read in csv, tsv, txt.
- Separate with commas ,tab Split plain text file
- pd.read_csv
Read in excel
- Microsoft xls perhaps xlsx file
- ``pd.read_sql`
msql
- Relational tables
- pd.read_sql

Read in csv

ratings = pd.read_csv(fpath)
ratings.head() # Look at the first few lines of data 
ratings.shape # Look at the shape of the data 
ratings.columns # View the list of column names 
ratings.index # View index columns 
ratings.dtypes # Look at the data type of each column

Read txt file

pru = pd.read_csv(
fpath,
sep = "\t", # Separator 
header = None, # Title Line 
names = ['pdate', 'pv', 'uv'] # Specifies the column name 
)

Read excel file

puv = pd.read_csv(fpath)

Read mysql Data sheet

import pymysql
conn = pymysql.connect(
host = '127.0.0.1',
user = 'root',
password = '',
database = 'txy',
charset = 'utf8'
)
mysql_page = pd.read_sql("select * from txy", con = conn)

Pandas data structure

DataFrame
Series

DataFrame

Two dimensional data , The whole table , Multiple rows and columns

Index of each column df.columns, The index of each row df.index

Series

One-dimensional data , A row or column

Objects similar to one-dimensional arrays , It's a set of data （ Different data types ） And a group with it

Generate from a list series

s1 = pd.Series([1, 'a', 5, 2, 7])
# Index on the left , On the right is the data 
s1.index # Get index 
s1.values # get data 
# Create a with a label index series
s2 = pd.Series([1, 'a', 5.2, 7]), index = ['d', 'b', 'a', 'c']

use python Dictionary creation Series

s3 = pd.Series(sdata_dict)
# key Become index ,value Become a value

Query data according to label index

s2['a'] # Query individual values 
s2[['b', 'a']] # Return to one Series

DataFrame

Each column is a different data type
Existing row index index, There are also column indexes columns
Can be seen as by Series A dictionary made up of

Multidimensional dictionary creation DataFrame

data = {

'state' : ['Ohio', 'Nevada'],
'year' : [2000, 2002],
'pop' : [1.5, 1.7]
}
df = pd.DataFrame(data)

from DataFrame It was found that Series

If you only query one line 、 A column of , return pd.Series
If you query multiple rows 、 Multiple columns , return pd.DataFrame

df['year']
df[['year', 'pop']]
df.loc[1] # Look up a line

Pandas Query data

df.loc Method

By line 、 Column tag value query

That's ok 、 Column passes in a single value , Achieve accurate matching
Pass in a list, Batch query
Pass in an interval , Make a range query ·[ Start of interval , End of interval ]
Conditions of the query
call lambda Anonymous function or function query

df.iloc Method

By line 、 Column number position query

df.where Method

df.query Method

Pandas New data column

Direct assignment

df.loc[:, "wencha"] = df["bWendu"] - df["yWendu"]

df.apply Method

df.loc[:, "wendu_type"] = df.apply(get_wendu_type, axis = 1)

df.assign Method

# You can add multiple new columns at the same time 
df.assign(
yWendu huashi = lambda x : x["yWendu"] * 9 / 5 + 32, # It could be a function 
bWendu_huashi = lambda x : x["bWendu"] * 9 / 5 + 32
)

Select groups according to conditions and assign values respectively

df['wencha_type'] = ''
df.loc[df["bWendu"] - df["yWendu"] > 10, "wencha_type"] = " Big temperature difference "

Data statistics function

Summary statistics

df.describe() # Extract the statistical results of all digital columns

Unique de duplication and count by value

For non numeric columns

df["fengxiang"].unique Enumerate all columns

df["fengxiang"].value_counts() Count

Correlation coefficient and covariance

The correlation coefficient ： Measure the degree of similarity .1： The maximum positive similarity .-1： Maximum reverse similarity

covariance ： Measure the degree of same direction and reverse direction . just ： Same change . negative ： Reverse motion

df.cov()： Covariance matrix

df.corr()： correlation matrix

Pandas Missing value processing

isnull and notnull function

Check whether it is null

dropna

discarded 、 Delete missing value

axis： Delete row or column
how： If it is equal to any If any value is empty, delete , If it is equal to all Then all values are null before deleting
inplace： If True Then modify the current df, Otherwise return to the new df

fillna

Fill in empty values

value： Value used for filling , It can be a single value , Or a dictionary
method：ffill Fill... With the previous value that is not empty ,bfill Fill... With a value that is not empty
axis： Fill by row or column
inplace： Equate to dropna

st = pd.read_csv("./data.xlsx", skiprows = 2) # Skip the previous blank line 
st.dropna(axis="columns". how="all", inplace=True)
st.fillna({
" fraction ", 0})

Waring： setting with copy

df[condition]["wencha"] = df["bWendu"] - df["yWendu"]
# Equate to 
df.get(condition).setr(wen_cha)

pandas Of dataframe Modify write operation , Only allowed at source dataframe on , One step in place

df.loc[condition, "wen_cha"] = df["bWendu"] - df["yWendu"] # 1
df_month3 = df[condition].copy()
df_month3["wencha"] = df["bWendu"] - df["yWendu"]

pandas It is not allowed to filter the child first dataframe, Then modify and write

sort_value(by, asscending = True, inplace = False)

inplace: Whether to modify the original DataFrame

by： Sort by

Pandas string manipulation

Usage method ： First get Series Of str attribute , Then call the function on the property
Can only be called by string of characters
DataFrame Shangwu str Properties and processing methods

startswith()： Start with something in parentheses

replace()： Replace