numerical python
An open source scientific computing library
The code is simpler ( In array , The matrix is granular )
More efficient performance ( Better storage efficiency and input / output performance )
numpy yes python The basic library of scientific data class library
Test the execution time of a function
%timeit Executed function
array All elements in are of the same type
np.random
Module creation import numpy as np
x = np.array([1,2,3,4,5,6,7,8])
X = np.array(
[
[1,2,3,4],
[5,6,7,8]
]
)
arange([start,] stop[,step,], dtype = None)
np.ones / [zeros] / [empty( Random value )] / [full( Specify the value )] (shape, dtype=None, order='C')
# for example
np.ones(10)
np.ones((2,3))
# np Use ones_like Create an array with the same shape
np.ones_like(x)
# full Create the specified value
np.full(10, 666)
np.full((2,4), 666)
np.random(d0,d1,d2……,dn)
# Random number generates the random number of the specified dimension tree array
B = np.random.randn(2,5)
# Random numbers change shape
A = np.arange(10).reshape(2,5)
# reshape Directly transform the shape into 2*5
# Binary operations operate directly element by element
0 Start from left to right ,1 Start right to left
You can index by slicing
x[0][0]
x[0,0] # The same thing as above
x[2] # Filter the second row
x[:-1] # Filter multiple rows
x[:,2] # Two dimensional filtering
numpy The modification of the slice will modify the original array
An index using an array of integers , It's called the magic index
x = np.arange(10)
x[[3,4,7]] # Returns an array array([3,4,7])
indexs = np.array([[0, 2], [1,3 ]])
x[indexs]
# Return to one array, Index by subscript
X[[0, 2], :] # Filter multiple columns , Lines cannot be omitted
X[[0, 2, 3], [1, 3, 4]] # Also specify rows and columns - list , The return is [(0,1),(2,3),(3,4)] The number of positions
You can filter
x = np.arange(10)
x > 5
# Return to one 10 A list of elements (True or False)
x[x > 5] # Return ratio 5 A list of elements of
x[x < 5] += 20 # Element self increment operation
# If the two-dimensional array is filtered
X > 5
# What is returned is a one bit array , Play the role of dimensionality reduction
axis=0 On behalf of the line , axis=1 Representative column
about sum,mean And so on
Standardization : A = ( A − m e a n ( a , a x i s = 0 ) ) / s t d ( A , a x i s = 0 ) A=(A-mean(a,axis=0))/std(A,axis=0) A=(A−mean(a,axis=0))/std(A,axis=0)
import numpy as np
arr = np.random.randint(1, 10000, size = int(1e8))
arr[arr > 5000]
arr[np.newaxis, :] # Add a row dimension
arr[:, np.newaxis] # Add a column dimension
np.expand_dims(arr, axis = 0)
np.reshape(arr, (1,5))
Add multiple lines
Add multiple columns
np.concatenate(array_list, axis = 0/1) # According to the specified axis A merger
np.vstack
np.row_stack(array_list) # Data consolidation by row
np.hstack
np.column_stack(array_list) # Data consolidation by column
Open source python Class library
pd.read_csv
pd.read_sql
ratings = pd.read_csv(fpath)
ratings.head() # Look at the first few lines of data
ratings.shape # Look at the shape of the data
ratings.columns # View the list of column names
ratings.index # View index columns
ratings.dtypes # Look at the data type of each column
pru = pd.read_csv(
fpath,
sep = "\t", # Separator
header = None, # Title Line
names = ['pdate', 'pv', 'uv'] # Specifies the column name
)
puv = pd.read_csv(fpath)
import pymysql
conn = pymysql.connect(
host = '127.0.0.1',
user = 'root',
password = '',
database = 'txy',
charset = 'utf8'
)
mysql_page = pd.read_sql("select * from txy", con = conn)
Two dimensional data , The whole table , Multiple rows and columns
Index of each column df.columns
, The index of each row df.index
One-dimensional data , A row or column
Objects similar to one-dimensional arrays , It's a set of data ( Different data types ) And a group with it
s1 = pd.Series([1, 'a', 5, 2, 7])
# Index on the left , On the right is the data
s1.index # Get index
s1.values # get data
# Create a with a label index series
s2 = pd.Series([1, 'a', 5.2, 7]), index = ['d', 'b', 'a', 'c']
s3 = pd.Series(sdata_dict)
# key Become index ,value Become a value
s2['a'] # Query individual values
s2[['b', 'a']] # Return to one Series
data = {
'state' : ['Ohio', 'Nevada'],
'year' : [2000, 2002],
'pop' : [1.5, 1.7]
}
df = pd.DataFrame(data)
from DataFrame It was found that Series
pd.Series
pd.DataFrame
df['year']
df[['year', 'pop']]
df.loc[1] # Look up a line
By line 、 Column tag value query
[ Start of interval , End of interval ]
By line 、 Column number position query
df.loc[:, "wencha"] = df["bWendu"] - df["yWendu"]
df.loc[:, "wendu_type"] = df.apply(get_wendu_type, axis = 1)
# You can add multiple new columns at the same time
df.assign(
yWendu huashi = lambda x : x["yWendu"] * 9 / 5 + 32, # It could be a function
bWendu_huashi = lambda x : x["bWendu"] * 9 / 5 + 32
)
df['wencha_type'] = ''
df.loc[df["bWendu"] - df["yWendu"] > 10, "wencha_type"] = " Big temperature difference "
df.describe() # Extract the statistical results of all digital columns
df["fengxiang"].unique
Enumerate all columns
df["fengxiang"].value_counts()
Count
The correlation coefficient : Measure the degree of similarity .1: The maximum positive similarity .-1: Maximum reverse similarity
covariance : Measure the degree of same direction and reverse direction . just : Same change . negative : Reverse motion
df.cov()
: Covariance matrix
df.corr()
: correlation matrix
Check whether it is null
discarded 、 Delete missing value
Fill in empty values
st = pd.read_csv("./data.xlsx", skiprows = 2) # Skip the previous blank line
st.dropna(axis="columns". how="all", inplace=True)
st.fillna({
" fraction ", 0})
df[condition]["wencha"] = df["bWendu"] - df["yWendu"]
# Equate to
df.get(condition).setr(wen_cha)
pandas Of dataframe Modify write operation , Only allowed at source dataframe on , One step in place
df.loc[condition, "wen_cha"] = df["bWendu"] - df["yWendu"] # 1
df_month3 = df[condition].copy()
df_month3["wencha"] = df["bWendu"] - df["yWendu"]
pandas It is not allowed to filter the child first dataframe, Then modify and write
inplace: Whether to modify the original DataFrame
by: Sort by
startswith(): Start with something in parentheses
replace(): Replace
df.set_index("userId", inplace = True, drop = False)
drop Keep the index column at column
df.index
Query index
python Automatically optimize according to data type
Press different tables key Associate to a table
Be similar to sql Of join
Batch merge of the same format Excel, to DataFrame add rows , to DataFrame Add columns
You can do something similar to sql Of qroup by