程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

[note] June 20, 2022: three artifacts of Python data analysis: numpy, pandas and Matplotlib

編輯:Python

0. Introduce cases

RIGHT Example

names = [' The Monkey King ', ' Li Yuanfang ', ' White ', ' Judge dee ', ' dharma ']
courses = [' Chinese language and literature ', ' mathematics ', ' English ']
import random
scores = [[random.randrange(60, 101) for _ in range(3)] for _ in range(5)]
print(scores)
# [[100, 97, 76], [91, 79, 66], [84, 78, 71], [87, 81, 96], [73, 71, 88]]
def mean(nums):
""" Calculating mean """
return sum(nums) / len(nums)
def variance(nums):
""" Variance estimation """
mean_value = mean(nums)
return mean([(num - mean_value) ** 2 for num in nums])
def stddev(nums):
""" Find standard deviation """
return variance(nums) ** 0.5
# Count the average score of each student 
for idx, name in enumerate(names):
temp = scores[idx]
avg_score = mean(temp)
print(f'{
name} The average score of the exam is :{
avg_score:.1f} branch ')
''' The average score of the monkey king test is :91.0 branch Liyuanfang's average score in the exam is :78.7 branch The average score of the Baiqi test is :77.7 branch The average score of Di Renjie in the exam is :88.0 branch The average score of Dharma test is :77.3 branch '''
# Count the highest scores of each course 、 Lowest score 、 Standard deviation 
for idx, course in enumerate(courses):
temp = [scores[i][idx] for i in range(len(names))]
max_score, min_score = max(temp), min(temp)
print(f'{
course} Highest score :{
max_score} branch ')
print(f'{
course} Lowest score :{
min_score} branch ')
print(f'{
course} Standard deviation of grades :{
stddev(temp):.1f} branch ')
''' The highest score in Chinese :100 branch The lowest score in Chinese :73 branch Standard deviation of Chinese achievement :8.8 branch The highest score in Mathematics :97 branch The lowest score in mathematics :71 branch Standard deviation of math scores :8.6 branch The highest score in English :96 branch Minimum score in English :66 branch Standard deviation of English performance :11.1 branch '''
# Output the students and their test scores in the form of lines ( Sort by average from high to low )
results = {
name: temp for name, temp in zip(names, scores)}
sorted_keys = sorted(results, key=lambda x: mean(results[x]), reverse=True)
for key in sorted_keys:
verbal, math, english = results[key]
print(f'{
key}:\t{
verbal}\t{
math}\t{
english}')
''' The Monkey King : 100 97 76 Judge dee : 87 81 96 Li Yuanfang : 91 79 66 White : 84 78 71 dharma : 73 71 88 '''

1. The perceptual cognition

Python Three artifact of data analysis

  1. Numpy - Numerical Python - ndarray - Save the data , Complete batch operation and processing
  2. pandas - Panel Data Set - Series( One-dimensional data ) / DataFrame( Two dimensional data ) - Encapsulates various methods needed for data analysis
  3. matplotlib - Chart Statistics

1.1 Numpy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# take list Processing into ndarray object 
scores = np.array(scores)
print(scores)
''' array([[100, 97, 76], [ 91, 79, 66], [ 84, 78, 71], [ 87, 81, 96], [ 73, 71, 88]]) '''
# View data type 
type(scores)
# numpy.ndarray
# Press horizontal ( Student ) averaging 
np.round(scores.mean(axis=1), 1)
# array([91. , 78.7, 77.7, 88. , 77.3])
# Press vertical ( Discipline ) Get the highest score 、 Lowest score 、 Standard deviation 
scores.max(axis=0)
# array([100, 97, 96])
scores.min(axis=0)
# array([73, 71, 66])
np.round(scores.std(axis=0), 1)
# array([ 8.8, 8.6, 11.1])

1.2 pandas

scores_df = pd.DataFrame(data=scores, columns=courses, index=names)
scores_df

design sketch :

# Calculate the average score 
np.round(scores_df.mean(axis=1), 1)
''' The Monkey King 91.00000 Li Yuanfang 78.66875 White 77.66875 Judge dee 88.00000 dharma 77.33125 dtype: float64 '''
# Add average columns to the table 
scores_df[' average ']= scores_df.mean(axis=1)
scores_df

design sketch :

# write in Excel file 
scores_df.to_excel(' Test results .xlsx')
# Change plt typeface 
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False
# Change the generated graph to vector graph 
%config InlineBackend.figure_format='svg'
# Generate a histogram 
scores_df.plot(kind='bar', y=[' Chinese language and literature ',' mathematics ',' English '])
# The scale of the horizontal axis of rotation 
plt.xticks(rotation=0)
# Save the chart 
plt.savefig(' Score bar .svg')
# Show chart 
plt.show()

design sketch :

# Get the highest score in the subject 、 Lowest score 、 average 
scores_df.max()
''' Chinese language and literature 100.0 mathematics 97.0 English 96.0 average 91.0 dtype: float64 '''
scores_df.min()
''' Chinese language and literature 73.0 mathematics 71.0 English 66.0 average 77.3 dtype: float64 '''
scores_df.std()
''' Chinese language and literature 9.874209 mathematics 9.602083 English 12.361230 average 6.461656 dtype: float64 '''

2. Numpy


2.1 Create a one-dimensional array

# Method 1 : adopt array Function will list Processing into ndarray object 
array1 = np.array([1, 2, 10, 20, 100])
array1
# array([ 1, 2, 10, 20, 100])
type(array1)
# numpy.ndarray
# Method 2 : Specify a range to create an array object 
array2 = np.arange(1, 100, 2)
array2
''' array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99]) '''
# Method 3 : Specify the range and the number of elements to create an array object 
array3 = np.linspace(-5, 5, 101)
array3
''' array([-5. , -4.9, -4.8, -4.7, -4.6, -4.5, -4.4, -4.3, -4.2, -4.1, -4. , -3.9, -3.8, -3.7, -3.6, -3.5, -3.4, -3.3, -3.2, -3.1, -3. , -2.9, -2.8, -2.7, -2.6, -2.5, -2.4, -2.3, -2.2, -2.1, -2. , -1.9, -1.8, -1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1. , -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. ]) '''
# Method four : Create an array object by randomly generating elements 
# Random decimal 
array4 = np.random.random(10)
array4
''' array([0.74861994, 0.80263292, 0.54287411, 0.99088428, 0.27465232, 0.4421258 , 0.34908231, 0.39729076, 0.11863797, 0.37728455]) '''
# Random integers 
array5 = np.random.randint(1, 10)
array5
# 5
# Random normal distribution 
array6 = np.random.normal(0, 1, 50000)
array6
''' array([-1.24165108, -0.07314869, -1.37729185, ..., -1.00691177, 0.19568883, 0.43887128]) '''
# Generate histogram 
plt.hist(array6, bins=24)
plt.show()

design sketch :


2.2 Create a 2D array

# Method 1 : adopt array Function to process a nested list into a two-dimensional array 
array8 = np.array(([1, 2, 3], [4, 5, 5], [7, 8, 9]))
array8
''' array([[1, 2, 3], [4, 5, 5], [7, 8, 9]]) '''
# Method 2 : By adjusting the shape of a one-dimensional array to a two-dimensional array 
temp= np.arange(1, 11)
array9 = temp.reshape((5, 2))
array9
''' array([[ 1, 2], [ 3, 4], [ 5, 6], [ 7, 8], [ 9, 10]]) '''
# Method 3 ; Create a two-dimensional array by generating random elements 
array10 = np.random.randint(60, 101, (5, 3))
array10
''' array([[71, 91, 67], [95, 71, 96], [90, 91, 92], [67, 83, 74], [95, 78, 60]]) '''
# Method four : Create whole 0、 whole 1、 A two-dimensional array of specified values 
array11 = np.zeros((5, 4), dtype='i8')
array11
''' array([[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], dtype=int64) '''
array12 = np.ones((5, 4), dtype='i8')
array12
''' array([[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]], dtype=int64) '''
array13 = np.full((5, 4), 100)
array13
''' array([[100, 100, 100, 100], [100, 100, 100, 100], [100, 100, 100, 100], [100, 100, 100, 100], [100, 100, 100, 100]]) '''
# Method five : Create identity matrix 
array14 = np.eye(5)
array14
''' array([[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], [0., 0., 1., 0., 0.], [0., 0., 0., 1., 0.], [0., 0., 0., 0., 1.]]) '''

2.3 Index and slice of arrays


2.3.1 Common usage

array10
''' array([[71, 91, 67], [95, 71, 96], [90, 91, 92], [67, 83, 74], [95, 78, 60]]) '''
# Take the specified element 
array10[1][2]
# 96
array10[1, 2]
# 96
# Take some elements 
array10[:3, :2]
''' array([[71, 91], [95, 71], [90, 91]]) '''


summary : In the operation of multidimensional array , Every time the index is made, the dimension is reduced , Slicing does not reduce the dimension .

2.3.2 Special index

# Fancy index (fancy index)
array2[[0, 1, 2, -3, -2, -1, -10]]
# array([ 1, 3, 5, 95, 97, 99, 81])
array10[[0, 1, 4], [0, 1, 2]]
# array([71, 71, 60])
# Boolean index 
array1[[False, True, False, True, False]]
# array([ 2, 20], dtype=uint64)
array2[array2 > 80]
# array([81, 83, 85, 87, 89, 91, 93, 95, 97, 99])
array16 = np.arange(1, 10)
array16[array16 % 2 !=0]
# array([1, 3, 5, 7, 9])
array16[(array16 > 5) & (array16 % 2 != 0)]
# array([7, 9])
array16[(array16 > 5) | (array16 % 2 != 0)]
# array([1, 3, 5, 6, 7, 8, 9])
array16[(array16 > 5) | ~(array16 % 2 != 0)]
# array([2, 4, 6, 7, 8, 9])
array10[array10 > 80]
# array([91, 95, 96, 90, 91, 92, 83, 95])

2.5 ndarray Object method

array1 = np.random.randint(10, 50, 10)
array1
# Sum up 
array1.sum
np.sum(array1)
# Averaging 
array1.mean()
np.mean(array1)
# Find the median 
np.median(array1)
# Find the maximum and the minimum 
array1.max()
np.amax(array1)
array1.min()
np.amin(array1)
# Seeking the range ( Full range )
array1.ptp()
np.ptp(array1)
# Variance estimation 
array1.var
np.var(array1)
# Find standard deviation 
array1.std
np.std(array1)
# Find the cumulative sum 
array1.cumsum()
np.cumsum(array1)

2.5.1 A supplementary description of the above descriptive statistical methods :





2.5.2 use Excel Realize the above functions





2.5.3 Quantiles and outliers

def outliers_by_iqr(t_array, lower_points = 0.25, upper_points = 0.75, whis = 1.5):
"""iqr Find outliers """
q1, q3 = np.quantile(t_array, [lower_points, upper_points])
iqr = q3 - q1
return t_array[(t_array < q1 - whis * iqr)|(t_array > q3 + whis * iqr)]
# test 
array2[-1] = 15
outliers_by_iqr(array2)
# array([15])

2.5.4 Z-score Find outliers

def outliers_by_zscore(array, threshold=3):
"""Z-score Decision method to detect outliers """
mu, sigma = array.mean(), array.std()
return array[np.abs((array - mu) / sigma) > threshold]
# Make the array longer , In order to make 500 It's abrupt 
# repeat()( Will put the same elements together ) and tile()( Repeat in the original order )
temp = np.repeat(array2[:-1], 100)
# append() and insert() add to 
temp = np.append(temp, array2[-1])
temp = np.insert(temp, 0, array2[-1])
temp
outliers_by_zscore(temp)
# array([500, 500])

2.6 Other methods


2.6.1 all()/any

Determine whether all elements in the array are True/ Determine whether the array has True The elements of .


2.6.2 astype() Method

Copy an array , And convert the elements in the array to the specified type .

array5 = array3.astype(np.float64)
array5.dtype
# dtype('float64')

2.6.3 dump() and load() Method

Save array to file , Can pass NumPy Medium load() Function to create an array of data from a saved file .

serialize : Process objects into strings (str) Or byte string (bytes) —> Serialization / Pickles

Deserialization : Restore a string or byte string to an object —> Anti serialization

  • json modular :dump / dumps / load / loads —> Universal , Cross language

  • pickle modular :dump / dumps / load / loads —> Private agreements , Other languages cannot deserialize

# preservation 
with open('array3', 'wb') as file:
array3.dump(file)
# Read 
with open('array3', 'rb') as file:
array6 = np.load(file, allow_pickle=True)
array6

2.6.4 fill() Method

Fills the array with the specified elements . That is, all the elements in the array become specified elements .


2.6.5 flatten() Method

Flatten a multidimensional array into a one-dimensional array

array7 = np.arange(1, 11).reshape(5, 2)
array7
''' array([[ 1, 2], [ 3, 4], [ 5, 6], [ 7, 8], [ 9, 10]]) '''
array7.flatten('C')
# array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
array7.flatten('F')
# array([ 1, 3, 5, 7, 9, 2, 4, 6, 8, 10])

2.6.6 nonzero() Method

Return non 0 Index of elements


2.6.7 round() Method

Round the elements in the array .


2.6.8 sort() Method

array8 = np.random.randint(1, 100, 10)
array8
# array([55, 98, 48, 98, 38, 5, 35, 36, 39, 87])
# Return the new array after sorting 
np.sort(array8)
# array([ 5, 35, 36, 38, 39, 48, 55, 87, 98, 98])
array8
# array([55, 98, 48, 98, 38, 5, 35, 36, 39, 87])
# Sort in place on the original array 
array8.sort()
# No return value 
array8
# array([ 8, 10, 12, 27, 28, 43, 45, 57, 65, 98])

2.6.9 transpose() and swapaxes() Method

Swap axes specified by array .

# For two dimensional arrays ,transpose It is equivalent to the transpose of the matrix 
array7.transpose()
''' array([[ 1, 3, 5, 7, 9], [ 2, 4, 6, 8, 10]]) '''
# swapaxes Swap the specified two axes , Order doesn't matter 
array7.swapaxes(0, 1)
''' array([[ 1, 3, 5, 7, 9], [ 2, 4, 6, 8, 10]]) '''

2.6.10 tolist() Method

Convert the array to Python Medium list. Do not change the original array .

array7.tolist()
# [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]

2.7 Other common functions


2.7.1 np.unique()

duplicate removal .

array9 = np.array([1,1,1,1,2,2,2,3,3,3])
array9
# array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3])
np.unique(array9)
# array([1, 2, 3])

2.7.2 stack The stack

array10 = np.array([[1,1,1], [2,2,2]])
array10
''' array([[1, 1, 1], [2, 2, 2]]) '''
array11 = np.array([[3,3,3,], [4,4,4,]])
array11
''' array([[3, 3, 3], [4, 4, 4]]) '''
# horizontal direction ( Along 1 Axis direction ) Stack of 
np.hstack((array10, array11))
''' array([[1, 1, 1, 3, 3, 3], [2, 2, 2, 4, 4, 4]]) '''
# vertical direction ( Along 0 Axis direction ) Stack of 
np.vstack((array10, array11))
''' array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]]) '''
# Stack along the specified axis and increase dimension 
np.stack((array10, array11), axis=0)
''' array([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]) '''
np.stack((array10, array11), axis=1)
''' array([[[1, 1, 1], [3, 3, 3]], [[2, 2, 2], [4, 4, 4]]]) '''

2.7.3 concatenate Merge

array12 = np.concatenate((array10, array11), axis=0)
array12
''' array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]]) '''
array13 = np.concatenate((array10, array11), axis=1)
array13
''' array([[1, 1, 1, 3, 3, 3], [2, 2, 2, 4, 4, 4]]) '''

2.7.4 split Split

# vertical direction ( Along 0 Axis direction ) The split 
np.vsplit(array12, 2)
''' [array([[1, 1, 1], [2, 2, 2]]), array([[3, 3, 3], [4, 4, 4]])] '''
np.vsplit(array12, 4)
''' [array([[1, 1, 1]]), array([[2, 2, 2]]), array([[3, 3, 3]]), array([[4, 4, 4]])] '''
# horizontal direction ( Along 1 Axis direction ) The split 
np.hsplit(array12, 3)
''' [array([[1], [2], [3], [4]]), array([[1], [2], [3], [4]]), array([[1], [2], [3], [4]])] '''
# Split along the specified axis 
np.split(array13, 2)
''' [array([[1, 1, 1, 3, 3, 3]]), array([[2, 2, 2, 4, 4, 4]])] '''
np.split(array13, 3, axis=1)
''' [array([[1, 1], [2, 2]]), array([[1, 3], [2, 4]]), array([[3, 3], [4, 4]])] '''

2.7.5 extract extract

# Extract elements from the array according to the specified conditions ( Similar to Boolean index )
np.extract(array14 <= 50, array14)
# array([19, 23, 50])
array14[array14 <= 50]
# array([19, 23, 50])

2.7.6 select Multi criteria screening

# Process the elements in the array according to the condition list to get a new array ( Multiple conditions )
np.select([array14 % 2 == 0, array14 % 2 != 0], [array14 / 2, array14 ** 2])
# array([ 361., 9025., 38., 529., 25.])

2.7.7 where Single condition screening

# Get a new array according to the elements in the condition array ( Single condition )
np.where(array14 % 2 == 0, array14, 0)
# array([ 0, 0, 76, 0, 50])

2.7.8 resize Resize

From left to right , Cycle filling from top to bottom

array15 = np.arange(1, 10).reshape((3, 3))
array15
''' array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) '''
# Size the array 
np.resize(array15, (4, 4))
''' array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 1, 2, 3], [4, 5, 6, 7]]) '''

2.7.9 put Replace

# Replace the element with the specified index in the array 
np.put(array15, 5, 100)
array15
''' array([[ 1, 2, 3], [ 4, 5, 100], [ 7, 8, 9]]) '''
np.put(array15, [1, 2], 100)
array15
''' array([[ 1, 100, 100], [ 4, 5, 100], [ 7, 8, 9]]) '''

2.7.10 place Replace

# Replace the elements in the array that meet the conditions 
np.place(array15, array15 == 100, [2, 3])
array15
''' array([[1, 2, 3], [4, 5, 2], [7, 8, 9]]) '''

2.8 Operation of array


2.8.1 Broadcast mechanism

Prerequisite ( One of these must be met ):

  • The trailing dimension of two arrays (shape Attributes look back and forward ) identical .

  • The trailing edge dimensions of the two arrays are different , But one of the dimensions is 1.

The broadcast mechanism is satisfied to follow the missing axis or along the dimension 1 The axis broadcasts itself , Eventually the shape becomes uniform .

# Meet the conditions 1:
array16 = np.arange(1, 16).reshape(5, 3)
array17 = np.array([[1, 1, 1]])
array16 + array17
''' array([[ 2, 3, 4], [ 5, 6, 7], [ 8, 9, 10], [11, 12, 13], [14, 15, 16]]) '''
# Still meet the conditions 1:
array18 = np.random.randint(1, 18, (3, 4, 2))
array18
''' array([[[ 2, 11], [17, 16], [ 2, 6], [ 3, 13]], [[ 6, 17], [15, 4], [ 2, 2], [ 9, 6]], [[ 8, 4], [ 5, 9], [15, 11], [ 6, 17]]]) '''
array19 = np.random.randint(1, 10, (4, 2))
array19
''' array([[7, 5], [1, 7], [2, 7], [6, 5]]) '''




2.8.2 Dot product operations

A ⋅ B = ∑ a i b i A \cdot B = \sum a_ib_i \\ A⋅B=∑ai​bi​

A ⋅ B = ∣ A ∣ ∣ B ∣ c o s θ A \cdot B = |A||B|cos\theta A⋅B=∣A∣∣B∣cosθ

Calculated as scalar

# Find the dot product 
a = np.array([1, 2, 3])
b = np.array([2, 4, 6])
np.dot(a, b)
# 28, namely 1 * 2 + 2 * 4 + 3 * 6 = 28
# seek a The mold 
# linear algebra
np.linalg.norm(a)
# seek b The mold 
np.linalg.norm(b)
# seek a,b Cosine of the included angle , And then you get a,b The relevance of 
np.dot(a, b) / np.linalg.norm(a) / np.linalg.norm(b)
# 1.0

2.9 matrix


2.9.1 Some methods of matrix

# Create a matrix 
m1 = np.matrix('1 2; 3 4')
m1
''' matrix([[1, 2], [3, 4]]) '''
type(m1)
# numpy.matrix
# Get the corresponding array object 
m1.A
''' array([[1, 2], [3, 4]]) '''
# Get the corresponding flattened array object 
m1.A1
''' array([1, 2, 3, 4]) '''
# Get the transposed matrix 
m1.T
''' matrix([[1, 3], [2, 4]]) '''
m1.swapaxes(0, 1)
m1.transpose()
# Empathy 
# Get the inverse matrix 
m1.I
''' matrix([[-2. , 1. ], [ 1.5, -0.5]]) '''

2.9.2 Matrix multiplication


A ⋅ A − 1 = I A \cdot A^{-1} = I A⋅A−1=I

m2 = np.matrix('1 0 2; -1 3 1')
m2
''' matrix([[ 1, 0, 2], [-1, 3, 1]]) '''
m3 = np.mat([[3, 1], [2, 1], [1, 0]])
m3
''' matrix([[3, 1], [2, 1], [1, 0]]) '''
m2 * m3
''' matrix([[5, 1], [4, 2]]) '''
# Create from nested lists or 2D arrays matrix object 
m4 = np.asmatrix(np.arange(1, 10).reshape((3, 3)))
m4
''' matrix([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) '''
# determinent
# Calculate the value of the matrix 
np.linalg.det(m4)
# Calculate the rank of the matrix 
np.linalg.matrix_rank(m4)
array2 = m2.A
array2
''' array([[ 1, 0, 2], [-1, 3, 1]]) '''
array3 = m3.A
array3
''' array([[3, 1], [2, 1], [1, 0]]) '''
# Matrix multiplication of array objects 
array2 @ array3
''' array([[5, 1], [4, 2]]) '''
# Inverse matrix 
array1 = m1.A
np.linalg.inv(array1)
''' array([[-2. , 1. ], [ 1.5, -0.5]]) '''

2.9.3 Solving linear equations

{ x 1 + 2 x 2 + x 3 = 8 3 x 1 + 7 x 2 + 2 x 3 = 23 2 x 1 + 2 x 2 + x 3 = 9 \begin{cases} x_1 + 2x_2 + x_3 = 8 \\ 3x_1 + 7x_2 + 2x_3 = 23 \\ 2x_1 + 2x_2 + x_3 = 9 \end{cases} ⎩⎪⎨⎪⎧​x1​+2x2​+x3​=83x1​+7x2​+2x3​=232x1​+2x2​+x3​=9​

A ⋅ x = b A \cdot x = b A⋅x=b

A = np.array([[1, 2, 1], [3, 7, 2], [2, 2, 1]])
b = np.array([8, 23, 9]).reshape(-1, 1)
C = np.hstack((A, b))
C
''' array([[ 1, 2, 1, 9], [ 3, 7, 2, 23], [ 2, 2, 1, 9]]) '''
np.linalg.matrix_rank(A)
# 3
np.linalg.matrix_rank(C)
# 3

A ⋅ x = b A − 1 ⋅ A ⋅ x = A − 1 ⋅ b I ⋅ x = A − 1 ⋅ b x = A − 1 ⋅ b A \cdot x = b \\ A^{-1} \cdot A \cdot x = A^{-1} \cdot b \\ I \cdot x = A^{-1} \cdot b \\ x = A^{-1} \cdot b \\ A⋅x=bA−1⋅A⋅x=A−1⋅bI⋅x=A−1⋅bx=A−1⋅b

# Use the above formula to solve linear equations 
np.linalg.inv(A) @ b
''' array([[1.], [2.], [3.]]) '''
# Solve a system of linear equations 
np.linalg.solve(A, b)
''' array([[1.], [2.], [3.]]) '''

2.10 The correlation coefficient

# Download the machine learning module 
pip install scikit-learn
import warnings
warnings.filterwarnings('ignore')
from sklearn.datasets import load_boston
# Load Boston house price data set 
datasets = load_boston()
# see datasets attribute 
dir(datasets)
# ['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'target']
# Print content description 
print(datasets.DESCR)
# Too much content to show 
# Print influencing factor data 
datasets.data
# Too much content to show 
# View data type 
type(datasets.data)
# numpy.ndarray
# see shape
datasets.data.shape
# (506, 13)
# View the influencing factor name 
datasets.feature_names
''' array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7') '''
# Get house price data 
y = datasets.target
y.shape
# (506,)
# Get crime rate data 
x1 = datasets.data[:, 0]
x1.shape
# (506,)
# Calculate the correlation coefficient between crime rate and house price 
np.corrcoef(x1, y)
''' array([[ 1. , -0.38830461], [-0.38830461, 1. ]]) '''
# Access to data on low-income groups 
x2 = datasets.data[:, -1]
x2.shape
# (506,)
# Calculate the correlation coefficient between low-income groups and house prices 
np.corrcoef(x2, y)
''' array([[ 1. , -0.73766273], [-0.73766273, 1. ]]) '''
# The following same example omits 
# Draw a scatter diagram 
plt.scatter(x1, y)

plt.scatter(x2, y)


2.11 K a near neighbor

# obtain x: Average room count data ,y: House price data 
x = datasets.data[:, 5]
y = datasets.target
# Store historical data in a dictionary 
history_data = {
}
delta = 0
for room, price in zip(x, y):
if room not in history_data:
history_data[room] = price
else:
delta += 0.000001
history_data[room + delta] = price
len(history_data)
# 506
import heapq
def predicate_by_knn(room_num, k=5):
# Find the nearest... To the target data k Historical data 
nearest_neighbors = heapq.nsmallest(k, history_data, key=lambda x:(x-room_num) ** 2)
# Calculate the average of the values corresponding to these keys in the historical data 
return round(np.mean([history_data[key] for key in nearest_neighbors]), 2)
# Call the function to try 
predicate_by_knn(6.125)
# 22.12
predicate_by_knn(5.525)
# 13.32

2.12 Monte Carlo simulation

def get_loss(x, y, a, b):
""" Loss function """
return np.mean((a * x + b - y) ** 2)
# Start Monte Carlo simulation , If the error is lower, record it 
min_loss = np.inf
best_a, best_b = None, None
for _ in range(10000):
a, b = np.random.random(2) * 200 - 100
curr_loss = get_loss(x, y, a, b)
if curr_loss < min_loss:
min_loss = curr_loss
best_a, best_b = a, b
print(a, b, min_loss)
print(best_a, best_b)
y_hat = best_a * x + best_b
# Draw a picture to see 
plt.scatter(x, y)
plt.plot(x, y_hat, color='red', linewidth=8)

# Define the function to calculate the predicted value according to the fitting curve 
def predicate_by_regression(room_num):
return round(best_a * room_num + best_b, 2)
# Call the function to try 
predicate_by_regression(5.525)
# 15

2.13 gradient descent

# seek a Partial derivative of 
def partial_a(x, y, a, b):
return 2 * np.mean((y - a * x - b) * (-x))
# seek b Partial derivative of 
def partial_b(x, y, a, b):
return 2 * np.mean(-y + a * x + b)
# Use gradient descent algorithm to find a, b
a, b = 50, -50
delta = 0.013
for _ in range(1000):
a = a - partial_a(x, y, a, b) * delta
b = b - partial_b(x, y, a, b) * delta
print(get_loss(x, y, a, b))
print(best_a, best_b)

2.14 NumPy Bring their own lstsq function

param1 = np.vstack((x, np.ones(x.size))).T
param1
param2 = y.reshape((-1, 1))
param2
# least square - Least square solution 
results = np.linalg.lstsq(param1, param2)
results
''' (array([[ 9.10210898], [-34.67062078]]), array([22061.87919621]), 2, array([143.99484122, 2.46656609])) '''
a, b = results[0].flatten()
a, b
# (9.102108981180313, -34.67062077643857)
y_hat = a * x + b
plt.scatter(x, y)
plt.plot(x, y_hat, color='red', linewidth=8)



3. Pandas


3.1 series


3.1.1 Create objects

# establish Series object 
ser1 = pd.Series(data=[320, 180, 250, 220], index=[f'{
x} quarter ' for x in '1234'])
ser1
''' 1 quarter 320 2 quarter 180 3 quarter 250 4 quarter 220 dtype: int64 '''
# Create... From a dictionary series object 
ser2 = pd.Series(data={
' First quarter ': 320, ' The two quarter ': 180, ' third quater ': 250, ' In the fourth quarter ': 220})
ser2
''' First quarter 320 The two quarter 180 third quater 250 In the fourth quarter 220 dtype: int64 '''

3.1.2 Indexes

# General index 
ser1['1 quarter ']
# 320
ser2. First quarter
# 320 ( This query method cannot be changed to 1 quarter , Because the number can't start )
# Fancy index 
ser1[['1 quarter ', '3 quarter ']]
''' 1 quarter 320 3 quarter 250 dtype: int64 '''
# Boolean index 
ser1[ser1 >= 200]
''' 1 quarter 320 3 quarter 250 4 quarter 220 dtype: int64 '''

3.1.3 section

ser1[1:3]
''' 2 quarter 180 3 quarter 250 dtype: int64 '''
# Slice with your own index name. You can get it at both ends 
ser1['2 quarter ': '3 quarter ']
''' 2 quarter 180 3 quarter 250 dtype: int64 '''

3.1.4 series Object properties

# Get index 
ser2.index
# Index([' First quarter ', ' The two quarter ', ' third quater ', ' In the fourth quarter '], dtype='object')
# Get the value of the index 
ser2.index.values
# array([' First quarter ', ' The two quarter ', ' third quater ', ' In the fourth quarter '], dtype=object)
# get data 
ser2.values
# array([320, 180, 250, 220], dtype=int64)
# Element number 
ser2.size
# 4
# Is the element unique 
ser2.is_unique
# True
# Is there a null value 
ser2.hasnans
# False
ser2[' The two quarter '] = np.nan
ser2
''' First quarter 320.0 The two quarter NaN third quater 250.0 In the fourth quarter 220.0 dtype: float64 '''
ser2.hasnans
# True
ser2[' The two quarter '] = 280
ser2
# 280
# Whether the data is monotonically increasing 
ser2.is_monotonic_increasing
# False
# Whether the data is monotonically decreasing 
ser2.is_monotonic_decreasing
# True

3.1.5 series Object method

# Get descriptive statistics - Concentration trend 
# Sum up 
print(ser2.sum())
# Averaging 
print(ser2.mean())
# Median 
print(ser2.median())
''' 1070.0 267.5 265.0 '''
# The number of 
print(ser2.mode())
''' 0 220.0 1 250.0 2 280.0 3 320.0 dtype: float64 '''
# Get descriptive statistics - Discrete trends 
# Maximum and minimum 
print(ser2.max())
print(ser2.min())
# Variance and standard deviation 
print(ser2.var())
print(ser2.std())
print(np.var(ser2))
print(np.std(ser2))
# The upper and lower quartiles 
print(ser2.quantile(0.25))
print(ser2.quantile(0.75))
print(np.quantile(ser2, (0.25, 0.5, 0.75)))
''' 320 220 1825.0 42.720018726587654 1368.75 36.99662146737185 242.5 290.0 [242.5 265. 290. ] '''
# Get all descriptive statistics directly 
ser2.describe()
''' count 4.000000 mean 267.500000 std 42.720019 min 220.000000 25% 242.500000 50% 265.000000 75% 290.000000 max 320.000000 dtype: float64 '''

3.1.6 Data cleaning


3.1.6.1 duplicate removal

ser3 = pd.Series(['apple', 'banana', 'apple', 'pitaya', 'apple', 'pitaya', 'durian'])
ser3
''' 0 apple 1 banana 2 apple 3 pitaya 4 apple 5 pitaya 6 durian dtype: object '''
# duplicate removal ---> ndarray
ser3.unique()
# array(['apple', 'banana', 'pitaya', 'durian'], dtype=object)
# The number of non repeating elements 
ser3.nunique()
# 4
# The frequency of element repetition ( In descending order of frequency )
ser3.value_counts()
''' apple 3 pitaya 2 banana 1 durian 1 dtype: int64 '''
# Judge whether the elements are repeated 
ser3.duplicated()
''' 0 False 1 False 2 True 3 False 4 True 5 True 6 False dtype: bool '''
# Boolean index de duplication 
ser3[~ser3.duplicated()]
''' 0 apple 1 banana 3 pitaya 6 durian dtype: object '''
# duplicate removal ---> Series
ser3.drop_duplicates()
''' 0 apple 1 banana 3 pitaya 6 durian dtype: object '''
# keep - The repeating element retains the first or last item , The default value is first
ser3.drop_duplicates(keep='last')
''' 1 banana 4 apple 5 pitaya 6 durian dtype: object '''
# inplace - Whether to operate locally 
# True ---> Local operation , Do not return new objects ---> None
# False( The default value is )---> Return the new object after the operation ---> Series
ser3.drop_duplicates(keep=False, inplace=True)
ser3
''' 1 banana 6 durian dtype: object '''

3.1.6.2 Missing value

# Judge null 
ser4.isnull()
''' 0 False 1 False 2 True 3 False 4 True dtype: bool '''
# Judge non null value 
ser4.notnull()
''' 0 True 1 True 2 False 3 True 4 False dtype: bool '''
# Filter non null values by Boolean index 
ser4[ser4.notnull()]
''' 0 10.0 1 20.0 3 30.0 dtype: float64 '''
# Delete the specified data 
ser4.drop(index=2)
''' 0 10.0 1 20.0 3 30.0 4 NaN dtype: float64 '''
ser4.drop(index=[2, 4])
''' 0 10.0 1 20.0 3 30.0 dtype: float64 '''
# Delete null (inplace=True, Delete in place )
ser4.dropna()
''' 0 10.0 1 20.0 3 30.0 dtype: float64 '''
# Fill in empty values 
ser4.fillna(50)
''' 0 10.0 1 20.0 2 50.0 3 30.0 4 50.0 dtype: float64 '''
ser4.fillna(method='ffill')
''' 0 10.0 1 20.0 2 20.0 3 30.0 4 30.0 dtype: float64 '''
ser4.fillna(method='bfill')
''' 0 10.0 1 20.0 2 30.0 3 30.0 4 NaN dtype: float64 '''
# Fill both front and back once 
ser4.fillna(method='ffill').fillna(method='bfill')

3.1.6.3 Sort and header values

# Sort the index 
# ascending ---> Ascending or descending ---> The default value is True, Represents ascending order 
ser1.sort_index(ascending=False)
''' 4 quarter 220 3 quarter 250 2 quarter 180 1 quarter 320 dtype: int64 '''
# Sort values 
ser1.sort_values(ascending=False)
''' 1 quarter 320 3 quarter 250 4 quarter 220 2 quarter 180 dtype: int64 '''
# Top-N
ser1.nlargest(3)
''' 1 quarter 320 3 quarter 250 4 quarter 220 dtype: int64 '''
ser1.nsmallest(2)
''' 2 quarter 180 4 quarter 220 dtype: int64 '''

3.1.6.4 Data processing

# example 1:format Batch process strings 
ser5 = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
ser5
''' 0 cat 1 dog 2 NaN 3 rabbit dtype: object '''
ser5.map('I am a {}'.format, na_action='ignore')
''' 0 I am a cat 1 I am a dog 2 NaN 3 I am a rabbit dtype: object '''
# example 2: Batch processing of data 
ser6 = pd.Series(np.random.randint(30, 80, 10))
ser6
''' 0 62 1 67 2 65 3 64 4 64 5 31 6 76 7 42 8 54 9 45 dtype: int32 '''
# following 3 Either way 
def upgrade(score):
return score ** 0.5 * 10
# np.round(ser6.map(upgrade), 0)
# np.round(ser6.map(lambda x: x ** .5 * 10), 0)
np.round(ser6.apply(lambda x: x ** .5 * 10), 0)
''' 0 79.0 1 82.0 2 81.0 3 80.0 4 80.0 5 56.0 6 87.0 7 65.0 8 73.0 9 67.0 dtype: float64 '''

Linear normalization ( Standardization ):

X i ′ = X i − X m i n X m a x − X m i n X_i' = \frac {X_{i} - X_{min}} {X_{max} - X_{min}} Xi′​=Xmax​−Xmin​Xi​−Xmin​​

Zero mean normalization ( Centralization ):

X i ′ = X i − μ σ X_i' = \frac {X_{i} - \mu} {\sigma} Xi′​=σXi​−μ​

# example 3: Normalize the data 
ser7 = pd.Series(data=np.random.randint(1, 10000, 10))
ser7
''' 0 9359 1 1222 2 2843 3 985 4 2478 5 3935 6 6838 7 1999 8 5907 9 9064 dtype: int32 '''
# Linear normalization 
x_min = ser7.min()
x_max = ser7.max()
ser7.map(lambda x: (x - x_min) / (x_max - x_min))
''' 0 1.000000 1 0.028302 2 0.221877 3 0.000000 4 0.178290 5 0.352281 6 0.698949 7 0.121089 8 0.587772 9 0.964772 dtype: float64 '''
# Zero mean normalization 
miu = ser7.mean()
sigma = ser7.std()
ser7.map(lambda x: (x - miu) / sigma)
''' 0 1.646880 1 -1.090183 2 -0.544923 3 -1.169903 4 -0.667699 5 -0.177605 6 0.798885 7 -0.828822 8 0.485722 9 1.547650 dtype: float64 '''

3.1.7 mapping

ser7 = pd.Series(
data = np.random.randint(150, 550, 8),
index = [f'{
x} quarter ' for x in '11213434']
)
ser7
''' 1 quarter 322 1 quarter 440 2 quarter 348 1 quarter 256 3 quarter 242 4 quarter 483 3 quarter 401 4 quarter 448 dtype: int32 '''
# Draw a pie chart based on quarterly summary data 
# level = 0 According to the said 0 Level index grouping 
temp = ser7.groupby(level=0).sum()
temp
''' 1 quarter 1018 2 quarter 348 3 quarter 643 4 quarter 931 dtype: int32 '''
# Draw a histogram 
temp.plot(kind='bar', color=['#a3f3b4', '#D8F781', 'blue', 'pink'])
plt.xticks(rotation=0)
plt.grid(True, alpha=0.5, axis='y', linestyle=':')
plt.show()

# Draw the pie chart 
temp.plot(kind='pie', autopct='%.2f%%', wedgeprops={

'edgecolor': 'w',
'width': 0.4,
}, pctdistance=0.8)
# Change y Axis labels 
plt.ylabel('')
plt.show()


3.2 DataFrame


3.2.1 establish DataFrame object


3.2.1.1 Create... Through constructor syntax DataFrame object

stuids = np.arange(1001, 1006)
courses = [' Chinese language and literature ', ' mathematics ', ' English ']
scores = np.random.randint(60, 101, (5, 3))
# Method 1: Create by 2D array or nesting DataFrame object 
df1 = pd.DataFrame(data = scores, columns = courses, index = stuids)
df1
# Method 2: Create... From a dictionary DataFrame object 
scores = {

' Chinese language and literature ': [62, 72, 93, 88, 93],
' mathematics ': [95, 65, 86, 66, 87],
' English ': [66, 75, 82, 69, 82]
}
df2 = pd.DataFrame(data = scores, index = stuids)
df2

3.2.1.2 Read CSV file

# Read CSV File creation DataFrame object 
df4 = pd.read_csv(
'datas/2018 Beijing points settlement data in .csv',
index_col='id', # Specify index columns 
sep='#', # Specify the separator 
quotechar='`', # The characters of the package contents 
usecols=['id', 'name', 'score'], # Which columns to read 
nrows=10, # Number of rows read 
skiprows=np.arange(1, 11) # Number of lines skipped 
# skiprows=lambda rn: rn > 0 and random.random() < 0.9 # Number of randomly skipped rows 
)
df4
# Read CSV File and create an iterator object 
df5_iter = pd.read_csv(
'datas/bilibili.csv',
encoding='gbk',
chunksize=10, # 10 A content 1 box 
iterator=True # Create an iterator object 
)
df5_iter
# Get... Through iterators DataFrame object 
next(df5_iter)

3.2.1.3 Read Excel file

df6 = pd.read_excel('datas/ Data of national tourist attractions .xlsx')
df6
# Omit the result presentation 
df7 = pd.read_excel(
'datas/2020 Annual sales figures .xlsx',
sheet_name='data', # Name of the table to read 
header=0, # Specify header position 
nrows=500,
skiprows=np.arange(1, 502)
)
df7
# Omit the result presentation 
df8 = pd.read_excel(
'datas/ Figures of the Three Kingdoms .xlsx',
sheet_name=' All character data ',
header=0,
usecols=np.arange(0, 10)
)
df8
# Omit the result presentation 

3.2.1.4 from SQL Extract the data

# establish conn object 
import pymysql
conn = pymysql.connect(host='47.104.31.138', port=3306,
user='guest', password='Guest.618',
database='hrs', charset='utf8mb4')
conn
# <pymysql.connections.Connection at 0x1c469df74f0>
# Or create engine object 
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://guest:[email protected]/hrs')
engine
# from engine Object or conn Object to extract data 
dept_df = pd.read_sql('select dno, dname, dloc from tb_dept', conn)
dept_df
# Omit the result presentation 
emp_df = pd.read_sql('select eno, ename, job, sal, comm, dno from tb_emp', engine)
emp_df
# Omit the result presentation 

3.2.2 Data remodeling


3.2.2.1 Table join

from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://guest:[email protected]:3306/hrs')
engine
# Get department table 
dept_df = pd.read_sql('select dno, dname, dloc from tb_dept', engine)
dept_df
# Get employee table 
emp_df = pd.read_sql('select eno, ename, job, sal, comm, dno from tb_emp', engine)
emp_df
# use DataFrame The method of inner connection 
emp_df.merge(dept_df, how='inner', on='dno')
# After changing the column name , The column names of the two tables are no longer equal 
dept_df.rename(columns={
'dno': 'deptno'}, inplace=True)
# use left_on and right_on Set associated table conditions 
temp_df = emp_df.merge(dept_df, how='inner', left_on='dno', right_on='deptno')
temp_df
# You can also use functions to join tables 
temp_df = pd.merge(emp_df, dept_df, how='inner',
left_on='dno', right_on='deptno')
temp_df
# At this time, the same columns used in the join table will not be automatically merged , Can be deleted manually 
temp_df.drop(columns=['deptno', 'job'], inplace=True)
# take deptno Change column to index column 
dept_df = dept_df.set_index('deptno')
# When the join condition is the index column, write right_index=True
pd.merge(emp_df, dept_df, how='inner', left_on='dno', right_index=True)

3.2.2.2 Table splicing ( amount to union)

import os
# ignore_index=True It means that the original number is ignored and renumbered 
df1 = pd.concat([pd.read_csv(f'datas/jobs/{
i}') for i in os.listdir('datas/jobs')],
ignore_index=True)
df1.drop(columns=['uri', 'city'], inplace=True)
df1

3.2.3 Get the specified line / Columns of data

# 1. Operation column 
# Get the specified column 
df1['site']
# Calculate the number of companies without repetition 
df1['company_name'].nunique()
# duplicate removal 
df1['company_name'].drop_duplicates()
# Fancy indexes get multiple columns at the same time 
df1[['company_name', 'site', 'edu']]
# Boolean index 
df1[df1['edu']==' Undergraduate ']
# Get multiple rows by index column and column name at the same time 
df1.loc[[4, 7, 1]]
# Slice multiple rows according to the built-in index 
df1.iloc[1:5]
# Slice multiple rows by index column 
df1.loc[1:5]
# 2. Action cell 
# Lock cells by index column and column name 
df1.at[7, 'company_name']
# Lock cells by number 
df1.iat[7, 0]
# Change cell 
df1.iat[7, 0] = ' A capitalist of Sichuan University '

Be careful :iloc and loc、iat and at, No addition i Means to search by row name and column name , added i Means to search by index number .


3.2.4 Data cleaning


3.2.4.1 Null processing

# Judge whether it is a null value 
emp_df.isnull()
# Determine whether it is a non null value 
emp_df.notnull()
# Delete rows with null values ( That is, along 0 Axis deletion )
emp_df.dropna()
# Delete columns with null values ( That is, along 1 Axis deletion )
emp_df.dropna(axis=1)
# Fill in the null value of the specified column ( Since each column has different requirements for handling null values , So, one column at a time )
emp_df['comm'] = emp_df['comm'].fillna(0).astype(np.int64)
emp_df

3.2.4.2 duplicate removal

# Judgment repeats 
emp_df.duplicated('job')
# duplicate removal 
emp_df.drop_duplicates('job', keep='first')
# obtain DataFrame Information about the object 
ytb.info()
# obtain DataFrame The front of the object 10 That's ok 
ytb.head(10)
# obtain DataFrame After the object 10 That's ok 
ytb.tail(10)

Case study :

# Do not limit the maximum number of columns displayed 
pd.set_option('display.max_column', None)
# Read Kobe's shooting data sheet 
kobe = pd.read_csv('datas/Kobe_data.csv', index_col='shot_id')
kobe
# Check the information 
kobe.info()
# Count the number of matches 
kobe['game_id'].nunique()
# Count which team you have played against the most 
kobe.drop_duplicates('game_id', inplace=True)
ser = kobe['opponent'].value_counts()
ser.index[0]

3.2.4.3 outliers

# Delete by index 
kobe.drop(index=kobe[kobe['opponent']=='BKN'].index)
# Replace the specified cell with ( First limit to the target column , Lest other places be inadvertently replaced )
kobe['opponent'] = kobe['opponent'].replace(['SEA', 'BKN'], ['OKC', 'NJN'])
# Regular expression substitution 
kobe = kobe.replace('BKN|SEA', '---', regex=True)

3.2.4.4 Time date

# String to time date 
kobe['game_date'] = pd.to_datetime(kobe['game_date'])
# Get the year in the time date 、 quarter 、 month 
kobe['year'] = kobe['game_date'].dt.year
kobe['quarter'] = kobe['game_date'].dt.quarter
kobe['month'] = kobe['game_date'].dt.month

3.2.4.5 character string

# .str Get the string corresponding to the data series , Readjust the string method 
kobe['opponent'] = kobe['opponent'].str.lower()
kobe

3.2.5 Data filtering

# Screening job_type by python Or the position of data analysis 
jobs_df.query('job_type == "python" or job_type == " Data analysis "')
# Screening job_name contain python Or data analysis 
jobs_df = jobs_df[jobs_df['job_name'].str.lower().str.contains('python') |
jobs_df['job_name'].str.contains(' Data analysis ')]

3.2.6 Data processing


3.2.6.1 Regular expressions extract data

# Use regular expression capture group to extract data 
temp_df = jobs_df['salary'].str.extract(r'(\d+)[Kk]?-(\d+)[Kk]?')
# adopt applymap Methods will DataFrame Each element of is processed into int type 
temp_df = temp_df.applymap(int)
temp_df.info()
# Along 1 Average the axis 
jobs_df['salary'] = temp_df.mean(axis=1)
jobs_df

3.2.6.2 String splitting

# Split site Column 
temp_df = jobs_df['site'].str.split(r'\s', expand=True, regex=True)
temp_df.rename(columns={
0: 'city', 1: 'district', 2: 'street'}, inplace=True)
temp_df
# Add a column directly 
jobs_df[temp_df.columns] = temp_df
jobs_df
# Delete the specified column 
jobs_df.drop(columns='site', inplace=True)
jobs_df
# Through intra index join 
jobs_df.merge(temp_df, how='inner', left_index=True, right_index=True)

Be careful :apply and map yes Series Methods , and applymap yes DataFrame Methods .


3.2.6.3 Adjust the order of the columns

# Reorder indexes 
jobs_df.reindex(columns=['company_name', 'city', 'district', 'street', 'salary', 'year', 'edu', 'job_name', 'job_type'])
# Adjust the order of columns with fancy indexes 
jobs_df[['company_name', 'city', 'district', 'street', 'salary', 'year', 'edu', 'job_name', 'job_type']]

3.2.6.4 Delete specified row

# Get the target row by Boolean index and delete 
jobs_df.drop(index=jobs_df[(jobs_df['edu'] == ' high school ') | (jobs_df['edu'] == ' secondary specialized school ')].index,
inplace=True)
jobs_df

3.2.6.5 Replace target cell

jobs_df['min_exp'] = jobs_df['year'].replace(['1 Within years ', ' Experience is unlimited ', ' Fresh graduates '],
['0-1 year ', '0 year ', '0 year ']).str.extract(r'(\d+)')
jobs_df

3.2.6.6 Cartoning and drawing

# Let's first look at the data distribution 
luohu_df['score'].describe()
# Separate boxes 
score_seg = pd.cut(luohu_df['score'], bins=np.arange(90, 130, 5), right=False)
score_seg
# Add back the data after sorting 
luohu_df.insert(4, 'score_seg', score_seg)
luohu_df
# Count the quantity of each box 
ser2 = luohu_df['score_seg'].value_counts()
ser2
# Draw a histogram 
ser2.plot(kind='bar')
for i, index in enumerate(ser2.index):
# The first parameter is x Axis position , The second is y Axis position , The third is the displayed string , The fourth is center alignment 
plt.text(i, ser2[index] + 20, ser2[index], ha='center')
# x The shaft scale rotates counterclockwise 30 degree 
plt.xticks(rotation=30)

3.2.6.7 Processing of fixed class variables

# Create dummy variable matrix 
temp_df = pd.get_dummies(persons_df[' occupation '])
''' Doctor Teachers' Painter The programmer 0 1 0 0 0 1 1 0 0 0 2 0 0 0 1 3 0 0 1 0 4 0 1 0 0 '''
# Write back to the original table 
persons_df[temp_df.columns] = temp_df
persons_df
# Delete the original occupation column 
persons_df.drop(columns=' occupation ', inplace=True)
persons_df

3.2.6.8 Processing of ordered variables

# use apply Mapping functions process ordered variables into numeric values 
def edu_to_value(sc):
results = {
' high school ': 1, ' junior college ': 3, ' Undergraduate ': 5, ' Graduate student ': 10}
return results.get(sc, 0)
persons_df[' Education '] = persons_df[' Education '].apply(edu_to_value)
persons_df

3.2.7 Data analysis


3.2.7.1 Get descriptive statistics

# Import data 
sales_df = pd.read_excel('datas/2020 Annual sales figures .xlsx', sheet_name='data')
sales_df
# Check the information 
sales_df.info()
# Calculate sales 
sales_df[' sales '] = sales_df[' sales volumes '] * sales_df[' The price is ']
sales_df
# Calculate the average value of sales per order 
sales_df[' sales '].sum() / sales_df[' Sales order '].nunique()
# Get descriptive statistics 
sales_df[' sales volumes '].describe()

3.2.7.2 Sort and header values

# take DataFrame Objects are in descending order by sales ( If you want to sort by multiple keywords, use a list ,ascending Followed by a list )
sales_df.sort_values(by=[' sales '], ascending=False)
# Take the top one with the largest sales 10 Data 
sales_df.nlargest(10, ' sales ')
# Take the top one with the smallest sales 5 Data 
sales_df.nsmallest(5, ' sales ')

3.2.7.3 Group aggregation

# 1. Statistics 2020 Monthly sales amount 
# Add the writing of the month column 
sales_df[' month '] = sales_df[' Sales date '].dt.month
sales_df.groupby(' month ')[[' sales ']].sum()
# Do not add the month column ( The column names of the resulting tables are still the same ‘ Sales date ’)
sales_df.groupby(sales_df[' Sales date '].dt.month)[[' sales ']].sum()
# 2. Statistics on the proportion of sales of various brands 
total_sales = sales_df[' sales '].sum()
ser = sales_df.groupby(' brand ')[' sales '].sum()
ser / total_sales
# Draw a pie chart ( Don't count the percentage yourself , Will automatically calculate )
ser.plot(kind='pie', autopct='%.1f%%', pctdistance=1.3)
# 3. Count the monthly sales of each region 
# Method 1: own groupby polymerization 
# Multi level index ( In this case, the sales region and month are used as index columns )
temp_df = sales_df.groupby([' Sales area ', ' month '])[[' sales ']].sum()
temp_df
# Reset index ( At this point, the sales region and month become common columns )
temp_df = temp_df.reset_index()
temp_df
# perspective ( Specify who is the index 、 Who is column 、 Who is worth )
temp_df2 = temp_df.pivot(index=' Sales area ', columns=' month ', values=' sales ').fillna(0).applymap(int)
temp_df2
# Method 2: One step in place 
# Generate pivot table ---> according to A Statistics B
sales_df.pivot_table(
index=' Sales area ',
columns=' month ',
values=' sales ',
aggfunc=np.sum, # You can use multiple aggregate functions at the same time , Transfer list 
fill_value=0,
margins=True, # Add summary row 、 Column 
margins_name=' A total of ' # Remit to the head office 、 Column name 
).applymap(int)
# 4. Count the brand sales of each channel 
pd.pivot_table(
sales_df, index=' Distribution channel ', columns=' brand ', values=' sales volumes ',
aggfunc='sum', margins=True, margins_name=' A total of '
)
# 5. Calculate the monthly sales volume proportion of different price ranges 
sales_df[' The price is '].describe()
# Method 1: Conditional column 
# Define the condition of the condition column 
def make_tag(price):
if price < 200:
return ' Low-end '
return ' Middle end ' if price < 470 else ' High-end '
# Create condition column 
sales_df[' Price '] = sales_df[' The price is '].apply(make_tag)
sales_df
# PivotTable 
temp_df = pd.pivot_table(sales_df, index=' Price ', columns=' month ', values=' sales volumes ', aggfunc='sum')
temp_df
# The order of changing careers 
temp_df = temp_df.reindex(index=[' Low-end ', ' Middle end ', ' High-end '])
temp_df
# Method 2: Separate boxes 
price_seg = pd.cut(sales_df[' The price is '], bins=[0, 200, 470, 1500], right=False)
price_seg
# PivotTable 
pd.pivot_table(sales_df, index=price_seg, columns=' month ', values=' sales volumes ', aggfunc='sum')

3.2.8 summary

  1. Core data type :

    • Series - Represents one-dimensional data
    • DataFrame - Represents two-dimensional data
  2. Core methods and functions

    • Data loading :read_csv / read_excel / read_sql —> DataFrame
      • rename: Change the names of row index and column index
      • reset_index: Reset index (drop=True, Indicates that the original index column is deleted ;level=0, It means that only the 1 Index columns become normal columns )
      • set_index: catalog index , Specify which column or columns to use as the index
      • reindex: Adjust the order of the index ( Rows and columns are OK )
    • Data filtering :
      • query: Conditional expression / Boolean index
      • drop: Delete the data that does not meet the conditions
    • Data remodeling :
      • merge —> how / on / left_on / right_on / left_index / right_index
      • concat
    • Data preparation :
      • Missing value :isna / notna / dropna / fillna
      • duplicate value :duplicated / drop_duplicates / nunique
      • outliers :drop / replace
      • Preprocessing :map / apply / applymap / transform / cut / qcut / get_dummies / where / mask
    • Statistical sorting :
      • Statistics :sum / mean / count / max / min / std / var / cumsum
      • Sort :sort_index / sort_values / nlargest / nsmallest
    • Pivoting :
      • grouping :groupby —> Groupby object —> Aggregate functions
      • PivotTable :pivot_table —> according to A Statistics B —> index / columns / values / aggfunc
    • Data presentation
      • mapping :plot —> kind —> line / scatter / bar / barh / pie

3.2.9 Other corner knowledge

# 1. aggregate - polymerization 
sales_df.groupby(' month ')[[' sales ']].agg(['sum', 'max', 'min', 'count'])
# 2. Indexes ( back ) The stack 
# Narrow watch becomes wide watch ( Put one level of the multi-level index on top )
# Mr. Cheng narrow watch 
temp_df = sales_df.groupby([' Sales area ', ' month '])[[' sales ']].sum()
temp_df
# Anti stack variable width table 
temp_df = temp_df.unstack(level=0).fillna(0).applymap(int)
temp_df
# Stack column indexes on row indexes ( A wide watch becomes a narrow watch )
temp_df=temp_df.stack()
temp_df
# 3. Index ordering 
# Method 1
temp_df.reorder_levels([' month ', ' Sales area '])
# Method 2
temp_df.swaplevel(0, 1)
# 4. Random sampling 
# smoke xx Spline 
sales_df.sample(n=200)
# Draw percent xx What kind of 
sales_df.sample(frac=0.1).sort_index()
# 5. interpolation 
ser = pd.Series([0, 1, np.nan, 9, 16, np.nan, 36])
ser
# linear interpolation 
ser.interpolate()
# Take the value of the above number 
ser.interpolate(method='pad')
# Quadratic interpolation 
ser.interpolate(method='polynomial', order=2)
# 6. Processing compound values 
temp_df = pd.DataFrame({
'A': [[1, 2, 3], 'foo', 10, 20], 'B': [[10, 20, 30], 1, 1, 1]})
temp_df
''' A B 0 [1, 2, 3] [10, 20, 30] 1 foo 1 2 10 1 3 20 1 '''
temp_df.explode(['A', 'B'])
''' A B 0 1 10 0 2 20 0 3 30 1 foo 1 2 10 1 3 20 1 '''
# 7. mobile data 
# Generate the data 
temp_df = sales_df.groupby(' month ')[[' sales ']].sum()
temp_df
# Move down one row to generate a new column 
temp_df[' Sales of last period '] = temp_df[' sales '].shift(1)
temp_df
# Calculation of month on month ratio 
100 * (temp_df[' sales '] - temp_df[' Sales of last period ']) / temp_df[' Sales of last period ']
# 8. use pct_change Method to calculate the link ratio 
def to_percentage(value):
if np.isnan(value):
return '---'
return f'{
value * 100:.2f}%'
temp_df.pct_change()[' sales '].map(to_percentage)
# 9. Window calculation 
import pandas_datareader as pdr
# get data 
baidu_df = pdr.get_data_stooq('BIDU', start='2022-1-1', end='2022-5-31')
baidu_df
# Sort 
baidu_df.sort_index(inplace=True)
baidu_df
# Window calculation (5 Row calculation 1 Sub mean )
day5_mean = baidu_df['Close'].rolling(5).mean()
day5_mean
# Window calculation (10 Row calculation 1 Sub mean )
day10_mean = baidu_df['Close'].rolling(10).mean()
day10_mean
# Draw a line 
plt.figure(figsize=(8, 4), dpi=150)
plt.plot(baidu_df.index, day5_mean, color='orange')
plt.plot(baidu_df.index, day10_mean, color='blue')
plt.show()
# 10. Calculate the correlation coefficient 
# get data 
boston_df = pd.read_csv('datas/boston_house_price.csv', index_col=0)
boston_df
# Calculate covariance 
boston_df.cov()
# Calculate the correlation coefficient of the two columns 
np.corrcoef(boston_df['RM'], boston_df['PRICE'])
# Calculate the correlation coefficient in pairs ( By default, it is calculated according to Pearson correlation coefficient )
temp_df = boston_df.corr(method='pearson')
temp_df
# Calculate the correlation coefficient between a column and other columns 
temp_df[['PRICE']].style.background_gradient('Reds')
# Calculate the skewness 
boston_df['PRICE'].skew()
# Calculate kurtosis 
boston_df['PRICE'].kurt()

attach : Skewness and kurtosis

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-0izLuvrG-1656407811559)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\image-20220628155808772.png)]

# 11. Index type 
RangeIndex: Digital index
CategoricalIndex: Category index
MultiIndex: Multi level index
DatetimeIndex: Time date index
# 1. Multi level index 
# Prepare the data 
stu_ids = np.arange(1001, 1006)
sms = [' Midterm ', ' End of term ']
index = pd.MultiIndex.from_product((stu_ids, sms), names=[' Student number ', ' semester '])
index
from random import random
courses = [' Chinese language and literature ', ' mathematics ', ' English ']
scores = np.random.randint(60, 101, (10, 3))
scores_df = pd.DataFrame(data=scores, columns=courses, index=index)
scores_df
# Calculate the average score of each student , Interim accounts for 25%, Accounting for 75%
def handel_score(x):
# Unpack and get two values 、 perhaps x.values() Get the list 
a, b = x
return a * 0.25 + b * 0.75
scores_df.groupby(level=0).agg(handel_score)
# 2. Time date index 
# Create a date list by the specified number of dates 
pd.date_range('2021-1-1', '2021-6-1', periods=21)
# Create a time list in surrounding units 
pd.date_range('2021-1-1', '2021-6-1', freq='W')
# After creating the time list, subtract to change the time 
pd.date_range('2021-1-1', '2021-6-1', freq='W') - pd.DateOffset(days=2)
# Take... Every week 1 Time data 
baidu_df.asfreq('M')
# Every time 10 Tianqu 1 Time data , If you can't get it, fill it with the above data 
baidu_df.asfreq('10D', method='ffill')
# Time based grouping of data 
baidu_df['Volume'].resample('10D').sum()
# Change time zone 
baidu_df.tz_localize('Asia/Chongqing')


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved