您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python data analysis science library pandas (statistical analysis and decision)

編輯：Python

One 、Pandas

It is selected by NumPy Design for the foundation , In data analysis Pandas and NumPy These two modules are often used together .
Pandas It is the preferred library for data processing and data analysis .
Pandas It provides a large number of standard data models and functions and methods required for efficient operation of large data sets ,
Is making Python One of the important factors to become an efficient and powerful data analysis tool .
Use Pandas library , You need to import it and name it pd：

import pandas as pd

Two 、pandas data structure

1）Series, One dimensional array with label ;
2）DatetimeIndex, The time series ;
3）DataFrame, Two dimensional table structure with label and variable size ;
4）Panel, Three dimensional array with label and variable size .

3、 ... and 、series object

1)Series：pandas An object that provides a dictionary structure similar to a one-dimensional array , Used to store a row or column of data .
2)Series The internal structure of an object is composed of two interrelated arrays , Which is used to store data （ Instant value ） Yes. value Main array ,
Each element of the main array has a tag associated with it （ Index ）, These tags are stored in another called Index In the array .
3) If the index is not explicitly specified at the time of creation, it is automatically used from 0 Start with a nonnegative integer as the index .
4)pd.Series(data=None, index=None, dtype=None, name=None, copy=False)
data： The data that produces the sequence , It can be ndarray、list Or a dictionary
index： catalog index , The number of index values must be the same as data The length of the data is the same . If you omit index Parameters , Its index 0, 1, 2, ….
dtype： The data type of the sequence element , If you omit pandas Infer data types from data types .
name： The name of the sequence
copy： Copy the data , Generate a copy of the data . The default value is false

''' 1.Series Object creation （1） Create with list Series Series([ data 1, data 2,…],index=[ Indexes 1, Indexes 2,…]) If you omit index, The index is automatically added '''
import pandas as pd
slist1 = pd.Series([1, 2, 3, 4])
# Automatically add index , Index from 0 Start 
slist2 = pd.Series([1,2,3,4],index=['a','b','c','d'])
slist3 = pd.Series([1,2,3,4],[chr(i+65) for i in range(4)])
print(slist1,slist2,slist3,sep='\n')
#(2) Use range establish Series
#Series(range(strat,stop,step),index=[ Indexes 1, Indexes 2,…])
# Omit index, Automatically add index , Index from 0 Start 
srange1 = pd.Series(range(3))
srange2 = pd.Series(range(2,10,3))
srange3 = pd.Series(range(2,10,2),index=[i for i in range(2,10,2) ])
print(srange1,srange2,srange3,sep='\n')
#(3) Use numpy One dimensional array creation Series
import numpy as np
snumpy1 = pd.Series(np.array([1, 2, 3, 4]))
snumpy2 = pd.Series(np.arange(6,10))
snumpy3 = pd.Series(np.linspace(20,30,5))
#20-30 It is divided into 5 A floating point number 
print(snumpy1,snumpy2,snumpy3,sep='\n')
# Created by array Series when , Reference to the object .
aa = np.array([2,3,4,5])
aas = pd.Series(aa)
aas1 = pd.Series(aa,copy=True)
print(aa,aas,sep='\n')
aa[3] = 999
aas[0] = 1000
print(aa,aas,aas1,sep='\n')
#(4) Create with dictionary Series, The key of the dictionary is the index 
sdict1 = pd.Series( {
'Python Programming ':95,' data structure ':90,' Advanced mathematics ':88} )# Tuples 
sdict2 = pd.Series(dict(python Programming =92, Advanced mathematics =89, data structure =96)) # Convert to tuple -> Indexes 
print(sdict1,sdict2,sep='\n')
# 2.Series View the value and index of 
# utilize values and index You can see Series Values and labels 
print(sdict1.index)
print(sdict1.values)
print(slist2.index)
print(sdict2.values)
# 3. Access and modify by index or slice Series Value 
sdict1 = pd.Series({
'python Programming ':85,' data structure ':87,' Advanced mathematics ':78})
print('\n','sdict1 Raw data '.center(20, '='))
print(sdict1)
sdict1['python Programming ']=98
# How to modify data 
print('\n'," modify sdict1['python Programming ']=89 Post data ".center(30, '='))
print(sdict1)
sdict1[' College English ']=87 # If the index does not exist , Then add the data 
print('\n','sdict1 Data after adding '.center(20, '='))
print(sdict1)
# You can also use the default index for modification and access 
sdict1[0:2]=0 # Local modification in the form of slices 
sdict1[3]=999
print('\n','sdict1[0:2]=0 Slice modification sdict1[3]=999 Data after index modification '.center(40, '='),sdict1,sep='\n')
# 4.Series Operations between objects 
#（1） Arithmetic operation and mathematical function operation 
ss = pd.Series(np.linspace(-2,2,4))
print('\n','ss The original data '.center(14, '='),ss,sep='\n')
print('\n',' Yes ss Calculate the absolute value of all data , Retain 3 Decimal place '.center(20, '='),round(abs(ss),3),sep='\n')
print(np.sqrt(abs(ss)))
# round() Method returns a floating-point number x Round the value of . Its syntax :round( x [, n] )
# Parameters ：x -- Numerical expression ;n -- Numerical expression , From the decimal places .
# abs Function to find the absolute value 
print('\n','ss The arithmetic operation of '.center(14, '='))
print(' Add +5:',ss+5,' reduce -5:',ss-5,' ride *5',ss*5,' except /5',ss/5,sep='\n')
# （2） Select the elements that meet the criteria 
ss = pd.Series([np.random.randint(1,50) for i in range(10)])
# Screening is greater than 25 All the numbers of 
ss1 = ss[ss>25]
print('\n',ss,ss1,sep='\n')
# Screening can be 5 Divisible number 
ss2 = ss[ss%5==0]
print(' Screening can be 5 Divisible number \n',ss2)
# select 20 To 30 Number between 
print(' select 20 To 30 Number between '.center(20,'='))
ss3 = ss[(20<=ss) & (ss<=30)]
print(ss3)
# Screen out less than 20 And greater than 40 Number of numbers 
print(' Screen out less than 20 And greater than 40 Number of numbers '.center(20,'='))
ss3 = ss[(40<=ss) | (ss<=20)]
print(ss3)
# （3） Modify the index reset_index(drop=True)
ss_index1 = ss1.reset_index()
print(ss_index1,type(ss_index1))
ss_index2 = ss1.reset_index(drop=True)
# Delete the original index 
print(ss_index2,type(ss_index2))
# 5.Series Object statistical operation 
stu1 = pd.Series({
'python Programming ':85,' data structure ':87,' Advanced mathematics ':96,' College English ':76})
# Find the maximum index and the minimum index 
print('stu1 Index of maximum value '.center(20, '='))
print(stu1.idxmax())
print('stu1 Index of minimum value '.center(20, '='))
print(stu1.idxmin())
# Calculating mean mean()
print('\n mean value ：',stu1.mean())
print('stu1 Larger than average data in '.center(30,'='))
stu_mean=stu1[stu1>stu1.mean()]
print(stu_mean)
# Determine whether an element exists isin()
nn = [i for i in range(80,90)]
nn1 = stu1.isin(nn)
# Its return value is True or False
print(' With bool Value of output 80--89 Data between '.center(30, '='))
print(nn1)
print(' Only the output bool by True The data of '.center(30, '='))
print(nn1[nn1.values==True]) # The output value is True The data of 
nn1_index = nn1[nn1.values==True].index
print(' Output True Index of data '.center(30, '='))
print(nn1_index)
print(list(nn1_index))
# Project analysis 
import pandas as pd
num = pd.Series([np.random.randint(1,10) for i in range(20)])
print('\n primary Series\n',num)
# take Series Delete duplicate elements in , duplicate removal .
print(' take Series De duplication of elements in unique()'.center(30,'='))
print(num.unique())
# Count the number of occurrences of elements value_counts()
print(' Count the number of occurrences of elements value_counts()'.center(30,'='))
print(num.value_counts())
# Sort , Sort by value sort_values(ascending=True), Sort ascending by default , primary Series unchanged 
print(' Sort by value sort_values()'.center(30,'='))
print(num.sort_values())
# Sort by value sort_values(ascending=False), Sort in descending order , primary Series unchanged 
print(' Sort by value sort_values(ascending=False)'.center(30,'='))
print(num.sort_values(ascending=False))

The operation results are as follows ：

0 1
1 2
2 3
3 4
dtype: int64
a 1
b 2
c 3
d 4
dtype: int64
A 1
B 2
C 3
D 4
dtype: int64
0 0
1 1
2 2
dtype: int64
0 2
1 5
2 8
dtype: int64
2 2
4 4
6 6
8 8
dtype: int64
0 1
1 2
2 3
3 4
dtype: int32
0 6
1 7
2 8
3 9
dtype: int32
0 20.0
1 22.5
2 25.0
3 27.5
4 30.0
dtype: float64
[2 3 4 5]
0 2
1 3
2 4
3 5
dtype: int32
[1000 3 4 999]
0 1000
1 3
2 4
3 999
dtype: int32
0 2
1 3
2 4
3 5
dtype: int32
Python Programming 95
data structure 90
Advanced mathematics 88
dtype: int64
python Programming 92
Advanced mathematics 89
data structure 96
dtype: int64
Index([‘Python Programming ’, ‘ data structure ’, ‘ Advanced mathematics ’], dtype=‘object’)
[95 90 88]
Index([‘a’, ‘b’, ‘c’, ‘d’], dtype=‘object’)
[92 89 96]
=sdict1 Raw data =
python Programming 85
data structure 87
Advanced mathematics 78
dtype: int64
= modify sdict1[‘python Programming ’]=89 Post data =
python Programming 98
data structure 87
Advanced mathematics 78
dtype: int64
sdict1 Data after adding =
python Programming 98
data structure 87
Advanced mathematics 78
College English 87
dtype: int64
=sdict1[0:2]=0 Slice modification sdict1[3]=999 Data after index modification =
python Programming 0
data structure 0
Advanced mathematics 78
College English 999
dtype: int64
ss The original data =
0 -2.000000
1 -0.666667
2 0.666667
3 2.000000
dtype: float64
= Yes ss Calculate the absolute value of all data , Retain 3 Decimal place =
0 2.000
1 0.667
2 0.667
3 2.000
dtype: float64
0 1.414214
1 0.816497
2 0.816497
3 1.414214
dtype: float64
=ss The arithmetic operation of ==
Add +5:
0 3.000000
1 4.333333
2 5.666667
3 7.000000
dtype: float64
reduce -5:
0 -7.000000
1 -5.666667
2 -4.333333
3 -3.000000
dtype: float64
ride *5
0 -10.000000
1 -3.333333
2 3.333333
3 10.000000
dtype: float64
except /5
0 -0.400000
1 -0.133333
2 0.133333
3 0.400000
dtype: float64
0 28
1 29
2 33
3 25
4 21
5 46
6 17
7 13
8 13
9 20
dtype: int64
0 28
1 29
2 33
5 46
dtype: int64
Screening can be 5 Divisible number
3 25
9 20
dtype: int64
select 20 To 30 Number between
0 28
1 29
3 25
4 21
9 20
dtype: int64
= Screen out less than 20 And greater than 40 Number of numbers =
5 46
6 17
7 13
8 13
9 20
dtype: int64
index 0
0 0 28
1 1 29
2 2 33
3 5 46 <class ‘pandas.core.frame.DataFrame’>
0 28
1 29
2 33
3 46
dtype: int64 <class ‘pandas.core.series.Series’>
=stu1 Index of maximum value =
Advanced mathematics
=stu1 Index of minimum value =
College English
mean value ： 86.0
stu1 Larger than average data in =
data structure 87
Advanced mathematics 96
dtype: int64
With bool Value of output 80–89 Data between
python Programming True
data structure True
Advanced mathematics False
College English False
dtype: bool
= Only the output bool by True The data of ==
python Programming True
data structure True
dtype: bool
= Output True Index of data =
Index([‘python Programming ’, ‘ data structure ’], dtype=‘object’)
[‘python Programming ’, ‘ data structure ’]
primary Series
0 4
1 7
2 4
3 5
4 1
5 1
6 5
7 8
8 6
9 7
10 4
11 8
12 9
13 1
14 6
15 5
16 6
17 7
18 2
19 1
dtype: int64
take Series De duplication of elements in unique()=
[4 7 5 1 8 6 9 2]
= Count the number of occurrences of elements value_counts()=
1 4
7 3
6 3
5 3
4 3
8 2
9 1
2 1
dtype: int64
Sort by value sort_values()
19 1
13 1
4 1
5 1
18 2
10 4
0 4
2 4
3 5
15 5
6 5
8 6
14 6
16 6
1 7
17 7
9 7
7 8
11 8
12 9
dtype: int64
Sort by value sort_values(ascending=False)
12 9
11 8
7 8
9 7
17 7
1 7
16 6
14 6
8 6
6 5
15 5
3 5
2 4
0 4
10 4
18 2
5 1
4 1
13 1
19 1
dtype: int64