您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python數據分析科學庫——Pandas（統計分析與決策）

編輯：Python

一、Pandas

它是選擇以NumPy為基礎進行設計，在數據分析中Pandas和NumPy這兩個模塊經常是一起使用。
Pandas是數據處理和數據分析首選庫。
Pandas提供了大量標准數據模型和高效操作大型數據集所需要的函數和方法，
是使得Python能夠成為高效且強大的數據分析工具的重要因素之一。
使用Pandas庫時，需要導入它並命名別名為pd：

import pandas as pd

二、pandas數據結構

1）Series，帶標簽的一維數組；
2）DatetimeIndex，時間序列；
3）DataFrame，帶標簽且大小可變的二維表格結構；
4）Panel，帶標簽且大小可變的三維數組。

三、series對象

1)Series：pandas提供類似於一維數組的字典結構的對象，用於存儲一行或一列的數據。
2)Series對象的內部結構是由兩個相互關聯的數組組成，其中用於存放數據（即值）的是value主數組，
主數組的每個元素都有一個與之相關聯的標簽（即索引），這些標簽存儲在另外一個叫做Index的數組中。
3)如果在創建時沒有明確指定索引則會自動使用從0開始的非負整數作為索引。
4)pd.Series(data=None, index=None, dtype=None, name=None, copy=False)
data：產生序列的數據，可以是ndarray、list或字典
index：設置索引，索引值的個數與需與data數據的長度相同。若省略index參數，其索引0, 1, 2, …。
dtype：序列元素的數據類型，若省略pandas以數據的類型推斷數據類型。
name：序列的名稱
copy：復制數據，產生數據副本。默認值是false

''' 1.Series對象的創建 （1）使用列表創建Series Series([數據1, 數據2,…]，index=[索引1, 索引2,…]) 若省略index,則自動添加索引 '''
import pandas as pd
slist1 = pd.Series([1, 2, 3, 4])
#自動添加索引，索引從0開始 
slist2 = pd.Series([1,2,3,4],index=['a','b','c','d'])
slist3 = pd.Series([1,2,3,4],[chr(i+65) for i in range(4)])
print(slist1,slist2,slist3,sep='\n')
#(2) 使用range創建Series
#Series(range(strat,stop,step)，index=[索引1, 索引2,…])
#省略index，自動添加索引，索引從0開始
srange1 = pd.Series(range(3))
srange2 = pd.Series(range(2,10,3))
srange3 = pd.Series(range(2,10,2),index=[i for i in range(2,10,2) ])
print(srange1,srange2,srange3,sep='\n')
#(3) 使用numpy一維數組創建Series
import numpy as np
snumpy1 = pd.Series(np.array([1, 2, 3, 4]))
snumpy2 = pd.Series(np.arange(6,10))
snumpy3 = pd.Series(np.linspace(20,30,5))
#20-30之間分為5個浮點數
print(snumpy1,snumpy2,snumpy3,sep='\n')
#由數組創建Series時，引用了該對象。
aa = np.array([2,3,4,5])
aas = pd.Series(aa)
aas1 = pd.Series(aa,copy=True)
print(aa,aas,sep='\n')
aa[3] = 999
aas[0] = 1000
print(aa,aas,aas1,sep='\n')
#(4) 使用字典創建Series，其中字典的鍵就是索引
sdict1 = pd.Series( {
'Python程序設計':95,'數據結構':90,'高等數學':88} )#元組
sdict2 = pd.Series(dict(python編程=92,高等數學=89,數據結構=96)) #轉換為元組->索引
print(sdict1,sdict2,sep='\n')
# 2.Series 的值與索引的查看
# 利用values和index可查看Series值和標簽
print(sdict1.index)
print(sdict1.values)
print(slist2.index)
print(sdict2.values)
# 3.通過索引或切片訪問與修改Series的值
sdict1 = pd.Series({
'python程序設計':85,'數據結構':87,'高等數學':78})
print('\n','sdict1原始數據'.center(20, '='))
print(sdict1)
sdict1['python程序設計']=98
# 修改數據的方法
print('\n',"修改sdict1['python程序設計']=89後數據".center(30, '='))
print(sdict1)
sdict1['大學英語']=87 #若該索引不存在，則增加該數據
print('\n','sdict1增加後數據'.center(20, '='))
print(sdict1)
# 也可采用默認的索引進行修改與訪問
sdict1[0:2]=0 #采用切片的形式進行局部修改
sdict1[3]=999
print('\n','sdict1[0:2]=0 切片修改sdict1[3]=999索引修改後數據'.center(40, '='),sdict1,sep='\n')
# 4.Series對象間的運算
#（1）算術運算與數學函數運算
ss = pd.Series(np.linspace(-2,2,4))
print('\n','ss原數據'.center(14, '='),ss,sep='\n')
print('\n','對ss所有數據求絕對值,保留3位小數'.center(20, '='),round(abs(ss),3),sep='\n')
print(np.sqrt(abs(ss)))
# round() 方法返回浮點數x的四捨五入值。它的的語法:round( x [, n] )
# 參數：x -- 數值表達式；n -- 數值表達式，表示從小數點位數。
# abs函求絕對值
print('\n','ss的算術運算'.center(14, '='))
print('加+5:',ss+5,'減-5:',ss-5,'乘*5',ss*5,'除/5',ss/5,sep='\n')
# （2）篩選符合條件的元素
ss = pd.Series([np.random.randint(1,50) for i in range(10)])
# 篩選大於25的所有數
ss1 = ss[ss>25]
print('\n',ss,ss1,sep='\n')
# 篩選能被5整除的數
ss2 = ss[ss%5==0]
print('篩選能被5整除的數\n',ss2)
# 篩選出20到30之間的數
print('篩選出20到30之間的數'.center(20,'='))
ss3 = ss[(20<=ss) & (ss<=30)]
print(ss3)
# 篩選出小於20與大於40的數
print('篩選出小於20與大於40的數'.center(20,'='))
ss3 = ss[(40<=ss) | (ss<=20)]
print(ss3)
# （3）修改索引 reset_index(drop=True)
ss_index1 = ss1.reset_index()
print(ss_index1,type(ss_index1))
ss_index2 = ss1.reset_index(drop=True)
#刪除原索引
print(ss_index2,type(ss_index2))
# 5.Series對象統計運算
stu1 = pd.Series({
'python程序設計':85,'數據結構':87,'高等數學':96,'大學英語':76})
#求最大值的索引與最小值的索引
print('stu1最大值的索引'.center(20, '='))
print(stu1.idxmax())
print('stu1最小值的索引'.center(20, '='))
print(stu1.idxmin())
# 求均值 mean()
print('\n均值：',stu1.mean())
print('stu1中大於平均值的數據'.center(30,'='))
stu_mean=stu1[stu1>stu1.mean()]
print(stu_mean)
#判斷元素是否存在 isin()
nn = [i for i in range(80,90)]
nn1 = stu1.isin(nn)
# 它的返回值是 True or False
print('以bool值輸出80--89間的數據'.center(30, '='))
print(nn1)
print('只輸出bool為True的數據'.center(30, '='))
print(nn1[nn1.values==True]) #輸出值為True的數據
nn1_index = nn1[nn1.values==True].index
print('輸出True的數據的索引'.center(30, '='))
print(nn1_index)
print(list(nn1_index))
# 項目分析
import pandas as pd
num = pd.Series([np.random.randint(1,10) for i in range(20)])
print('\n原Series\n',num)
#將Series中的重復元素刪除，去重。
print('將Series中元素去重 unique()'.center(30,'='))
print(num.unique())
#統計元素出現的次數 value_counts()
print('統計元素出現的次數 value_counts()'.center(30,'='))
print(num.value_counts())
#進行排序,按值排序 sort_values(ascending=True),默認按升序排序，原Series不變
print('按值排序 sort_values()'.center(30,'='))
print(num.sort_values())
#按值排序 sort_values(ascending=False),按降序排序，原Series不變
print('按值排序 sort_values(ascending=False)'.center(30,'='))
print(num.sort_values(ascending=False))

運行結果如下：

0 1
1 2
2 3
3 4
dtype: int64
a 1
b 2
c 3
d 4
dtype: int64
A 1
B 2
C 3
D 4
dtype: int64
0 0
1 1
2 2
dtype: int64
0 2
1 5
2 8
dtype: int64
2 2
4 4
6 6
8 8
dtype: int64
0 1
1 2
2 3
3 4
dtype: int32
0 6
1 7
2 8
3 9
dtype: int32
0 20.0
1 22.5
2 25.0
3 27.5
4 30.0
dtype: float64
[2 3 4 5]
0 2
1 3
2 4
3 5
dtype: int32
[1000 3 4 999]
0 1000
1 3
2 4
3 999
dtype: int32
0 2
1 3
2 4
3 5
dtype: int32
Python程序設計 95
數據結構 90
高等數學 88
dtype: int64
python編程 92
高等數學 89
數據結構 96
dtype: int64
Index([‘Python程序設計’, ‘數據結構’, ‘高等數學’], dtype=‘object’)
[95 90 88]
Index([‘a’, ‘b’, ‘c’, ‘d’], dtype=‘object’)
[92 89 96]
=sdict1原始數據=
python程序設計 85
數據結構 87
高等數學 78
dtype: int64
=修改sdict1[‘python程序設計’]=89後數據=
python程序設計 98
數據結構 87
高等數學 78
dtype: int64
sdict1增加後數據=
python程序設計 98
數據結構 87
高等數學 78
大學英語 87
dtype: int64
=sdict1[0:2]=0 切片修改sdict1[3]=999索引修改後數據=
python程序設計 0
數據結構 0
高等數學 78
大學英語 999
dtype: int64
ss原數據=
0 -2.000000
1 -0.666667
2 0.666667
3 2.000000
dtype: float64
=對ss所有數據求絕對值,保留3位小數=
0 2.000
1 0.667
2 0.667
3 2.000
dtype: float64
0 1.414214
1 0.816497
2 0.816497
3 1.414214
dtype: float64
=ss的算術運算==
加+5:
0 3.000000
1 4.333333
2 5.666667
3 7.000000
dtype: float64
減-5:
0 -7.000000
1 -5.666667
2 -4.333333
3 -3.000000
dtype: float64
乘*5
0 -10.000000
1 -3.333333
2 3.333333
3 10.000000
dtype: float64
除/5
0 -0.400000
1 -0.133333
2 0.133333
3 0.400000
dtype: float64
0 28
1 29
2 33
3 25
4 21
5 46
6 17
7 13
8 13
9 20
dtype: int64
0 28
1 29
2 33
5 46
dtype: int64
篩選能被5整除的數
3 25
9 20
dtype: int64
篩選出20到30之間的數
0 28
1 29
3 25
4 21
9 20
dtype: int64
=篩選出小於20與大於40的數=
5 46
6 17
7 13
8 13
9 20
dtype: int64
index 0
0 0 28
1 1 29
2 2 33
3 5 46 <class ‘pandas.core.frame.DataFrame’>
0 28
1 29
2 33
3 46
dtype: int64 <class ‘pandas.core.series.Series’>
=stu1最大值的索引=
高等數學
=stu1最小值的索引=
大學英語
均值： 86.0
stu1中大於平均值的數據=
數據結構 87
高等數學 96
dtype: int64
以bool值輸出80–89間的數據
python程序設計 True
數據結構 True
高等數學 False
大學英語 False
dtype: bool
=只輸出bool為True的數據==
python程序設計 True
數據結構 True
dtype: bool
=輸出True的數據的索引=
Index([‘python程序設計’, ‘數據結構’], dtype=‘object’)
[‘python程序設計’, ‘數據結構’]
原Series
0 4
1 7
2 4
3 5
4 1
5 1
6 5
7 8
8 6
9 7
10 4
11 8
12 9
13 1
14 6
15 5
16 6
17 7
18 2
19 1
dtype: int64
將Series中元素去重 unique()=
[4 7 5 1 8 6 9 2]
=統計元素出現的次數 value_counts()=
1 4
7 3
6 3
5 3
4 3
8 2
9 1
2 1
dtype: int64
按值排序 sort_values()
19 1
13 1
4 1
5 1
18 2
10 4
0 4
2 4
3 5
15 5
6 5
8 6
14 6
16 6
1 7
17 7
9 7
7 8
11 8
12 9
dtype: int64
按值排序 sort_values(ascending=False)
12 9
11 8
7 8
9 7
17 7
1 7
16 6
14 6
8 6
6 5
15 5
3 5
2 4
0 4
10 4
18 2
5 1
4 1
13 1
19 1
dtype: int64