您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Pandas讀取文件性能優化 Tricks

編輯：Python

1. to_pickle性能優化

最近做 AETA地震預測AI算法大賽這個比賽，看他的baseline代碼時，發現了這麼幾行：

# 取三者的交集，有_magn文件，有_sound文件，而且有 MagnUpdate & SoundUpdate
usable_stations = _continueable_stations & _set_magn & _set_sound
dump_object(Usable_Station_Path, usable_stations)
print('合並數據:')
for type in ('magn', 'sound'):
res = []
for _id in tqdm(usable_stations, desc=f'{
type}:'):
# Data_Folder_Path+str(_id)+f'_{type}.csv' ==> e.g. : './data/19_magn.csv'
# Used_features 用來取對應的特征
_df = pd.read_csv(Data_Folder_Path+str(_id)+f'_{
type}.csv')[Used_features[type]]
res.append(_df)
final_df = pd.concat(res)
final_df.to_pickle(Merged_Data_Path[type]) # <-------------- 看這裡
del(final_df)

為啥讀取後合並完，要.to_pickle導出為pkl文件呢？

合並數據:
magn:: 100%|██████████| 131/131 [11:44<00:00, 5.38s/it]
sound:: 100%|██████████| 131/131 [08:51<00:00, 4.05s/it]

另外注意 tqdm(usable_stations, desc=f'{type}:') 這個類，它自帶一個:，不用在desc 參數中加 :

提速百倍的Pandas性能優化方法，讓你的Pandas飛起來！中，指明，讀取csv、hdf和pkl文件中，讀取 pkl 格式最快

2. isin()性能優化

他的代碼還有這樣兩行：

# 將該區域的地震 AETA 台站數據拿出來
local_magn_data = magn_data[magn_data['StationID'].apply(lambda x:x in ID_list)].reset_index(drop=True)

哇，直接卡的一批，慢死了！！

意思是：

idx = magn_data['StationID'].apply(lambda x:x in ID_list) # 找出 StationID 那列，元素在ID_list的行，是的話給True, 反之給False
local_magn_data = magn_data[idx] # 將那些行取出來
local_magn_data = local_magn_data.reset_index(drop=True) # 重置index

根據提速百倍的Pandas性能優化方法，讓你的Pandas飛起來！我們嘗試用 .isin 方法來替換.apply(lambda x:x in ID_list)

local_magn_data = magn_data[magn_data['StationID'].isin(ID_list)].reset_index(drop=True)

替換完之後單步調試1s就通過，Yes！

3. 不要用None

res_df[f'{
feature}_mean'] = None
res_df[f'{
feature}_max'] = None
res_df[f'{
feature}_min'] = None
res_df[f'{
feature}_max_min'] = None
for i,row in res_df.iterrows():
endDay = row['Day']
startDay = endDay - window
data_se = df[(df['Day']>startDay)&(df['Day']<=endDay)][feature]
res_df[f'{
feature}_mean'].iloc[i] = data_se.mean()
res_df[f'{
feature}_max'].iloc[i] = data_se.max()
res_df[f'{
feature}_min'].iloc[i] = data_se.min()
res_df[f'{
feature}_max_min'].iloc[i] = data_se.max() - data_se.min()

這樣操作是先給 res_df[f'{feature}_mean'] 占個坑，但時候後邊處理時，這個列是 object 的type
所以建議變成

res_df[f'{
feature}_mean'] = 0.0
res_df[f'{
feature}_max'] = 0.0
res_df[f'{
feature}_min'] = 0.0
res_df[f'{
feature}_max_min'] = 0.0

4. 降低內存

以上是讀取速度加快的方式，讓我想起之前kaggle上看到的 [Reducing DataFrame memory size by ~65%] 降低 Pandas 讀取內存的方式：

直接用他寫的這個函數就行：

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
def reduce_mem_usage(props):
# 參考自：
# https://www.kaggle.com/code/arjanso/reducing-dataframe-memory-size-by-65/notebook
start_mem_usg = props.memory_usage().sum() / 1024**2
print("Memory usage of properties dataframe is :",start_mem_usg," MB")
NAlist = [] # Keeps track of columns that have missing values filled in. 
for col in props.columns:
if props[col].dtype != object: # 排除 strings
# 打印當前列的 type
print("******************************")
print("Column: ",col)
print("dtype before: ",props[col].dtype)
# make variables for Int, max and min
IsInt = False
mx = props[col].max()
mn = props[col].min()
# Integer does not support NA, therefore, NA needs to be filled
# Integer 不支持 NA, 所以要填充
if not np.isfinite(props[col]).all():
NAlist.append(col)
props[col].fillna(mn-1,inplace=True)
# test if column can be converted to an integer
# 測試該列是否可以轉化為整數
asint = props[col].fillna(0).astype(np.int64)
result = (props[col] - asint)
result = result.sum()
# 如果差得很多, 則可以轉化為 int
if result > -0.01 and result < 0.01:
IsInt = True
# Make Integer/unsigned Integer datatypes
if IsInt:
if mn >= 0:
if mx < 255:
props[col] = props[col].astype(np.uint8)
elif mx < 65535:
props[col] = props[col].astype(np.uint16)
elif mx < 4294967295:
props[col] = props[col].astype(np.uint32)
else:
props[col] = props[col].astype(np.uint64)
else:
if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
props[col] = props[col].astype(np.int8)
elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
props[col] = props[col].astype(np.int16)
elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
props[col] = props[col].astype(np.int32)
elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
props[col] = props[col].astype(np.int64)
# Make float datatypes 32 bit, Int 轉化不了則轉為 float32
else:
props[col] = props[col].astype(np.float32)
# Print new column type
print("dtype after: ",props[col].dtype)
print("******************************")
# Print final result
print("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = props.memory_usage().sum() / 1024**2
print("Memory usage is: ",mem_usg," MB")
print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
return props, NAlist

使用方式：

props = pd.read_csv(r"../input/properties_2016.csv") #The properties dataset
# props 為降低內存後的 DataFrame，NAlist 為存在異常值的列名字
props, NAlist = reduce_mem_usage(props)
print("_________________")
print("")
print("Warning: the following columns have missing values filled with 'df['column_name'].min() -1': ")
print("_________________")
print("")
print(NAlist)

還可以參考：
How to Speed up Pandas by 4x with one line of code
How to Speed Up Pandas Calculations

附錄

關於 np.isfinite 使用的方式：

>>> np.isfinite(1)
True
>>> np.isfinite(0)
True
>>> np.isfinite(np.nan)
False
>>> np.isfinite(np.inf)
False
>>> np.isfinite(np.NINF)
False
>>> np.isfinite([np.log(-1.),1.,np.log(0)])
array([False, True, False])
>>> x = np.array([-np.inf, 0., np.inf])
>>> y = np.array([2, 2, 2])
>>> np.isfinite(x, y)
array([0, 1, 0])
>>> y
array([0, 1, 0])

返回 True 如果 x 不是正無窮大、負無窮大或 NaN；否則返回 False。如果 x 是標量，則這是一個標量