您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Operating experience of applying winsorize tail reduction in Python

編輯：Python

Recently, I found that , When shrinking the tail, the data is filled in the place that is originally empty or invalid value . Traditional research will eliminate null values and then shrink the tail , However, some data sets that do not need to eliminate null values need to eliminate extreme values , Therefore, the tail contraction cannot be omitted . Make some records based on your own operating experience ：

To save in Excel Take the data in as an example ：

from scipy.stats.mstats import winsorizeimport pandas as pddf = pd.read_excel('Excel.xlsx', engine='openpyxl', header=0)df_list=["a","b","c"]# Column names that need to be shortened

1： Direct application Winsorize, Do not consider null and invalid values , Tailing results may cause some null values to be filled with data

for i in df_list(): df[i]=winsorize(df[i],limits=[0.01, 0.01])# The continuous data in the specified column is 1% and 99% Shrinking tail of （Winsorize） Handle

2.1： Mask null and invalid values , Only for other values Winsorize Handle , The shrinking result does not change the original null and invalid values

for i in df_list(): df[i]=np.where(df[i].isnull(), np.nan, winsorize(np.ma.masked_invalid(df[i]),limits=(0.01,0.01)))#np.where(condition, x, y), Satisfy condition yes x, otherwise y# Here to judge whether the value is null , If yes, it is blank , If not, mask null and invalid values 1% and 99% Tail reduction treatment

2.2：winsorize Provided parameters , But I didn't succeed in this method … For reference only

for i in df_list(): df[i]=winsorize(df[i],limits=[0.01, 0.01], nan_policy='omit')

3： Mask null and invalid values , All values are Winsorize Handle , The shrinking result does not change the original null and invalid values , With the method 2 The difference is the method 3 There is no change in the length of data that needs to be shortened

for i in df_list(): mask = df[i].notna() df.loc[mask,i] = winsorize(df[i].loc[mask],limits=[0.01, 0.01]) # This mask It's just one. bool index, Indicate where nan # For example, a column of data is [1, NaN, 2], If you use df['A'].isnan() What you get is a [False, True, False] Array of # This array is called mask, It can be dataframe Select the specific data in

I encountered the problem of negative infinity in the follow-up descriptive statistics , So replace it with a null value

# If you need to replace infinite value with null value df=df.replace(-np.Inf,np.NaN)

（ I would like to thank Mr. Zhang who took the trouble to provide me with reference 、 Miss li 、 Miss sun ！）

Reference article ：

1.Winsorize But in Python Ignored in nan

2. of numpy.ma.masked_invalid Usage of

3.Python Data analysis - Tail reduction treatment

summary

This is about Python Application in Winsorize This is the end of the article on tail reduction , More about Python application Winsorize Please search the previous articles of SDN or continue to browse the relevant articles below. I hope you will support SDN more in the future ！