Recently, I found that , When shrinking the tail, the data is filled in the place that is originally empty or invalid value . Traditional research will eliminate null values and then shrink the tail , However, some data sets that do not need to eliminate null values need to eliminate extreme values , Therefore, the tail contraction cannot be omitted . Make some records based on your own operating experience :
To save in Excel Take the data in as an example :
from scipy.stats.mstats import winsorizeimport pandas as pddf = pd.read_excel('Excel.xlsx', engine='openpyxl', header=0)df_list=["a","b","c"]# Column names that need to be shortened
1: Direct application Winsorize, Do not consider null and invalid values , Tailing results may cause some null values to be filled with data
for i in df_list(): df[i]=winsorize(df[i],limits=[0.01, 0.01])# The continuous data in the specified column is 1% and 99% Shrinking tail of (Winsorize) Handle
2.1: Mask null and invalid values , Only for other values Winsorize Handle , The shrinking result does not change the original null and invalid values
for i in df_list(): df[i]=np.where(df[i].isnull(), np.nan, winsorize(np.ma.masked_invalid(df[i]),limits=(0.01,0.01)))#np.where(condition, x, y), Satisfy condition yes x, otherwise y# Here to judge whether the value is null , If yes, it is blank , If not, mask null and invalid values 1% and 99% Tail reduction treatment
2.2:winsorize Provided parameters , But I didn't succeed in this method … For reference only
for i in df_list(): df[i]=winsorize(df[i],limits=[0.01, 0.01], nan_policy='omit')
3: Mask null and invalid values , All values are Winsorize Handle , The shrinking result does not change the original null and invalid values , With the method 2 The difference is the method 3 There is no change in the length of data that needs to be shortened
for i in df_list(): mask = df[i].notna() df.loc[mask,i] = winsorize(df[i].loc[mask],limits=[0.01, 0.01]) # This mask It's just one. bool index, Indicate where nan # For example, a column of data is [1, NaN, 2], If you use df['A'].isnan() What you get is a [False, True, False] Array of # This array is called mask, It can be dataframe Select the specific data in
I encountered the problem of negative infinity in the follow-up descriptive statistics , So replace it with a null value
# If you need to replace infinite value with null value df=df.replace(-np.Inf,np.NaN)
( I would like to thank Mr. Zhang who took the trouble to provide me with reference 、 Miss li 、 Miss sun !)
Reference article :
1.Winsorize But in Python Ignored in nan
2. of numpy.ma.masked_invalid Usage of
3.Python Data analysis - Tail reduction treatment
summary
This is about Python Application in Winsorize This is the end of the article on tail reduction , More about Python application Winsorize Please search the previous articles of SDN or continue to browse the relevant articles below. I hope you will support SDN more in the future !