程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Pandas missing data processing Encyclopedia

編輯:Python

author | Dongge takes off

source |  Dongge takes off

Use your spare time to Data cleaning 、 Data analysis Some skills are classified again , It also contains some tips I usually use , This time, we will start with data cleaning and missing value processing ~

All data and codes are available in my GitHub obtain :

https://github.com/xiaoyusmd/PythonDataScience

One 、 Missing value type

stay pandas in , Missing data is displayed as NaN. Missing values are 3 There are two ways to express ,np.nan,none,pd.NA.

1、np.nan

Missing values have a characteristic ( pit ), It's not equal to any value , Not even yourself . If you use nan Comparing with any other value will return nan.

np.nan == np.nan
>> False

Because of this feature , After the dataset is read in , No matter what type of data the column is , The default missing values are all np.nan.

because nan stay Numpy The type in is floating point , Therefore, the integer column will be converted to floating point ; Character type cannot be converted to floating point type , Can only be merged into object type ('O'), The original floating-point type remains the same .

type(np.nan)
>> float
pd.Series([1,2,3]).dtype
>> dtype('int64')
pd.Series([1,np.nan,3]).dtype
>> dtype('float64')

Beginners do data processing object Type will be confused , I don't know what this is , It's a character type , It changes after importing , It's actually caused by missing values .

besides , Also introduce a missing value for time series , It exists alone , use NaT Express , yes pandas Built in type , It can be regarded as a time series version of np.nan, It's not equal to yourself .

s_time = pd.Series([pd.Timestamp('20220101')]*3)
s_time
>> 0 2022-01-01
   1 2022-01-01
   2 2022-01-01
   dtype:datetime64[ns]
-----------------
s_time[2] = pd.NaT
s_time
>> 0 2022-01-01
   1 2022-01-01
   2 NaT
   dtype:datetime64[ns]

2、None

There's another one None, It's better than nan Well, then , Because it is at least equal to itself .

None == None
>> True

After passing in the value type , It will automatically change to np.nan.

type(pd.Series([1,None])[1])
>> numpy.float64

Only when it comes in object Type is constant , Therefore, it can be considered that if it is not artificially named None Words , It basically doesn't automatically appear in pandas in , therefore None You can hardly see .

type(pd.Series([1,None],dtype='O')[1])
>> NoneType

3、NA Scalar

pandas1.0 A scalar representing missing values has been introduced in later versions pd.NA, It represents an empty integer 、 Empty Boolean 、 Null character , This function is currently in the experimental stage .

Developers have also noticed this , It would be messy to adopt different missing values for different data types .pd.NA It exists for unity .pd.NA The goal is to provide a missing value indicator , It can be used consistently in various data types ( instead of np.nan、None perhaps NaT Use... According to circumstances ).

s_new = pd.Series([1, 2], dtype="Int64")
s_new
>> 0   1
   1   2
   dtype: Int64
-----------------
s_new[1] = pd.NaT
s_new
>> 0    1
   1  <NA>
   dtype: Int64

Empathy , For Booleans 、 The character type will not change the original data type , In this way, we can solve the problem that everything turns into object Type of trouble .

Here is pd.NA Examples of some common arithmetic operations and comparison operations :

#####  Arithmetic operations
#  Add
pd.NA + 1
>> <NA>
-----------
#  Multiplication
"a" * pd.NA
>> <NA>
-----------
#  The following two results are 1
pd.NA ** 0
>> 1
-----------
1 ** pd.NA
>> 1
#####  Comparison operations
pd.NA == pd.NA
>> <NA>
-----------
pd.NA < 2.5
>> <NA>
-----------
np.log(pd.NA)
>> <NA>
-----------
np.add(pd.NA, 1)
>> <NA>

Two 、 Missing value judgment

After understanding several forms of missing values , We need to know how to judge the missing value . For one dataframe for , The main way to judge the lack is isnull() perhaps isna(), These two methods will directly return True and False Boolean value . It can be for the whole dataframe Or a column .

df = pd.DataFrame({
      'A':['a1','a1','a2','a3'],
      'B':['b1',None,'b2','b3'],
      'C':[1,2,3,4],
      'D':[5,None,9,10]})
#  Set infinity to the missing value       
pd.options.mode.use_inf_as_na = True

1、 To the whole dataframe Lack of judgment

df.isnull()
>> A B C D
0 False False False False
1 False True False True
2 False False False False
3 False False False False

2、 Missing judgment for a column

df['C'].isnull()
>> 0    False
   1    False
   2    False
   3    False
Name: C, dtype: bool

If you want to get non missing, you can use notna(), It's the same way , The result is the opposite .

3、 ... and 、 Missing value statistics

1、 Column missing

Generally, we will treat a dataframe Of Column Make missing statistics , See how many columns are missing , If the missing rate is too high, delete or interpolate again . So directly above isnull() Apply directly to the returned results .sum() that will do ,axis Default equal to 0,0 Is listed ,1 Yes .

##  Column missing statistics
isnull().sum(axis=0)

2、 Row missing

But a lot of times , We also need to be right That's ok Judge the missing value . For example, a row of data may have no value at all , If this sample enters the model , It will cause great interference . therefore , The two missing rates of rows and columns are usually checked and counted .

Easy to operate , Only need sum() Set in axis=1 that will do .

##  Row missing Statistics
isnull().sum(axis=1)

3、 Absence rate

Sometimes I not only want to know the number of missing , I'd rather know the proportion of missing , That is, the deletion rate . Normally, you may think of using the above value to compare the total number of rows . But there's a little trick you can do in one step .

##  Absence rate
df.isnull().sum(axis=0)/df.shape[0]
##  Absence rate ( One step in place )
isnull().mean()

Four 、 Missing value filtering

Screening requires loc Cooperate to complete , The missing filters for rows and columns are as follows :

#  Filter rows with missing values
df.loc[df.isnull().any(1)]
>> A B C D
1 a1 None 2 NaN
-----------------
#  Filter columns with missing values
df.loc[:,df.isnull().any()]
>> B D
0 b1 5.0
1 None NaN
2 b2 9.0
3 b3 10.0

If you want to query rows and columns without missing values , You can negate an expression with ~ operation :

df.loc[~(df.isnull().any(1))]
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

It USES any Judge and screen as long as there are deficiencies , It can also be used. all Determine whether all are missing , You can also judge in the line , If the entire column or row are missing values , Then this variable or sample will lose the significance of analysis , Consider deleting .

5、 ... and 、 Missing value fill

Generally, we have two methods to deal with missing values , One is to delete , The other is to keep and fill . Let's first introduce the filling method fillna.

#  take dataframe All missing values are populated with 0
df.fillna(0)
>> A B C D
0 a1 b1 1 5.0
1 a1 0 2 0.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
--------------
#  take D Column missing values are populated with -999
df.D.fillna('-999')
>> 0       5
   1    -999
   2       9
   3      10
Name: D, dtype: object

It's easy , But some parameters need to be paid attention to when using .

  • inplace: You can set fillna(0, inplace=True) To make the fill take effect , primary dataFrame Be filled with .

  • methond: You can set methond Method to fill forward or backward ,pad/ffill Fill forward ,bfill/backfill To fill backwards , such as df.fillna(methond='ffill'), Or we could just write it as df.ffill().

df.ffill()
>> A B C D
0 a1 b1 1 5.0
1 a1 b1 2 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

The original missing value will be filled in according to the previous value (B Column 1 That's ok ,D Column 1 That's ok ).

In addition to filling in , You can also use the average of the entire column to fill , For example, yes. D The average of the other non missing values of the column 8 To fill in the missing values .

df.D.fillna(df.D.mean())
>> 0     5.0
   1     8.0
   2     9.0
   3    10.0
Name: D, dtype: float64

6、 ... and 、 Missing values delete

Deleting missing values is not the case , For example, whether to delete all or delete with high deletion rate , It depends on your tolerance , Real data is bound to be missing , This cannot be avoided . And in some cases, lack also represents a certain meaning , It depends on the situation .

1、 Delete all directly

#  Delete all directly
df.dropna()
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

2、 Line missing delete

#  Line missing delete
df.dropna(axis=0)
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

3、 Column missing delete

#  Column missing delete
df.dropna(axis=1)
>> A C
0 a1 1
1 a1 2
2 a2 3
3 a3 4
-------------
#  Deletes the missing columns within the specified column range , because C No missing columns , So there was no change in the end
df.dropna(subset=['C'])
>> A B C D
0 a1 b1 1 5.0
1 a1 None 2 NaN
2 a2 b2 3 9.0
3 a3 b3 4 10.0

4、 Delete by deletion rate

This can be realized by filtering , For example, to delete a column, the missing column is greater than 0.1 Of ( That is, the filter is less than 0.1 Of ).

df.loc[:,df.isnull().mean(axis=0) < 0.1]
>> A C
0 a1 1
1 a1 2
2 a2 3
3 a3 4
-------------
#  Delete row missing greater than 0.1 Of
df.loc[df.isnull().mean(axis=1) < 0.1]
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

7、 ... and 、 Missing values participate in the calculation

If missing values are not handled , So what logic will the missing value be calculated according to ?

Let's take a look at the participation logic of missing values under various operations .

1、 Add

df
>>A B C D
0 a1 b1 1 5.0
1 a1 None 2 NaN
2 a2 b2 3 9.0
3 a3 b3 4 10.0
---------------
#  Sum all columns
df.sum()
>> A    a1a1a2a3
   C          10
   D          24

You can see , Addition ignores missing values .

2、 Add up

#  Yes D Columns are accumulated
df.D.cumsum()
>> 0     5.0
   1     NaN
   2    14.0
   3    24.0
Name: D, dtype: float64
---------------
df.D.cumsum(skipna=False)
>> 0    5.0
   1    NaN
   2    NaN
   3    NaN
Name: D, dtype: float64

cumsum It will be ignored NA, But the value remains in the column , have access to skipna=False Skip calculations with missing values and return missing values .

3、 Count

#  Count Columns
df.count()
>> A    4
   B    3
   C    4
   D    3
dtype: int64

Missing values are not in the count range .

4、 Group together

df.groupby('B').sum()
>> C D
B  
b1 1 5.0
b2 3 9.0
b3 4 10.0
---------------
df.groupby('B',dropna=False).sum()
>> C D
B  
b1 1 5.0
b2 3 9.0
b3 4 10.0
NaN 2 0.0

Missing values are ignored by default when aggregating , If missing values are to be included in the group , You can set dropna=False. This usage is similar to others, such as value_counts It's the same , Sometimes we need to look at the number of missing values .

These are all the common operations about missing values , From understanding the missing value 3 Two forms of expression begin , To determine the missing value 、 Statistics 、 Handle 、 Calculation, etc .

Looking back

NLP Exploration and practice of class problem modeling scheme

Python Common encryption algorithms in crawlers !

2D Transformation 3D, Look at NVIDIA's AI“ new ” magic !

Get it done Python Several common data structures !

 Share
Point collection
A little bit of praise
Click to see 

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved