author | Dongge takes off
source | Dongge takes off
Use your spare time to Data cleaning 、 Data analysis Some skills are classified again , It also contains some tips I usually use , This time, we will start with data cleaning and missing value processing ~
All data and codes are available in my GitHub
obtain :
https://github.com/xiaoyusmd/PythonDataScience
stay pandas
in , Missing data is displayed as NaN. Missing values are 3 There are two ways to express ,np.nan
,none
,pd.NA
.
Missing values have a characteristic ( pit ), It's not equal to any value , Not even yourself . If you use nan
Comparing with any other value will return nan
.
np.nan == np.nan
>> False
Because of this feature , After the dataset is read in , No matter what type of data the column is , The default missing values are all np.nan
.
because nan
stay Numpy
The type in is floating point , Therefore, the integer column will be converted to floating point ; Character type cannot be converted to floating point type , Can only be merged into object type ('O'), The original floating-point type remains the same .
type(np.nan)
>> float
pd.Series([1,2,3]).dtype
>> dtype('int64')
pd.Series([1,np.nan,3]).dtype
>> dtype('float64')
Beginners do data processing object Type will be confused , I don't know what this is , It's a character type , It changes after importing , It's actually caused by missing values .
besides , Also introduce a missing value for time series , It exists alone , use NaT Express , yes pandas
Built in type , It can be regarded as a time series version of np.nan
, It's not equal to yourself .
s_time = pd.Series([pd.Timestamp('20220101')]*3)
s_time
>> 0 2022-01-01
1 2022-01-01
2 2022-01-01
dtype:datetime64[ns]
-----------------
s_time[2] = pd.NaT
s_time
>> 0 2022-01-01
1 2022-01-01
2 NaT
dtype:datetime64[ns]
There's another one None
, It's better than nan
Well, then , Because it is at least equal to itself .
None == None
>> True
After passing in the value type , It will automatically change to np.nan
.
type(pd.Series([1,None])[1])
>> numpy.float64
Only when it comes in object
Type is constant , Therefore, it can be considered that if it is not artificially named None
Words , It basically doesn't automatically appear in pandas
in , therefore None
You can hardly see .
type(pd.Series([1,None],dtype='O')[1])
>> NoneType
pandas1.0 A scalar representing missing values has been introduced in later versions pd.NA, It represents an empty integer 、 Empty Boolean 、 Null character , This function is currently in the experimental stage .
Developers have also noticed this , It would be messy to adopt different missing values for different data types .pd.NA It exists for unity .pd.NA The goal is to provide a missing value indicator , It can be used consistently in various data types ( instead of np.nan、None perhaps NaT Use... According to circumstances ).
s_new = pd.Series([1, 2], dtype="Int64")
s_new
>> 0 1
1 2
dtype: Int64
-----------------
s_new[1] = pd.NaT
s_new
>> 0 1
1 <NA>
dtype: Int64
Empathy , For Booleans 、 The character type will not change the original data type , In this way, we can solve the problem that everything turns into object
Type of trouble .
Here is pd.NA Examples of some common arithmetic operations and comparison operations :
##### Arithmetic operations
# Add
pd.NA + 1
>> <NA>
-----------
# Multiplication
"a" * pd.NA
>> <NA>
-----------
# The following two results are 1
pd.NA ** 0
>> 1
-----------
1 ** pd.NA
>> 1
##### Comparison operations
pd.NA == pd.NA
>> <NA>
-----------
pd.NA < 2.5
>> <NA>
-----------
np.log(pd.NA)
>> <NA>
-----------
np.add(pd.NA, 1)
>> <NA>
After understanding several forms of missing values , We need to know how to judge the missing value . For one dataframe
for , The main way to judge the lack is isnull()
perhaps isna()
, These two methods will directly return True
and False
Boolean value . It can be for the whole dataframe
Or a column .
df = pd.DataFrame({
'A':['a1','a1','a2','a3'],
'B':['b1',None,'b2','b3'],
'C':[1,2,3,4],
'D':[5,None,9,10]})
# Set infinity to the missing value
pd.options.mode.use_inf_as_na = True
df.isnull()
>> A B C D
0 False False False False
1 False True False True
2 False False False False
3 False False False False
df['C'].isnull()
>> 0 False
1 False
2 False
3 False
Name: C, dtype: bool
If you want to get non missing, you can use notna()
, It's the same way , The result is the opposite .
Generally, we will treat a dataframe
Of Column Make missing statistics , See how many columns are missing , If the missing rate is too high, delete or interpolate again . So directly above isnull()
Apply directly to the returned results .sum()
that will do ,axis
Default equal to 0,0 Is listed ,1 Yes .
## Column missing statistics
isnull().sum(axis=0)
But a lot of times , We also need to be right That's ok Judge the missing value . For example, a row of data may have no value at all , If this sample enters the model , It will cause great interference . therefore , The two missing rates of rows and columns are usually checked and counted .
Easy to operate , Only need sum()
Set in axis=1
that will do .
## Row missing Statistics
isnull().sum(axis=1)
Sometimes I not only want to know the number of missing , I'd rather know the proportion of missing , That is, the deletion rate . Normally, you may think of using the above value to compare the total number of rows . But there's a little trick you can do in one step .
## Absence rate
df.isnull().sum(axis=0)/df.shape[0]
## Absence rate ( One step in place )
isnull().mean()
Screening requires loc Cooperate to complete , The missing filters for rows and columns are as follows :
# Filter rows with missing values
df.loc[df.isnull().any(1)]
>> A B C D
1 a1 None 2 NaN
-----------------
# Filter columns with missing values
df.loc[:,df.isnull().any()]
>> B D
0 b1 5.0
1 None NaN
2 b2 9.0
3 b3 10.0
If you want to query rows and columns without missing values , You can negate an expression with ~
operation :
df.loc[~(df.isnull().any(1))]
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
It USES any
Judge and screen as long as there are deficiencies , It can also be used. all
Determine whether all are missing , You can also judge in the line , If the entire column or row are missing values , Then this variable or sample will lose the significance of analysis , Consider deleting .
Generally, we have two methods to deal with missing values , One is to delete , The other is to keep and fill . Let's first introduce the filling method fillna
.
# take dataframe All missing values are populated with 0
df.fillna(0)
>> A B C D
0 a1 b1 1 5.0
1 a1 0 2 0.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
--------------
# take D Column missing values are populated with -999
df.D.fillna('-999')
>> 0 5
1 -999
2 9
3 10
Name: D, dtype: object
It's easy , But some parameters need to be paid attention to when using .
inplace: You can set fillna(0, inplace=True)
To make the fill take effect , primary dataFrame Be filled with .
methond: You can set methond
Method to fill forward or backward ,pad/ffill
Fill forward ,bfill/backfill
To fill backwards , such as df.fillna(methond='ffill')
, Or we could just write it as df.ffill()
.
df.ffill()
>> A B C D
0 a1 b1 1 5.0
1 a1 b1 2 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
The original missing value will be filled in according to the previous value (B Column 1 That's ok ,D Column 1 That's ok ).
In addition to filling in , You can also use the average of the entire column to fill , For example, yes. D The average of the other non missing values of the column 8 To fill in the missing values .
df.D.fillna(df.D.mean())
>> 0 5.0
1 8.0
2 9.0
3 10.0
Name: D, dtype: float64
Deleting missing values is not the case , For example, whether to delete all or delete with high deletion rate , It depends on your tolerance , Real data is bound to be missing , This cannot be avoided . And in some cases, lack also represents a certain meaning , It depends on the situation .
# Delete all directly
df.dropna()
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
# Line missing delete
df.dropna(axis=0)
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
# Column missing delete
df.dropna(axis=1)
>> A C
0 a1 1
1 a1 2
2 a2 3
3 a3 4
-------------
# Deletes the missing columns within the specified column range , because C No missing columns , So there was no change in the end
df.dropna(subset=['C'])
>> A B C D
0 a1 b1 1 5.0
1 a1 None 2 NaN
2 a2 b2 3 9.0
3 a3 b3 4 10.0
This can be realized by filtering , For example, to delete a column, the missing column is greater than 0.1 Of ( That is, the filter is less than 0.1 Of ).
df.loc[:,df.isnull().mean(axis=0) < 0.1]
>> A C
0 a1 1
1 a1 2
2 a2 3
3 a3 4
-------------
# Delete row missing greater than 0.1 Of
df.loc[df.isnull().mean(axis=1) < 0.1]
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
If missing values are not handled , So what logic will the missing value be calculated according to ?
Let's take a look at the participation logic of missing values under various operations .
1、 Add
df
>>A B C D
0 a1 b1 1 5.0
1 a1 None 2 NaN
2 a2 b2 3 9.0
3 a3 b3 4 10.0
---------------
# Sum all columns
df.sum()
>> A a1a1a2a3
C 10
D 24
You can see , Addition ignores missing values .
2、 Add up
# Yes D Columns are accumulated
df.D.cumsum()
>> 0 5.0
1 NaN
2 14.0
3 24.0
Name: D, dtype: float64
---------------
df.D.cumsum(skipna=False)
>> 0 5.0
1 NaN
2 NaN
3 NaN
Name: D, dtype: float64
cumsum
It will be ignored NA, But the value remains in the column , have access to skipna=False
Skip calculations with missing values and return missing values .
3、 Count
# Count Columns
df.count()
>> A 4
B 3
C 4
D 3
dtype: int64
Missing values are not in the count range .
4、 Group together
df.groupby('B').sum()
>> C D
B
b1 1 5.0
b2 3 9.0
b3 4 10.0
---------------
df.groupby('B',dropna=False).sum()
>> C D
B
b1 1 5.0
b2 3 9.0
b3 4 10.0
NaN 2 0.0
Missing values are ignored by default when aggregating , If missing values are to be included in the group , You can set dropna=False
. This usage is similar to others, such as value_counts
It's the same , Sometimes we need to look at the number of missing values .
These are all the common operations about missing values , From understanding the missing value 3 Two forms of expression begin , To determine the missing value 、 Statistics 、 Handle 、 Calculation, etc .
Looking back
NLP Exploration and practice of class problem modeling scheme
Python Common encryption algorithms in crawlers !
2D Transformation 3D, Look at NVIDIA's AI“ new ” magic !
Get it done Python Several common data structures !
Share
Point collection
A little bit of praise
Click to see