您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Pandas missing data processing Encyclopedia

編輯：Python

author | Dongge takes off

source | Dongge takes off

Use your spare time to Data cleaning 、 Data analysis Some skills are classified again , It also contains some tips I usually use , This time, we will start with data cleaning and missing value processing ~

All data and codes are available in my GitHub obtain ：

https://github.com/xiaoyusmd/PythonDataScience

One 、 Missing value type

stay pandas in , Missing data is displayed as NaN. Missing values are 3 There are two ways to express ,np.nan,none,pd.NA.

1、np.nan

Missing values have a characteristic （ pit ）, It's not equal to any value , Not even yourself . If you use nan Comparing with any other value will return nan.

np.nan == np.nan
>> False

Because of this feature , After the dataset is read in , No matter what type of data the column is , The default missing values are all np.nan.

because nan stay Numpy The type in is floating point , Therefore, the integer column will be converted to floating point ; Character type cannot be converted to floating point type , Can only be merged into object type （'O'）, The original floating-point type remains the same .

type(np.nan)
>> float

pd.Series([1,2,3]).dtype
>> dtype('int64')
pd.Series([1,np.nan,3]).dtype
>> dtype('float64')

Beginners do data processing object Type will be confused , I don't know what this is , It's a character type , It changes after importing , It's actually caused by missing values .

besides , Also introduce a missing value for time series , It exists alone , use NaT Express , yes pandas Built in type , It can be regarded as a time series version of np.nan, It's not equal to yourself .

s_time = pd.Series([pd.Timestamp('20220101')]*3)
s_time
>> 0 2022-01-01
   1 2022-01-01
   2 2022-01-01
   dtype:datetime64[ns]
-----------------
s_time[2] = pd.NaT
s_time
>> 0 2022-01-01
   1 2022-01-01
   2 NaT
   dtype:datetime64[ns]

2、None

There's another one None, It's better than nan Well, then , Because it is at least equal to itself .

None == None
>> True

After passing in the value type , It will automatically change to np.nan.

type(pd.Series([1,None])[1])
>> numpy.float64

Only when it comes in object Type is constant , Therefore, it can be considered that if it is not artificially named None Words , It basically doesn't automatically appear in pandas in , therefore None You can hardly see .

type(pd.Series([1,None],dtype='O')[1])
>> NoneType

3、NA Scalar

pandas1.0 A scalar representing missing values has been introduced in later versions pd.NA, It represents an empty integer 、 Empty Boolean 、 Null character , This function is currently in the experimental stage .

Developers have also noticed this , It would be messy to adopt different missing values for different data types .pd.NA It exists for unity .pd.NA The goal is to provide a missing value indicator , It can be used consistently in various data types ( instead of np.nan、None perhaps NaT Use... According to circumstances ).

s_new = pd.Series([1, 2], dtype="Int64")
s_new
>> 0   1
   1   2
   dtype: Int64
-----------------
s_new[1] = pd.NaT
s_new
>> 0    1
   1  <NA>
   dtype: Int64

Empathy , For Booleans 、 The character type will not change the original data type , In this way, we can solve the problem that everything turns into object Type of trouble .

Here is pd.NA Examples of some common arithmetic operations and comparison operations ：

#####  Arithmetic operations
#  Add
pd.NA + 1
>> <NA>
-----------
#  Multiplication
"a" * pd.NA
>> <NA>
-----------
#  The following two results are 1
pd.NA ** 0
>> 1
-----------
1 ** pd.NA
>> 1
#####  Comparison operations
pd.NA == pd.NA
>> <NA>
-----------
pd.NA < 2.5
>> <NA>
-----------
np.log(pd.NA)
>> <NA>
-----------
np.add(pd.NA, 1)
>> <NA>

Two 、 Missing value judgment

After understanding several forms of missing values , We need to know how to judge the missing value . For one dataframe for , The main way to judge the lack is isnull() perhaps isna(), These two methods will directly return True and False Boolean value . It can be for the whole dataframe Or a column .

df = pd.DataFrame({
      'A':['a1','a1','a2','a3'],
      'B':['b1',None,'b2','b3'],
      'C':[1,2,3,4],
      'D':[5,None,9,10]})
#  Set infinity to the missing value       
pd.options.mode.use_inf_as_na = True

1、 To the whole dataframe Lack of judgment

df.isnull()
>> A B C D
0 False False False False
1 False True False True
2 False False False False
3 False False False False

2、 Missing judgment for a column

df['C'].isnull()
>> 0    False
   1    False
   2    False
   3    False
Name: C, dtype: bool

If you want to get non missing, you can use notna(), It's the same way , The result is the opposite .

3、 ... and 、 Missing value statistics

1、 Column missing

Generally, we will treat a dataframe Of Column Make missing statistics , See how many columns are missing , If the missing rate is too high, delete or interpolate again . So directly above isnull() Apply directly to the returned results .sum() that will do ,axis Default equal to 0,0 Is listed ,1 Yes .

##  Column missing statistics
isnull().sum(axis=0)

2、 Row missing

But a lot of times , We also need to be right That's ok Judge the missing value . For example, a row of data may have no value at all , If this sample enters the model , It will cause great interference . therefore , The two missing rates of rows and columns are usually checked and counted .

Easy to operate , Only need sum() Set in axis=1 that will do .

##  Row missing Statistics
isnull().sum(axis=1)

3、 Absence rate

Sometimes I not only want to know the number of missing , I'd rather know the proportion of missing , That is, the deletion rate . Normally, you may think of using the above value to compare the total number of rows . But there's a little trick you can do in one step .

##  Absence rate
df.isnull().sum(axis=0)/df.shape[0]
##  Absence rate （ One step in place ）
isnull().mean()

Four 、 Missing value filtering

Screening requires loc Cooperate to complete , The missing filters for rows and columns are as follows ：

#  Filter rows with missing values
df.loc[df.isnull().any(1)]
>> A B C D
1 a1 None 2 NaN
-----------------
#  Filter columns with missing values
df.loc[:,df.isnull().any()]
>> B D
0 b1 5.0
1 None NaN
2 b2 9.0
3 b3 10.0

If you want to query rows and columns without missing values , You can negate an expression with ~ operation ：

df.loc[~(df.isnull().any(1))]
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

It USES any Judge and screen as long as there are deficiencies , It can also be used. all Determine whether all are missing , You can also judge in the line , If the entire column or row are missing values , Then this variable or sample will lose the significance of analysis , Consider deleting .

5、 ... and 、 Missing value fill

Generally, we have two methods to deal with missing values , One is to delete , The other is to keep and fill . Let's first introduce the filling method fillna.

#  take dataframe All missing values are populated with 0
df.fillna(0)
>> A B C D
0 a1 b1 1 5.0
1 a1 0 2 0.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
--------------
#  take D Column missing values are populated with -999
df.D.fillna('-999')
>> 0       5
   1    -999
   2       9
   3      10
Name: D, dtype: object

It's easy , But some parameters need to be paid attention to when using .

inplace： You can set fillna(0, inplace=True) To make the fill take effect , primary dataFrame Be filled with .
methond： You can set methond Method to fill forward or backward ,pad/ffill Fill forward ,bfill/backfill To fill backwards , such as df.fillna(methond='ffill'), Or we could just write it as df.ffill().

df.ffill()
>> A B C D
0 a1 b1 1 5.0
1 a1 b1 2 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

The original missing value will be filled in according to the previous value (B Column 1 That's ok ,D Column 1 That's ok ).

In addition to filling in , You can also use the average of the entire column to fill , For example, yes. D The average of the other non missing values of the column 8 To fill in the missing values .

df.D.fillna(df.D.mean())
>> 0     5.0
   1     8.0
   2     9.0
   3    10.0
Name: D, dtype: float64

6、 ... and 、 Missing values delete

Deleting missing values is not the case , For example, whether to delete all or delete with high deletion rate , It depends on your tolerance , Real data is bound to be missing , This cannot be avoided . And in some cases, lack also represents a certain meaning , It depends on the situation .

1、 Delete all directly

#  Delete all directly
df.dropna()
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

2、 Line missing delete

#  Line missing delete
df.dropna(axis=0)
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

3、 Column missing delete

#  Column missing delete
df.dropna(axis=1)
>> A C
0 a1 1
1 a1 2
2 a2 3
3 a3 4
-------------
#  Deletes the missing columns within the specified column range , because C No missing columns , So there was no change in the end
df.dropna(subset=['C'])
>> A B C D
0 a1 b1 1 5.0
1 a1 None 2 NaN
2 a2 b2 3 9.0
3 a3 b3 4 10.0

4、 Delete by deletion rate

This can be realized by filtering , For example, to delete a column, the missing column is greater than 0.1 Of （ That is, the filter is less than 0.1 Of ）.

df.loc[:,df.isnull().mean(axis=0) < 0.1]
>> A C
0 a1 1
1 a1 2
2 a2 3
3 a3 4
-------------
#  Delete row missing greater than 0.1 Of
df.loc[df.isnull().mean(axis=1) < 0.1]
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

7、 ... and 、 Missing values participate in the calculation

If missing values are not handled , So what logic will the missing value be calculated according to ？

Let's take a look at the participation logic of missing values under various operations .

1、 Add

df
>>A B C D
0 a1 b1 1 5.0
1 a1 None 2 NaN
2 a2 b2 3 9.0
3 a3 b3 4 10.0
---------------
#  Sum all columns
df.sum()
>> A    a1a1a2a3
   C          10
   D          24

You can see , Addition ignores missing values .

2、 Add up

#  Yes D Columns are accumulated
df.D.cumsum()
>> 0     5.0
   1     NaN
   2    14.0
   3    24.0
Name: D, dtype: float64
---------------
df.D.cumsum(skipna=False)
>> 0    5.0
   1    NaN
   2    NaN
   3    NaN
Name: D, dtype: float64

cumsum It will be ignored NA, But the value remains in the column , have access to skipna=False Skip calculations with missing values and return missing values .

3、 Count

#  Count Columns
df.count()
>> A    4
   B    3
   C    4
   D    3
dtype: int64

Missing values are not in the count range .

4、 Group together

df.groupby('B').sum()
>> C D
B  
b1 1 5.0
b2 3 9.0
b3 4 10.0
---------------
df.groupby('B',dropna=False).sum()
>> C D
B  
b1 1 5.0
b2 3 9.0
b3 4 10.0
NaN 2 0.0

Missing values are ignored by default when aggregating , If missing values are to be included in the group , You can set dropna=False. This usage is similar to others, such as value_counts It's the same , Sometimes we need to look at the number of missing values .

These are all the common operations about missing values , From understanding the missing value 3 Two forms of expression begin , To determine the missing value 、 Statistics 、 Handle 、 Calculation, etc .

Looking back

NLP Exploration and practice of class problem modeling scheme

Python Common encryption algorithms in crawlers ！

2D Transformation 3D, Look at NVIDIA's AI“ new ” magic ！

Get it done Python Several common data structures ！

 Share
Point collection
A little bit of praise
Click to see

上一篇文章： Vacation is coming! How to realize the security system of scenic spots with Python
下一篇文章： Fundamentals of python (5)

Python

武漢理工大學 Python程序設計第六章測驗

ls = [25, 13, 36, 1] ls.clear(

Using Python to automatically like friends

Use this program only to famil

Pandas中創建DataFrame對象以及相關的列操作，行操作

創建DataFrame對象pd.DataFrame(data

【MicroPython】用ESP32學Python

用ESP32學Python一、環境搭建1、硬件平台ESP32

Python is easy to learn and use, but it is the best choice for hackers

First, lets understand what is

Attack and defense world CTF Web_ python_ template_ injection

Record the module into the lea

Pandas uses the split function to split the specific string data column of dataframe into two new data columns and generate a new dataframe

51job crawler + data visualization Python

Python data structure problems

Introduction to Python data structure and algorithm

Django project - order module (next) and data statistics_ 11 [more readable version]

Python data analysis - pandas data structure (dataframe)

Python data analysis science library pandas (statistical analysis and decision)

Python -- data visualization using Matplotlib Library

I read a value from a file. How can I make this data value locate according to the value I read (Language Python)

Python implements the cell filling color of data required in Excel

熱門圖文

HDU 2136 Largest prime factor itextsharp去掉PDF加密，itextsharppdf加密 Ajax+PHP邊學邊練之五圖片處理 javaee-java後台的輸出流。怎麼在前台接收 C#通過url下載圖片 .NET開發之中的17種正則表達式小結方法-求大神,用javascript導出Excel後,怎麼用JS對某些單元格設置計算函數? 深入探索Java工作原理：JVM，內存回收及其他

欄目導航