您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Data cleaning: processing method of missing value and abnormal value -- operation of filling missing value with regression equation (with Python code)

編輯：Python

Operation method of filling missing values in regression equation ( attach python Code )

1. Background description ：

In the process of data cleaning, we often encounter problems such as outliers and missing values , occasionally , Will treat outliers as missing values . General missing value processing methods include ： Delete 、 Statistical value filling （ mean value 、 Median, etc ）、 Regression equation predicts filling, etc .
Using the direct delete method is simple and easy , But the disadvantage is , In the case of less recorded data , It will further reduce the sample size , It may change the original distribution of response variables , Resulting in inaccurate analysis results . therefore , The advantage of treating outliers as missing values lies in that the information of existing variables can be used for modeling and mining , For outliers （ Missing value ） Fill in .（ The purpose of this paper is to explore how to use regression equation to predict and estimate , For outliers 、 Operation method of filling with missing values ）

2. Application scenarios ：

Regression equation filling method , Is to select several independent variables that can predict the missing values , Estimating missing values by establishing regression equations . This method can make full use of the information in the original data set , But there are also some shortcomings ：1. Although this is an unbiased estimate , But it ignores random errors , Underestimate the standard deviation and other measurements of unknown properties .2. Before using , It must be assumed that the variable with missing values has a linear relationship with other variables , But in reality, they do not necessarily have such a linear relationship , This can be distinguished with the help of statistical tools , But it often needs more practical experience and business knowledge of modelers to analyze and judge .

3. Methods and steps ：

a. Variables that determine the filling missing values （ Characteristic column ）

b. Split the original dataset ：

Fill in variables with missing values as needed , Split the original data set into 2 A subset of （1. No missing values ：dataset_train; 2. Contains only missing values dataset_pred)

c. Analyze and test the correlation of relevant variables ：

Empirical analysis determines which attributes are listed in relation to variables that fill missing values , Apply statistical analysis tools , stay dataset_train View on the dataset to verify the correlation between the selected attribute columns .

d. Model and predict ：

Use dataset_train Set up linear regression model , And apply the built model to dataset_pred The missing variables in the data set are predicted and estimated

e. Merge restore datasets ：

Restore two subsets together into one dataset , Prepare data for subsequent modeling .

4. Sample code ：

Data set description ：
The data set is intercepted from a computational intensity （ The response value is ”strength“） As an example, some of the original data of .
In this case "force" Is an important feature , But with missing values , Try to use the regression equation to predict the filling missing value , To build a forecast "Strength" Model data preparation .

Load the data and determine the characteristics that need to be filled with missing values ：
Split datasets ：
Find out "item" by 3, 12, 16, 26 Of "force" The feature has missing values . Considering that the amount of data is not too much , Check variables ”force“ Whether it conforms to the normal distribution .
The results of the above analysis show that ：p The value is 0.612, Greater than 0.05, The characteristic data conform to the normal distribution .
Analyze and test the correlation of relevant variables ：
According to practical experience, we can learn that , Welding drawing force will be affected by temperature (temp)、 Time (duration) And amount of solder paste (paste_qty) Etc , therefore , We will choose to make use of the above 3 The regression equation is established by three factors , Before that, check each factor and prediction variable (force) The correlation between .
The above results show that ： The factors we choose are related to “force” There is a certain correlation between attributes .
Model and predict ：
The regression equation thus obtained is ：
force = 12.246 + 0.238 * temp - 0.262 * time + 4.419 * paste_qty
By modeling , Estimated at "force" Medium 4 Missing values .
Merge restore datasets ：
Check "item" by 3, 12,16,26 Of "force" features , The results show that the original missing values have been filled in by the predicted values using the regression equation ！