(https://github.com/MLEveryday/100-Days-Of-ML-Code.git)
explain : In the article python Most of the code comes from github( A few tests are added during learning ), The attached notes are for study notes .
Import required libraries –> Import dataset –> Processing lost data –> Analyze classified data –> Split the data set into test set and training set –> Feature scaling
#Day1:Data Prepocessing
#2019.1.26-27,2019.2.9
#coding=utf-8
import warnings
warnings.filterwarnings("ignore")
#Step 1:Importing the libraries
import numpy as np
import pandas as pd
#Step 2:Importing dataset
dataset = pd.read_csv('C:/Users/Ymy/Desktop/100-Days-Of-ML-Code/datasets/Data.csv')
# test
#loc['a']-- adopt Row labels Index row data The choice range is —— Closed interval
X = dataset.loc[0:1]
print(X)
print("-----------")
#iloc['number']-- adopt Line number ( from 0 Start ) Index row data The choice range is —— Front closing back opening section
# If [] The inside is the specific number , Then select a line with this number
X = dataset.iloc[0:1]
print(X)
print("-----------")
#iloc[:,a:b]/loc[:,a:b] Represents index column data , Usage peer index
# If a Omit not to write , The default is 0
X = dataset.iloc[:,:2]
print(X)
print("-----------")
#.values Indicates the value
X = dataset.iloc[:,:-1].values
print(X)
print("-----------")
# Officially introduce X,Y
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,3].values
print("Step 2: Importing dataset")
print("X")
print(X)
print("Y")
print(Y)
#Step 3:Handling the missing data
#scikit-learn In the model, it is assumed that the input data is numerical ,
# And it's all meaningful , If there is missing data, it is through NAN, Or a null value ,
# You can't recognize and calculate
#Imputer Class can be used to compensate for missing values , And it can only make up for the missing data of numerical type
''' Imputer Class description : strategy : string, optional (default="mean") The imputation strategy. Mean strategy - If "mean", then replace missing values using the mean along the axis. Median - If "median", then replace missing values using the median along the axis. The number of - If "most_frequent", then replace missing using the most frequent value along the axis. '''
from sklearn.preprocessing import Imputer
#axis=0, This operation is performed on each column ;
#axis=1, Means to perform this operation on each row ;
# For null values in data ("NaN") make up
imputer = Imputer(missing_values = "NaN",strategy = "mean",axis = 0)
# Yes X Medium 1、2 Column processing
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
# In the output at this time nan The value is compensated by the data
print("---------------------")
print("Step 3: Handling the missing data")
print("X")
print(X)
#Step 4:Encoding categorical data
print("---------------------")
print("Step 4: Encoding categorical data")
#LabelEncoder: Tag code from 0 Start coding ,n A value of , Encoded as 0~(n—1)
''' OneHotEncoder: Hot coding alone Use discrete features one-hot code , It will make the distance between features more reasonable . For discrete features, the basic According to one-hot( Fever alone ) code , How many values does the discrete feature have , How many dimensions are used to represent this feature 2 A value of : [1,0]:0/[0,1]:1 3 A value of :[1,0,0]:0/[0,1,0]:1/[0,0,1]:2 4 A value of :[1,0,0,0]:0/[0,1,0,0]:1/[0,0,1,0]:2/[0,0,0,1]:3 And so on ...( Numbers here 0、1、2、3 And so on all indicate the kind ) '''
# Import sklearn Medium Imputer class , stay sklearn Of preprocessing It's a bag
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X = LabelEncoder()
# Yes X The first column of the tag is encoded , altogether 3 A value of :0,1,2
X[:,0] = labelencoder_X.fit_transform(X[:,0])
print("X The first column of the tag is encoded ")
print(X)
#dummy variables: Dummy variable , Also called dummy variable and discrete feature coding ,
# Can be used to represent categorical variables 、 Possible impact of non quantitative factors
#Creating a dummy variable
''' categorical_features Indicates which features are encoded By index or bool Value to determine Such as OneHotEncoder(categorical_features = [0,2]) Equivalent to [True, False, True] to 0、2 Code in two columns ps: If the original data contains three columns , Choose 0、2 When encoding columns , Each item of output data will not be encoded “1” Column data In the last '''
# Select the first column to encode
onehotencoder = OneHotEncoder(categorical_features = [0])
# take X The first column , contain 3 The value of countries , Use a three-dimensional vector to represent
X = onehotencoder.fit_transform(X).toarray()
# Yes Y Tag coding (NO:0;Yes:1)
# The following two sentences can be abbreviated as : Y = LabelEncoder().fit_transform(Y)
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print("X")
print(X)
print("Y")
print(Y)
print("---------------------")
print("test")
# test
a = OneHotEncoder(categorical_features = [0,2])
b = a.fit([[1,2,3],[2,3,4],[3,4,5]])
c = b.transform([[1,3,5]]).toarray()
#d = a.fit_transform([[1,2,3],[2,3,4],[3,4,5]]).toarray()
print(c)
#print(d)
#Step 5:Splitting the datasets into training sets and Test sets
print("---------------------")
print('Step 5: Splitting the datasets into training sets and Test sets')
from sklearn.model_selection import train_test_split
#X_train,X_test, y_train, y_test =
#cross_validation.train_test_split(train_data,train_target,test_size=0.4, random_state=0)
''' Parameter interpretation : train_data: The sample book to be divided is collected train_target: The sample results to be divided test_size: The proportion of samples , If it's an integer it's the number of samples random_state: Is the seed of a random number cross_validatio For cross validation '''
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,random_state = 0)
''' explain : This method will X,Y Divided into X(Y)_train, X(Y)_test Two parts among ,X_test Occupy X The sum of 20%,X_train Take up the rest 80% In this case ,X_test contain X Medium 2 Group data ,X_train Include the rest of 8 Group Y Empathy '''
print("X_train")
print(X_train)
print("X_test")
print(X_test)
print("Y_train")
print(Y_train)
print("Y_test")
print(Y_test)
Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>>
RESTART: C:\Users\Ymy\Desktop\100-Days-Of-ML-Code\Code\Day 1_Data_Preprocessing.py
Step 2: Importing dataset
X
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
Y
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
Warning (from warnings module):
File "D:\python\lib\site-packages\sklearn\utils\deprecation.py", line 58
warnings.warn(msg, category=DeprecationWarning)
DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
---------------------
Step 3: Handling the missing data
step2
X
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
Warning (from warnings module):
File "D:\python\lib\site-packages\sklearn\preprocessing\_encoders.py", line 368
warnings.warn(msg, FutureWarning)
FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
Warning (from warnings module):
File "D:\python\lib\site-packages\sklearn\preprocessing\_encoders.py", line 390
"use the ColumnTransformer instead.", DeprecationWarning)
DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
---------------------
Step 4: Encoding categorical data
X
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
5.40000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
8.30000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]]
Y
[0 1 0 0 1 1 0 1 0 1]
---------------------
Step 5: Splitting the datasets into training sets and Test sets
X_train
[[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]]
X_test
[[0.0e+00 1.0e+00 0.0e+00 3.0e+01 5.4e+04]
[0.0e+00 1.0e+00 0.0e+00 5.0e+01 8.3e+04]]
Y_train
[1 1 1 0 1 0 0 1]
Y_test
[0 0]
---------------------
Step 6: Feature Scaling
X_train
[[-1. 2.64575131 -0.77459667 0.26306757 0.12381479]
[ 1. -0.37796447 -0.77459667 -0.25350148 0.46175632]
[-1. -0.37796447 1.29099445 -1.97539832 -1.53093341]
[-1. -0.37796447 1.29099445 0.05261351 -1.11141978]
[ 1. -0.37796447 -0.77459667 1.64058505 1.7202972 ]
[-1. -0.37796447 1.29099445 -0.0813118 -0.16751412]
[ 1. -0.37796447 -0.77459667 0.95182631 0.98614835]
[ 1. -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
X_test
[[-1. 2.64575131 -0.77459667 -1.45882927 -0.90166297]
[-1. 2.64575131 -0.77459667 1.98496442 2.13981082]]
>>>
notes :
because python Version change , In the original code Step 4:Encoding categorical data The function usage of the exclusive hot code in the section has changed , stay 2 month 9 The No. 1 test has been unable to work normally , The output of this paper is the result before the change of function library . Besides , The part contained in the running result warning Most of them are warnings generated by some function declarations or usage changes after the version changes .