The conclusion of this paper is that we are all in python Data preprocessing is a common method in data processing , The following is adopted sklearn Of preprocessing Module to introduce ;
1. Standardization (Standardization or Mean Removal and Variance Scaling)
After the transformation, the characteristics of each dimension are 0 mean value , Unit variance . Also called z-score Normalization ( Zero mean normalization ). The calculation method is to subtract the mean value from the eigenvalue , Divided by standard deviation .
sklearn.preprocessing.scale(X)
Usually the train and test Sets are normalized together , Or in train Once we do the normalization on the set , Use the same standardiser to standardise test Set , You can use scaler
scaler = sklearn.preprocessing.StandardScaler().fit(train)
scaler.transform(train)
scaler.transform(test)
Practical application , Common scenarios requiring feature standardization :SVM
2. Minimum - Maximum normalization
Minimum - Maximum normalization performs linear transformation on the original data , Change to [0,1] Section ( It could be some other fixed minimum maximum range )
min_max_scaler = sklearn.preprocessing.MinMaxScaler()
min_max_scaler.fit_transform(X_train)
3. Normalization (Normalization)
Normalization is mapping values from different ranges of variation to the same fixed range , Common is [0,1], This is also called normalization .
I'm going to transform each sample into unit norm.
X = [[ 1, -1, 2],[ 2, 0, 0], [ 0, 1, -1]]
sklearn.preprocessing.normalize(X, norm='l2')
obtain :
array([[ 0.40, -0.40, 0.81], [ 1, 0, 0], [ 0, 0.70, -0.70]])
It can be found that for each sample ,0.42+0.42+0.81^2=1, This is it. L2 norm, After transformation, the sum of the squares of each dimension of each sample is 1. Similarly ,L1 norm Is the sum of the absolute values of each dimensional feature of each sample after transformation 1. also max norm, Is to divide each dimension feature of each sample by the maximum value of each dimension feature of the sample .
When measuring similarity between samples , If you're using quadratic form kernel, Need to do Normalization
4. Feature binarization (Binarization)
Given the threshold , Convert the feature to 0/1
binarizer = sklearn.preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)
5. Label binarization (Label binarization)
lb = sklearn.preprocessing.LabelBinarizer()
6. Category feature coding
Sometimes features are categorical , And the input of some algorithms must be numeric , Now you need to code it .
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
enc.transform([[0, 1, 3]]).toarray() #array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])
The above example , The first dimensional feature has two values 0 and 1, Code in two digits . The second dimension USES three bits , The third dimension USES four digits .
Another coding method
newdf=pd.get_dummies(df,columns=["gender","title"],dummy_na=True)
7. Tag code (Label encoding)
le = sklearn.preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])
le.transform([1, 1, 2, 6]) #array([0, 0, 1, 2])
# Conversion from non-numerical to numerical
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"]) #array([2, 2, 1])
8. When the feature contains outliers
sklearn.preprocessing.robust_scale
9. Generating polynomial features
This actually involves feature engineering , Polynomial features / Cross features .
poly = sklearn.preprocessing.PolynomialFeatures(2)
poly.fit_transform(X)