程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Nine commonly used preprocessing methods in Python

編輯:Python

The conclusion of this paper is that we are all in python Data preprocessing is a common method in data processing , The following is adopted sklearn Of preprocessing Module to introduce ;

1. Standardization (Standardization or Mean Removal and Variance Scaling)

After the transformation, the characteristics of each dimension are 0 mean value , Unit variance . Also called z-score Normalization ( Zero mean normalization ). The calculation method is to subtract the mean value from the eigenvalue , Divided by standard deviation .

sklearn.preprocessing.scale(X)

Usually the train and test Sets are normalized together , Or in train Once we do the normalization on the set , Use the same standardiser to standardise test Set , You can use scaler

scaler = sklearn.preprocessing.StandardScaler().fit(train)
scaler.transform(train)
scaler.transform(test)

Practical application , Common scenarios requiring feature standardization :SVM

2. Minimum - Maximum normalization

Minimum - Maximum normalization performs linear transformation on the original data , Change to [0,1] Section ( It could be some other fixed minimum maximum range )

min_max_scaler = sklearn.preprocessing.MinMaxScaler()
min_max_scaler.fit_transform(X_train)

3. Normalization (Normalization)

Normalization is mapping values from different ranges of variation to the same fixed range , Common is [0,1], This is also called normalization .

I'm going to transform each sample into unit norm.

X = [[ 1, -1, 2],[ 2, 0, 0], [ 0, 1, -1]]
sklearn.preprocessing.normalize(X, norm='l2')

obtain :

array([[ 0.40, -0.40, 0.81], [ 1, 0, 0], [ 0, 0.70, -0.70]])

It can be found that for each sample ,0.42+0.42+0.81^2=1, This is it. L2 norm, After transformation, the sum of the squares of each dimension of each sample is 1. Similarly ,L1 norm Is the sum of the absolute values of each dimensional feature of each sample after transformation 1. also max norm, Is to divide each dimension feature of each sample by the maximum value of each dimension feature of the sample .
When measuring similarity between samples , If you're using quadratic form kernel, Need to do Normalization

4. Feature binarization (Binarization)

Given the threshold , Convert the feature to 0/1

binarizer = sklearn.preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

5. Label binarization (Label binarization)

lb = sklearn.preprocessing.LabelBinarizer()

6. Category feature coding

Sometimes features are categorical , And the input of some algorithms must be numeric , Now you need to code it .

enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
enc.transform([[0, 1, 3]]).toarray() #array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

The above example , The first dimensional feature has two values 0 and 1, Code in two digits . The second dimension USES three bits , The third dimension USES four digits .

Another coding method

newdf=pd.get_dummies(df,columns=["gender","title"],dummy_na=True)

7. Tag code (Label encoding)

le = sklearn.preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])
le.transform([1, 1, 2, 6]) #array([0, 0, 1, 2]) 
# Conversion from non-numerical to numerical 
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"]) #array([2, 2, 1])

8. When the feature contains outliers

sklearn.preprocessing.robust_scale

9. Generating polynomial features

This actually involves feature engineering , Polynomial features / Cross features .


poly = sklearn.preprocessing.PolynomialFeatures(2)
poly.fit_transform(X)

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved