您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Nine commonly used preprocessing methods in Python

編輯：Python

The conclusion of this paper is that we are all in python Data preprocessing is a common method in data processing , The following is adopted sklearn Of preprocessing Module to introduce ;

1. Standardization （Standardization or Mean Removal and Variance Scaling)

After the transformation, the characteristics of each dimension are 0 mean value , Unit variance . Also called z-score Normalization （ Zero mean normalization ）. The calculation method is to subtract the mean value from the eigenvalue , Divided by standard deviation .

sklearn.preprocessing.scale(X)

Usually the train and test Sets are normalized together , Or in train Once we do the normalization on the set , Use the same standardiser to standardise test Set , You can use scaler

scaler = sklearn.preprocessing.StandardScaler().fit(train)
scaler.transform(train)
scaler.transform(test)

Practical application , Common scenarios requiring feature standardization ：SVM

2. Minimum - Maximum normalization

Minimum - Maximum normalization performs linear transformation on the original data , Change to [0,1] Section （ It could be some other fixed minimum maximum range ）

min_max_scaler = sklearn.preprocessing.MinMaxScaler()
min_max_scaler.fit_transform(X_train)

3. Normalization （Normalization）

Normalization is mapping values from different ranges of variation to the same fixed range , Common is [0,1], This is also called normalization .

I'm going to transform each sample into unit norm.

X = [[ 1, -1, 2],[ 2, 0, 0], [ 0, 1, -1]]
sklearn.preprocessing.normalize(X, norm='l2')

obtain ：

array([[ 0.40, -0.40, 0.81], [ 1, 0, 0], [ 0, 0.70, -0.70]])

It can be found that for each sample ,0.4^2+0.42+0.81^2=1, This is it. L2 norm, After transformation, the sum of the squares of each dimension of each sample is 1. Similarly ,L1 norm Is the sum of the absolute values of each dimensional feature of each sample after transformation 1. also max norm, Is to divide each dimension feature of each sample by the maximum value of each dimension feature of the sample .
When measuring similarity between samples , If you're using quadratic form kernel, Need to do Normalization

4. Feature binarization （Binarization）

Given the threshold , Convert the feature to 0/1

binarizer = sklearn.preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

5. Label binarization （Label binarization）

lb = sklearn.preprocessing.LabelBinarizer()

6. Category feature coding

Sometimes features are categorical , And the input of some algorithms must be numeric , Now you need to code it .

enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
enc.transform([[0, 1, 3]]).toarray() #array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

The above example , The first dimensional feature has two values 0 and 1, Code in two digits . The second dimension USES three bits , The third dimension USES four digits .

Another coding method

newdf=pd.get_dummies(df,columns=["gender","title"],dummy_na=True)

7. Tag code （Label encoding）

le = sklearn.preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])
le.transform([1, 1, 2, 6]) #array([0, 0, 1, 2]) 
# Conversion from non-numerical to numerical 
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"]) #array([2, 2, 1])

8. When the feature contains outliers

sklearn.preprocessing.robust_scale

9. Generating polynomial features

This actually involves feature engineering , Polynomial features / Cross features .


poly = sklearn.preprocessing.PolynomialFeatures(2)
poly.fit_transform(X)

上一篇文章： Two methods of reading and displaying pictures in Python
下一篇文章： [Python app automation test practice part ④] - drive the night God simulator through appoum to complete the first automation script - view the address book

Python

This generation of workers is too difficult. How hard is it for people who dont understand Python?

As McKinsey said , Data has pe

halcon安裝以及python聯合編程

halcon簡介halcon是工業領域最著名的商業視覺軟件，

再見 Excel，你好 Python Spreadsheets！

作者：韓信子@ShowMeAI數據分析 ◉ 技能提升系列：h

Python3教程：copy模塊詳細用法

copy-對象拷貝模塊；提供了淺拷貝和深拷貝復制對象的功能,

Python用MCMC馬爾科夫鏈蒙特卡洛、拒絕抽樣和Metropolis-Hastings采樣算法

原文鏈接：http://tecdat.cn/?p=27267

Lensemble du réseau est le plus facile à comprendre! 495 pages de tutoriel Python comic, développement HD PDF télécharger

Lensemble du Web est le plus s

Use optimize curve_ Residuals are not final in the initial point when fit function is used to fit the curve. What should I do?

Resolve the conflict between the required library versions in Python multiple projects [contrib library cannot be used for versions above tensorflow2]

Mac sublime3 sublimerepl- python- Python IPython cannot be used after installation

Is used to judge none in Python

Python list sort() method can only be used for the same type

What the hell is Python? Why is it used in mechanical design?

Only three lines of Python code are used to realize the import and export between the database and excel

A large collection of commonly used graphs for Python data analysis

I used Python to analyze a wave of hot new year goods. It turns out that everyone is buying these things?

Six Super Python libraries have been sorted out. Come and see if you have used them all

熱門圖文

zoj 2853 Evolution 矩陣快速冪將窗體從屬於主窗體 MSDN幫助集成指南（將HTML Help幫助集成到MSDN庫） C＃基礎（五）（C#條件，循環和判斷）(4) Function module import of Python Foundation java netty socket庫和自定義C#socket庫利用protobuf進行通信完整實例， PHP6 先修班 JSON實例代碼 .net面向對象之多線程(Multithreading)及多線程高級應用

欄目導航