您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Four methods for pandas to perform box wise operation on values

編輯：Python

Use Pandas Of between 、cut、qcut and value_count Discretize numerical variables .

Bin splitting is a common data preprocessing technology, which is sometimes called bucket splitting or discretization , It can be used to group the intervals of continuous data into “ box ” or “ bucket ” in . In this paper , We will discuss the use of python Pandas The library divides the values into boxes 4 Methods .

Let's create the following composite data to demonstrate

import pandas as pd # version 1.3.5 
import numpy as np 
def create_df(): 
 df = pd.DataFrame({'score': np.random.randint(0,101,1000)}) 
 return df 
 
create_df() 
df.head()

The data includes 1000 Students 0 To 100 The test score of points . The task this time is to divide the numerical score into values “A”、“B” and “C” Level of , among “A” Is the best grade ,“C” Is the worst grade .

1、between & loc

Pandas .between Method returns a containing True Boolean vector of , Used to correspond to Series The element is at the boundary value left and right Between .

The parameters have the following three ：

left： Left boundary
right： Right border
inclusive： Which boundary to include . The acceptable value is {“both”、“neither”、“left”、“right”}.

Students are graded according to the following spacing rules ：

A: (80, 100]
B: (50, 80]
C: [0, 50]

The square brackets [ Parentheses ) Indicates that the boundary value is included and excluded respectively . We need to determine which score is between the intervals of interest , And assign the corresponding level value . Note that the following different parameters indicate whether the boundary is included

df.loc[df['score'].between(0, 50, 'both'), 'grade'] = 'C' 
df.loc[df['score'].between(50, 80, 'right'), 'grade'] = 'B' 
df.loc[df['score'].between(80, 100, 'right'), 'grade'] = 'A'

Here is the number of people in each score range ：

df.grade.value_counts()

C 488
B 310
A 202
Name: grade, dtype: int64

This method needs to be used for each bin Write code for processing , Therefore, it only applies to bin In rare cases .

2、cut

have access to cut Classify values into discrete intervals . This function is also useful for moving from continuous variables to categorical variables .

cut The parameters of are as follows ：

x： The array to be boxed . Must be one-dimensional .
bins： Scalar sequence ： Defines the allowable non-uniform width bin edge .
labels： Specify the returned bin The label of . It has to do with bins Same parameter length .
include_lowest: (bool) Whether the first interval should be left contained .

bins = [0, 50, 80, 100] 
labels = ['C', 'B', 'A'] 
df['grade'] = pd.cut(x = df['score'], 
                      bins = bins, 
                      labels = labels, 
                      include_lowest = True)

This creates a file that contains bin Boundary value bins The list and one contain the corresponding bin Tag list of tags .

View the number of people in each section

df.grade.value_counts()

C 488
B 310
A 202
Name: grade, dtype: int64

The result is the same as the above example .

3、qcut

qcut Variables can be discretized into buckets of equal size according to ranking or based on sample quantiles [3].

In the previous example , We defined the score interval for each level , This makes the number of students at each level uneven . In the following example , We will try to classify students as 3 Two have equal （ about ） Score grade of quantity . Examples include 1000 Famous student , Therefore, each sub box should have about 333 Famous student .

qcut Parameters ：

x： Input array to be boxed . Must be one-dimensional .
q： quantile .10 Represents the decile ,4 Indicates the quartile, etc . It can also be alternatively arranged quantiles , for example [0, .25, .5, .75, 1.] Four percentile .
labels： Appoint bin The label of . Must be consistent with the generated bin Same length .
retbins: (bool) Whether to return (bins, labels).

df['grade'], cut_bin = pd.qcut(df['score'], 
                          q = 3, 
                          labels = ['C', 'B', 'A'], 
                          retbins = True) 
df.head()

If retbins Set to True Will return bin The border .

print (cut_bin) 
>> [  0.  36.  68. 100.]

The score interval is as follows ：

C：[0, 36]
B：(36, 68]
A：(68, 100]

Use .value_counts() Check how many students there are at each level . Ideally , Each box should have about 333 Famous student .

df.grade.value_counts()

C 340
A 331
B 329
Name: grade, dtype: int64

4、value_counts

although pandas .value_counts Usually used to calculate the number of unique values in a series , But it can also be used bins Parameters group values into half boxes .

df['score'].value_counts(bins = 3, sort = False)

By default , .value_counts Sort the returned Series in descending order of values . take sort Set to False Sort the series in ascending order of their index .

(-0.101, 33.333] 310
(33.333, 66.667] 340
(66.667, 100.0] 350
Name: score, dtype: int64

series Index refers to each bin Range of , The square brackets [ Parentheses ) Indicates that the boundary value is included and excluded respectively . return series The value of represents each bin How many records are there in .

And .qcut Different , Every bin The number of records in is not necessarily the same （ about ）..value_counts The same number of records will not be assigned to the same category , Instead, the score range is divided into... According to the highest and lowest scores 3 An equal part . The minimum value of the score is 0, The maximum value is 100, So this 3 Each of the three parts is about 33.33 Within the scope of . It also explains why bin The boundary of this is 33.33 Multiple .

We can also define by passing in the boundary list bin The border .

df['score'].value_counts(bins = [0,50,80,100], sort = False)

(-0.001, 50.0] 488
(50.0, 80.0] 310
(80.0, 100.0] 202
Name: score, dtype: int64

This gives us examples 1 and 2 Same result .

summary

In this paper , How to use .between、.cut、.qcut and .value_counts Box the continuous values . Here is the source code of this article ：

[1]

source : https://colab.research.google.com/drive/1yWTl2OzOnxG0jCdmeIN8nV1MoX3KQQ_1%3Fusp%3Dsharing

Long press attention - About data analysis and visualization - Set to star , Dry goods express

NO.1

Previous recommendation

Historical articles

New generation reptile weapon — Playwright

Use LSTM Forecast sales （Python Code ）

Python Handle PDF Artifact ：PyMuPDF Installation and use of

Decision tree 、 Random forests 、bagging、boosting、Adaboost、GBDT、XGBoost summary

Share 、 Collection 、 give the thumbs-up 、 I'm looking at the arrangement ？

上一篇文章： Python has eight schemes for implementing scheduled tasks, which are full of dry goods
下一篇文章： Run flash back after Python is packaged in Windows: other referenced software packages such as xlrd and xlwt are not packaged together

Python

Developing esp32 firmware burning and testing with micropython

Chen Tuo 2022/06/10-2022/06/11

Python -- collections and dictionaries

One 、 aggregate （ A mountain c

太牛了，40秒！用Python實現自動掃雷，挑戰世界紀錄！

今天給大家分享的這個案例是用 Python+OpenCV 實

[ANSYS] Python parses the ANSYS result file (pyansys Library)

List of articles 1、 download P

Learn the abs() function in Python

p{margin:10px 0}.markdown-body

Windows installation Python Environment & pycharm

1、 install python Environmenta

The use of str() and repr() methods in Python

Python uses the pilot library for image

Reasons for the slow query speed of Django ORM

An efficient paging method for Python Django ORM article list

Python calculates the area and perimeter of a circle. Analysis of the real problem of level 2 of the python programming level examination of the Electronic Society for youth programming march2021

Program code for solving Sudoku in Python

Resolve the conflict between the required library versions in Python multiple projects [contrib library cannot be used for versions above tensorflow2]

Python based path optimization algorithm for solving multiple minima course paper + source code

Pandasdataframe method for batch modification of values

Format() method for Python string formatting

熱門圖文

C# 在引用插件中出現的問題| Csharp cite the plugin problem C# 2.0泛型編程基礎（1）理解類引用這種類型服務器-校園網可以登錄，遠程登錄不了tomcat cvs-Eclipse for java EE沒有CVS 開發JSP自定義行為 Java中使用JSCH來實現連接遠程服務器執行linux命令 php與paypal整合方法

欄目導航