程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python data cleaning: pd Cut() box splitting statistics

編輯:Python

Purpose

Segment the commodity amount of the original data , Count the number of orders in each section

Solutions :

Use by box pd.cut()

pd.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False,duplicates='raise')

  • x : One dimensional array

  • bins : Integers , Scalar sequence or interval index , It is the basis for grouping ,

         If you fill in an integer n, Will be x The values in are divided into equal width n Share ( That is, the difference between the maximum value and the minimum value in each group is approximately equal );

         If it's a scalar sequence , The values in the sequence represent the dividing values used for grading

         If it is an interval index ,“ bins” The interval indexes of must not overlap

  • right : Boolean value , The default is True Indicates that it contains the rightmost value

         When “ right = True”( The default value is ) when , be “ bins”=[1、2、3、4] Express (1,2],(2,3],(3,4]

         When bins Is an interval index , This parameter is ignored .

  • labels : Array or Boolean , Optional . Specify the label of the sub box

         If it's an array , The length should be consistent with the number of boxes , such as “ bins”=[1、2、3、4] Express (1,2],(2,3],(3,4] altogether 3 Intervals , be labels The length of the tag is the number of tags 3

         If False, Then only the integer indicator of the bin is returned , namely x The data in is in the first few boxes

         When bins Is the interval index , This parameter will be ignored

  • retbins: Whether to display the boundary value of sub box . The default is False, When bins When taking an integer, you can set retbins=True To display the boundary value , Get the divided interval

  • precision: Integers , Default 3, Accuracy of storing and displaying sub box labels .

  • include_lowest: Boolean value , Indicates whether the left side of the interval is open or closed , The default is false, That is, the left side of the interval is not included .

  • duplicates: If the critical value of container separation is not unique , The cause ValueError Or discard non unique

Visual Statistics , Use pd.pivot_table()

Specific code :

1、 get data

df = pd.read_excel(' newly build XLSX Worksheet .xlsx',sheet_name= 'Sheet1') # Read data entry Python Program
data = df[[' The order number ',' Time of payment ',' Order amount ',' Name of commodity ',' Product specifications ',' The number ',' The amount of goods ']] # Select the data column you want to use
print(data.head()) # Print data 

 2、 Sub container handling

# Define the amount range
je_cut =[0,10,50,100,150,200,500,100000]
je_label = ['0-10','10-50','50-100','100-150','150-200','200-500','500 above '] # Label the amount range
data[' Amount range '] = pd.cut(data[' The amount of goods '],bins=je_cut,labels= je_label)
print(data)

 3、 Statistical analysis

Data = pd.pivot_table(data,values = [' The order number '],index = [' Amount range '],aggfunc = {' The order number ':lambda x: len(x.dropna().unique())}
,fill_value =0).reset_index(drop = False) # fill_value = 0 Is used to fill in missing values 、 Null value
Data = Data.rename(columns ={' The order number ':' Number of orders '}) # Change column names
print(Data)

  thus , We have completed the statistical requirements of data by box ~~~~~~


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved