您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python data cleaning: pd Cut() box splitting statistics

編輯：Python

Purpose ：

Segment the commodity amount of the original data , Count the number of orders in each section

Solutions ：

Use by box pd.cut()

pd.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False,duplicates='raise')
x ： One dimensional array
bins ： Integers , Scalar sequence or interval index , It is the basis for grouping ,
         If you fill in an integer n, Will be x The values in are divided into equal width n Share ( That is, the difference between the maximum value and the minimum value in each group is approximately equal );
         If it's a scalar sequence , The values in the sequence represent the dividing values used for grading
       If it is an interval index ,“ bins” The interval indexes of must not overlap
right ： Boolean value , The default is True Indicates that it contains the rightmost value
       When “ right = True”( The default value is ) when , be “ bins”=[1、2、3、4] Express (1,2],(2,3],(3,4]
       When bins Is an interval index , This parameter is ignored .
labels : Array or Boolean , Optional . Specify the label of the sub box
       If it's an array , The length should be consistent with the number of boxes , such as “ bins”=[1、2、3、4] Express (1,2],(2,3],(3,4] altogether 3 Intervals , be labels The length of the tag is the number of tags 3
       If False, Then only the integer indicator of the bin is returned , namely x The data in is in the first few boxes
       When bins Is the interval index , This parameter will be ignored
retbins： Whether to display the boundary value of sub box . The default is False, When bins When taking an integer, you can set retbins=True To display the boundary value , Get the divided interval
precision： Integers , Default 3, Accuracy of storing and displaying sub box labels .
include_lowest： Boolean value , Indicates whether the left side of the interval is open or closed , The default is false, That is, the left side of the interval is not included .
duplicates： If the critical value of container separation is not unique , The cause ValueError Or discard non unique

Visual Statistics , Use pd.pivot_table()

Specific code ：

1、 get data

df = pd.read_excel(' newly build XLSX Worksheet .xlsx',sheet_name= 'Sheet1') # Read data entry Python Program
data = df[[' The order number ',' Time of payment ',' Order amount ',' Name of commodity ',' Product specifications ',' The number ',' The amount of goods ']] # Select the data column you want to use
print(data.head()) # Print data

2、 Sub container handling

# Define the amount range
je_cut =[0,10,50,100,150,200,500,100000]
je_label = ['0-10','10-50','50-100','100-150','150-200','200-500','500 above '] # Label the amount range
data[' Amount range '] = pd.cut(data[' The amount of goods '],bins=je_cut,labels= je_label)
print(data)

3、 Statistical analysis

Data = pd.pivot_table(data,values = [' The order number '],index = [' Amount range '],aggfunc = {' The order number ':lambda x: len(x.dropna().unique())}
,fill_value =0).reset_index(drop = False) # fill_value = 0 Is used to fill in missing values 、 Null value
Data = Data.rename(columns ={' The order number ':' Number of orders '}) # Change column names
print(Data)