您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Save pandas plan (20) -- count the monthly order volume of retail stores

編輯：Python

save pandas plan （20）—— Count the monthly order volume of retail stores

- / Data requirement
- / Demand processing
- / summary

Recently, I found that many friends around me are not happy to use pandas, Switch to other data operation Libraries , As a data worker , Basically open your mouth pandas, Closed mouth pandas 了 , So I wrote this series to make more friends fall in love with pandas.

Series article description ：

Series name （ Serial number of series articles ）—— This series of articles specifically address the needs

platform ：

windows 10
python 3.8
pandas >=1.2.4

/ Data requirement

Recently I was reading a book about using pandas A book for data processing , On 2020 Published in , There is a section on the statistical data processing of online retail goods , Each order and each item is recorded separately , Therefore, when you only care about orders, you will find that there are multiple same order numbers , This article discusses how to count the monthly order quantity . The data is read as follows ：

import pandas as pd
df = pd.read_csv('Online_Retail.csv.zip', parse_dates=['InvoiceDate'])
df_new = df.dropna().copy()
# Lending month 
df_new['YearMonth'] = df_new['InvoiceDate'].map(lambda x: 100 * x.year + x.month)

ps: Data acquisition method ：

github：
https://github.com/lk-itween/FunnyCodeRepository/raw/main/PandasSaved/data/Online_Retail.csv.zip

(406829, 9)

/ Demand processing

Because only the order number is concerned , Repeated order numbers will make the data statistics inaccurate , The order number needs to be de duplicated before statistics .

Mode one ： Used in books unique Statistics will be made later .

df_new.groupby('InvoiceNo')['YearMonth'].unique().value_counts().sort_index()

pandas from 2020 It has been updated many times since its development in , The methods in the previous book may not be implemented , In this case, the following error reports will be generated , The reason is unique() After execution, each row of data is of list type ,value_counts Can't handle .

Change the code as follows to complete the requirements .

df_new.groupby('InvoiceNo')['YearMonth'].unique().explode().value_counts().sort_index()

（ Manual watermark ： original CSDN The fate of the sleepers ,https://blog.csdn.net/weixin_46281427?spm=1011.2124.3001.5343 , official account A11Dot send )

Mode two ： Yes groupby Use of results value_counts De duplication and re statistics .

df_new.groupby('InvoiceNo')['YearMonth'].value_counts().reset_index(name='count')['YearMonth'].value_counts().sort_index()

first value_counts The function of is to YearMonth duplicate removal , The required column name is already used as an index , adopt reset_index Reset index to column data , Right again YearMonth Conduct value_counts Count the monthly order volume .

On the same computer , This method is faster than the method mentioned in the book , Probably unique It takes some time to process , However, this treatment complicates the thinking ,pandas De reprocessing can be used directly drop_duplicates.

Mode three ：drop_duplicates Statistics after weight removal .

df_new[['InvoiceNo', 'YearMonth']].drop_duplicates()['YearMonth'].value_counts().sort_index()

Compare the first two methods , The code is a lot shorter , Processing time is also reduced .

/ summary

This article introduces the examples in the book , Reproduce the code in the book , Combined with existing data processing methods , Step by step optimize the way your code is handled , Explain the similarities and differences of each method , Complete data requirements . The source data can be obtained at the beginning of the article .

Watch the sky , Listen to the wind and rain at dawn .

Made on June 22, 2002