author :Peter edit :Peter
Hello everyone , I am a Peter~
I've written a lot about Pandas The article , This paper develops a simple comprehensive use , It is mainly divided into :
<!--MORE-->
The data used in this case is simulated by Xiaobian , It mainly contains two data : Fruit data and order information , And will merge the two data
import pandas as pd import numpy as np import random from datetime import * import time import plotly.express as px import plotly.graph_objects as go import plotly as py # Draw a subgraph from plotly.subplots import make_subplots
1、 The time field
2、 Fruit and users
3、 Generate order data
order = pd.DataFrame({ "time":time_range, # Order time "fruit":fruit_list, # Fruit name "name":name_list, # Customer name # Purchase volume "kilogram":np.random.choice(list(range(50,100)), size=len(time_range),replace=True) }) order
4、 Generate fruit information data
infortmation = pd.DataFrame({ "fruit":fruits, "price":[3.8, 8.9, 12.8, 6.8, 15.8, 4.9, 5.8, 7], "region":[" south China "," The north China "," The northwest "," Central China "," The northwest "," south China "," The north China "," Central China "] }) infortmation
5、 Data merging
Directly combine the order information and fruit information into a complete DataFrame, This df This is the data to be processed next
6、 Generate new fields : Order amount
Here you can learn :
1、 First extract the year and month :
df["year"] = df["time"].dt.year df["month"] = df["time"].dt.month # Extract the year and month at the same time df["year_month"] = df["time"].dt.strftime('%Y%m') df
2、 View the field type :
3、 Count and display by month and year :
# Count the sales volume by month df1 = df.groupby(["year_month"])["kilogram"].sum().reset_index() fig = px.bar(df1,x="year_month",y="kilogram",color="kilogram") fig.update_layout(xaxis_tickangle=45) # Tilt angle fig.show()
df2 = df.groupby(["year_month"])["amount"].sum().reset_index() df2["amount"] = df2["amount"].apply(lambda x:round(x,2)) fig = go.Figure() fig.add_trace(go.Scatter( # x=df2["year_month"], y=df2["amount"], mode='lines+markers', # mode Mode selection name='lines')) # name fig.update_layout(xaxis_tickangle=45) # Tilt angle fig.show()
df4 = df.groupby(["year","fruit"]).agg({"kilogram":"sum","amount":"sum"}).reset_index() df4["year"] = df4["year"].astype(str) df4["amount"] = df4["amount"].apply(lambda x: round(x,2)) from plotly.subplots import make_subplots import plotly.graph_objects as go fig = make_subplots( rows=1, cols=3, subplot_titles=["2019 year ","2020 year ","2021 year "], specs=[[{"type": "domain"}, # adopt type To specify the type {"type": "domain"}, {"type": "domain"}]] ) years = df4["year"].unique().tolist() for i, year in enumerate(years): name = df4[df4["year"] == year].fruit value = df4[df4["year"] == year].kilogram fig.add_traces(go.Pie(labels=name, values=value ), rows=1,cols=i+1 ) fig.update_traces( textposition='inside', # 'inside','outside','auto','none' textinfo='percent+label', insidetextorientation='radial', # horizontal、radial、tangential hole=.3, hoverinfo="label+percent+name" ) # fig.update_layout(title_text=" Making multi row and multi column subgraphs ") fig.show()
years = df4["year"].unique().tolist() for _, year in enumerate(years): df5 = df4[df4["year"]==year] fig = go.Figure(go.Treemap( labels = df5["fruit"].tolist(), parents = df5["year"].tolist(), values = df5["amount"].tolist(), textinfo = "label+value+percent root" )) fig.show()
fig = px.bar(df5,x="year_month",y="amount",color="fruit") fig.update_layout(xaxis_tickangle=45) # Tilt angle fig.show()
The line chart shows the changes :
df7 = df.groupby(["year","region"])["amount"].mean().reset_index()
df8 = df.groupby(["name"]).agg({"time":"count","amount":"sum"}).reset_index().rename(columns={"time":"order_number"}) df8.style.background_gradient(cmap="Spectral_r")
Analyze according to the order quantity and order amount of each fruit by each user :
df9 = df.groupby(["name","fruit"]).agg({"time":"count","amount":"sum"}).reset_index().rename(columns={"time":"number"}) df10 = df9.sort_values(["name","number","amount"],ascending=[True,False,False]) df10.style.bar(subset=["number","amount"],color="#a97fcf")
px.bar(df10, x="fruit", y="amount", # color="number", facet_col="name" )
RFM Model is an important tool and means to measure customer value and profitability .
This model can reflect the delivery transaction behavior of a user 、 The overall frequency and total amount of transactions 3 Indicators , adopt 3 An indicator to describe the value of the customer ; At the same time, according to these three indicators, customers are divided into 8 Class customer value :
Pass below Pandas To solve this problem separately 3 Indicators , First of all F and M: Number of orders per customer and total amount
How to solve R Index ?
1、 First solve the difference between each order and the current time
2、 According to the difference of each user R In ascending order , The number one data is his recent purchase record : With xiaoming The user, for example , Last time 12 month 15 Number , The difference from the current time is 25 God
3、 According to the user's weight , Keep the first piece of data , In this way, each user's R indicators :
4、 Data consolidation results in 3 Indicators :
When the amount of data is large enough , When there are enough users , You can just RFM Model to divide users into 8 A type of
The re purchase cycle is the time interval between every two purchases : With xiaoming The user, for example , front 2 The re purchase cycles are 4 Days and 22 God
The following is the process of solving the repurchase cycle of each user :
1、 The purchase time of each user is in ascending order
2、 Move time one unit :
3、 The combined difference :
The occurrence of null value is the first record of each user, and there is no data before , After that, the null value part is deleted directly
Directly take out the numerical part of the number of days :
5、 Re purchase cycle comparison
px.bar(df16, x="day", y="name", orientation="h", color="day", color_continuous_scale="spectral" # purples )
In the figure above, the narrower the rectangle, the smaller the interval ; The whole re purchase cycle of each user is determined by the length of the whole rectangle . Check the sum of the overall re purchase cycle and the average re purchase cycle of each user :
Come to a conclusion :Michk and Mike The overall re purchase cycle of the two users is relatively long , Loyal users in the long run ; And from the average repurchase cycle , Relatively low , It indicates that re purchase is active in a short time .
It can also be observed from the violin below ,Michk and Mike The re purchase cycle distribution is the most concentrated .