您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Pandas+matplotlib, in simple terms, python data analysis

編輯：Python

I am from 17 Year to learn Python And data analysis , Along the way , There is no particularly suitable book , I've seen a lot , But most of them are scattered , At the end of the article, a set of PythonPDF Book tutorial , Dear friends You can study hard ！

Explore charts with visualization

One 、 Data visualization and exploration diagram

Data visualization refers to the presentation of data in the form of graphics or tables . The chart can clearly show the nature of the data , And the relationship between data or attributes , It's easy for people to see the picture and interpret . Users explore the map （Exploratory Graph） You can understand the characteristics of data 、 Look for trends in data 、 Lower the threshold of data understanding .

Two 、 Common chart examples

This chapter mainly adopts Pandas The way to draw , Instead of using Matplotlib modular . Actually Pandas Have the Matplotlib The drawing method is integrated into DataFrame in , So in practice , Users do not need to directly reference Matplotlib You can also complete the work of drawing .

1. Broken line diagram

Broken line diagram （line chart） Is the most basic chart , It can be used to present the relationship between continuous data in different fields . The line chart is drawn using plot.line() Methods , You can set the color 、 Shape and other parameters . On the use , The drawing method of line drawing completely inherits Matplotlib Usage of , So the program must finally call plt.show() Production graph , Pictured 8.4 Shown .

df_iris[['sepal length (cm)']].plot.line() 
plt.show()
ax = df[['sepal length (cm)']].plot.line(color='green',title="Demo",style='--') 
ax.set(xlabel="index", ylabel="length")
plt.show()

2. Scatter map

Scatter map （Scatter Chart） Used to view the relationship between discrete data in different fields . Scatter charts are drawn using df.plot.scatter(), Pictured 8.5 Shown .

df = df_iris
df.plot.scatter(x='sepal length (cm)', y='sepal width (cm)')
from matplotlib import cm 
cmap = cm.get_cmap('Spectral')
df.plot.scatter(x='sepal length (cm)',
          y='sepal width (cm)', 
          s=df[['petal length (cm)']]*20, 
          c=df['target'],
          cmap=cmap,
          title='different circle size by petal length (cm)')

3. Histogram 、 bar chart

Histogram （Histogram Chart） Usually used in the same field , Present the distribution of continuous data , Another kind of graph similar to histogram is long bar graph （Bar Chart）, Used to view the same field , Pictured 8.6 Shown .

df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)','petal width (cm)']].plot.hist()
2 df.target.value_counts().plot.bar()

4. Pie chart 、 Box chart

Pie chart （Pie Chart） It can be used to view the proportion of each category in the same field , Box diagram （Box Chart） It is used to view the same field or compare the distribution differences of data in different fields , Pictured 8.7 Shown .

df.target.value_counts().plot.pie(legend=True)
df.boxplot(column=['target'],ﬁgsize=(10,5))

Data exploration and actual combat sharing

This section uses two real data sets to show several methods of data exploration .

One 、2013 American community survey

Community surveys in the United States （American Community Survey） in , About every year 350 Million families were asked detailed questions about who they are and how they live . The survey covered many topics , Including ancestors 、 education 、 Work 、 traffic 、 Internet use and residence .

Data sources ：https://www.kaggle.com/census/2013-american-community-survey.

Data name ：2013 American Community Survey.

First observe the appearance and characteristics of the data , And the meaning of each field 、 Type and scope .

#  Reading data
df = pd.read_csv("./ss13husa.csv")
#  Type and quantity of fields
df.shape
# (756065,231)
#  Field value range
df.describe()

The first two ss13pusa.csv String together , This data contains a total of 30 Ten thousand data ,3 Fields ：SCHL ( Education ,School Level)、 PINCP ( income ,Income) and ESR ( Working state ,Work Status).

pusa = pd.read_csv("ss13pusa.csv") pusb = pd.read_csv("ss13pusb.csv")
#  Connect two data in series
col = ['SCHL','PINCP','ESR']
df['ac_survey'] = pd.concat([pusa[col],pusb[col],axis=0)

Group the data according to educational background , Observe the number and proportion of different degrees , Then calculate their average income .

group = df['ac_survey'].groupby(by=['SCHL']) print(' Education distribution :' + group.size())
group = ac_survey.groupby(by=['SCHL']) print(' Average income :' +group.mean())

Two 、 Boston housing data set

Boston housing data set （Boston House Price Dataset） Contains information about houses in the Boston area , package 506 Data samples and 13 Characteristic dimensions .

Data sources ：https://archive.ics.uci.edu/ml/machine-learning-databases/housing/.

Data name ：Boston House Price Dataset.

First observe the appearance and characteristics of the data , And the meaning of each field 、 Type and scope .

You can use histogram to draw house prices （MEDV） The distribution of , Pictured 8.8 Shown .

df = pd.read_csv("./housing.data")
#  Type and quantity of fields
df.shape
# (506, 14)
# Field value range df.describe()
import matplotlib.pyplot as plt 
df[['MEDV']].plot.hist() 
plt.show()

notes ： The Chinese and English in the figure correspond to the name specified by the author in the code or data , In practice, readers can replace them with the words they need .

The next thing you need to know is which dimensions are related to “ housing price ” The relationship is obvious . Let's look at... In the form of a scatter diagram , Pictured 8.9 Shown .

# draw scatter chart 
df.plot.scatter(x='MEDV', y='RM') .
plt.show()

Last , Calculate the correlation coefficient and use the cluster heat map （Heatmap） For visual presentation , Pictured 8.10 Shown .

# compute pearson correlation 
corr = df.corr()
# draw  heatmap 
import seaborn as sns 
corr = df.corr() 
sns.heatmap(corr) 
plt.show()

The color is red , Indicates a positive relationship ; The color is blue , Indicates a negative relationship ; The color is white , It doesn't matter .RM The correlation with house prices tends to be red , It is a positive relationship ;LSTAT、PTRATIO The correlation with house prices tends to dark blue , Is a negative relationship ;CRIM、RAD、AGE The correlation with house prices tends to be white , It doesn't matter .

more Python See the following practice for technical knowledge PDF electronic text Download this