程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Pandas+matplotlib, in simple terms, python data analysis

編輯:Python

I am from 17 Year to learn Python And data analysis , Along the way , There is no particularly suitable book , I've seen a lot , But most of them are scattered , At the end of the article, a set of PythonPDF Book tutorial , Dear friends You can study hard !

Explore charts with visualization

One 、 Data visualization and exploration diagram

      Data visualization refers to the presentation of data in the form of graphics or tables . The chart can clearly show the nature of the data , And the relationship between data or attributes , It's easy for people to see the picture and interpret . Users explore the map (Exploratory Graph) You can understand the characteristics of data 、 Look for trends in data 、 Lower the threshold of data understanding .

Two 、 Common chart examples

      This chapter mainly adopts Pandas The way to draw , Instead of using Matplotlib modular . Actually Pandas Have the Matplotlib The drawing method is integrated into DataFrame in , So in practice , Users do not need to directly reference Matplotlib You can also complete the work of drawing .

1. Broken line diagram

      Broken line diagram (line chart) Is the most basic chart , It can be used to present the relationship between continuous data in different fields . The line chart is drawn using plot.line() Methods , You can set the color 、 Shape and other parameters . On the use , The drawing method of line drawing completely inherits Matplotlib Usage of , So the program must finally call plt.show() Production graph , Pictured 8.4 Shown .

df_iris[['sepal length (cm)']].plot.line() 
plt.show()
ax = df[['sepal length (cm)']].plot.line(color='green',title="Demo",style='--') 
ax.set(xlabel="index", ylabel="length")
plt.show()

2. Scatter map

      Scatter map (Scatter Chart) Used to view the relationship between discrete data in different fields . Scatter charts are drawn using df.plot.scatter(), Pictured 8.5 Shown .

df = df_iris
df.plot.scatter(x='sepal length (cm)', y='sepal width (cm)')
from matplotlib import cm 
cmap = cm.get_cmap('Spectral')
df.plot.scatter(x='sepal length (cm)',
          y='sepal width (cm)', 
          s=df[['petal length (cm)']]*20, 
          c=df['target'],
          cmap=cmap,
          title='different circle size by petal length (cm)')

3. Histogram 、 bar chart

      Histogram (Histogram Chart) Usually used in the same field , Present the distribution of continuous data , Another kind of graph similar to histogram is long bar graph (Bar Chart), Used to view the same field , Pictured 8.6 Shown .

df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)','petal width (cm)']].plot.hist()
2 df.target.value_counts().plot.bar()

4. Pie chart 、 Box chart

      Pie chart (Pie Chart) It can be used to view the proportion of each category in the same field , Box diagram (Box Chart) It is used to view the same field or compare the distribution differences of data in different fields , Pictured 8.7 Shown .

df.target.value_counts().plot.pie(legend=True)
df.boxplot(column=['target'],figsize=(10,5))

Data exploration and actual combat sharing

          

      This section uses two real data sets to show several methods of data exploration .

One 、2013 American community survey

      Community surveys in the United States (American Community Survey) in , About every year 350 Million families were asked detailed questions about who they are and how they live . The survey covered many topics , Including ancestors 、 education 、 Work 、 traffic 、 Internet use and residence .

       Data sources :https://www.kaggle.com/census/2013-american-community-survey.

        Data name :2013 American Community Survey.

        First observe the appearance and characteristics of the data , And the meaning of each field 、 Type and scope .

#  Reading data
df = pd.read_csv("./ss13husa.csv")
#  Type and quantity of fields
df.shape
# (756065,231)
#  Field value range
df.describe()

      The first two ss13pusa.csv String together , This data contains a total of 30 Ten thousand data ,3 Fields :SCHL ( Education ,School Level)、 PINCP ( income ,Income) and ESR ( Working state ,Work Status).

pusa = pd.read_csv("ss13pusa.csv") pusb = pd.read_csv("ss13pusb.csv")
#  Connect two data in series
col = ['SCHL','PINCP','ESR']
df['ac_survey'] = pd.concat([pusa[col],pusb[col],axis=0)

      Group the data according to educational background , Observe the number and proportion of different degrees , Then calculate their average income .

group = df['ac_survey'].groupby(by=['SCHL']) print(' Education distribution :' + group.size())
group = ac_survey.groupby(by=['SCHL']) print(' Average income :' +group.mean())

Two 、 Boston housing data set

      Boston housing data set (Boston House Price Dataset) Contains information about houses in the Boston area , package 506 Data samples and 13 Characteristic dimensions .

      Data sources :https://archive.ics.uci.edu/ml/machine-learning-databases/housing/.

        Data name :Boston House Price Dataset.

      First observe the appearance and characteristics of the data , And the meaning of each field 、 Type and scope .

      You can use histogram to draw house prices (MEDV) The distribution of , Pictured 8.8 Shown .

df = pd.read_csv("./housing.data")
#  Type and quantity of fields
df.shape
# (506, 14)
# Field value range df.describe()
import matplotlib.pyplot as plt 
df[['MEDV']].plot.hist() 
plt.show()

notes : The Chinese and English in the figure correspond to the name specified by the author in the code or data , In practice, readers can replace them with the words they need .

      The next thing you need to know is which dimensions are related to “ housing price ” The relationship is obvious . Let's look at... In the form of a scatter diagram , Pictured 8.9 Shown .

# draw scatter chart 
df.plot.scatter(x='MEDV', y='RM') .
plt.show()

      Last , Calculate the correlation coefficient and use the cluster heat map (Heatmap) For visual presentation , Pictured 8.10 Shown .

# compute pearson correlation 
corr = df.corr()
# draw  heatmap 
import seaborn as sns 
corr = df.corr() 
sns.heatmap(corr) 
plt.show()

      The color is red , Indicates a positive relationship ; The color is blue , Indicates a negative relationship ; The color is white , It doesn't matter .RM The correlation with house prices tends to be red , It is a positive relationship ;LSTAT、PTRATIO The correlation with house prices tends to dark blue , Is a negative relationship ;CRIM、RAD、AGE The correlation with house prices tends to be white , It doesn't matter .

more Python See the following practice for technical knowledge PDF electronic text Download this

Python actual combat PDF Download this


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved