您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Descriptive statistics of data using python (machine learning)

編輯：Python

Use Python Descriptive statistics of the data

Data sets ：diabetes.csv
Reference books ：《Machine Learning Mastery With Python Understand Your Data, Create Accurate Models and work Projects End-to-End》
For a link ：https://github.com/aoyinke/ML_learner

Additional Knowledge

When two variables are related , Used to evaluate the impact of their corresponding variables due to correlation .
When multiple variables are independent , The variance is used to evaluate the difference of this effect .
When multiple variables are related , Covariance is used to assess the difference in this effect .

The overview

Some common indicators , For example, dimension , How many rows of data are there before
Pearson correlation coefficient and skewness were observed in multivariable and univariate respectively
Histogram , Density map , Code demonstration and explanation of box diagram
Multivariable visualization

Some common indicators

from pandas import read_csv
path = "diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path,names=names,skiprows=1)
# Before observing the data 5 That's ok
print(data.head())
# Observe the dimensions of the data
print(data.shape)
"""
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
(768, 9) 768 That's ok ,9 Column
"""
# Observe the type of each data
print(types)
"""
preg int64
plas int64
pres int64
skin int64
test int64
mass float64
pedi float64
age int64
class int64
"""

Use Pandas Make descriptive statistics

from pandas import set_option
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
print(description)

count: The calculation is under the corresponding attribute , All non null The number of data
mean,max,min Are the average values of all data under this attribute , Maximum and minimum
std: The standard deviation of the observations
Be careful , There are different statistical items for different types of data , For example, for object type data , Back to you ： count, unique, top, and freq These indicators
Please refer to the official documents :pandas.DataFrame.describe

Class Distribution( Is limited to classfication problem )

class_counts = data.groupby('class').size()
print(class_counts)
"""
class
0 500
1 268
"""

Correlation between attributes( The relationship between attributes )

For machine learning algorithms such as linear regression and logistic regression , If the correlation between attributes is too high , Will lead to worse performance
Pearson’s Correlation Coefficient( Pearson product moment correlation coefficient ) It is often used to calculate the correlation between attributes , It assumes that the properties involved are normally distributed
Pearson correlation coefficient is the product of the covariance of two variables divided by their standard deviation
0 Means not relevant , The correlation factors are distributed in -1-1 Between , Positive numbers indicate correlation , Negative numbers are irrelevant
Take a simple chestnut , It can be expected that the age and height of the high school adolescent sample Pearson The correlation coefficient is significantly greater than 0, But less than 1（ because 1 It means unrealistic perfect correlation ）

from pandas import set_option,read_csv
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

Skew of Univariate Distributions（ Skewness of univariate distribution ）

In the formula ,Sk—— skewness ;E—— expect ;μ—— Average ;μ3——3 Moment of order center ;σ—— Standard deviation . In general , When the statistical data is right biased ,Sk>0, And Sk The bigger the value is. , The higher the right deviation ;
When the statistical data is left biased distribution ,Sk< 0, And Sk The smaller the value. , The higher the left deviation . When the statistical data are symmetrically distributed , Obviously there is Sk= 0.

So we should pay attention to deal with skew more （ The absolute value ） The variable of

skew = data.skew()
print(skew)
"""
preg 0.901674
plas 0.173754
pres -1.843608
skin 0.109372
test 2.272251
mass -0.428982
pedi 1.919911
age 1.129597
class 0.635017
"""

Univariate Plots （ Univariate visual observation data ）

Histograms（ Histogram ）

# Univariate Histograms
from matplotlib.pyplot as plt
from pandas import read_csv
path = "diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path , names=names,skiprows=1)
data.hist()
plt.show()

age, pedi and test Presents an exponential distribution （exponential distribution）
mass and pres and plas Present or approximate Gauss （ normal ） Distribution （Gaussian Distribution）
Many machine learning algorithms assume that the input is normally distributed , But we can see that this is not the case （ Need to carry out standardlization Further treatment ）

Density Plots（ Density map ）

Density maps are another way to quickly understand the distribution of each attribute

data.plot(kind=?density?, subplots=True, layout=(3,3), sharex=False)
plt.show()

Box and Whisker Plots（ boxplot ）

Median （Q2 / 50th Percentiles ）： The middle value of the data set ;
The first quartile （Q1 / 25 Percentiles ）： Minimum number （ No “ minimum value ”） And the median of the dataset ;
third quartile （Q3 / 75th Percentile）： The median between the median and the maximum of the data set （ No “ Maximum ”）;
Interquartile spacing （IQR）： The first 25 To 75 A percentage point distance ;
whisker （ Blue shows ）
Outlier （ Show as a green circle ）
“ Maximum ”：Q3 + 1.5 * IQR
“ The minimum ”：Q1 -1.5 * IQR

summary ：

The boxplot is for continuous variables , Focus on the average level when interpreting 、 Volatility and outliers .
When the box is pressed flat , Or there are many abnormal times , Try logarithmic transformation .
When there is only one continuous variable , It is not suitable for drawing box line diagram , Histograms are a more common choice .
The most effective way to use box diagram is to make comparison , With one or more qualitative data , Draw group box diagram
data.plot(kind=‘box’, subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

Multivariate Plots（ Multivariable observation ）

Correlation Matrix Plot（ Pearson correlation coefficient , The relationship between variables ）

import matplotlib.pyplot as plt
import numpy as np
from pandas import read_csv
path = "diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
correlations = data.corr(method='pearson') # The Pearson correlation coefficient is obtained
# plot correlation matrix
fig = plt.figure() # Equivalent to getting a canvas
ax = fig.add_subplot(1,1,1) # Create a subgraph with rows and columns
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax) # Change the color bar （ The one standing on the right ） Add to the diagram
ticks = np.arange(0,9,1)
# ticks = [0 1 2 3 4 5 6 7 8] Construct a 0-8,step=1 Of np Array
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names) # In the play index, The default is number
ax.set_yticklabels(names)
plt.show()

Scatter Plot Matrix( Scatter matrix )

from matplotlib.pyplot as plt
from pandas import read_csv
from pandas.tools.plotting import scatter_matrix
path = "diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
scatter_matrix(data)
plt.show()

Summary:

Diagonals show histograms of each attribute .
Scatter charts are useful for discovering structural relationships between variables , For example, can you use a straight line to summarize the relationship between two variables . Attributes with structured relationships can also be relevant , Can be deleted from the dataset .