Today we will talk about Pandas An advanced tutorial for , Include read and write files 、 Select subsets and graphical representation, etc .
A key step in data processing is to read files for analysis , Then write the analysis result to the file again .
Pandas Supports multiple file formats for reading and writing :
In [108]: pd.read_ read_clipboard() read_excel() read_fwf() read_hdf() read_json read_parquet read_sas read_sql_query read_stata read_csv read_feather() read_gbq() read_html read_msgpack read_pickle read_sql read_sql_table read_table
Next, we'll take Pandas Provided by the official website Titanic.csv For example to explain Pandas Use .
Titanic.csv Provides 800 Information about multiple passengers on Titan little , It's a 891 rows x 12 columns Matrix .
We use Pandas To read this csv:
In [5]: titanic=pd.read_csv("titanic.csv")
read_csv Methods will csv File conversion to pandas Of DataFrame
.
By default, we use DF Variable , Before the show 5 After the row and 5 Row data :
In [3]: titanic Out[3]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female ... 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female ... 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male ... 0 373450 8.0500 NaN S .. ... ... ... ... ... ... ... ... ... ... ... 886 887 0 2 Montvila, Rev. Juozas male ... 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female ... 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female ... 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male ... 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male ... 0 370376 7.7500 NaN Q [891 rows x 12 columns]
have access to head(n) and tail(n) To specify a specific number of rows :
In [4]: titanic.head(8) Out[4]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female ... 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female ... 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male ... 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male ... 0 330877 8.4583 NaN Q 6 7 0 1 McCarthy, Mr. Timothy J male ... 0 17463 51.8625 E46 S 7 8 0 3 Palsson, Master. Gosta Leonard male ... 1 349909 21.0750 NaN S [8 rows x 12 columns]
Use dtypes You can see the data type of each column :
In [5]: titanic.dtypes Out[5]: PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
Use to_excel Can be DF Convert to excel file , Use read_excel Can read again excel file :
In [11]: titanic.to_excel('titanic.xlsx', sheet_name='passengers', index=False) In [12]: titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')
Use info() You can come to DF Make a preliminary statistic :
In [14]: titanic.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
DF Of head perhaps tail Method can only display all column data , The following method can select specific column data .
In [15]: ages = titanic["Age"] In [16]: ages.head() Out[16]: 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 Name: Age, dtype: float64
Each column is a Series:
In [6]: type(titanic["Age"]) Out[6]: pandas.core.series.Series In [7]: titanic["Age"].shape Out[7]: (891,)
You can choose more than one :
In [8]: age_sex = titanic[["Age", "Sex"]] In [9]: age_sex.head() Out[9]: Age Sex 0 22.0 male 1 38.0 female 2 26.0 female 3 35.0 female 4 35.0 male
If you choose multiple columns , The result returned is a DF type :
In [10]: type(titanic[["Age", "Sex"]]) Out[10]: pandas.core.frame.DataFrame In [11]: titanic[["Age", "Sex"]].shape Out[11]: (891, 2)
Above we talked about how to select column data , Now let's see how to select row data :
Choose a client older than 35 Year old :
In [12]: above_35 = titanic[titanic["Age"] > 35] In [13]: above_35.head() Out[13]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833 C85 C 6 7 0 1 McCarthy, Mr. Timothy J male ... 0 17463 51.8625 E46 S 11 12 1 1 Bonnell, Miss. Elizabeth female ... 0 113783 26.5500 C103 S 13 14 0 3 Andersson, Mr. Anders Johan male ... 5 347082 31.2750 NaN S 15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female ... 0 248706 16.0000 NaN S [5 rows x 12 columns]
Use isin choice Pclass stay 2 and 3 All of our customers :
In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])] In [17]: class_23.head() Out[17]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
above isin be equal to :
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
Screening Age It's not empty :
In [20]: age_no_na = titanic[titanic["Age"].notna()] In [21]: age_no_na.head() Out[21]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female ... 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female ... 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male ... 0 373450 8.0500 NaN S [5 rows x 12 columns]
We can select both rows and columns .
Use loc and iloc You can select rows and columns , The difference between them is loc It's choosing by name ,iloc It's using numbers to choose .
choice age>35 Name of passenger :
In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"] In [24]: adult_names.head() Out[24]: 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 6 McCarthy, Mr. Timothy J 11 Bonnell, Miss. Elizabeth 13 Andersson, Mr. Anders Johan 15 Hewlett, Mrs. (Mary D Kingcome) Name: Name, dtype: object
loc The first value in represents the row selection , The second value represents the column selection .
Use iloc Make a selection :
In [25]: titanic.iloc[9:25, 2:5] Out[25]: Pclass Name Sex 9 2 Nasser, Mrs. Nicholas (Adele Achem) female 10 3 Sandstrom, Miss. Marguerite Rut female 11 1 Bonnell, Miss. Elizabeth female 12 3 Saundercock, Mr. William Henry male 13 3 Andersson, Mr. Anders Johan male .. ... ... ... 20 2 Fynney, Mr. Joseph J male 21 2 Beesley, Mr. Lawrence male 22 3 McGowan, Miss. Anna "Annie" female 23 1 Sloper, Mr. William Thompson male 24 3 Palsson, Miss. Torborg Danira female [16 rows x 3 columns]
How to DF Convert it into a variety of graphic display ?
To use it on the command line matplotlib Make a picture , So you need to start ipython Of QT Environmental Science :
ipython qtconsole --pylab=inline
Use it directly plot Let's show the passenger information we read above :
import matplotlib.pyplot as plt import pandas as pd titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers') titanic.plot()
The abscissa is DF Medium index, The column coordinates are the names of the columns . Note that the above column only shows numerical type .
We just show age Information :
titanic['Age'].plot()
The default is the histogram , We can change the form of graphics , Like the dot graph :
titanic.plot.scatter(x="PassengerId",y="Age", alpha=0.5)
Select... In the data PassengerId As x Axis ,age As y Axis :
Except for the scatter plot , Many other images are also supported :
[method_name for method_name in dir(titanic.plot) if not method_name.startswith("_")] Out[11]: ['area', 'bar', 'barh', 'box', 'density', 'hexbin', 'hist', 'kde', 'line', 'pie', 'scatter']
Then look at one box chart :
titanic['Age'].plot.box()
You can see , Most of the passengers' ages are concentrated in 20-40 Between the ages of .
You can also draw and display the selected columns separately :
titanic.plot.area(figsize=(12, 4), subplots=True)
Specify a specific column :
titanic[['Age','Pclass']].plot.area(figsize=(12, 4), subplots=True)
You can also draw pictures first , Then fill in :
fig, axs = plt.subplots(figsize=(12, 4));
First draw an empty picture , And then fill it in :
titanic['Age'].plot.area(ax=axs); axs.set_ylabel("Age"); fig
occasionally , We need to transform the existing columns , To get a new column , For example, we want to add a Age2 Column , Its value is Age Column +10, You can do this :
titanic["Age2"]=titanic["Age"]+10; titanic[["Age","Age2"]].head() Out[34]: Age Age2 0 22.0 32.0 1 38.0 48.0 2 26.0 36.0 3 35.0 45.0 4 35.0 45.0
You can also rename Columns :
titanic_renamed = titanic.rename( ...: columns={"Age": "Age2", ...: "Pclass": "Pclas2"})
Convert column names to lowercase :
titanic_renamed = titanic_renamed.rename(columns=str.lower)
Let's count the average age of the passengers :
titanic["Age"].mean() Out[35]: 29.69911764705882
Choose the median :
titanic[["Age", "Fare"]].median() Out[36]: Age 28.0000 Fare 14.4542 dtype: float64
For more information :
titanic[["Age", "Fare"]].describe() Out[37]: Age Fare count 714.000000 891.000000 mean 29.699118 32.204208 std 14.526497 49.693429 min 0.420000 0.000000 25% 20.125000 7.910400 50% 28.000000 14.454200 75% 38.000000 31.000000 max 80.000000 512.329200
Use agg Specify a specific aggregation method :
titanic.agg({'Age': ['min', 'max', 'median', 'skew'],'Fare': ['min', 'max', 'median', 'mean']}) Out[38]: Age Fare max 80.000000 512.329200 mean NaN 32.204208 median 28.000000 14.454200 min 0.420000 0.000000 skew 0.389108 NaN
have access to groupby:
titanic[["Sex", "Age"]].groupby("Sex").mean() Out[39]: Age Sex female 27.915709 male 30.726645
groupby All columns :
titanic.groupby("Sex").mean() Out[40]: PassengerId Survived Pclass Age SibSp Parch Sex female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702
groupby You can then select specific columns :
titanic.groupby("Sex")["Age"].mean() Out[41]: Sex female 27.915709 male 30.726645 Name: Age, dtype: float64
It can be classified count:
titanic["Pclass"].value_counts() Out[42]: 3 491 1 216 2 184 Name: Pclass, dtype: int64
It's equivalent to :
titanic.groupby("Pclass")["Pclass"].count()
You can sort by a column :
titanic.sort_values(by="Age").head() Out[43]: PassengerId Survived Pclass Name Sex \ 803 804 1 3 Thomas, Master. Assad Alexander male 755 756 1 2 Hamalainen, Master. Viljo male 644 645 1 3 Baclini, Miss. Eugenie female 469 470 1 3 Baclini, Miss. Helene Barbara female 78 79 1 2 Caldwell, Master. Alden Gates male
Sort by multiple columns :
titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head() Out[44]: PassengerId Survived Pclass Name Sex Age \ 851 852 0 3 Svensson, Mr. Johan male 74.0 116 117 0 3 Connors, Mr. Patrick male 70.5 280 281 0 3 Duane, Mr. Frank male 65.0 483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0
Select specific row and column data , In the following example, we will select some data whose gender is female :
female=titanic[titanic['Sex']=='female'] female_subset=female[["Age","Pclass","PassengerId","Survived"]].sort_values(["Pclass"]).groupby(["Pclass"]).head(2) female_subset Out[58]: Age Pclass PassengerId Survived 1 38.0 1 2 1 356 22.0 1 357 1 726 30.0 2 727 1 443 28.0 2 444 1 855 18.0 3 856 1 654 18.0 3 655 0
Use pivot You can do the axis conversion :
female_subset.pivot(columns="Pclass", values="Age") Out[62]: Pclass 1 2 3 1 38.0 NaN NaN 356 22.0 NaN NaN 443 NaN 28.0 NaN 654 NaN NaN 18.0 726 NaN 30.0 NaN 855 NaN NaN 18.0 female_subset.pivot(columns="Pclass", values="Age").plot()
This article has been included in http://www.flydean.com/02-python-pandas-advanced/ The most popular interpretation , The deepest dry goods , The most concise tutorial , There are so many tricks you don't know about waiting for you to discover ! Welcome to my official account. :「 Program those things 」, Know technology , Know you better !