The present python It is widely used in data processing and quantitative analysis data mining , This article will focus on python in pandas Under bag DataFrame The use of this data type is described in detail , And its equivalent in data processing SQL The contrast in , Don't talk much , On and off !!!
https://blog.csdn.net/weixin_39818658/article/details/79944925
keyword :pandas、DataFrame、SQL
1、 About filtering data table columns
Yes DataFrame( That is, a two-dimensional data table )data:
Use
data=DataFrame(np.arange(9).reshape(3,3),columns=[‘a’,‘b’,‘c’])
Statement to generate the DataFrame
data :
a b c
0 0 1 2
1 3 4 5
2 6 7 8
If we only take data Of a Column Use a slicing operation directly ( notes : The selection slice operation on the column is a two-layer bracket ):
data[[‘a’]]
--------->
0 0
1 3
2 6
Corresponding SQL Statement for :
select a from data
If we want to take a and b The two fields are also slicing operations :
data[[‘a’,‘b’]]
------>
a b
0 0 1
1 3 4
2 6 7
Corresponding SQL Statement for :
select a,b from data
First, let's create two here DataFrame:
Basic information data sheet of the company
companydata=DataFrame({'ID':['aa','bb','cc'],
'name':[' company A',' company B',' company C'],
'establish_year':['2001','2002','2003'],
'city':['ShangHai','BeiJing','ShenZhen']})
----->
ID city establish_year name
0 aa ShangHai 2001 company A
1 bb BeiJing 2002 company B
2 cc ShenZhen 2003 company C
List of products issued by the company :
productdata=DataFrame({'ID':['aa','aa','bb','bb','cc'],
'productname':['stock_A','bond_A','stock_B','bond_B','stock_C']
})
--------->
ID productname
0 aa stock_A
1 aa bond_A
2 bb stock_B
3 bb bond_B
4 cc stock_C
2、 Index of data 、 selection 、 Filter
(1) Single condition screening
First select ’ company A’ All the information about , The operation is as follows :
companydata[companydata[‘name’]==‘ company A’]
---------------------------------------------------------------------------->
ID city establish_year name
0 aa ShangHai 2001 company A
amount to SQL:
select * from companydata where name=’ company A’
(2) Multi criteria screening ( notes : The multi criteria filter criteria should be written in brackets )
selection ’ company B’ also ’city’ by ’BeiJing’, The operation is as follows :
companydata[(companydata[‘name’]‘ company B’)&(companydata[‘city’]‘BeiJing’)]
-------->
ID city establish_year name
1 bb BeiJing 2002 company B
amount to SQL:
select * from companydata where name=’ company A’ and city=’BeiJing’
selection ’ID’ by ’aa’ perhaps ’cc’ All the information about , The operation is as follows :
companydata[(companydata[‘ID’]‘aa’)|(companydata[‘ID’]‘cc’)]
------>
ID city establish_year name
0 aa ShangHai 2001 company A
2 cc ShenZhen 2003 company C
amount to SQL:
select * from companydata where ID=’aa’ or ID=’cc’
perhaps
companydata[companydata[‘ID’].isin([‘aa’,‘cc’])]
------->
ID city establish_year name
0 aa ShangHai 2001 company A
2 cc ShenZhen 2003 company C
amount to SQL:
select * from companydata where ID in (‘aa’,’cc’)
(3) Sometimes we don't need all the fields after filtering, but only one or several fields ,
for example :
View company A also ’ID’ by ’aa’ Of ’ Year of establishment ’ and ’ city ’
We need to apply it to DataFrame An index method of ix, The operation is as follows :
companydata.ix[(companydata[‘name’]‘ company A’)&(companydata[‘ID’]‘aa’),[‘city’,‘establish_year’]]
---------->
city establish_year
0 ShangHai 2001
amount to SQL:
select city,establish_year from companydata where ID=’aa’ and name=’ company A’
3、 Label according to different data , That is to say SQL Medium case when … then … operation
for example : When the city where the company is located ’city’ ,’ShangHai’ When is ’SH’,’BeiJing’ When is ’BJ’,’ShenZhen’ When is ’SZ’, The operation is as follows :
fuc=lambda x:'SH' if x=='ShangHai' else 'BJ' if x=='BeiJing' else 'SZ' if x=='ShenZhen' else np.NaN
companydata['citylabel']=companydata['city'].apply(fuc)
print(companydata)
( I can only think of such a way of writing about this , If there is a clever way , Please grant me your advice )
------->
ID city establish_year name citylabel
0 aa ShangHai 2001 company A SH
1 bb BeiJing 2002 company B BJ
2 cc ShenZhen 2003 company C SZ
amount to SQL:
select
ID,
city,
establish_year,
name,
case ‘city’ when ‘'ShangHai' ’ then ‘SH’
when 'BeiJing' then ‘BJ’
when ‘'ShenZhen' ’ then ‘SZ’ else null
end as citylabel
from companydata
4、 Aggregation of data
example :
surface productdata Yes ID Grouping , And calculate each ID There are several financial products
productdata[productname].groupby(productdata[‘ID’]).count()
------>
ID
aa 2
bb 2
cc 1
Name: ID, dtype: int64
amount to SQL:
select count(productname) from productdata group by ID
productdata[productname].groupby([productdata[‘ID’],productdata[‘productname’]]).count()
----------->
ID productname
aa bond_A 1
stock_A 1
bb bond_B 1
stock_B 1
cc stock_C 1
Name: ID, dtype: int64
amount to SQL:
select count(productname) from productdata group by ID,‘productname’
5、 Multi table association operation
In general, data processing is not a single table operation , Generally, multiple tables are associated .
example :
Make statistics on what financial products each company has , The operation is as follows :
pd.merge(companydata,productdata,left_on=‘ID’,right_on=‘ID’,how=‘left’)[[‘ID’,‘name’,‘productname’]]
------>
ID name productname
0 aa company A stock_A
1 aa company A bond_A
2 bb company B stock_B
3 bb company B bond_B
4 cc company C stock_C
amount to SQL:
Select A.ID,A.name,B.productname from companydata A left join productdata on A.ID=B.ID
Here is a brief description of the method merge Parameters of
(1) The first two parameters that must be passed in are two DataFrame , I won't elaborate here
(2)left_on Is the left table as the associated foreign key
(3)right_on Is the right table as the associated foreign key
(4)how Is the way of association , There were left\right\inner\outer, Corresponding to SQL Medium left\right\inner\full Four kinds of join The way