Grouping operations Sometimes called “split-apply-combine” operation . Among them “split” That's why obj.groupby()
Method to achieve .
.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False)
Method acts on an axis , And accept a Group key (by) Parameter to group callers . The grouping key can be Series Or list , The length shall be consistent with the axis to be grouped ; It can also be a mapping function 、 A dictionary or even an array Name ( character string ), But these parameter types are just shortcuts , It will ultimately still be used to generate a set of values for splitting objects .
lang:python
>>> df = DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
>>> df
data1 data2 key1 key2
0 0.922269 0.110285 a one
1 -0.181773 1.022435 a two
2 0.635899 0.279316 b one
3 0.527926 0.482807 b two
4 -1.586040 -1.312042 a one
[5 rows x 4 columns]
>>> grouped = df.groupby(df['key1'])
>>> grouped
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000005BC25F8>
Use here df['key1']
Made grouping key , Press a and b Grouping . But the actual grouping key does not need to be associated with an array object , As long as the length is the same , Using the columns of an array is just a diagram convenience . In the above example, if [1,1,2,2,3]
If such a list is used as a grouping key , Results and df['key1']
It's the same .
groupby Method DataFrameGroupBy The object does not actually contain data content , It records the grouping keys ——df['key1']
The middle data of . When you apply functions or other aggregation operations to grouped data ,pandas Re basis groupby Object df Perform fast block operation , And return the result .
The above passage actually means : groupby Method calls themselves do not involve operations , So it's fast . While operating this grouped When the object , It is more convenient to regard it as an object that stores actual data . For example, we can directly apply many methods to it , Or index slice :
lang:python
>>> grouped.mean()
data1 data2
key1
a -0.281848 -0.059774
b 0.581912 0.381061
[2 rows x 2 columns]
The above example does not show key2
Column , Because its value is not a numeric type , By mean() Method automatically ignores . When you want to see only one ( some ) At the time of listing , This can be done by indexing , stay groupby Method call Both front and back ( This is a kind of grammar sugar ):
lang:python
>>> df['data1'].groupby(df['key1']).mean()
key1
a -0.281848
b 0.581912
dtype: float64
>>> df.groupby(df['key2'])['data2'].mean()
key2
one -0.307481
two 0.752621
Name: data2, dtype: float64
If the grouping key uses multiple arrays , You will get the result of a hierarchical index :
lang:python
>>> df.groupby([df['key1'],df['key2']]).mean()
data1 data2
key1 key2
a one -0.331885 -0.600879
two -0.181773 1.022435
b one 0.635899 0.279316
two 0.527926 0.482807
[4 rows x 2 columns]
Last , have access to GroupBy object ( Whether it's DataFrameGroupBy still SeriesGroupBy) Of .size()
Method to view the group size :
lang:python
>>> grouped.size()
key1
a 3
b 2
dtype: int64
<br /> ### Iterate over groups GroupBy The object is to be able to pass through for Loop iterative , A set of two tuples can be generated , Group name and intra group data . Here is a case of multiple grouping keys :
lang:python
>>> for i,j in df.groupby([df['key1'],df['key2']]):
print(i)
print('-----------')
print(j)
('a', 'one')
-----------
data1 data2 key1 key2
0 0.922269 0.110285 a one
4 -1.586040 -1.312042 a one
[2 rows x 4 columns]
('a', 'two')
-----------
data1 data2 key1 key2
1 -0.181773 1.022435 a two
[1 rows x 4 columns]
('b', 'one')
-----------
data1 data2 key1 key2
2 0.635899 0.279316 b one
[1 rows x 4 columns]
('b', 'two')
-----------
data1 data2 key1 key2
3 0.527926 0.482807 b two
[1 rows x 4 columns]
<br /> ### Use character string column names as grouping keys As mentioned earlier, you can use ** Column names in string form ** As a grouping key , But none of the above examples are useful . Because this method is convenient , But there are hidden dangers —— When using this method , The caller must be DataFrame The object itself cannot be DataFrame Index form of . namely `df.groupby('key1')['data1']` yes ok Of , but `df['data1'].groupby('key1')` Will report a mistake . Pay attention to distinguish when using . <br /> ### Use Dictionary or Series Make grouping key These two parameters need to provide a slave line ( Column ) Name to group name mapping .( Remember Series It is a kind of ordered dictionary with fixed length This kind of saying )
lang:python
>>> df.groupby({0:'a',1:'a',2:'b',3:'b',4:'a'}).mean()
data1 data2
a -0.281848 -0.059774
b 0.581912 0.381061
[2 rows x 2 columns]
<br /> ### Grouping by function The function acts somewhat like a dictionary , In other words, these strange grouping keys are similar to dictionaries —— The axes to be grouped are transformed into an equal length sequence of grouping names by some mapping relationship .
If the row and column names are passed to the dictionary as an index to get the group names , So in the function grouping key , The row and column names are passed to the function as parameters . This is the type of function you need to provide :
lang:python
>>> df.groupby(lambda x:'even' if x%2==0 else 'odd').mean()
data1 data2
even -0.009290 -0.307481
odd 0.173076 0.752621
[2 rows x 2 columns]
<br /> ### Group by index level When grouping by high-level index , The parameter is no longer `by=None` 了 , Instead of `level=None`, The value can be the number or name of the index level :
lang:python
>>> index = pd.MultiIndex.from_arrays([['even','odd','even','odd','even'],
[0,1,2,3,4]],names=['a','b'])
>>> df.index = index
>>> df.groupby(level='a').mean()
data1 data2
a
even -0.009290 -0.307481
odd 0.173076 0.752621
[2 rows x 2 columns]
>>> df.groupby(level=0).mean()
data1 data2
a
even -0.009290 -0.307481
odd 0.173076 0.752621
[2 rows x 2 columns]
<br /> # Data aggregation (Aggregation) --- Data aggregation , Refers to any data conversion process that can generate scalar values from an array . You can also simply understand it as statistical calculation , Such as mean(), sum(), max() etc. .
Data aggregation itself is not directly related to grouping , In any column ( That's ok ) Or all columns ( That's ok ) It can be done on the Internet . But when this operation is applied to packet data , The results may become more meaningful .
about GroupBy The aggregation operations that an object can apply include :
GroupBy.aggregate()
or GroupBy.agg()
To achieve The parameter of the custom function should be an array type , namely GroupBy Object iterates over the second element of the tuple . Such as
lang:python
>>> df.groupby('key1')['data1','data2'].agg(lambda arr:arr.max()-arr.min())
data1 data2
key1
a 2.508309 2.334477
b 0.107973 0.203492
[2 rows x 2 columns]
But in fact, the efficiency of custom functions is very slow , It's not as good as GroupBy Object has been optimized for built-in methods , These methods include : <br />
<table > <tr> <td>############</td> <td>****************************************************</td> </tr> <tr> <td>count</td> <td> In the group, non NA Worth the quantity </td> </tr> <tr> <td>sum</td> <td> Not NA It's worth it and </td> </tr> <tr> <td>mean</td> <td> Not NA The average of the values </td> </tr> <tr> <td>median</td> <td> Not NA The arithmetic median of the value </td> </tr> <tr> <td>std, var</td> <td> Unbiased ( The denominator is n-1) Standard deviation and variance </td> </tr> <tr> <td>min, max</td> <td> Not NA The minimum and maximum of values </td> </tr> <tr> <td>prod</td> <td> Not NA Product of values </td> </tr> <tr> <td>first, last</td> <td> First and last non NA value </td> </tr> </table> <br /> ### Column oriented multi function application In the previous example , We only call one aggregate method at a time . For multi function applications , We can discuss it in two cases :
The first is the case where multiple functions are applied to the same column to get multiple results , At this time, just give agg() Pass in a function list :
lang:python
>>> df.groupby('key1')['data1','data2'].agg(['min','max'])
data1 data2
min max min max
key1
a -1.586040 0.922269 -1.312042 1.022435
b 0.527926 0.635899 0.279316 0.482807
[2 rows x 4 columns]
One trick here is , For those statistical methods in the previous section , Method names can be passed in as strings agg(). in addition , If you don't like the way columns are named , Or you use simply lambda Anonymous functions , You can replace the function arguments with (name,function) Tuple format , In this way, the columns in the result set will no longer be named after the function name, but will be named after the name Subject to .
The second is when different functions are applied to different columns , It needs to be passed to agg() One Mapping from column names to function names Dictionary :
lang:python
>>> df.groupby('key1').agg({'data1':'min','data2':'max'})
data1 data2
key1
a -1.586040 1.022435
b 0.527926 0.482807
[2 rows x 2 columns]
Aggregation is just a kind of grouping operation , More kinds of grouping operations can be performed by .transform()
and apply()
Method realization . <br /> ###transform In the previous aggregation operation , The result is a group named index The result of . If we want to use the index Words , It needs to be done merge transformation .transform(func, *args, **kwargs)
Method simplifies the process , It will bring func Parameters apply to all groups , Then put the result in the original array index On ( If the result is a scalar , Just broadcast ):
lang:python
>>> df
data1 data2 key1 key2
a b
even 0 0.922269 0.110285 a one
odd 1 -0.181773 1.022435 a two
even 2 0.635899 0.279316 b one
odd 3 0.527926 0.482807 b two
even 4 -1.586040 -1.312042 a one
[5 rows x 4 columns]
>>> df.groupby('key1').transform('mean')
data1 data2
a b
even 0 -0.281848 -0.059774
odd 1 -0.281848 -0.059774
even 2 0.581912 0.381061
odd 3 0.581912 0.381061
even 4 -0.281848 -0.059774
[5 rows x 2 columns]
<br /> ###apply `apply(func, *args, **kwargs)` The object to be processed will be split into multiple fragments , Then call the incoming function on each fragment , Finally try to use `pd.concat()` Put the results together .func The return value of can be pandas Object or scalar , And the size of array objects is unlimited .
lang:python
>>> df
data1 data2 key1 key2
0 0.721150 -0.359337 a one
1 -1.727197 1.539508 a two
2 -0.339751 0.171379 b one
3 -0.291888 -1.000769 b two
4 -0.127029 0.506162 a one
[5 rows x 4 columns]
>>> def foo(df,n=12):
return pd.DataFrame(np.arange(n).reshape(3,4))
>>> df.groupby('key1').apply(foo)
0 1 2 3
key1
a 0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
b 0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
[6 rows x 4 columns]
This is a meaningless example , Because it's passed on to apply Of func The parameter is not correct df Do anything about , Straight back to a (3,4) Array of . But in fact , Such a meaningless example just illustrates apply The generality of the method —— You can return any result . Most of the time , Limit apply In fact, it is the user's brain hole . <br /> ### PivotTable and crosstab DataFrame The object has a .pivot_table(data, values=None, rows=None, cols=None, aggfunc='mean', fill_value=None, margins=False, dropna=True)
Method can be used to make pivot table , meanwhile pd.pivot_table()
It is also a top-level function .
example :
lang:python
>>> df
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
>>> table = pivot_table(df, values='D', rows=['A', 'B'],
... cols=['C'], aggfunc=np.sum)
>>> table
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7
<br /> Crossover table (cross-tabulation,crosstab) Is a special pivot table used to calculate grouping frequency .
crosstab(rows, cols, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
lang:python
>>> a
array([foo, foo, foo, foo, bar, bar,
bar, bar, foo, foo, foo], dtype=object)
>>> b
array([one, one, one, two, one, one,
one, two, two, two, one], dtype=object)
>>> c
array([dull, dull, shiny, dull, dull, shiny,
shiny, dull, shiny, shiny, shiny], dtype=object)
>>> crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
b one two
c dull shiny dull shiny
a
bar 1 2 1 0
foo 2 2 1 2