您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Pandas groupby operation

編輯：Python

Link to the original text ：https://my.oschina.net/lionets/blog/280332

#GroupBy

Grouping operations Sometimes called “split-apply-combine” operation . Among them “split” That's why obj.groupby() Method to achieve .

.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False) Method acts on an axis , And accept a Group key （by） Parameter to group callers . The grouping key can be Series Or list , The length shall be consistent with the axis to be grouped ; It can also be a mapping function 、 A dictionary or even an array Name （ character string ）, But these parameter types are just shortcuts , It will ultimately still be used to generate a set of values for splitting objects .

lang:python
>>> df = DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
>>> df
data1 data2 key1 key2
0 0.922269 0.110285 a one
1 -0.181773 1.022435 a two
2 0.635899 0.279316 b one
3 0.527926 0.482807 b two
4 -1.586040 -1.312042 a one
[5 rows x 4 columns]
>>> grouped = df.groupby(df['key1'])
>>> grouped
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000005BC25F8>

Use here df['key1'] Made grouping key , Press a and b Grouping . But the actual grouping key does not need to be associated with an array object , As long as the length is the same , Using the columns of an array is just a diagram convenience . In the above example, if [1,1,2,2,3] If such a list is used as a grouping key , Results and df['key1'] It's the same .

groupby Method DataFrameGroupBy The object does not actually contain data content , It records the grouping keys ——df['key1'] The middle data of . When you apply functions or other aggregation operations to grouped data ,pandas Re basis groupby Object df Perform fast block operation , And return the result .

The above passage actually means ： groupby Method calls themselves do not involve operations , So it's fast . While operating this grouped When the object , It is more convenient to regard it as an object that stores actual data . For example, we can directly apply many methods to it , Or index slice ：

lang:python
>>> grouped.mean()
data1 data2
key1
a -0.281848 -0.059774
b 0.581912 0.381061
[2 rows x 2 columns]

The above example does not show key2 Column , Because its value is not a numeric type , By mean() Method automatically ignores . When you want to see only one （ some ） At the time of listing , This can be done by indexing , stay groupby Method call Both front and back （ This is a kind of grammar sugar ）：

lang:python
>>> df['data1'].groupby(df['key1']).mean()
key1
a -0.281848
b 0.581912
dtype: float64
>>> df.groupby(df['key2'])['data2'].mean()
key2
one -0.307481
two 0.752621
Name: data2, dtype: float64

If the grouping key uses multiple arrays , You will get the result of a hierarchical index ：

lang:python
>>> df.groupby([df['key1'],df['key2']]).mean()
data1 data2
key1 key2
a one -0.331885 -0.600879
two -0.181773 1.022435
b one 0.635899 0.279316
two 0.527926 0.482807
[4 rows x 2 columns]

Last , have access to GroupBy object （ Whether it's DataFrameGroupBy still SeriesGroupBy） Of .size() Method to view the group size ：

lang:python
>>> grouped.size()
key1
a 3
b 2
dtype: int64

### Iterate over groups GroupBy The object is to be able to pass through for Loop iterative , A set of two tuples can be generated , Group name and intra group data . Here is a case of multiple grouping keys ：

lang:python
>>> for i,j in df.groupby([df['key1'],df['key2']]):
print(i)
print('-----------')
print(j)
('a', 'one')
-----------
data1 data2 key1 key2
0 0.922269 0.110285 a one
4 -1.586040 -1.312042 a one
[2 rows x 4 columns]
('a', 'two')
-----------
data1 data2 key1 key2
1 -0.181773 1.022435 a two
[1 rows x 4 columns]
('b', 'one')
-----------
data1 data2 key1 key2
2 0.635899 0.279316 b one
[1 rows x 4 columns]
('b', 'two')
-----------
data1 data2 key1 key2
3 0.527926 0.482807 b two
[1 rows x 4 columns]

### Use character string column names as grouping keys As mentioned earlier, you can use ** Column names in string form ** As a grouping key , But none of the above examples are useful . Because this method is convenient , But there are hidden dangers —— When using this method , The caller must be DataFrame The object itself cannot be DataFrame Index form of . namely `df.groupby('key1')['data1']` yes ok Of , but `df['data1'].groupby('key1')` Will report a mistake . Pay attention to distinguish when using . ### Use Dictionary or Series Make grouping key These two parameters need to provide a slave line （ Column ） Name to group name mapping .（ Remember Series It is a kind of ordered dictionary with fixed length This kind of saying ）

lang:python
>>> df.groupby({0:'a',1:'a',2:'b',3:'b',4:'a'}).mean()
data1 data2
a -0.281848 -0.059774
b 0.581912 0.381061
[2 rows x 2 columns]

### Grouping by function The function acts somewhat like a dictionary , In other words, these strange grouping keys are similar to dictionaries —— The axes to be grouped are transformed into an equal length sequence of grouping names by some mapping relationship .

If the row and column names are passed to the dictionary as an index to get the group names , So in the function grouping key , The row and column names are passed to the function as parameters . This is the type of function you need to provide ：

lang:python
>>> df.groupby(lambda x:'even' if x%2==0 else 'odd').mean()
data1 data2
even -0.009290 -0.307481
odd 0.173076 0.752621
[2 rows x 2 columns]

### Group by index level When grouping by high-level index , The parameter is no longer `by=None` 了 , Instead of `level=None`, The value can be the number or name of the index level ：

lang:python
>>> index = pd.MultiIndex.from_arrays([['even','odd','even','odd','even'],
[0,1,2,3,4]],names=['a','b'])
>>> df.index = index
>>> df.groupby(level='a').mean()
data1 data2
a
even -0.009290 -0.307481
odd 0.173076 0.752621
[2 rows x 2 columns]
>>> df.groupby(level=0).mean()
data1 data2
a
even -0.009290 -0.307481
odd 0.173076 0.752621
[2 rows x 2 columns]

# Data aggregation （Aggregation） --- Data aggregation , Refers to any data conversion process that can generate scalar values from an array . You can also simply understand it as statistical calculation , Such as mean(), sum(), max() etc. .

Data aggregation itself is not directly related to grouping , In any column （ That's ok ） Or all columns （ That's ok ） It can be done on the Internet . But when this operation is applied to packet data , The results may become more meaningful .

about GroupBy The aggregation operations that an object can apply include ：

Already built-in methods , Such as sum(), mean() etc.
Series Methods , Such as quantile() etc.
Custom aggregate functions , By passing in GroupBy.aggregate() or GroupBy.agg() To achieve

The parameter of the custom function should be an array type , namely GroupBy Object iterates over the second element of the tuple . Such as

lang:python
>>> df.groupby('key1')['data1','data2'].agg(lambda arr:arr.max()-arr.min())
data1 data2
key1
a 2.508309 2.334477
b 0.107973 0.203492
[2 rows x 2 columns]

But in fact, the efficiency of custom functions is very slow , It's not as good as GroupBy Object has been optimized for built-in methods , These methods include ：

<table > <tr> <td>############</td> <td>****************************************************</td> </tr> <tr> <td>count</td> <td> In the group, non NA Worth the quantity </td> </tr> <tr> <td>sum</td> <td> Not NA It's worth it and </td> </tr> <tr> <td>mean</td> <td> Not NA The average of the values </td> </tr> <tr> <td>median</td> <td> Not NA The arithmetic median of the value </td> </tr> <tr> <td>std, var</td> <td> Unbiased （ The denominator is n-1） Standard deviation and variance </td> </tr> <tr> <td>min, max</td> <td> Not NA The minimum and maximum of values </td> </tr> <tr> <td>prod</td> <td> Not NA Product of values </td> </tr> <tr> <td>first, last</td> <td> First and last non NA value </td> </tr> </table> ### Column oriented multi function application In the previous example , We only call one aggregate method at a time . For multi function applications , We can discuss it in two cases ：

The first is the case where multiple functions are applied to the same column to get multiple results , At this time, just give agg() Pass in a function list ：

lang:python
>>> df.groupby('key1')['data1','data2'].agg(['min','max'])
data1 data2
min max min max
key1
a -1.586040 0.922269 -1.312042 1.022435
b 0.527926 0.635899 0.279316 0.482807
[2 rows x 4 columns]

One trick here is , For those statistical methods in the previous section , Method names can be passed in as strings agg(). in addition , If you don't like the way columns are named , Or you use simply lambda Anonymous functions , You can replace the function arguments with （name,function） Tuple format , In this way, the columns in the result set will no longer be named after the function name, but will be named after the name Subject to .

The second is when different functions are applied to different columns , It needs to be passed to agg() One Mapping from column names to function names Dictionary ：

lang:python
>>> df.groupby('key1').agg({'data1':'min','data2':'max'})
data1 data2
key1
a -1.586040 1.022435
b 0.527926 0.482807
[2 rows x 2 columns]

What we should pay attention to here is , No more GroupBy Object , Your dictionary parameter has already done the column selection of the response . # Group level operations and transformations

Aggregation is just a kind of grouping operation , More kinds of grouping operations can be performed by .transform() and apply() Method realization . ###transform In the previous aggregation operation , The result is a group named index The result of . If we want to use the index Words , It needs to be done merge transformation .transform(func, *args, **kwargs) Method simplifies the process , It will bring func Parameters apply to all groups , Then put the result in the original array index On （ If the result is a scalar , Just broadcast ）：

lang:python
>>> df
data1 data2 key1 key2
a b
even 0 0.922269 0.110285 a one
odd 1 -0.181773 1.022435 a two
even 2 0.635899 0.279316 b one
odd 3 0.527926 0.482807 b two
even 4 -1.586040 -1.312042 a one
[5 rows x 4 columns]
>>> df.groupby('key1').transform('mean')
data1 data2
a b
even 0 -0.281848 -0.059774
odd 1 -0.281848 -0.059774
even 2 0.581912 0.381061
odd 3 0.581912 0.381061
even 4 -0.281848 -0.059774
[5 rows x 2 columns]

###apply `apply(func, *args, **kwargs)` The object to be processed will be split into multiple fragments , Then call the incoming function on each fragment , Finally try to use `pd.concat()` Put the results together .func The return value of can be pandas Object or scalar , And the size of array objects is unlimited .

lang:python
>>> df
data1 data2 key1 key2
0 0.721150 -0.359337 a one
1 -1.727197 1.539508 a two
2 -0.339751 0.171379 b one
3 -0.291888 -1.000769 b two
4 -0.127029 0.506162 a one
[5 rows x 4 columns]
>>> def foo(df,n=12):
return pd.DataFrame(np.arange(n).reshape(3,4))
>>> df.groupby('key1').apply(foo)
0 1 2 3
key1
a 0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
b 0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
[6 rows x 4 columns]

This is a meaningless example , Because it's passed on to apply Of func The parameter is not correct df Do anything about , Straight back to a （3,4） Array of . But in fact , Such a meaningless example just illustrates apply The generality of the method —— You can return any result . Most of the time , Limit apply In fact, it is the user's brain hole . ### PivotTable and crosstab DataFrame The object has a .pivot_table(data, values=None, rows=None, cols=None, aggfunc='mean', fill_value=None, margins=False, dropna=True) Method can be used to make pivot table , meanwhile pd.pivot_table() It is also a top-level function .

data The parameter is equivalent to self, It's called data Maybe it's for the top-level function version pivot_table bring into correspondence with .
values The parameter can be a list with column names as elements , Used to specify the data you want to aggregate , If not, all data will be used by default .
rows Parameter is used to specify row grouping key
cols Parameter is used to specify the column grouping key
aggfunc Parameter is used to specify the aggregate function , The default is the mean value （mean）
margins Parameter is subtotal （Total） Functional switch , Set to True After that, the result set will appear with the name “ALL” Rows and columns of

example ：

lang:python
>>> df
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
>>> table = pivot_table(df, values='D', rows=['A', 'B'],
... cols=['C'], aggfunc=np.sum)
>>> table
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7

Crossover table （cross-tabulation,crosstab） Is a special pivot table used to calculate grouping frequency .

crosstab(rows, cols, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)

lang:python
>>> a
array([foo, foo, foo, foo, bar, bar,
bar, bar, foo, foo, foo], dtype=object)
>>> b
array([one, one, one, two, one, one,
one, two, two, two, one], dtype=object)
>>> c
array([dull, dull, shiny, dull, dull, shiny,
shiny, dull, shiny, shiny, shiny], dtype=object)
>>> crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
b one two
c dull shiny dull shiny
a
bar 1 2 1 0
foo 2 2 1 2