今天,我們將探討如何在 Python 的 Pandas 庫中創建 GroupBy 對象以及該對象的工作原理。我們將詳細了解分組過程的每個步驟,可以將哪些方法應用於 GroupBy 對象上,以及我們可以從中提取哪些有用信息
不要再觀望了,一起學起來吧
首先文末要知道,任何 groupby 過程都涉及以下 3 個步驟的某種組合:
讓我先來大致浏覽下今天用到的測試數據集
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)
df = pd.read_csv('complete.csv')
df = df[['awardYear', 'category', 'prizeAmount', 'prizeAmountAdjusted', 'name', 'gender', 'birth_continent']]
df.head()
Output:
awardYear category prizeAmount prizeAmountAdjusted name gender birth_continent
0 2001 Economic Sciences 10000000 12295082 A. Michael Spence male North America
1 1975 Physics 630000 3404179 Aage N. Bohr male Europe
2 2004 Chemistry 10000000 11762861 Aaron Ciechanover male Asia
3 1982 Chemistry 1150000 3102518 Aaron Klug male Europe
4 1979 Physics 800000 2988048 Abdus Salam male Asia
在這個階段,我們調用 pandas DataFrame.groupby() 函數。 我們使用它根據預定義的標准將數據分組,沿行(默認情況下,axis=0)或列(axis=1)。 換句話說,此函數將標簽映射到組的名稱。
例如,在我們的案例中,我們可以按獎項類別對諾貝爾獎的數據進行分組:
grouped = df.groupby('category')
也可以使用多個列來執行數據分組,傳遞一個列列表即可。讓我們首先按獎項類別對我們的數據進行分組,然後在每個創建的組中,我們將根據獲獎年份應用額外的分組:
grouped_category_year = df.groupby(['category', 'awardYear'])
現在,如果我們嘗試打印剛剛創建的兩個 GroupBy 對象之一,我們實際上將看不到任何組:
print(grouped)
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026083789DF0>
我們要注意的是,創建 GroupBy 對象成功與否,只檢查我們是否通過了正確的映射;在我們顯式地對該對象使用某些方法或提取其某些屬性之前,都不會真正執行拆分-應用-組合鏈的任何操作
為了簡要檢查生成的 GroupBy 對象並檢查組的拆分方式,我們可以從中提取組或索引屬性。它們都返回一個字典,其中鍵是創建的組,值是原始 DataFrame 中每個組的實例的軸標簽列表(對於組屬性)或索引(對於索引屬性):
grouped.indices
Output:
{'Chemistry': array([ 2, 3, 7, 9, 10, 11, 13, 14, 15, 17, 19, 39, 62,
64, 66, 71, 75, 80, 81, 86, 92, 104, 107, 112, 129, 135,
153, 169, 175, 178, 181, 188, 197, 199, 203, 210, 215, 223, 227,
239, 247, 249, 258, 264, 265, 268, 272, 274, 280, 282, 284, 289,
296, 298, 310, 311, 317, 318, 337, 341, 343, 348, 352, 357, 362,
365, 366, 372, 374, 384, 394, 395, 396, 415, 416, 419, 434, 440,
442, 444, 446, 448, 450, 455, 456, 459, 461, 463, 465, 469, 475,
504, 505, 508, 518, 522, 523, 524, 539, 549, 558, 559, 563, 567,
571, 572, 585, 591, 596, 599, 627, 630, 632, 641, 643, 644, 648,
659, 661, 666, 667, 668, 671, 673, 679, 681, 686, 713, 715, 717,
719, 720, 722, 723, 725, 726, 729, 732, 738, 742, 744, 746, 751,
756, 759, 763, 766, 773, 776, 798, 810, 813, 814, 817, 827, 828,
829, 832, 839, 848, 853, 855, 862, 866, 880, 885, 886, 888, 889,
892, 894, 897, 902, 904, 914, 915, 920, 921, 922, 940, 941, 943,
946, 947], dtype=int64),
'Economic Sciences': array([ 0, 5, 45, 46, 58, 90, 96, 139, 140, 145, 152, 156, 157,
180, 187, 193, 207, 219, 231, 232, 246, 250, 269, 279, 283, 295,
305, 324, 346, 369, 418, 422, 425, 426, 430, 432, 438, 458, 467,
476, 485, 510, 525, 527, 537, 538, 546, 580, 594, 595, 605, 611,
636, 637, 657, 669, 670, 678, 700, 708, 716, 724, 734, 737, 739,
745, 747, 749, 750, 753, 758, 767, 800, 805, 854, 856, 860, 864,
871, 882, 896, 912, 916, 924], dtype=int64),
'Literature': array([ 21, 31, 40, 49, 52, 98, 100, 101, 102, 111, 115, 142, 149,
159, 170, 177, 201, 202, 220, 221, 233, 235, 237, 253, 257, 259,
275, 277, 278, 286, 312, 315, 316, 321, 326, 333, 345, 347, 350,
355, 359, 364, 370, 373, 385, 397, 400, 403, 406, 411, 435, 439,
441, 454, 468, 479, 480, 482, 483, 492, 501, 506, 511, 516, 556,
569, 581, 602, 604, 606, 613, 614, 618, 631, 633, 635, 640, 652,
653, 655, 656, 665, 675, 683, 699, 761, 765, 771, 774, 777, 779,
780, 784, 786, 788, 796, 799, 803, 836, 840, 842, 850, 861, 867,
868, 878, 881, 883, 910, 917, 919, 927, 928, 929, 930, 936],
dtype=int64),
'Peace': array([ 6, 12, 16, 25, 26, 27, 34, 36, 44, 47, 48, 54, 61,
65, 72, 78, 79, 82, 95, 99, 116, 119, 120, 126, 137, 146,
151, 166, 167, 171, 200, 204, 205, 206, 209, 213, 225, 236, 240,
244, 255, 260, 266, 267, 270, 287, 303, 320, 329, 356, 360, 361,
377, 386, 387, 388, 389, 390, 391, 392, 393, 433, 447, 449, 471,
477, 481, 489, 491, 500, 512, 514, 517, 528, 529, 530, 533, 534,
540, 542, 544, 545, 547, 553, 555, 560, 562, 574, 578, 590, 593,
603, 607, 608, 609, 612, 615, 616, 617, 619, 620, 628, 634, 639,
642, 664, 677, 688, 697, 703, 705, 710, 727, 736, 787, 793, 795,
806, 823, 846, 847, 852, 865, 875, 876, 877, 895, 926, 934, 935,
937, 944, 948, 949], dtype=int64),
'Physics': array([ 1, 4, 8, 20, 23, 24, 30, 32, 38, 51, 59, 60, 67,
68, 69, 70, 74, 84, 89, 97, 103, 105, 108, 109, 114, 117,
118, 122, 125, 127, 128, 130, 133, 141, 143, 144, 155, 162, 163,
164, 165, 168, 173, 174, 176, 179, 183, 195, 212, 214, 216, 222,
224, 228, 230, 234, 238, 241, 243, 251, 256, 263, 271, 276, 291,
292, 297, 301, 306, 307, 308, 323, 327, 328, 330, 335, 336, 338,
349, 351, 353, 354, 363, 367, 375, 376, 378, 381, 382, 398, 399,
402, 404, 405, 408, 410, 412, 413, 420, 421, 424, 428, 429, 436,
445, 451, 453, 457, 460, 462, 470, 472, 487, 495, 498, 499, 509,
513, 515, 521, 526, 532, 535, 536, 541, 548, 550, 552, 557, 561,
564, 565, 566, 573, 576, 577, 579, 583, 586, 588, 592, 601, 610,
621, 622, 623, 629, 647, 650, 651, 654, 658, 674, 676, 682, 684,
690, 691, 693, 694, 695, 696, 698, 702, 707, 711, 714, 721, 730,
731, 735, 743, 752, 755, 770, 772, 775, 781, 785, 790, 792, 797,
801, 802, 808, 822, 833, 834, 835, 844, 851, 870, 872, 879, 884,
887, 890, 893, 900, 901, 903, 905, 907, 908, 909, 913, 925, 931,
932, 933, 938, 942, 945], dtype=int64),
'Physiology or Medicine': array([ 18, 22, 28, 29, 33, 35, 37, 41, 42, 43, 50, 53, 55,
56, 57, 63, 73, 76, 77, 83, 85, 87, 88, 91, 93, 94,
106, 110, 113, 121, 123, 124, 131, 132, 134, 136, 138, 147, 148,
150, 154, 158, 160, 161, 172, 182, 184, 185, 186, 189, 190, 191,
192, 194, 196, 198, 208, 211, 217, 218, 226, 229, 242, 245, 248,
252, 254, 261, 262, 273, 281, 285, 288, 290, 293, 294, 299, 300,
302, 304, 309, 313, 314, 319, 322, 325, 331, 332, 334, 339, 340,
342, 344, 358, 368, 371, 379, 380, 383, 401, 407, 409, 414, 417,
423, 427, 431, 437, 443, 452, 464, 466, 473, 474, 478, 484, 486,
488, 490, 493, 494, 496, 497, 502, 503, 507, 519, 520, 531, 543,
551, 554, 568, 570, 575, 582, 584, 587, 589, 597, 598, 600, 624,
625, 626, 638, 645, 646, 649, 660, 662, 663, 672, 680, 685, 687,
689, 692, 701, 704, 706, 709, 712, 718, 728, 733, 740, 741, 748,
754, 757, 760, 762, 764, 768, 769, 778, 782, 783, 789, 791, 794,
804, 807, 809, 811, 812, 815, 816, 818, 819, 820, 821, 824, 825,
826, 830, 831, 837, 838, 841, 843, 845, 849, 857, 858, 859, 863,
869, 873, 874, 891, 898, 899, 906, 911, 918, 923, 939], dtype=int64)}
要查找 GroupBy 對象中的組數,我們可以從中提取 ngroups 屬性或調用 Python 標准庫的 len 函數:
print(grouped.ngroups)
print(len(grouped))
Output:
6
6
如果我們需要可視化每個組的所有或部分條目,那麼可以遍歷 GroupBy 對象:
for name, entries in grouped:
print(f'First 2 entries for the "{name}" category:')
print(30*'-')
print(entries.head(2), '\n\n')
Output:
First 2 entries for the "Chemistry" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted name \
2 2004 Chemistry 10000000 11762861 Aaron Ciechanover
3 1982 Chemistry 1150000 3102518 Aaron Klug
gender birth_continent
2 male Asia
3 male Europe
First 2 entries for the "Economic Sciences" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted \
0 2001 Economic Sciences 10000000 12295082
5 2019 Economic Sciences 9000000 9000000
name gender birth_continent
0 A. Michael Spence male North America
5 Abhijit Banerjee male Asia
First 2 entries for the "Literature" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted \
21 1957 Literature 208629 2697789
31 1970 Literature 400000 3177966
name gender birth_continent
21 Albert Camus male Africa
31 Alexandr Solzhenitsyn male Europe
First 2 entries for the "Peace" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted \
6 2019 Peace 9000000 9000000
12 1980 Peace 880000 2889667
name gender birth_continent
6 Abiy Ahmed Ali male Africa
12 Adolfo Pérez Esquivel male South America
First 2 entries for the "Physics" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted name gender \
1 1975 Physics 630000 3404179 Aage N. Bohr male
4 1979 Physics 800000 2988048 Abdus Salam male
birth_continent
1 Europe
4 Asia
First 2 entries for the "Physiology or Medicine" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted \
18 1963 Physiology or Medicine 265000 2839286
22 1974 Physiology or Medicine 550000 3263449
name gender birth_continent
18 Alan Hodgkin male Europe
22 Albert Claude male Europe
相反,如果我們想以 DataFrame 的形式選擇單個組,我們應該在 GroupBy 對象上使用 get_group()
方法:
grouped.get_group('Economic Sciences')
Output:
awardYear category prizeAmount prizeAmountAdjusted name gender birth_continent
0 2001 Economic Sciences 10000000 12295082 A. Michael Spence male North America
5 2019 Economic Sciences 9000000 9000000 Abhijit Banerjee male Asia
45 2012 Economic Sciences 8000000 8361204 Alvin E. Roth male North America
46 1998 Economic Sciences 7600000 9713701 Amartya Sen male Asia
58 2015 Economic Sciences 8000000 8384572 Angus Deaton male Europe
… … … … … … … …
882 2002 Economic Sciences 10000000 12034660 Vernon L. Smith male North America
896 1973 Economic Sciences 510000 3331882 Wassily Leontief male Europe
912 2018 Economic Sciences 9000000 9000000 William D. Nordhaus male North America
916 1990 Economic Sciences 4000000 6329114 William F. Sharpe male North America
924 1996 Economic Sciences 7400000 9490424 William Vickrey male North America
在拆分原始數據並檢查結果組之後,我們可以對每個組執行以下操作之一或其組合:
要聚合 GroupBy 對象的數據(即按組計算匯總統計量),我們可以在對象上使用 agg()
方法:
# Showing only 1 decimal for all float numbers
pd.options.display.float_format = '{:.1f}'.format
grouped.agg(np.mean)
Output:
awardYear prizeAmount prizeAmountAdjusted
category
Chemistry 1972.3 3629279.4 6257868.1
Economic Sciences 1996.1 6105845.2 7837779.2
Literature 1960.9 2493811.2 5598256.3
Peace 1964.5 3124879.2 6163906.9
Physics 1971.1 3407938.6 6086978.2
Physiology or Medicine 1970.4 3072972.9 5738300.7
上面的代碼生成一個 DataFrame,其中組名作為其新索引,每個數字列的平均值作為分組
我們可以直接在 GroupBy 對象上應用其他相應的 Pandas 方法,而不僅僅是使用 agg()
方法。最常用的方法是 mean()
、median()
、mode()
、sum()
、size()
、count()
、min()
、max()
、std()
、var()
(計算每個的方差 group)、describe()
(按組輸出描述性統計信息)和 nunique()
(給出每個組中唯一值的數量)
grouped.sum()
Output:
awardYear prizeAmount prizeAmountAdjusted
category
Chemistry 362912 667787418 1151447726
Economic Sciences 167674 512891000 658373449
Literature 227468 289282102 649397731
Peace 263248 418733807 825963521
Physics 419837 725890928 1296526352
Physiology or Medicine 431508 672981066 1256687857
通常情況下我們只對某些特定列或列的統計信息感興趣,因此我們需要指定它們。 在上面的例子中,我們絕對不想總結所有年份,相應的我們可能希望按獎品類別對獎品價值求和。為此我們可以選擇 GroupBy 對象的 PrizeAmountAdjusted 列,就像我們選擇 DataFrame 的列,然後對其應用 sum() 函數:
grouped['prizeAmountAdjusted'].sum()
Output:
category
Chemistry 1151447726
Economic Sciences 658373449
Literature 649397731
Peace 825963521
Physics 1296526352
Physiology or Medicine 1256687857
Name: prizeAmountAdjusted, dtype: int64
對於上面的代碼片段,我們可以在選擇必要的列之前使用對 GroupBy 對象應用函數的等效語法:grouped.sum()['prizeAmountAdjusted']
。但是前面的語法更可取,因為它的性能更好,尤其是在大型數據集上,效果更為明顯
如果我們需要聚合兩列或更多列的數據,我們使用雙方括號:
grouped[['prizeAmount', 'prizeAmountAdjusted']].sum()
Output:
prizeAmount prizeAmountAdjusted
category
Chemistry 667787418 1151447726
Economic Sciences 512891000 658373449
Literature 289282102 649397731
Peace 418733807 825963521
Physics 725890928 1296526352
Physiology or Medicine 672981066 1256687857
可以一次將多個函數應用於 GroupBy 對象的一列或多列。為此我們再次需要 agg()
方法和感興趣的函數列表:
grouped[['prizeAmount', 'prizeAmountAdjusted']].agg([np.sum, np.mean, np.std])
Output:
prizeAmount prizeAmountAdjusted
sum mean std sum mean std
category
Chemistry 667787418 3629279.4 4070588.4 1151447726 6257868.1 3276027.2
Economic Sciences 512891000 6105845.2 3787630.1 658373449 7837779.2 3313153.2
Literature 289282102 2493811.2 3653734.0 649397731 5598256.3 3029512.1
Peace 418733807 3124879.2 3934390.9 825963521 6163906.9 3189886.1
Physics 725890928 3407938.6 4013073.0 1296526352 6086978.2 3294268.5
Physiology or Medicine 672981066 3072972.9 3898539.3 1256687857 5738300.7 3241781.0
此外,我們可以考慮通過傳遞字典將不同的聚合函數應用於 GroupBy 對象的不同列:
grouped.agg({'prizeAmount': [np.sum, np.size], 'prizeAmountAdjusted': np.mean})
Output:
prizeAmount prizeAmountAdjusted
sum size mean
category
Chemistry 667787418 184 6257868.1
Economic Sciences 512891000 84 7837779.2
Literature 289282102 116 5598256.3
Peace 418733807 134 6163906.9
Physics 725890928 213 6086978.2
Physiology or Medicine 672981066 219 5738300.7
與聚合方法不同,轉換方法返回一個新的 DataFrame,其形狀和索引與原始 DataFrame 相同,但具有轉換後的各個值。這裡需要注意的是,transformation 一定不能修改原始 DataFrame 中的任何值,也就是這些操作不能原地執行
轉換 GroupBy 對象數據的最常見的 Pandas 方法是 transform()
。例如它可以幫助計算每個組的 z-score:
grouped[['prizeAmount', 'prizeAmountAdjusted']].transform(lambda x: (x - x.mean()) / x.std())
Output:
prizeAmount prizeAmountAdjusted
0 1.0 1.3
1 -0.7 -0.8
2 1.6 1.7
3 -0.6 -1.0
4 -0.6 -0.9
… … …
945 -0.7 -0.8
946 -0.8 -1.1
947 -0.9 0.3
948 -0.5 -1.0
949 -0.7 -1.0
使用轉換方法,我們還可以用組均值、中位數、眾數或任何其他值替換缺失數據:
grouped['gender'].transform(lambda x: x.fillna(x.mode()[0]))
Output:
0 male
1 male
2 male
3 male
4 male
...
945 male
946 male
947 female
948 male
949 male
Name: gender, Length: 950, dtype: object
我們當然還可以使用其他一些 Pandas 方法來轉換 GroupBy 對象的數據:bfill()
、ffill()
、diff()
、pct_change()
、rank()
、shift()
、quantile()
等
過濾方法根據預定義的條件從每個組中丟棄組或特定行,並返回原始數據的子集。例如我們可能希望只保留所有組中某個列的值,其中該列的組均值大於預定義值。在我們的 DataFrame 的情況下,讓我們過濾掉所有組均值小於 7,000,000 的prizeAmountAdjusted 列,並在輸出中僅保留該列:
grouped['prizeAmountAdjusted'].filter(lambda x: x.mean() > 7000000)
Output:
0 12295082
5 9000000
45 8361204
46 9713701
58 8384572
...
882 12034660
896 3331882
912 9000000
916 6329114
924 9490424
Name: prizeAmountAdjusted, Length: 84, dtype: int64
另一個例子是過濾掉具有超過一定數量元素的組:
grouped['prizeAmountAdjusted'].filter(lambda x: len(x) < 100)
Output:
0 12295082
5 9000000
45 8361204
46 9713701
58 8384572
...
882 12034660
896 3331882
912 9000000
916 6329114
924 9490424
Name: prizeAmountAdjusted, Length: 84, dtype: int64
在上述兩個操作中,我們使用了 filter()
方法,將 lambda
函數作為參數傳遞。這樣的函數,應用於整個組,根據該組與預定義統計條件的比較結果返回 True
或 False
。換句話說,filter()
方法中的函數決定了哪些組保留在新的 DataFrame 中
除了過濾掉整個組之外,還可以從每個組中丟棄某些行。這裡有一些有用的方法是 first()
、last()
和 nth()
。將其中一個應用於 GroupBy 對象會相應地返回每個組的第一個/最後一個/第 n 個條目:
grouped.last()
Output:
awardYear prizeAmount prizeAmountAdjusted name gender birth_continent
category
Chemistry 1911 140695 7327865 Marie Curie female Europe
Economic Sciences 1996 7400000 9490424 William Vickrey male North America
Literature 1968 350000 3052326 Yasunari Kawabata male Asia
Peace 1963 265000 2839286 International Committee of the Red Cross male Asia
Physics 1972 480000 3345725 John Bardeen male North America
Physiology or Medicine 2016 8000000 8301051 Yoshinori Ohsumi male Asia
對於 nth()
方法,我們必須傳遞表示要為每個組返回的條目索引的整數:
grouped.nth(1)
Output:
awardYear prizeAmount prizeAmountAdjusted name gender birth_continent
category
Chemistry 1982 1150000 3102518 Aaron Klug male Europe
Economic Sciences 2019 9000000 9000000 Abhijit Banerjee male Asia
Literature 1970 400000 3177966 Alexandr Solzhenitsyn male Europe
Peace 1980 880000 2889667 Adolfo Pérez Esquivel male South America
Physics 1979 800000 2988048 Abdus Salam male Asia
Physiology or Medicine 1974 550000 3263449 Albert Claude male Europe
上面的代碼收集了所有組的第二個條目
另外兩個過濾每個組中的行的方法是 head()
和 tail()
,分別返回每個組的第一/最後 n 行(默認為 5):
grouped.head(3)
Output:
awardYear category prizeAmount prizeAmountAdjusted name gender birth_continent
0 2001 Economic Sciences 10000000 12295082 A. Michael Spence male North America
1 1975 Physics 630000 3404179 Aage N. Bohr male Europe
2 2004 Chemistry 10000000 11762861 Aaron Ciechanover male Asia
3 1982 Chemistry 1150000 3102518 Aaron Klug male Europe
4 1979 Physics 800000 2988048 Abdus Salam male Asia
5 2019 Economic Sciences 9000000 9000000 Abhijit Banerjee male Asia
6 2019 Peace 9000000 9000000 Abiy Ahmed Ali male Africa
7 2009 Chemistry 10000000 10958504 Ada E. Yonath female Asia
8 2011 Physics 10000000 10545557 Adam G. Riess male North America
12 1980 Peace 880000 2889667 Adolfo Pérez Esquivel male South America
16 2007 Peace 10000000 11301989 Al Gore male North America
18 1963 Physiology or Medicine 265000 2839286 Alan Hodgkin male Europe
21 1957 Literature 208629 2697789 Albert Camus male Africa
22 1974 Physiology or Medicine 550000 3263449 Albert Claude male Europe
28 1937 Physiology or Medicine 158463 4716161 Albert Szent-Györgyi male Europe
31 1970 Literature 400000 3177966 Alexandr Solzhenitsyn male Europe
40 2013 Literature 8000000 8365867 Alice Munro female North America
45 2012 Economic Sciences 8000000 8361204 Alvin E. Roth male North America
split-apply-combine 鏈的最後一個階段——合並結果——由Ppandas 在後台執行。它包括獲取在 GroupBy 對象上執行的所有操作的輸出並將它們重新組合在一起,生成新的數據結構,例如 Series 或 DataFrame。將此數據結構分配給一個變量,我們可以用它來解決其他任務
今天我們介紹了使用 pandas groupby 函數和使用結果對象的許多知識
好了,這就是今天分享的全部內容,喜歡就點個贊吧~
公眾號後台回復“groupby”獲取今天的測試數據集