第10章 数据聚合与分组运算

pandas的gruopby功能,它使你能以⼀种⾃然的⽅式对数据集进⾏切⽚、切块、摘要等操作。

本章内容:

  • 计算分组摘要统计,如计数、平均值、标准差,或⽤户⾃定义函数。
  • 计算分组的概述统计,⽐如数量、平均值或标准差,或是⽤户定义的函数。
  • 应⽤组内转换或其他运算,如规格化、线性回归、排名或选取⼦集等。
  • 计算透视表或交叉表。
  • 执⾏分位数分析以及其它统计分组分析。

10.1 GroupBy机制

分组运算:
split-apply-combine(拆分-应⽤-合并)

  1. 第⼀阶段,pandas对象中的数据会根据你所提供的⼀个或多个键被拆分(split)为多组。拆分操作是在对象的特定轴上执⾏的。
    例如,DataFrame可以在其⾏(axis=0)或列(axis=1)上进⾏分组。
  2. 第二阶段,将⼀个函数应⽤(apply)到各个分组并产⽣⼀个新值。
  3. 第三阶段,所有这些函数的执⾏结果会被合并(combine)到最终的结果对象中。结果对象的形式⼀般取决于数据上所执⾏的操作。

4202537

分组键可以有多种形式,且类型不必相同,但是其最终⽬的是产生一组用于拆分对象的值。:

  • 列表或数组,其⻓度与待分组的轴⼀样。
  • 表示DataFrame某个列名的值。
  • 字典或Series,给出待分组轴上的值与分组名之间的对应关系。
  • 函数,⽤于处理轴索引或索引中的各个标签。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import pandas as pd


np.random.seed(888)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})

print(df)
# key1 key2 data1 data2
# 0 a one -0.176201 -0.105339
# 1 a two 0.188876 0.217776
# 2 b one 0.826747 0.587282
# 3 b two -0.032447 0.100238
# 4 a one -0.652499 -1.099947


# 按key1进⾏分组,并计算data1列的平均值。
# 访问data1,并根据key1调⽤groupby:
grouped = df['data1'].groupby(df['key1'])


# 调用gruopby函数,生成GroupBy对象.这个GroupBy对象含有接下来对各分组执⾏运算所需的⼀切信息
print(grouped)
# <pandas.core.groupby.groupby.SeriesGroupBy object at 0x00000238B9BC0908>


# 调⽤GroupBy的mean⽅法来计算分组平均值
se = grouped.mean()
print(se)
# key1
# a -0.213275
# b 0.397150
# Name: data1, dtype: float64


print(type(se))
# <class 'pandas.core.series.Series'>


# 调用grouped对象的sum方法
print(grouped.sum())
# key1
# a -0.639824
# b 0.794300
# Name: data1, dtype: float64


# 调用grouped对象的count方法
print(grouped.count())
# key1
# a 3
# b 2
# Name: data1, dtype: int64


# ⼀次传⼊多个数组的列表,通过两个键对数据进⾏了分组
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
print(means)
# key1 key2
# a one -0.414350
# two 0.188876
# b one 0.826747
# two -0.032447
# Name: data1, dtype: float64


print(means.unstack())
# key2 one two
# key1
# a -0.414350 0.188876
# b 0.826747 -0.032447


# 还可以将列名(可以是字符串、数字或其他Python对象)⽤作分组键
# 因为key2是分类型变量,无法计算平均值,就直接忽略
print(df.groupby('key1').mean())
# data1 data2
# key1
# a -0.213275 -0.32917
# b 0.397150 0.34376


# 对比上面,分类型变量无法计算平均值,但是可以计算次数,所以这里保留了key2列
print(df.groupby('key1').count())
# key2 data1 data2
# key1
# a 3 3 3
# b 2 2 2


print(df.groupby(['key1', 'key2']).mean())
# data1 data2
# key1 key2
# a one -0.414350 -0.602643
# two 0.188876 0.217776
# b one 0.826747 0.587282
# two -0.032447 0.100238

# size方法返回⼀个含有分组⼤⼩的Series:
print(df.groupby(['key1', 'key2']).size())
# key1 key2
# a one 2
# two 1
# b one 1
# two 1
# dtype: int64

简单来说,

  • grouped = df['data1'].groupby(df['key1'])里的df[‘key1’]被称为分组键

  • ==你想对哪列分组计算,就调用哪列的groupby方法==.
    或者可以直接将DataFrame作为分组键,那么所有列都会参与计算

  • df.groupby('key1').mean()一句中

    1. ==df.groupby('key1')确定了结果的行索引==
    2. ==mean()确定了单元格的值.==

    简单来讲:

    1. groupby确定分组键(按什么分组)
    2. mean()确定聚合值(将数据用什么函数聚合)
  • 默认情况下,所有数值列(连续性变量列)都会被聚合.
    但是如果列是分类型变量,可能会因为无法计算而被忽略
    (eg:分类型变量无法计算平均值,当调用mean()方法时就会被直接抛弃)
    这种列俗称麻烦列

  • Groupby对象的size方法会返回⼀个含有分组⼤⼩的Series

  • 分组关键词中的缺失值,都会被从结果中除去。

对分组进行迭代

GroupBy对象⽀持迭代,可以产⽣⼀组⼆元元组(由分组名数据块组成)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
import pandas as pd


np.random.seed(888)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})

print(df)
# key1 key2 data1 data2
# 0 a one -0.176201 -0.105339
# 1 a two 0.188876 0.217776
# 2 b one 0.826747 0.587282
# 3 b two -0.032447 0.100238
# 4 a one -0.652499 -1.099947


# 返回 分组名:name 数据块group
for name, group in df.groupby('key1'):
print(name,'\n')
print(group)
# a

# key1 key2 data1 data2
# 0 a one -0.176201 -0.105339
# 1 a two 0.188876 0.217776
# 4 a one -0.652499 -1.099947

# b

# key1 key2 data1 data2
# 2 b one 0.826747 0.587282
# 3 b two -0.032447 0.100238



# 多重键的分组:
# 返回 分组名:(k1, k2) 数据块group
for (k1, k2), group in df.groupby(['key1', 'key2']):
print((k1, k2),'\n')
print(group)
print('--------------')

# ('a', 'one')

# key1 key2 data1 data2
# 0 a one -0.176201 -0.105339
# 4 a one -0.652499 -1.099947
# --------------
# ('a', 'two')

# key1 key2 data1 data2
# 1 a two 0.188876 0.217776
# --------------
# ('b', 'one')

# key1 key2 data1 data2
# 2 b one 0.826747 0.587282
# --------------
# ('b', 'two')

# key1 key2 data1 data2
# 3 b two -0.032447 0.100238
# --------------



print(list(df.groupby('key1')))
# [('a', key1 key2 data1 data2
# 0 a one -0.176201 -0.105339
# 1 a two 0.188876 0.217776
# 4 a one -0.652499 -1.099947), ('b', key1 key2 data1 data2
# 2 b one 0.826747 0.587282
# 3 b two -0.032447 0.100238)]

# 制作成字典,后续可用
pieces = dict(list(df.groupby('key1')))
print(pieces['b'])
# key1 key2 data1 data2
# 2 b one 0.826747 0.587282
# 3 b two -0.032447 0.100238



print(df.dtypes)
# key1 object
# key2 object
# data1 float64
# data2 float64
# dtype: object

# 使用axis,根据dtypes进行分组
grouped = df.groupby(df.dtypes, axis=1)

for dtype, group in grouped:
print(dtype,'\n')
print(group)
print('----------------')

# float64

# data1 data2
# 0 -0.176201 -0.105339
# 1 0.188876 0.217776
# 2 0.826747 0.587282
# 3 -0.032447 0.100238
# 4 -0.652499 -1.099947
# ----------------
# object

# key1 key2
# 0 a one
# 1 a two
# 2 b one
# 3 b two
# 4 a one
# ----------------

选取⼀列或列的⼦集

从上面得出,要对某一列进行分组的代码为:

1
2
3
df['data1'].groupby(df['key1'])

df[['data2']].groupby(df['key1'])

这两句都有语法糖:

1
2
3
df.groupby('key1')['data1']

df.groupby('key1')[['data2']]

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
import pandas as pd


np.random.seed(888)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})

print(df)
# key1 key2 data1 data2
# 0 a one -0.176201 -0.105339
# 1 a two 0.188876 0.217776
# 2 b one 0.826747 0.587282
# 3 b two -0.032447 0.100238
# 4 a one -0.652499 -1.099947

s = df['data2']
d = df[['data2']]

print(type(s))
# <class 'pandas.core.series.Series'>

print(type(d))
# <class 'pandas.core.frame.DataFrame'>


print(s)
# 0 -0.105339
# 1 0.217776
# 2 0.587282
# 3 0.100238
# 4 -1.099947
# Name: data2, dtype: float64


print(d)
# data2
# 0 -0.105339
# 1 0.217776
# 2 0.587282
# 3 0.100238
# 4 -1.099947


x = df.groupby(['key1', 'key2'])['data2'].mean()
y = df.groupby(['key1', 'key2'])[['data2']].mean()


print(x)
# key1 key2
# a one -0.602643
# two 0.217776
# b one 0.587282
# two 0.100238
# Name: data2, dtype: float64


print(y)
# data2
# key1 key2
# a one -0.602643
# two 0.217776
# b one 0.587282
# two 0.100238

print(type(x))
# <class 'pandas.core.series.Series'>

print(type(y))
# <class 'pandas.core.frame.DataFrame'>

通过字典或Series进⾏分组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import pandas as pd


people = pd.DataFrame(np.arange(25).reshape(5,5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

people.iloc[2, [1, 2]] = np.nan
print(people)
# a b c d e
# Joe 0 1.0 2.0 3 4
# Steve 5 6.0 7.0 8 9
# Wes 10 NaN NaN 13 14
# Jim 15 16.0 17.0 18 19
# Travis 20 21.0 22.0 23 24


mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
'd': 'blue', 'e': 'red', 'f' : 'orange'}

# 通过传递字典来制作分组键
by_column = people.groupby(mapping, axis=1)

print(by_column.sum())
# blue red
# Joe 5.0 5.0
# Steve 15.0 20.0
# Wes 13.0 24.0
# Jim 35.0 50.0
# Travis 45.0 65.0


# 通过传递Series制作分组键,进⾏分组
map_series = pd.Series(mapping)
print(map_series)
# a red
# b red
# c blue
# d blue
# e red
# f orange
# dtype: object


print(people.groupby(map_series, axis=1).count())
# blue red
# Joe 2 3
# Steve 2 3
# Wes 1 2
# Jim 2 3
# Travis 2 3

通过函数进⾏分组

任何被当做分组键的函数都会在各个索引值上被调⽤⼀次,其返回值就会被用作分组名称

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
import pandas as pd


people = pd.DataFrame(np.arange(25).reshape(5,5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

people.iloc[2, [1, 2]] = np.nan
print(people)
# a b c d e
# Joe 0 1.0 2.0 3 4
# Steve 5 6.0 7.0 8 9
# Wes 10 NaN NaN 13 14
# Jim 15 16.0 17.0 18 19
# Travis 20 21.0 22.0 23 24


print(people.groupby(len).sum())
# a b c d e
# 3 25 17.0 19.0 34 37
# 5 5 6.0 7.0 8 9
# 6 20 21.0 22.0 23 24


# 将函数跟数组、列表、字典、Series混合使⽤
key_list = ['one', 'one', 'one', 'two', 'two']
print(people.groupby([len, key_list]).min())
# a b c d e
# 3 one 0 1.0 2.0 3 4
# two 15 16.0 17.0 18 19
# 5 one 5 6.0 7.0 8 9
# 6 two 20 21.0 22.0 23 24

实际上:
将函数跟数组、列表、字典、Series混合使⽤时,在内部都会被转换为数组

根据索引级别分组

层次化索引数据集最⽅便的地⽅就在于它能够根据轴索引的一个级别进行聚合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
import pandas as pd


columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
[1, 3, 5, 1, 3]],
names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.arange(10000,10020).reshape(4,5), columns=columns)

print(hier_df)
# cty US JP
# tenor 1 3 5 1 3
# 0 10000 10001 10002 10003 10004
# 1 10005 10006 10007 10008 10009
# 2 10010 10011 10012 10013 10014
# 3 10015 10016 10017 10018 10019


# 根据索引级别分组
print(hier_df.groupby(level='cty', axis=1).sum())
# cty JP US
# 0 20007 30003
# 1 20017 30018
# 2 20027 30033
# 3 20037 30048

10.2 数据聚合

聚合运算函数:

374722537

除了上面的函数,还可以使用Series或DataFrame的聚合函数,甚至支持自定义的聚合函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import pandas as pd


np.random.seed(888)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})

print(df)
# key1 key2 data1 data2
# 0 a one -0.176201 -0.105339
# 1 a two 0.188876 0.217776
# 2 b one 0.826747 0.587282
# 3 b two -0.032447 0.100238
# 4 a one -0.652499 -1.099947

grouped = df.groupby('key1')


# 调用Series或DataFrame的样本分位数聚合函数
print(grouped['data1'].quantile(0.9))
# key1
# a 0.115861
# b 0.740828
# Name: data1, dtype: float64


def peak_to_peak(arr):
return arr.max() - arr.min()


# 如果要使用自己的聚合函数,就要传入aggregate或agg⽅法
print(grouped.agg(peak_to_peak))
# data1 data2
# key1
# a 0.841376 1.317723
# b 0.859194 0.487044


# 甚至并非严格聚合运算的describe函数也可以使用
print(grouped.describe()['data1'])
# count mean std ... 50% 75% max
# key1 ...
# a 3.0 -0.213275 0.421911 ... -0.176201 0.006338 0.188876
# b 2.0 0.397150 0.607542 ... 0.397150 0.611949 0.826747
# [2 rows x 8 columns]

注意:
⾃定义聚合函数要⽐那些经过优化的函数慢得多。这是因为在构造中间分组数据块时存在非常⼤的开销(函数调⽤、数据重排等)

⾯向列的多函数应⽤
(对不同的列使⽤不同的聚合函数,或⼀次应⽤多个函数)

  • ⼀次应⽤多个函数:
    只要往agg函数里添加由函数组成的列表即可
  • 不同的列使⽤不同的聚合函数:
    • 方法一:
      使用形如result['tip_pct']的方式获取某些列.
      这些列调用函数.
      使用concat将结果组装到⼀起,
    • 方法二:
      向agg传⼊⼀个从列名映射到函数的字典:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
import pandas as pd


tips = pd.read_csv('examples/tips.csv')

tips['tip_pct'] = tips['tip'] / tips['total_bill']
print(tips[:6])
# total_bill tip smoker day time size tip_pct
# 0 16.99 1.01 No Sun Dinner 2 0.059447
# 1 10.34 1.66 No Sun Dinner 3 0.160542
# 2 21.01 3.50 No Sun Dinner 3 0.166587
# 3 23.68 3.31 No Sun Dinner 2 0.139780
# 4 24.59 3.61 No Sun Dinner 4 0.146808
# 5 25.29 4.71 No Sun Dinner 4 0.186240

grouped = tips.groupby(['day', 'smoker'])

grouped_pct = grouped['tip_pct']

print(grouped_pct.agg('mean'))
# day smoker
# Fri No 0.151650
# Yes 0.174783
# Sat No 0.158048
# Yes 0.147906
# Sun No 0.160113
# Yes 0.187250
# Thur No 0.160298
# Yes 0.163863
# Name: tip_pct, dtype: float64


def peak_to_peak(arr):
return arr.max() - arr.min()

# agg函数参数变成由函数胡组成的列表
# 列表的元素可以是函数名,也可以是函数名的字符串
print(grouped_pct.agg(['mean', 'std', peak_to_peak]))
# mean std peak_to_peak
# day smoker
# Fri No 0.151650 0.028123 0.067349
# Yes 0.174783 0.051293 0.159925
# Sat No 0.158048 0.039767 0.235193
# Yes 0.147906 0.061375 0.290095
# Sun No 0.160113 0.042347 0.193226
# Yes 0.187250 0.154134 0.644685
# Thur No 0.160298 0.038774 0.193350
# Yes 0.163863 0.039389 0.151240


# 传入(name,function)元组组成的列表,这样就可以对计算列命名了
print(grouped_pct.agg([('name1', 'mean'), ('name2', np.std)]))
# name1 name2
# day smoker
# Fri No 0.151650 0.028123
# Yes 0.174783 0.051293
# Sat No 0.158048 0.039767
# Yes 0.147906 0.061375
# Sun No 0.160113 0.042347
# Yes 0.187250 0.154134
# Thur No 0.160298 0.038774
# Yes 0.163863 0.039389


grouped = tips.groupby(['day', 'smoker'])
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)


print(result)
# tip_pct total_bill
# count mean max count mean max
# day smoker
# Fri No 4 0.151650 0.187735 4 18.420000 22.75
# Yes 15 0.174783 0.263480 15 16.813333 40.17
# Sat No 45 0.158048 0.291990 45 19.661778 48.33
# Yes 42 0.147906 0.325733 42 21.276667 50.81
# Sun No 57 0.160113 0.252672 57 20.506667 48.17
# Yes 19 0.187250 0.710345 19 24.120000 45.35
# Thur No 45 0.160298 0.266312 45 17.113111 41.19
# Yes 17 0.163863 0.241255 17 19.190588 43.11


grouped = tips.groupby(['day', 'smoker'])

# 使用字典,对不同的列使用不同的聚合函数.
# tip列使用np.max函数,size列使用sum函数
print(grouped.agg({'tip' : np.max, 'size' : 'sum'}))
# tip size
# day smoker
# Fri No 3.50 9
# Yes 4.73 31
# Sat No 9.00 115
# Yes 10.00 104
# Sun No 6.00 167
# Yes 6.50 49
# Thur No 6.70 112
# Yes 5.00 40

# 对tip_pct的外层column使用一系列函数,,size列使用sum函数
print(grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],'size' : 'sum'}))
# tip_pct size
# min max mean std sum
# day smoker
# Fri No 0.120385 0.187735 0.151650 0.028123 9
# Yes 0.103555 0.263480 0.174783 0.051293 31
# Sat No 0.056797 0.291990 0.158048 0.039767 115
# Yes 0.035638 0.325733 0.147906 0.061375 104
# Sun No 0.059447 0.252672 0.160113 0.042347 167
# Yes 0.065660 0.710345 0.187250 0.154134 49
# Thur No 0.072961 0.266312 0.160298 0.038774 112
# Yes 0.090014 0.241255 0.163863 0.039389 40




print(tips.groupby(['day', 'smoker']).mean())
# total_bill tip size tip_pct
# day smoker
# Fri No 18.420000 2.812500 2.250000 0.151650
# Yes 16.813333 2.714000 2.066667 0.174783
# Sat No 19.661778 3.102889 2.555556 0.158048
# Yes 21.276667 2.875476 2.476190 0.147906
# Sun No 20.506667 3.167895 2.929825 0.160113
# Yes 24.120000 3.516842 2.578947 0.187250
# Thur No 17.113111 2.673778 2.488889 0.160298
# Yes 19.190588 3.030000 2.352941 0.163863


# 以“没有⾏索引”的形式返回聚合数据
print(tips.groupby(['day', 'smoker'], as_index=False).mean())
# day smoker total_bill tip size tip_pct
# 0 Fri No 18.420000 2.812500 2.250000 0.151650
# 1 Fri Yes 16.813333 2.714000 2.066667 0.174783
# 2 Sat No 19.661778 3.102889 2.555556 0.158048
# 3 Sat Yes 21.276667 2.875476 2.476190 0.147906
# 4 Sun No 20.506667 3.167895 2.929825 0.160113
# 5 Sun Yes 24.120000 3.516842 2.578947 0.187250
# 6 Thur No 17.113111 2.673778 2.488889 0.160298
# 7 Thur Yes 19.190588 3.030000 2.352941 0.163863

以“没有⾏索引”的形式返回聚合数据:
就是把所有的行索引改为RangeIndex


10.3 apply:⼀般性的“拆分-应⽤-合并”

最通⽤的GroupBy⽅法是apply

回忆这张图

4202537

apply会将待处理的对象拆分成多个⽚段,然后对各⽚段调⽤传⼊的函数,最后尝试将各⽚段组合到⼀起。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
import pandas as pd


tips = pd.read_csv('examples/tips.csv')

tips['tip_pct'] = tips['tip'] / tips['total_bill']
print(tips[:6])
# total_bill tip smoker day time size tip_pct
# 0 16.99 1.01 No Sun Dinner 2 0.059447
# 1 10.34 1.66 No Sun Dinner 3 0.160542
# 2 21.01 3.50 No Sun Dinner 3 0.166587
# 3 23.68 3.31 No Sun Dinner 2 0.139780
# 4 24.59 3.61 No Sun Dinner 4 0.146808
# 5 25.29 4.71 No Sun Dinner 4 0.186240


def top(df, n=5, column='tip_pct'):
return df.sort_values(by=column)[-n:]

print(top(tips, n=6))
# total_bill tip smoker day time size tip_pct
# 109 14.31 4.00 Yes Sat Dinner 2 0.279525
# 183 23.17 6.50 Yes Sun Dinner 4 0.280535
# 232 11.61 3.39 No Sat Dinner 2 0.291990
# 67 3.07 1.00 Yes Sat Dinner 1 0.325733
# 178 9.60 4.00 Yes Sun Dinner 2 0.416667
# 172 7.25 5.15 Yes Sun Dinner 2 0.710345


# 对smoker分组并⽤该函数调⽤apply
# 对smoker分组,然后对分组后的结果排序并返回前n项
print(tips.groupby('smoker').apply(top))
# total_bill tip smoker day time size tip_pct
# smoker
# No 88 24.71 5.85 No Thur Lunch 2 0.236746
# 185 20.69 5.00 No Sun Dinner 5 0.241663
# 51 10.29 2.60 No Sun Dinner 2 0.252672
# 149 7.51 2.00 No Thur Lunch 2 0.266312
# 232 11.61 3.39 No Sat Dinner 2 0.291990
# Yes 109 14.31 4.00 Yes Sat Dinner 2 0.279525
# 183 23.17 6.50 Yes Sun Dinner 4 0.280535
# 67 3.07 1.00 Yes Sat Dinner 1 0.325733
# 178 9.60 4.00 Yes Sun Dinner 2 0.416667
# 172 7.25 5.15 Yes Sun Dinner 2 0.710345


# 上面代码解释:
# top函数在DataFrame的各个⽚段上调⽤,
# 然后结果由pandas.concat组装到⼀起,并以分组名称进⾏了标记。
# 于是,最终结果就有了⼀个层次化索引,其内层索引值来⾃原DataFrame。


# 向top函数传递参数
print(tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill'))
# total_bill tip smoker day time size tip_pct
# smoker day
# No Fri 94 22.75 3.25 No Fri Dinner 2 0.142857
# Sat 212 48.33 9.00 No Sat Dinner 4 0.186220
# Sun 156 48.17 5.00 No Sun Dinner 6 0.103799
# Thur 142 41.19 5.00 No Thur Lunch 5 0.121389
# Yes Fri 95 40.17 4.73 Yes Fri Dinner 4 0.117750
# Sat 170 50.81 10.00 Yes Sat Dinner 3 0.196812
# Sun 182 45.35 3.50 Yes Sun Dinner 3 0.077178
# Thur 197 43.11 5.00 Yes Thur Lunch 4 0.115982
# [8 rows x 7 columns]



# 调⽤过describe
result = tips.groupby('smoker')['tip_pct'].describe()
print(result)
# count mean std ... 50% 75% max
# smoker ...
# No 151.0 0.159328 0.039910 ... 0.155625 0.185014 0.291990
# Yes 93.0 0.163196 0.085119 ... 0.153846 0.195059 0.710345
# [2 rows x 8 columns]


print(result.unstack('smoker'))
# smoker
# count No 151.000000
# Yes 93.000000
# mean No 0.159328
# Yes 0.163196
# std No 0.039910
# Yes 0.085119
# min No 0.056797
# Yes 0.035638
# 25% No 0.136906
# Yes 0.106771
# 50% No 0.155625
# Yes 0.153846
# 75% No 0.185014
# Yes 0.195059
# max No 0.291990
# Yes 0.710345
# dtype: float64

前面为什么说describe不是严格意义的聚合函数?

实际上这只是应⽤了下⾯两条代码的快捷⽅式⽽已:

1
2
f = lambda x: x.describe()
grouped.apply(f)

禁⽌分组键

从上⾯的例⼦中可以看出,分组键会跟原始对象的索引共同构成结果对象中的层次化索引

group_keys=False传⼊groupby即可禁⽌该效果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
print(tips.groupby('smoker').apply(top))
# total_bill tip smoker day time size tip_pct
# smoker
# No 88 24.71 5.85 No Thur Lunch 2 0.236746
# 185 20.69 5.00 No Sun Dinner 5 0.241663
# 51 10.29 2.60 No Sun Dinner 2 0.252672
# 149 7.51 2.00 No Thur Lunch 2 0.266312
# 232 11.61 3.39 No Sat Dinner 2 0.291990
# Yes 109 14.31 4.00 Yes Sat Dinner 2 0.279525
# 183 23.17 6.50 Yes Sun Dinner 4 0.280535
# 67 3.07 1.00 Yes Sat Dinner 1 0.325733
# 178 9.60 4.00 Yes Sun Dinner 2 0.416667
# 172 7.25 5.15 Yes Sun Dinner 2 0.710345

print(tips.groupby('smoker', group_keys=False).apply(top))
# total_bill tip smoker day time size tip_pct
# 88 24.71 5.85 No Thur Lunch 2 0.236746
# 185 20.69 5.00 No Sun Dinner 5 0.241663
# 51 10.29 2.60 No Sun Dinner 2 0.252672
# 149 7.51 2.00 No Thur Lunch 2 0.266312
# 232 11.61 3.39 No Sat Dinner 2 0.291990
# 109 14.31 4.00 Yes Sat Dinner 2 0.279525
# 183 23.17 6.50 Yes Sun Dinner 4 0.280535
# 67 3.07 1.00 Yes Sat Dinner 1 0.325733
# 178 9.60 4.00 Yes Sun Dinner 2 0.416667
# 172 7.25 5.15 Yes Sun Dinner 2 0.710345

分位数和桶分析

根据指定⾯元或样本分位数将数据拆分成多块的函数(⽐如cut和qcut)跟groupby结合起来就可以实现桶(bucket)分析,分位数(quantile)分析:
由cut返回的Categorical对象可直接传递到groupby

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
import pandas as pd

np.random.seed(888)
frame = pd.DataFrame({'data1': np.random.randn(1000),
'data2': np.random.randn(1000)})

print(frame)

# 切成4个分类
quartiles = pd.cut(frame.data1, 4)
print(quartiles[:10])
# 0 (-1.446, 0.101]
# 1 (0.101, 1.649]
# 2 (0.101, 1.649]
# 3 (-1.446, 0.101]
# 4 (-1.446, 0.101]
# 5 (-1.446, 0.101]
# 6 (0.101, 1.649]
# 7 (0.101, 1.649]
# 8 (-1.446, 0.101]
# 9 (-1.446, 0.101]
# Name: data1, dtype: category
# Categories (4, interval[float64]): [(-2.999, -1.446] < (-1.446, 0.101] < (0.101, 1.649] <
# (1.649, 3.196]]


def get_stats(group):
return {'min': group.min(), 'max': group.max(),
'count': group.count(), 'mean': group.mean()}

# 直接将cut函数返回的Categories对象传给groupby函数
grouped = frame.data2.groupby(quartiles)

# 得出每个分类的聚合数据:
print(grouped.apply(get_stats).unstack())
# count max mean min
# data1
# (-2.999, -1.446] 72.0 2.654189 -0.056724 -2.421468
# (-1.446, 0.101] 452.0 3.033133 0.016806 -3.636365
# (0.101, 1.649] 433.0 3.557303 0.028364 -3.362788
# (1.649, 3.196] 43.0 2.086797 0.125453 -1.510303

我们将这四个分类称之为.
上面这段代码我们得到了四个桶的聚合数据,这就是桶分析

现在要使用分位数分析,只要根据样本分位数得到⼤⼩相等的桶,使⽤qcut即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
import pandas as pd

np.random.seed(888)
frame = pd.DataFrame({'data1': np.random.randn(1000),
'data2': np.random.randn(1000)})

print(frame)

# 切成4个分类
quartiles = pd.cut(frame.data1, 4)
print(quartiles[:10])
# 0 (-1.446, 0.101]
# 1 (0.101, 1.649]
# 2 (0.101, 1.649]
# 3 (-1.446, 0.101]
# 4 (-1.446, 0.101]
# 5 (-1.446, 0.101]
# 6 (0.101, 1.649]
# 7 (0.101, 1.649]
# 8 (-1.446, 0.101]
# 9 (-1.446, 0.101]
# Name: data1, dtype: category
# Categories (4, interval[float64]): [(-2.999, -1.446] < (-1.446, 0.101] < (0.101, 1.649] <
# (1.649, 3.196]]


def get_stats(group):
return {'min': group.min(), 'max': group.max(),
'count': group.count(), 'mean': group.mean()}

# 直接将cut函数返回的Categories对象传给groupby函数
grouped = frame.data2.groupby(quartiles)


print(grouped.apply(get_stats))
# data1
# (-2.999, -1.446] count 72.000000
# max 2.654189
# mean -0.056724
# min -2.421468
# (-1.446, 0.101] count 452.000000
# max 3.033133
# mean 0.016806
# min -3.636365
# (0.101, 1.649] count 433.000000
# max 3.557303
# mean 0.028364
# min -3.362788
# (1.649, 3.196] count 43.000000
# max 2.086797
# mean 0.125453
# min -1.510303
# Name: data2, dtype: float64


# 得出每个分类的聚合数据:
print(grouped.apply(get_stats).unstack())
# count max mean min
# data1
# (-2.999, -1.446] 72.0 2.654189 -0.056724 -2.421468
# (-1.446, 0.101] 452.0 3.033133 0.016806 -3.636365
# (0.101, 1.649] 433.0 3.557303 0.028364 -3.362788
# (1.649, 3.196] 43.0 2.086797 0.125453 -1.510303



grouping = pd.qcut(frame.data1, 10)
grouped = frame.data2.groupby(grouping)
print(grouped.apply(get_stats).unstack())
# count max mean min
# data1
# (-2.9939999999999998, -1.276] 100.0 2.654189 0.035025 -2.421468
# (-1.276, -0.832] 100.0 3.033133 0.005532 -2.630387
# (-0.832, -0.494] 100.0 2.414314 0.018984 -2.816920
# (-0.494, -0.221] 100.0 2.520152 -0.014124 -2.066315
# (-0.221, 0.0408] 100.0 2.183119 0.003323 -3.636365
# (0.0408, 0.25] 100.0 2.689091 -0.045195 -3.226533
# (0.25, 0.501] 100.0 2.263420 0.088561 -2.077910
# (0.501, 0.829] 100.0 2.376516 0.107365 -2.232905
# (0.829, 1.265] 100.0 3.557303 -0.033168 -3.362788
# (1.265, 3.196] 100.0 2.323520 0.045576 -1.779255



# 传⼊labels=False即可只获取分位数的编号:
grouping = pd.qcut(frame.data1, 10, labels=False)
grouped = frame.data2.groupby(grouping)
print(grouped.apply(get_stats).unstack())
# count max mean min
# data1
# 0 100.0 2.654189 0.035025 -2.421468
# 1 100.0 3.033133 0.005532 -2.630387
# 2 100.0 2.414314 0.018984 -2.816920
# 3 100.0 2.520152 -0.014124 -2.066315
# 4 100.0 2.183119 0.003323 -3.636365
# 5 100.0 2.689091 -0.045195 -3.226533
# 6 100.0 2.263420 0.088561 -2.077910
# 7 100.0 2.376516 0.107365 -2.232905
# 8 100.0 3.557303 -0.033168 -3.362788
# 9 100.0 2.323520 0.045576 -1.779255

示例:⽤特定于分组的值填充缺失值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import pandas as pd

np.random.seed(888)
s = pd.Series(np.random.randn(6))
s[::2] = np.nan

print(s)
# 0 NaN
# 1 0.188876
# 2 NaN
# 3 -0.032447
# 4 NaN
# 5 -0.105339
# dtype: float64

# 使用平均值填充NA
print(s.fillna(s.mean()))
# 0 0.017030
# 1 0.188876
# 2 0.017030
# 3 -0.032447
# 4 0.017030
# 5 -0.105339
# dtype: float64


states = ['Ohio', 'New York', 'Vermont', 'Florida',
'Oregon', 'Nevada', 'California', 'Idaho']
data = pd.Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
print(data)
# Ohio 0.217776
# New York 0.587282
# Vermont NaN
# Florida -1.099947
# Oregon -0.255305
# Nevada NaN
# California 0.162664
# Idaho NaN
# dtype: float64


# 使用group_key作为分组键
group_key = ['East'] * 4 + ['West'] * 4
# 比如-0.046321=(-0.255305+0.162664)/2
print(data.groupby(group_key).mean())
# East -0.098296
# West -0.046321
# dtype: float64


# 使用fillna填充NA值:
print(data.fillna(data.mean()))
# Ohio 0.217776
# New York 0.587282
# Vermont -0.077506
# Florida -1.099947
# Oregon -0.255305
# Nevada -0.077506
# California 0.162664
# Idaho -0.077506
# dtype: float64


# ⽤分组平均值去填充NA值:
fill_mean = lambda g: g.fillna(g.mean())
print(data.groupby(group_key).apply(fill_mean))
# Ohio 0.217776
# New York 0.587282
# Vermont -0.098296
# Florida -1.099947
# Oregon -0.255305
# Nevada -0.046321
# California 0.162664
# Idaho -0.046321
# dtype: float64


# 使用字典预定义各组的填充值。
# 使用分组的name属性
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
print(data.groupby(group_key).apply(fill_func))
# Ohio 0.217776
# New York 0.587282
# Vermont 0.500000
# Florida -1.099947
# Oregon -0.255305
# Nevada -1.000000
# California 0.162664
# Idaho -1.000000
# dtype: float64

示例:随机采样和排列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import pandas as pd


np.random.seed(888)
suits = ['H', 'S', 'C', 'D']
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
cards.extend(str(num) + suit for num in base_names)


# 不同的扑克牌对应不同的分数
card_val = (list(range(1, 11)) + [10] * 3) * 4
deck = pd.Series(card_val, index=cards)

print(deck[:13])
# AH 1
# 2H 2
# 3H 3
# 4H 4
# 5H 5
# 6H 6
# 7H 7
# 8H 8
# 9H 9
# 10H 10
# JH 10
# KH 10
# QH 10
# dtype: int64


# 对Series使⽤sample⽅法:
def draw(deck, n=5):
return deck.sample(n)


print(draw(deck))
# KH 10
# QS 10
# 6H 6
# 2C 2
# 3C 3
# dtype: int64


# 最后一个字母是花色
get_suit = lambda card: card[-1]


# 按花色分组并且随机抽样出2个
print(deck.groupby(get_suit).apply(draw, n=2))
# C QC 10
# 2C 2
# D QD 10
# KD 10
# H 3H 3
# 4H 4
# S JS 10
# 3S 3
# dtype: int64


# 取消分组的索引
print(deck.groupby(get_suit, group_keys=False).apply(draw, n=2))
# 7C 7
# AC 1
# 4D 4
# QD 10
# 7H 7
# JH 10
# 10S 10
# QS 10
# dtype: int64

示例:分组加权平均数和相关系数

进⾏DataFrame的列与列之间或两个Series之间的运算(⽐如分组加权平均)

  • spx_corr = lambda x: x.corrwith(x['SPX']):

    计算x每一列与SPX列的相关系数

  • rets = close_px.pct_change().dropna():
    计算close_px的百分比变化,并且丢掉na.
    (pct_change()函数:将每个元素与其前一个元素进行比较,并计算变化百分比.默认情况下,pct_change()对列进行操作; 如果想应用到行上,那么可使用axis = 1参数。)

    计算变化率:(后一个值 - 前一个值)/前一个值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
import pandas as pd


np.random.seed(888)
df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
'b', 'b', 'b', 'b'],
'data': np.random.randn(8),
'weights': np.random.rand(8)})

print(df)
# category data weights
# 0 a -0.176201 0.132953
# 1 a 0.188876 0.533449
# 2 a 0.826747 0.899478
# 3 a -0.032447 0.248365
# 4 b -0.652499 0.030172
# 5 b -0.105339 0.072447
# 6 b 0.217776 0.874164
# 7 b 0.587282 0.558430


grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])

# 根据category分组,然后计算data的加权平均数
print(grouped.apply(get_wavg))
# category
# a 0.448072
# b 0.319831
# dtype: float64

计算相关系数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
import pandas as pd


close_px = pd.read_csv('examples/stock_px_2.csv', parse_dates=True,
index_col=0)

print(close_px.info())
# <class 'pandas.core.frame.DataFrame'>
# DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
# Data columns (total 4 columns):
# AAPL 2214 non-null float64
# MSFT 2214 non-null float64
# XOM 2214 non-null float64
# SPX 2214 non-null float64
# dtypes: float64(4)
# memory usage: 86.5 KB

print(close_px[-4:])
# AAPL MSFT XOM SPX
# 2011-10-11 400.29 27.00 76.27 1195.54
# 2011-10-12 402.19 26.96 77.16 1207.25
# 2011-10-13 408.43 27.18 76.37 1203.66
# 2011-10-14 422.00 27.27 78.11 1224.58


# 计算x每一列与SPX列的相关系数
spx_corr = lambda x: x.corrwith(x['SPX'])

# 计算close_px的百分比变化,并且丢掉na
rets = close_px.pct_change().dropna()

get_year = lambda x: x.year

by_year = rets.groupby(get_year)
print(by_year.apply(spx_corr))
# AAPL MSFT XOM SPX
# 2003 0.541124 0.745174 0.661265 1.0
# 2004 0.374283 0.588531 0.557742 1.0
# 2005 0.467540 0.562374 0.631010 1.0
# 2006 0.428267 0.406126 0.518514 1.0
# 2007 0.508118 0.658770 0.786264 1.0
# 2008 0.681434 0.804626 0.828303 1.0
# 2009 0.707103 0.654902 0.797921 1.0
# 2010 0.710105 0.730118 0.839057 1.0
# 2011 0.691931 0.800996 0.859975 1.0


# 根据年份分组,计算AAPL列与MSFT列的相关系数
print(by_year.apply(lambda g: g['AAPL'].corr(g['MSFT'])))
# 2003 0.480868
# 2004 0.259024
# 2005 0.300093
# 2006 0.161735
# 2007 0.417738
# 2008 0.611901
# 2009 0.432738
# 2010 0.571946
# 2011 0.581987
# dtype: float64

示例:组级别的线性回归

观察上面的代码,得知:
==只要函数返回的是pandas对象或标量值,我们就可以把这个函数应用到groupby语句中==

所以,我们可以用各种杂七杂八的函数来groupby

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import pandas as pd
import statsmodels.api as sm


close_px = pd.read_csv('examples/stock_px_2.csv', parse_dates=True,index_col=0)

print(close_px[-4:])
# AAPL MSFT XOM SPX
# 2011-10-11 400.29 27.00 76.27 1195.54
# 2011-10-12 402.19 26.96 77.16 1207.25
# 2011-10-13 408.43 27.18 76.37 1203.66
# 2011-10-14 422.00 27.27 78.11 1224.58

rets = close_px.pct_change().dropna()
by_year = rets.groupby(lambda x: x.year)


# 对各数据块执⾏普通最⼩⼆乘法回归
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params


print(by_year.apply(regress, 'AAPL', ['SPX']))
# SPX intercept
# 2003 1.195406 0.000710
# 2004 1.363463 0.004201
# 2005 1.766415 0.003246
# 2006 1.645496 0.000080
# 2007 1.198761 0.003438
# 2008 0.968016 -0.001110
# 2009 0.879103 0.002954
# 2010 1.052608 0.001261
# 2011 0.806605 0.001514import numpy as np
import pandas as pd
import statsmodels.api as sm


close_px = pd.read_csv('examples/stock_px_2.csv', parse_dates=True,index_col=0)

print(close_px[-4:])
# AAPL MSFT XOM SPX
# 2011-10-11 400.29 27.00 76.27 1195.54
# 2011-10-12 402.19 26.96 77.16 1207.25
# 2011-10-13 408.43 27.18 76.37 1203.66
# 2011-10-14 422.00 27.27 78.11 1224.58

rets = close_px.pct_change().dropna()
by_year = rets.groupby(lambda x: x.year)


# 对各数据块执⾏普通最⼩⼆乘法回归
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params


print(by_year.apply(regress, 'AAPL', ['SPX']))
# SPX intercept
# 2003 1.195406 0.000710
# 2004 1.363463 0.004201
# 2005 1.766415 0.003246
# 2006 1.645496 0.000080
# 2007 1.198761 0.003438
# 2008 0.968016 -0.001110
# 2009 0.879103 0.002954
# 2010 1.052608 0.001261
# 2011 0.806605 0.001514

10.4 透视表和交叉表

透视表(pivot table):
它根据⼀个或多个键对数据进⾏聚合,并根据⾏和列上的分组键将数据分配到各个矩形区域中。

制作透视表的函数

  • 可以通过本章所介绍的groupby功能以及(能够利⽤层次化索引的)重塑运算制作透视表。

  • DataFrame的pivot_table⽅法

  • 顶级的pandas.pivot_table函数。
    (pivot_table的默认聚合类型为平均数)

    (这个函数和除能为groupby提供便利之外,pivot_table还可以添加分项⼩计,也叫做margins。)

pivot_table参数:

  • values:要聚合的列
  • index:根据什么列进行分组(放到行索引)
  • columns:根据什么列进行分组(放到列索引)
  • aggfunc:聚合函数,确定分组的数据如何聚合
  • fill_value:替换缺失值
  • margins:添加分项⼩计
  • margins_name:小计列/小计行名称
  • dropna:是否丢掉全是NaN的行/列
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
import pandas as pd



tips = pd.read_csv('examples/tips.csv')

tips['tip_pct'] = tips['tip'] / tips['total_bill']
print(tips[:6])
# total_bill tip smoker day time size tip_pct
# 0 16.99 1.01 No Sun Dinner 2 0.059447
# 1 10.34 1.66 No Sun Dinner 3 0.160542
# 2 21.01 3.50 No Sun Dinner 3 0.166587
# 3 23.68 3.31 No Sun Dinner 2 0.139780
# 4 24.59 3.61 No Sun Dinner 4 0.146808
# 5 25.29 4.71 No Sun Dinner 4 0.186240


# 根据day和smoker计算分组平均数
# 将day和smoker放到⾏,单元格的值为各个分组的平均数
print(tips.pivot_table(index=['day', 'smoker']))
# size tip total_bill
# day smoker
# Fri No 2.250000 2.812500 18.420000
# Yes 2.066667 2.714000 16.813333
# Sat No 2.555556 3.102889 19.661778
# Yes 2.476190 2.875476 21.276667
# Sun No 2.929825 3.167895 20.506667
# Yes 2.578947 3.516842 24.120000
# Thur No 2.488889 2.673778 17.113111
# Yes 2.352941 3.030000 19.190588


# 根据time和day进⾏分组,聚合tip_pct和size
# 将smoker放到列上,把day放到⾏上:
print(tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],columns='smoker'))
# size tip_pct
# smoker No Yes No Yes
# time day
# Dinner Fri 2.000000 2.222222 0.139622 0.165347
# Sat 2.555556 2.476190 0.158048 0.147906
# Sun 2.929825 2.578947 0.160113 0.187250
# Thur 2.000000 NaN 0.159744 NaN
# Lunch Fri 3.000000 1.833333 0.187735 0.188937
# Thur 2.500000 2.352941 0.160311 0.163863


# margins:添加标签为All的⾏和列
print(tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],columns='smoker', margins=True))
# size tip_pct
# smoker No Yes All No Yes All
# time day
# Dinner Fri 2.000000 2.222222 2.166667 0.139622 0.165347 0.158916
# Sat 2.555556 2.476190 2.517241 0.158048 0.147906 0.153152
# Sun 2.929825 2.578947 2.842105 0.160113 0.187250 0.166897
# Thur 2.000000 NaN 2.000000 0.159744 NaN 0.159744
# Lunch Fri 3.000000 1.833333 2.000000 0.187735 0.188937 0.188765
# Thur 2.500000 2.352941 2.459016 0.160311 0.163863 0.161301
# All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803


# 传入len,得到有关分组大小的交叉表
print(tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day',aggfunc=len, margins=True))
# day Fri Sat Sun Thur All
# time smoker
# Dinner No 3.0 45.0 57.0 1.0 106.0
# Yes 9.0 42.0 19.0 NaN 70.0
# Lunch No 1.0 NaN NaN 44.0 45.0
# Yes 6.0 NaN NaN 17.0 23.0
# All 19.0 87.0 76.0 62.0 244.0


# fill_value:替换掉NA
print(tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],columns='day', aggfunc='mean', fill_value=0))
# day Fri Sat Sun Thur
# time size smoker
# Dinner 1 No 0.000000 0.137931 0.000000 0.000000
# Yes 0.000000 0.325733 0.000000 0.000000
# 2 No 0.139622 0.162705 0.168859 0.159744
# Yes 0.171297 0.148668 0.207893 0.000000
# 3 No 0.000000 0.154661 0.152663 0.000000
# Yes 0.000000 0.144995 0.152660 0.000000
# 4 No 0.000000 0.150096 0.148143 0.000000
# Yes 0.117750 0.124515 0.193370 0.000000
# 5 No 0.000000 0.000000 0.206928 0.000000
# Yes 0.000000 0.106572 0.065660 0.000000
# 6 No 0.000000 0.000000 0.103799 0.000000
# Lunch 1 No 0.000000 0.000000 0.000000 0.181728
# Yes 0.223776 0.000000 0.000000 0.000000
# 2 No 0.000000 0.000000 0.000000 0.166005
# Yes 0.181969 0.000000 0.000000 0.158843
# 3 No 0.187735 0.000000 0.000000 0.084246
# Yes 0.000000 0.000000 0.000000 0.204952
# 4 No 0.000000 0.000000 0.000000 0.138919
# Yes 0.000000 0.000000 0.000000 0.155410
# 5 No 0.000000 0.000000 0.000000 0.121389
# 6 No 0.000000 0.000000 0.000000 0.173706

交叉表:crosstab

交叉表(cross-tabulation,简称crosstab)是⼀种⽤于计算分组频率的特殊透视表。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
import pandas as pd
from io import StringIO


data = """\
Sample Nationality Handedness
1 USA Right-handed
2 Japan Left-handed
3 USA Right-handed
4 Japan Right-handed
5 Japan Left-handed
6 Japan Right-handed
7 USA Right-handed
8 USA Left-handed
9 Japan Right-handed
10 USA Right-handed"""


data = pd.read_table(StringIO(data), sep='\s+')
print(data)
# Sample Nationality Handedness
# 0 1 USA Right-handed
# 1 2 Japan Left-handed
# 2 3 USA Right-handed
# 3 4 Japan Right-handed
# 4 5 Japan Left-handed
# 5 6 Japan Right-handed
# 6 7 USA Right-handed
# 7 8 USA Left-handed
# 8 9 Japan Right-handed
# 9 10 USA Right-handed


# 根据Nationality分组,对Handedness进行频率统计
print(pd.crosstab(data.Nationality, data.Handedness, margins=True))
# Handedness Left-handed Right-handed All
# Nationality
# Japan 2 3 5
# USA 1 4 5
# All 3 7 10