第12章 pandas⾼级应⽤

前⾯的章节关注于不同类型的数据规整流程和NumPy、pandas与其它库的特点。本章就要深⼊学习pandas的⾼级功能。

12.1 分类数据

使用分类数据提⾼性能和内存的使⽤率。

背景和⽬的

Series和Datatime难免会拥有重复数据,MySQL是如何处理重复数据的?使用外键.
Series和Datatime的做法类似于MySQL的外键做法:
将主要的参数存储为引⽤维表整数键,

⽤整数表示的⽅法称为分类或字典编码表示法

import numpy as np
import pandas as pd


# 存在重复数据
values = pd.Series(['apple', 'orange', 'apple','apple'] * 2)
print(values)
# 0     apple
# 1    orange
# 2     apple
# 3     apple
# 4     apple
# 5    orange
# 6     apple
# 7     apple
# dtype: object

print(pd.unique(values))
# ['apple' 'orange']

print(pd.value_counts(values))
# apple     6
# orange    2
# dtype: int64



# 使用类似于MySQL的外键的做法
# 使用两个Series
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])

print(values)
# 0    0
# 1    1
# 2    0
# 3    0
# 4    0
# 5    1
# 6    0
# 7    0
# dtype: int64

print(dim)
# 0     apple
# 1    orange
# dtype: object


print(dim.take([0,1]))
# 0     apple
# 1    orange
# dtype: object


print(dim.take(values))
# 0     apple
# 1    orange
# 0     apple
# 0     apple
# 0     apple
# 1    orange
# 0     apple
# 0     apple
# dtype: object

pandas的分类类型

⽤于保存使⽤整数分类表示法的数据。

Categories对象:

构造函数pd.Categorical()
第二种构造函数pd.Categorical.from_codes():
参数为分类编码(Categories对象的codes属性)和分类数据
在创建Categories对象的时候,from_codes构造器能够使用ordered参数实现对分类的排序
对于已经存在的Categories对象,可以使用as_ordered()方法实现对分类的排序
对于Series和DataFrame对象,可以使用astype('category')将列转为Categories对象

import numpy as np
import pandas as pd


# pd.Categorical:使用python序列直接创建pandas.Categorical：
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
print(my_categories)
# [foo, bar, baz, foo, bar]
# Categories (3, object): [bar, baz, foo]


# 分类对象有categories和codes属性：
print(my_categories.categories)
# Index(['bar', 'baz', 'foo'], dtype='object')
print(my_categories.codes)
# [2 0 1 2 0]


categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]

# pd.Categorical.from_codes:第二种构造器.传入分类编码和分类数据
my_cats_2 = pd.Categorical.from_codes(codes, categories)
print(my_cats_2)
# [foo, bar, baz, foo, foo, bar]
# Categories (3, object): [foo, bar, baz]


# from_codes构造器使用ordered参数实现分类的排序
# [foo < bar < baz]指明‘foo’位于‘bar’的前⾯，以此类推。
ordered_cat = pd.Categorical.from_codes(codes, categories,ordered=True)
print(ordered_cat)
# [foo, bar, baz, foo, foo, bar]
# Categories (3, object): [foo < bar < baz]


# 对于已经存在的Categories对象,可以使用as_ordered对分类排序
print(my_cats_2.as_ordered())
# [foo, bar, baz, foo, foo, bar]
# Categories (3, object): [foo < bar < baz]

import numpy as np
import pandas as pd


np.random.seed(888)
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])
print(df)
#    basket_id   fruit  count    weight
# 0          0   apple     13  0.531811
# 1          1  orange      9  2.133796
# 2          2   apple      6  3.597910
# 3          3   apple     10  0.993460
# 4          4   apple     12  0.120687
# 5          5  orange      4  0.289789
# 6          6   apple      3  3.496658
# 7          7   apple      3  2.233721


# 将fruit列从字符串对象转为分类
fruit_cat = df['fruit'].astype('category')
print(fruit_cat)
# 0     apple
# 1    orange
# 2     apple
# 3     apple
# 4     apple
# 5    orange
# 6     apple
# 7     apple
# Name: fruit, dtype: category
# Categories (2, object): [apple, orange]


print(type(fruit_cat))
# <class 'pandas.core.series.Series'>


# fruit_cat列的值是⼀个pandas.Categorical实例
c = fruit_cat.values
print(type(c))
# <class 'pandas.core.arrays.categorical.Categorical'>



# 将fruit列从字符串对象转为分类,再赋值回去,实现原地转换类型
df['fruit'] = df['fruit'].astype('category')
print(df.fruit)
# 0     apple
# 1    orange
# 2     apple
# 3     apple
# 4     apple
# 5    orange
# 6     apple
# 7     apple
# Name: fruit, dtype: category
# Categories (2, object): [apple, orange]

⽤分类进⾏计算

使⽤pandas.qcut⾯元函数。它会返回pandas.Categorical对象
使用pandas.qcut函数的labels参数:
给面元添加名称
可以将Categorical对象制作成Series对象,然后传给DataFrame的groupby函数,实现分组聚合.
也可以直接将Categorical对象传给groupby函数,实现分组聚合.

import numpy as np
import pandas as pd


np.random.seed(12345)
draws = np.random.randn(1000)
print(draws[:5])
# [-0.20470766  0.47894334 -0.51943872 -0.5557303   1.96578057]

bins = pd.qcut(draws, 4)
# [(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
# Length: 1000
# Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] <
#                                     (0.63, 3.928]]
print(bins)


# 返回pandas.Categorical对象
print(type(bins))
# <class 'pandas.core.arrays.categorical.Categorical'>


# 使用labels参数:(给面元添加名称)
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(bins)
# [Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
# Length: 1000
# Categories (4, object): [Q1 < Q2 < Q3 < Q4]

print(bins.codes[:10])
# [1 2 1 1 3 3 2 2 3 3]


# 将Categorical对象转为Series对象(Series的dtype是category)
bins = pd.Series(bins, name='quartile')
print(bins.head())
# 0    Q2
# 1    Q3
# 2    Q2
# 3    Q2
# 4    Q4
# Name: quartile, dtype: category


# dtype是category的Series对象传给groupby函数,实现分组聚合,提取汇总信息
results = (pd.Series(draws)
           .groupby(bins)
           .agg(['count', 'min', 'max'])
           .reset_index())
print(results)
#   quartile  count       min       max
# 0       Q1    250 -2.949343 -0.685484
# 1       Q2    250 -0.683066 -0.010115
# 2       Q3    250 -0.010032  0.628894
# 3       Q4    250  0.634238  3.927528

直接将Categorical对象传给groupby函数,实现分组聚合:

import numpy as np
import pandas as pd


np.random.seed(12345)
draws = np.random.randn(1000)
bins = pd.qcut(draws, 4)


# 使用labels参数:(给面元添加名称)
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(bins)
# [Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
# Length: 1000
# Categories (4, object): [Q1 < Q2 < Q3 < Q4]


# dtype是category的Series对象传给groupby函数,实现分组聚合,提取汇总信息
results = (pd.Series(draws)
           .groupby(bins)
           .agg(['count', 'min', 'max'])
           .reset_index())
print(results)
#   quartile  count       min       max
# 0       Q1    250 -2.949343 -0.685484
# 1       Q2    250 -0.683066 -0.010115
# 2       Q3    250 -0.010032  0.628894
# 3       Q4    250  0.634238  3.927528

⽤分类提⾼性能

如果你是在⼀个特定数据集上做⼤量分析，将其转换为分类可以极⼤地提⾼效率

分类比标签快
GroupBy操作⽐分类快
(因为底层的算法使⽤整数编码数组，⽽不是字符串数组。)

import numpy as np
import pandas as pd


np.random.seed(888)
N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))
# 将标签转换为分类
categories = labels.astype('category')

# 标签使用的内存
print(labels.memory_usage())
# 80000080

# 分类使用的内存
print(categories.memory_usage())
# 10000272

分类⽅法

包含分类数据的Series有⼀些特殊的⽅法，类似于Series.str字符串⽅法。

包含分类数据的Series的cat属性提供了分类⽅法的⼊⼝
(包含分类数据的Series使用cat属性后,可以使用Categories对象的方法)
Categories()创建分类对象,
Series.astype('category')将列转换为分类对象,
Series.cat.set_categories在旧类别的基础上创建新类别或移除部分旧类别
remove_unused_categories⽅法删除没有使用到的分类
(没使用到的意思就是没有在数据中存在的分类)

import numpy as np
import pandas as pd


s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
print(cat_s)
# 0    a
# 1    b
# 2    c
# 3    d
# 4    a
# 5    b
# 6    c
# 7    d
# dtype: category
# Categories (4, object): [a, b, c, d]


# 包含分类数据的Series的cat属性提供了分类⽅法的⼊⼝
print(cat_s.cat.codes)
# 0    0
# 1    1
# 2    2
# 3    3
# 4    0
# 5    1
# 6    2
# 7    3
# dtype: int8

print(cat_s.cat.categories)
# Index(['a', 'b', 'c', 'd'], dtype='object')


# 之前的分类是['a', 'b', 'c', 'd'],现在多加了一个e
actual_categories = ['a', 'b', 'c', 'd', 'e']
# 在旧类别的基础上创建新类别或移除部分旧类别
cat_s2 = cat_s.cat.set_categories(actual_categories)
print(cat_s2)
# 0    a
# 1    b
# 2    c
# 3    d
# 4    a
# 5    b
# 6    c
# 7    d
# dtype: category
# Categories (5, object): [a, b, c, d, e]


print(cat_s.value_counts())
# d    2
# c    2
# b    2
# a    2
# dtype: int64

# 多个一个e类别
print(cat_s2.value_counts())
# d    2
# c    2
# b    2
# a    2
# e    0
# dtype: int64


print(cat_s.isin(['a', 'b']))
# 0     True
# 1     True
# 2    False
# 3    False
# 4     True
# 5     True
# 6    False
# 7    False
# dtype: bool


# cd分类只是被过滤掉了,Categories对象中还是有他们的
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
print(cat_s3)
# 0    a
# 1    b
# 4    a
# 5    b
# dtype: category
# Categories (4, object): [a, b, c, d]


# 删除没有使用到的分类(没使用到的意思就是没有在数据中存在的分类)
# 删除没有用到的cd分类,Categories对象中只剩下ab
print(cat_s3.cat.remove_unused_categories())
# 0    a
# 1    b
# 4    a
# 5    b
# dtype: category
# Categories (2, object): [a, b]

cat属性的方法

36f672537

为建模创建虚拟变量

当你使⽤统计或机器学习⼯具时，通常会将分类数据转换为虚拟变量，也称为one-hot编码。这包括创建⼀个不同类别的列的DataFrame；这些列包含给定分类的1，其它为0。

pandas.get_dummies函数可以把分类数据转换为包含虚拟变量的DataFrame：

import numpy as np
import pandas as pd


cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
print(cat_s)
# 0    a
# 1    b
# 2    c
# 3    d
# 4    a
# 5    b
# 6    c
# 7    d
# dtype: category
# Categories (4, object): [a, b, c, d]

# pandas.get_dummies函数把分类数据转换为包含虚拟变量的DataFrame：
# a为(1,0,0,0),b为(0,1,0,0)
print(pd.get_dummies(cat_s))
#    a  b  c  d
# 0  1  0  0  0
# 1  0  1  0  0
# 2  0  0  1  0
# 3  0  0  0  1
# 4  1  0  0  0
# 5  0  1  0  0
# 6  0  0  1  0
# 7  0  0  0  1

12.2 GroupBy⾼级应⽤

分组转换和“解封”GroupBy

apply为DataFrame的轴级应用函数.

transform⽅法:
与GroupBy的apply很像,但是对使⽤的函数有⼀定限制:

它可以产⽣向分组形状⼴播标量值
它可以产⽣⼀个和输⼊组形状相同的对象
它不能修改输⼊

简单来说transform方法和普通GroupBy方法最大的不同:
==transform的计算结果和原始数据的形状保持一致==

import pandas as pd
import numpy as np


df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12.)})
print(df)
#    key  value
# 0    a    0.0
# 1    b    1.0
# 2    c    2.0
# 3    a    3.0
# 4    b    4.0
# 5    c    5.0
# 6    a    6.0
# 7    b    7.0
# 8    c    8.0
# 9    a    9.0
# 10   b   10.0
# 11   c   11.0


# 按键进⾏分组
g = df.groupby('key').value
print(g.mean())
# key
# a    4.5
# b    5.5
# c    6.5
# Name: value, dtype: float64




# 产⽣⼀个和df['value']形状相同的Series,但值替换为按键分组的平均值。
# (transform的计算结果和原始数据的形状保持一致)
print(g.transform(lambda x: x.mean()))
# 0     4.5
# 1     5.5
# 2     6.5
# 3     4.5
# 4     5.5
# 5     6.5
# 6     4.5
# 7     5.5
# 8     6.5
# 9     4.5
# 10    5.5
# 11    6.5
# Name: value, dtype: float64


print(g.transform('mean'))
# 0     4.5
# 1     5.5
# 2     6.5
# 3     4.5
# 4     5.5
# 5     6.5
# 6     4.5
# 7     5.5
# 8     6.5
# 9     4.5
# 10    5.5
# 11    6.5
# Name: value, dtype: float64


print(g.transform(lambda x: x * 2))
# 0      0.0
# 1      2.0
# 2      4.0
# 3      6.0
# 4      8.0
# 5     10.0
# 6     12.0
# 7     14.0
# 8     16.0
# 9     18.0
# 10    20.0
# 11    22.0
# Name: value, dtype: float64


print(g.transform(lambda x: x.rank(ascending=False)))
# 0     4.0
# 1     4.0
# 2     4.0
# 3     3.0
# 4     3.0
# 5     3.0
# 6     2.0
# 7     2.0
# 8     2.0
# 9     1.0
# 10    1.0
# 11    1.0
# Name: value, dtype: float64


def normalize(x):
    return (x - x.mean()) / x.std()


# transform或apply可以获得等价的结果：
print(g.transform(normalize))
# 0    -1.161895
# 1    -1.161895
# 2    -1.161895
# 3    -0.387298
# 4    -0.387298
# 5    -0.387298
# 6     0.387298
# 7     0.387298
# 8     0.387298
# 9     1.161895
# 10    1.161895
# 11    1.161895
# Name: value, dtype: float64


# DataFrame的轴级应用函数:apply
print(g.apply(normalize))
# 0    -1.161895
# 1    -1.161895
# 2    -1.161895
# 3    -0.387298
# 4    -0.387298
# 5    -0.387298
# 6     0.387298
# 7     0.387298
# 8     0.387298
# 9     1.161895
# 10    1.161895
# 11    1.161895
# Name: value, dtype: float64


# 内置的聚合函数(⽐如mean或sum)⽐apply,transform函数快
print(g.transform('mean'))
# 0     4.5
# 1     5.5
# 2     6.5
# 3     4.5
# 4     5.5
# 5     6.5
# 6     4.5
# 7     5.5
# 8     6.5
# 9     4.5
# 10    5.5
# 11    6.5
# Name: value, dtype: float64


normalized = (df['value'] - g.transform('mean')) / g.transform('std')
print(normalized)
# 0    -1.161895
# 1    -1.161895
# 2    -1.161895
# 3    -0.387298
# 4    -0.387298
# 5    -0.387298
# 6     0.387298
# 7     0.387298
# 8     0.387298
# 9     1.161895
# 10    1.161895
# 11    1.161895
# Name: value, dtype: float64

分组的时间重采样

对于时间序列数据，resample⽅法从语义上是⼀个基于内在时间的分组操作。

使用pandas.TimeGrouper对象:
使⽤TimeGrouper的限制是时间必须是Series或DataFrame的索引。

import pandas as pd
import numpy as np


N = 15
times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)
df = pd.DataFrame({'time': times,'value': np.arange(N)})

print(df)
#                   time  value
# 0  2017-05-20 00:00:00      0
# 1  2017-05-20 00:01:00      1
# 2  2017-05-20 00:02:00      2
# 3  2017-05-20 00:03:00      3
# 4  2017-05-20 00:04:00      4
# 5  2017-05-20 00:05:00      5
# 6  2017-05-20 00:06:00      6
# 7  2017-05-20 00:07:00      7
# 8  2017-05-20 00:08:00      8
# 9  2017-05-20 00:09:00      9
# 10 2017-05-20 00:10:00     10
# 11 2017-05-20 00:11:00     11
# 12 2017-05-20 00:12:00     12
# 13 2017-05-20 00:13:00     13
# 14 2017-05-20 00:14:00     14


# ⽤time作为索引，然后重采样
print(df.set_index('time').resample('5min').count())
#                      value
# time                      
# 2017-05-20 00:00:00      5
# 2017-05-20 00:05:00      5
# 2017-05-20 00:10:00      5



df2 = pd.DataFrame({'time': times.repeat(3),
                    'key': np.tile(['a', 'b', 'c'], N),
                    'value': np.arange(N * 3.)})
print(df2[:7])
#                  time key  value
# 0 2017-05-20 00:00:00   a    0.0
# 1 2017-05-20 00:00:00   b    1.0
# 2 2017-05-20 00:00:00   c    2.0
# 3 2017-05-20 00:01:00   a    3.0
# 4 2017-05-20 00:01:00   b    4.0
# 5 2017-05-20 00:01:00   c    5.0
# 6 2017-05-20 00:02:00   a    6.0


# 要对每个key值进⾏相同的重采样
# 使用pandas.TimeGrouper对象：
time_key = pd.TimeGrouper('5min')

resampled = (df2.set_index('time')
             .groupby(['key', time_key])
             .sum())
print(resampled)
#                          value
# key time                      
# a   2017-05-20 00:00:00   30.0
#     2017-05-20 00:05:00  105.0
#     2017-05-20 00:10:00  180.0
# b   2017-05-20 00:00:00   35.0
#     2017-05-20 00:05:00  110.0
#     2017-05-20 00:10:00  185.0
# c   2017-05-20 00:00:00   40.0
#     2017-05-20 00:05:00  115.0
#     2017-05-20 00:10:00  190.0


print(resampled.reset_index())
#   key                time  value
# 0   a 2017-05-20 00:00:00   30.0
# 1   a 2017-05-20 00:05:00  105.0
# 2   a 2017-05-20 00:10:00  180.0
# 3   b 2017-05-20 00:00:00   35.0
# 4   b 2017-05-20 00:05:00  110.0
# 5   b 2017-05-20 00:10:00  185.0
# 6   c 2017-05-20 00:00:00   40.0
# 7   c 2017-05-20 00:05:00  115.0
# 8   c 2017-05-20 00:10:00  190.0

12.3 链式编程技术

当对数据集进⾏⼀系列变换时，可能创建的多个临时变量其实并没有在分析中⽤到。

import numpy as np
import pandas as pd


df = load_data()
df2 = df[df['col2'] < 0]
df2['col1_demeaned'] = df2['col1'] - df2['col1'].mean()
result = df2.groupby('key').col1_demeaned.std()

使用DataFrame.assign:

DataFrame.assign⽅法是⼀个df[k] = v形式的函数式的列分配⽅法。它不是就地修改对象，⽽是返回新的修改过的DataFrame

下面两个语句是等价的

import numpy as np
import pandas as pd


# 普通方法
df2 = df.copy()
df2['k'] = v

# 使用assign方法
# 返回修改过的dataframe
df2 = df.assign(k=v)

使用assign可以⽅便地进⾏链式编程：

1
2
3

result = (df2.assign(col1_demeaned = df2.col1 - df2.col2.mean())
          .groupby('key')
          .col1_demeaned.std())

所以前面的例子可以修改为:

前面例子中,

1 2	df = load_data() df2 = df[df['col2'] < 0]

不能简单的修改为:

1	df2 = load_data()[load_data()['col2'] < 0]

这就会加载两次数据.

为了链式调用,assign和许多其它pandas函数可以接收类似函数的参数，即可调⽤对象（callable）。

import numpy as np
import pandas as pd


# 下面二者等价

# 选取col2列<0的行
df = load_data()
df2 = df[df['col2'] < 0]


# 接受可调用对象
df = (load_data()[lambda x: x['col2'] < 0])

这⾥，load_data的结果没有赋值给某个变量，因此传递到[ ]的函数在这⼀步被绑定到了对象。

所以前面整个例子就可以实现链式调用了:

import pandas as pd
import numpy as np

# 下面两段代码等价

df = load_data()
df2 = df[df['col2'] < 0]
df2['col1_demeaned'] = df2['col1'] - df2['col1'].mean()
result = df2.groupby('key').col1_demeaned.std()



result = (load_data()
          [lambda x: x.col2 < 0]
          .assign(col1_demeaned=lambda x: x.col1 - x.col1.mean())
          .groupby('key')
          .col1_demeaned.std())

管道⽅法

在链式调用的时候,如果你想使用自己的函数或者第三方库的函数,这是就要使用管道方法.
简单来说,就是将f(df)改为df.pipe(f)

import pandas as pd
import numpy as np


a = f(df, arg1=v1)
b = g(a, v2, arg3=v3)
c = h(b, arg4=v4)


# f(df)和df.pipe(f)是等价的，但是pipe使得链式声明更容易。
result = (df.pipe(f, arg1=v1)
          .pipe(g, v2, arg3=v3)
          .pipe(h, arg4=v4))

使用pipe还有一个好处:
将重复的操作提炼为函数

现在要将下面的函数封装为函数:

1 2	g = df.groupby(['key1', 'key2']) df['col1'] = df['col1'] - g.transform('mean')

代码如下:

import pandas as pd
import numpy as np


# 使用pipe将操作提炼为函数
def group_demean(df, by, cols):
    result = df.copy()
    g = df.groupby(by)
    for c in cols:
        result[c] = df[c] - g[c].transform('mean')
    return result


# 在链式调用的时候调用这个函数
result = (df[df.col1 < 0]
          .pipe(group_demean, ['key1', 'key2'], ['col1']))

利用Python进行数据分析9_第12章_pandas高级应用

第12章 pandas⾼级应⽤

12.1 分类数据

12.2 GroupBy⾼级应⽤

12.3 链式编程技术