第5章 pandas入门

pandas与NumPy最大的不同:

  • pandas是专⻔为处理表格和混杂数据设计的。
  • NumPy更适合处理统⼀的数值数组数据。

常用的导入约定:

1
2
import pandas as pd
from pandas import Series,DataFrame

5.1 pandas的数据结构介绍

Series:
类似于⼀维数组的对象,它由⼀组数据(各种NumPy数据类型)以及⼀组与之相关的数据标签(即索引)组成。仅由⼀组数据即可产⽣最简单的Series:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
from pandas import Series,DataFrame

obj = Series([5,6,0,1.2,-9])
# 索引在左边,值在右边
print(obj)
# 0 5.0
# 1 6.0
# 2 0.0
# 3 1.2
# 4 -9.0
# dtype: float64

# values属性获取数组表示形式
print(obj.values)
# [ 5. 6. 0. 1.2 -9. ]

# index属性获取数组的索引对象
print(obj.index)
# RangeIndex(start=0, stop=5, step=1)

修改index值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
from pandas import Series,DataFrame

# 修改index
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
# d 4
# b 7
# a -5
# c 3
# dtype: int64

print(obj2.index)
# Index(['d', 'b', 'a', 'c'], dtype='object')


print(obj2['a']) # -5
print(obj2.get('a')) # -5
print(obj2.get('XXX','hyl')) # hyl

print(obj2[['a','d','c']])
# a -5
# d 4
# c 3
# dtype: int64

['a','d','c']被称为索引列表

当get使用索引列表时:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
print(obj2.get(['a','d','c']))
# a -5
# d 4
# c 3
# dtype: int64

# 找不到时返回NaN
# 注意:未来版本找不到时会返回KeyError
print(obj2.get(['a','d','XX']))
# a -5.0
# d 4.0
# XX NaN
# dtype: float64

print(obj2.get(['YY','ZZ','XX']))
# None

# default参数不起效
print(obj2.get(['a','d','XXX'],'没有'))
# a -5.0
# d 4.0
# XXX NaN
# dtype: float64

Series能使用使⽤NumPy函数或类似NumPy的运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from pandas import Series,DataFrame
import numpy as np

# 修改index
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
# d 4
# b 7
# a -5
# c 3
# dtype: int64

print(obj2[obj2>0])
# d 4
# b 7
# c 3
# dtype: int64

print(obj2*2)
# d 8
# b 14
# a -10
# c 6
# dtype: int64

print(np.exp(obj2))
# d 54.598150
# b 1096.633158
# a 0.006738
# c 20.085537
# dtype: float64

Series其实很像py里的字典.

如果只传⼊⼀个字典,则结果Series中的索引就是原字典的键
(有序排列)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from pandas import Series,DataFrame

# 修改index
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
# d 4
# b 7
# a -5
# c 3
# dtype: int64

# 检测索引是否在obj2里
print('a' in obj2) # True

# 传入字典创建Series对象
aDict = {'hyl':300,'dsz':500,'czj':400}
obj3 = Series(aDict)

print(obj3)
# hyl 300
# dsz 500
# czj 400
# dtype: int64


sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
print(obj4)
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64

可以传⼊排好序的字典的键以改变顺序:

1
2
3
4
5
6
7
8
9
10
11
from pandas import Series,DataFrame

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
print(obj4)
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64

在这个例⼦中,sdata中跟states索引相匹配的那3个值会被找出来并放到相应的位置上,但由于California所对应的sdata值找不到,所以其结果就为NaN.因为Utah不在states中,它被从结果中除去。

使用isnull和notnull判断是否为空:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
from pandas import Series,DataFrame

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
print(obj4)
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64

print(pd.isnull(obj4))
# California True
# Ohio False
# Oregon False
# Texas False
# dtype: bool

print(pd.notnull(obj4))
# California False
# Ohio True
# Oregon True
# Texas True
# dtype: bool

Series也有类似方法obj4.isnull(),obj4.notnull()

Series最重要的⼀个功能是,它会根据运算的索引标签⾃动对⻬数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
from pandas import Series,DataFrame

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)

print(obj3)
# Ohio 35000
# Texas 71000
# Oregon 16000
# Utah 5000
# dtype: int64

print(obj4)
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64

print(obj3 + obj4)
# California NaN
# Ohio 70000.0
# Oregon 32000.0
# Texas 142000.0
# Utah NaN
# dtype: float64

我们发现,只要obj3和obj4同时都拥有某个索引的情况下才会自动对齐计算,否则返回NaN.
==可以认为是类似数据库inner join的操作==。

Series对象本身及其索引都有⼀个name属性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
from pandas import Series,DataFrame

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)

# Series的name属性
obj4.name = 'population'
# Series.index的name属性
obj4.index.name = 'state'

print(obj4)
# state
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# Name: population, dtype: float64

Series的索引可以通过赋值的⽅式就地修改:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
from pandas import Series,DataFrame

obj = Series([5,6,0,1.2,-9])
print(obj)
# 0 5.0
# 1 6.0
# 2 0.0
# 3 1.2
# 4 -9.0
# dtype: float64

obj.index = ['a','b','c','d','e']
print(obj)
# a 5.0
# b 6.0
# c 0.0
# d 1.2
# e -9.0
# dtype: float64

DataFrame
DataFrame是⼀个表格型的数据结构,它含有⼀组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。

DataFrame既有⾏索引也有列索引,==DataFrame可以被看做由Series组成的字典==(共⽤同⼀个索引)。
所以说,DataFrame是以⼆维结构保存数据的.

创建DataFrame:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from pandas import Series,DataFrame

# 传入等长列表
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = DataFrame(data)

# 就像Series一样,所有的行都会自动添加索引
print(frame)
# state year pop
# 0 Ohio 2000 1.5
# 1 Ohio 2001 1.7
# 2 Ohio 2002 3.6
# 3 Nevada 2001 2.4
# 4 Nevada 2002 2.9
# 5 Nevada 2003 3.2

# head⽅法会选取前五⾏:
print(frame.head())
# state year pop
# 0 Ohio 2000 1.5
# 1 Ohio 2001 1.7
# 2 Ohio 2002 3.6
# 3 Nevada 2001 2.4
# 4 Nevada 2002 2.9

就像Series指定index,就会按照index的顺序排列.
如果指定了列序列,则DataFrame的列就会按照指定顺序进⾏排列:

1
2
3
4
5
6
7
8
9
10
11
from pandas import Series,DataFrame

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
print(obj4)
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from pandas import Series,DataFrame

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = DataFrame(data, columns=['year', 'state','debt'])
print(frame)
# year state debt
# 0 2000 Ohio NaN
# 1 2001 Ohio NaN
# 2 2002 Ohio NaN
# 3 2001 Nevada NaN
# 4 2002 Nevada NaN
# 5 2003 Nevada NaN

上面代码,砍掉了字段pop,所以frame就没有了这个字段.同时添加了新的debt字段,这个传入的列在数据中找不到,就会在结果中产生缺失值NaN

上面使用columns指定了列名,还可以使用index指定行名:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four','five', 'six'])
print(frame2)
# year state pop debt
# one 2000 Ohio 1.5 NaN
# two 2001 Ohio 1.7 NaN
# three 2002 Ohio 3.6 NaN
# four 2001 Nevada 2.4 NaN
# five 2002 Nevada 2.9 NaN
# six 2003 Nevada 3.2 NaN

print(frame2.columns)
# Index(['year', 'state', 'pop', 'debt'], dtype='object')

刚刚说过:DataFrame可以被看做由Series组成的字典.
也就是说,==DataFrame类似于一个字典,这个字典的键列名,字典的值就是Series对象==.
所以我们可以根据列名获取DataFrame的Series对象:

  • frame[‘year’]方式
  • frame.year点号方式
  • frame.get(‘year’)方式
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from pandas import Series,DataFrame

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = DataFrame(data, columns=['year', 'state','debt'])

print(frame.year)
# 0 2000
# 1 2001
# 2 2002
# 3 2001
# 4 2002
# 5 2003
# Name: year, dtype: int64

print(frame.get('state'))
# 0 Ohio
# 1 Ohio
# 2 Ohio
# 3 Nevada
# 4 Nevada
# 5 Nevada
# Name: state, dtype: object

print(frame['year'])
# 0 2000
# 1 2001
# 2 2002
# 3 2001
# 4 2002
# 5 2003
# Name: year, dtype: int64

# 同理能使用frame.state[:2],frame.get('state')[:2]
print(frame['year'][:2])
# 0 2000
# 1 2001
# Name: year, dtype: int64

上面的方法是获取列,那么如何获取行呢?
使用loc属性:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four','five', 'six'])
print(frame2)
# year state pop debt
# one 2000 Ohio 1.5 NaN
# two 2001 Ohio 1.7 NaN
# three 2002 Ohio 3.6 NaN
# four 2001 Nevada 2.4 NaN
# five 2002 Nevada 2.9 NaN
# six 2003 Nevada 3.2 NaN

print(frame2.loc['three'])
# year 2002
# state Ohio
# pop 3.6
# debt NaN
# Name: three, dtype: object

注意,返回的Series拥有原DataFrame相同的索引,且其name属性也已经被相应地设置好了。

也就是说,如果想用索引,不能像py的列表一样直接使用frame['year'][2],这里的2是DataFrame的行索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import pandas as pd

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four','five', 'six'])
print(frame2)
# year state pop debt
# one 2000 Ohio 1.5 NaN
# two 2001 Ohio 1.7 NaN
# three 2002 Ohio 3.6 NaN
# four 2001 Nevada 2.4 NaN
# five 2002 Nevada 2.9 NaN
# six 2003 Nevada 3.2 NaN

# 先列后行
print(frame2['year']['one']) # 2000
print(frame2.get('year')['five']) # 2002
print(frame2.year.loc['six']) # 2003
print(frame2.year.six) # 2003
print(frame2.get('year').get('five'))# 2002

# 先行后列
print(frame2.loc['six'].year) # 2003
print(frame2.loc['six']['year']) # 2003
print(frame2.loc['six'].get('year')) # 2003

# 先行后列必须使用.loc['six'],其他的不行,下面都会出错
# print(frame2.six.year)
# print(frame2['one']['year'])
# print(frame2.get('five').get('year'))

总结获取行和列的方法:

  • 获取列:
    1. frame.year
    2. frame.get(‘year’)
    3. frame[‘year’]
  • 获取行:
    1. frame.loc[‘six’]
    2. frame[‘six’]
    3. frame.get(‘six’)
    4. frame.six

可以先获取行再获取列,也可以先获取列再获取行.
==如果先获取行再获取列,那么获取行时必须使用frame.loc['six']==

DateFrame的值都是可读可写:
我们可以修改列,修改行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import pandas as pd
import numpy as np

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four','five', 'six'])
print(frame2)
# year state pop debt
# one 2000 Ohio 1.5 NaN
# two 2001 Ohio 1.7 NaN
# three 2002 Ohio 3.6 NaN
# four 2001 Nevada 2.4 NaN
# five 2002 Nevada 2.9 NaN
# six 2003 Nevada 3.2 NaN


# 修改列
frame2['debt'] = 16.5
print(frame2)
# year state pop debt
# one 2000 Ohio 1.5 16.5
# two 2001 Ohio 1.7 16.5
# three 2002 Ohio 3.6 16.5
# four 2001 Nevada 2.4 16.5
# five 2002 Nevada 2.9 16.5
# six 2003 Nevada 3.2 16.5


frame2['debt'] = np.arange(6.)
print(frame2)
# year state pop debt
# one 2000 Ohio 1.5 0.0
# two 2001 Ohio 1.7 1.0
# three 2002 Ohio 3.6 2.0
# four 2001 Nevada 2.4 3.0
# five 2002 Nevada 2.9 4.0
# six 2003 Nevada 3.2 5.0


# 修改行
aDict = {'year':9999,'state':'9999','pop':9999,'debt':9999}
row = pd.Series(aDict)
frame2.loc['one'] = row
print(frame2)
# year state pop debt
# one 9999 9999 9999.0 9999.0
# two 2001 Ohio 1.7 1.0
# three 2002 Ohio 3.6 2.0
# four 2001 Nevada 2.4 3.0
# five 2002 Nevada 2.9 4.0
# six 2003 Nevada 3.2 5.0


# 使用index精确匹配某行.
# 如果长度和DataFrame的⻓度不一致,那么所有的空行都将被填上缺失值
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)
# year state pop debt
# one 9999 9999 9999.0 NaN
# two 2001 Ohio 1.7 -1.2
# three 2002 Ohio 3.6 NaN
# four 2001 Nevada 2.4 -1.5
# five 2002 Nevada 2.9 -1.7
# six 2003 Nevada 3.2 NaN


# 为不存在的列赋值会创建出一个新列
frame2['eastern'] = (frame2.state == 'Ohio')
print(frame2)
# year state pop debt eastern
# one 9999 9999 9999.0 NaN False
# two 2001 Ohio 1.7 -1.2 True
# three 2002 Ohio 3.6 NaN True
# four 2001 Nevada 2.4 -1.5 False
# five 2002 Nevada 2.9 -1.7 False
# six 2003 Nevada 3.2 NaN False


# 使用del方法删除列
del frame2['eastern']
print(frame2)
# year state pop debt
# one 9999 9999 9999.0 NaN
# two 2001 Ohio 1.7 -1.2
# three 2002 Ohio 3.6 NaN
# four 2001 Nevada 2.4 -1.5
# five 2002 Nevada 2.9 -1.7
# six 2003 Nevada 3.2 NaN


# 修改行索引
frame2.index = ['(1)', '(2)', '(3)', '(4)','(5)', '(6)']
print(frame2)
# year state pop debt
# (1) 9999 9999 9999.0 NaN
# (2) 2001 Ohio 1.7 -1.2
# (3) 2002 Ohio 3.6 NaN
# (4) 2001 Nevada 2.4 -1.5
# (5) 2002 Nevada 2.9 -1.7
# (6) 2003 Nevada 3.2 NaN


# 修改列索引
frame2.columns = ['(year)', '(state)', '(pop)', '(debt)']
print(frame2)
# (year) (state) (pop) (debt)
# (1) 9999 9999 9999.0 NaN
# (2) 2001 Ohio 1.7 -1.2
# (3) 2002 Ohio 3.6 NaN
# (4) 2001 Nevada 2.4 -1.5
# (5) 2002 Nevada 2.9 -1.7
# (6) 2003 Nevada 3.2 NaN

注意:

  • eastern是新建的列,所以不能再⽤frame2.eastern创建新的列。
  • 通过索引⽅式返回的列是相应数据的视图,并不是副本。因此,对返回的Series所做的任何就地修改全都会反映到源DataFrame上。通过Series的copy⽅法即可指定复制列。

如果嵌套字典传给DataFrame,pandas就会被解释为:
外层字典的键作为列,内层键则作为⾏索引
(缺失值就会被填充为NaN):

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
import numpy as np


pop = {'col1': {'row1': 2.4, 'row2': 2},
'col2': {'row1': 1.7, 'row2': 3.6,'row3': 1.5}}

frame3 = pd.DataFrame(pop)
print(frame3)
# col1 col2
# row1 2.4 1.7
# row2 2.0 3.6
# row3 NaN 1.5

转置DateFrame:

1
2
3
4
print(frame3.T)
# row1 row2 row3
# col1 2.4 2.0 NaN
# col2 1.7 3.6 1.5

DataFrame构造函数所能接受的各种数据

6874701

就像Series的indx属性具有那么属性一样.
DataFrame本身和index,columns属性也有name属性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
print(frame3)
# Nevada Ohio
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6

frame3.index.name = 'year'
frame3.columns.name = 'state'
print(frame3)
# state Nevada Ohio
# year
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6

跟Series⼀样,values属性也会以⼆维ndarray的形式返回DataFrame中的数据:
如果DataFrame各列的数据类型不同,则值数组的dtype就会选⽤能兼容所有列的数据类型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
import numpy as np

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four','five', 'six'])
print(frame2)
# year state pop debt
# one 2000 Ohio 1.5 NaN
# two 2001 Ohio 1.7 NaN
# three 2002 Ohio 3.6 NaN
# four 2001 Nevada 2.4 NaN
# five 2002 Nevada 2.9 NaN
# six 2003 Nevada 3.2 NaN

# 返回二维ndarray
print(frame2.values)
# [[2000 'Ohio' 1.5 nan]
# [2001 'Ohio' 1.7 nan]
# [2002 'Ohio' 3.6 nan]
# [2001 'Nevada' 2.4 nan]
# [2002 'Nevada' 2.9 nan]
# [2003 'Nevada' 3.2 nan]]

# dtypes会变成能兼容所有列的数据类型:
print(frame2.dtypes)
# dtype: object

索引对象

pandas的索引对象负责管理轴标签和其他元数据(⽐如轴名称等)。
构建Series或DataFrame时,所⽤到的任何数组或其他序列的标签都会被转换成⼀个Index.

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd

obj = pd.Series(range(3), index=['a', 'b', 'c'])
a = obj.index

print(a)
# Index(['a', 'b', 'c'], dtype='object')

print(type(a))
# <class 'pandas.core.indexes.base.Index'>

print(a[1:])
# Index(['b', 'c'], dtype='object')

Index对象是不可变的,这使Index对象能在多个数据结构之间安全共享:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np

# 创建一个Index对象
obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj_index = obj.index


# Index对象不可改变,发生TypeError
# a[1] = 'd'


# 将已经有的Index对象传给Series对象
obj2 = pd.Series([1.5, -2.5, 0], index=obj_index)
print(obj2)
# a 1.5
# b -2.5
# c 0.0
# dtype: float64

print(obj2.index is obj_index)
# True

DataFrame的columns其实就是Index对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import numpy as np

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
print(frame3)
# Nevada Ohio
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6

print(frame3.columns)
# Index(['Nevada', 'Ohio'], dtype='object')

print('Ohio' in frame3.columns)
# True

print(2002 in frame3.index)
# True

print(2003 in frame3.index)
# False

Index可以包含重复的标签:选择重复的标签,会显示所有的结果。

1
2
3
4
5
6
7
import pandas as pd
import numpy as np

# 创建含有重复标签的Index
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
print(dup_labels)
# Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Index的方法和属性

68747537


5.2 基本功能

操作Series和DataFrame中的数据的基本⼿段。

重新索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import numpy as np

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(obj)
# d 4.5
# b 7.2
# a -5.3
# c 3.6
# dtype: float64

# 重新索引,如果存在原先则保留,不存在则引入NaN
# 注意这里把d索引丢掉了
obj2 = obj.reindex(['a', 'b', 'c', 'e'])
print(obj2)
# a -5.3
# b 7.2
# c 3.6
# e NaN
# dtype: float64

总结一下:

  • 原来就有的索引就保留
  • 原来没有的索引就添加,并且数据都是NaN
  • 多余的全部扔掉
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
import numpy as np

# 快速创建DataFrame的方式
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
print(frame)
# Ohio Texas California
# a 0 1 2
# c 3 4 5
# d 6 7 8

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
# Ohio Texas California
# a 0.0 1.0 2.0
# b NaN NaN NaN
# c 3.0 4.0 5.0
# d 6.0 7.0 8.0

states = ['Texas', 'Utah', 'California']
frame3 = frame.reindex(columns=states)
print(frame3)
# Texas Utah California
# a 1 NaN 2
# c 4 NaN 5
# d 7 NaN 8

使用ffill实现向前填充值:ffill很适用于像时间序列这样的有序数据

ffill:即front fill,说明还有bfill:back fill

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd


obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj3)
# 0 blue
# 2 purple
# 4 yellow
# dtype: object

# 使⽤ffill可以实现前向值填充:
obj4 = obj3.reindex(range(6), method='ffill')
print(obj4)
# 0 blue
# 1 blue
# 2 purple
# 3 purple
# 4 yellow
# 5 yellow
# dtype: object

# 使⽤bfill可以实现后向值填充:
obj5 = obj3.reindex(range(6), method='bfill')
print(obj5)
# 0 blue
# 1 purple
# 2 purple
# 3 yellow
# 4 yellow
# 5 NaN
# dtype: object

reindex函数的各参数及说明

68747470

丢弃指定轴上的项

drop⽅法返回的是⼀个在指定轴上删除了指定值的新对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import numpy as np


obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# e 4.0
# dtype: float64

new_obj = obj.drop('c')
print(new_obj)
# a 0.0
# b 1.0
# d 3.0
# e 4.0
# dtype: float64

data1 = obj.drop(['d', 'c'])
print(data1)
# a 0.0
# b 1.0
# e 4.0
# dtype: float64

上面是Series对象删除对象的方法.DataFrame一样可以使用drop:
但是==obj.drop(['d', 'c'])只能用于删除行,删除列必须使用axis参数==(因为默认axis=0)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import pandas as pd
import numpy as np


data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
print(data)
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15


# 删除行(默认axis=0)
data2 = data.drop(['Colorado', 'Ohio'])
print(data2)
# one two three four
# Utah 8 9 10 11
# New York 12 13 14 15


# 删除列传入axis=1或者axis='columns'
data3 = data2.drop('one',axis=1)
print(data3)
# two three four
# Utah 9 10 11
# New York 13 14 15


data4 = data3.drop(['two','four'],axis='columns')
print(data4)
# three
# Utah 10
# New York 14

索引、选取和过滤

Series索引(obj[…])的⼯作⽅式类似于NumPy数组的索引,只不过Series的索引值不只是整数。

简单来说:

  • Series的索引,切片可以使用py里的整数,也可以使用Series的标签来索引,切片
  • 使用标签来索引,切片,注意包含下上限
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import pandas as pd
import numpy as np

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj)
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# dtype: float64

print(obj['b'])
# 1.0

print(obj[1])
# 1.0

print(obj[2:4])
# c 2.0
# d 3.0
# dtype: float64

print(obj[['b', 'a', 'd']])
# b 1.0
# a 0.0
# d 3.0
# dtype: float64

print(obj[[1, 3]])
# b 1.0
# d 3.0
# dtype: float64

print(obj[obj < 2])
# a 0.0
# b 1.0
# dtype: float64


# 利⽤标签的切⽚包含上限
print(obj['b':'c'])
# b 1.0
# c 2.0
# dtype: float64

# 和py的list一样,索引和切片都是可读可写
obj['b':'c'] = 5
print(obj)
# a 0.0
# b 5.0
# c 5.0
# d 3.0
# dtype: float64

其实我们能把DataFrame当成NumPy⼆维数组:
他们的语法很像

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])

print(data)
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15

print(data['two'])
# Ohio 1
# Colorado 5
# Utah 9
# New York 13
# Name: two, dtype: int32

print(data[['three', 'one']])
# three one
# Ohio 2 0
# Colorado 6 4
# Utah 10 8
# New York 14 12

# 使用切片选取行
print(data[:2])
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7

# 使用布尔型选取数据
print(data[data['three'] > 5])
# one two three four
# Colorado 4 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15


# 使用比较运算扩散到全部选区
print(data < 5)
# one two three four
# Ohio True True True True
# Colorado True False False False
# Utah False False False False
# New York False False False False


# 使用布尔值运算将小于5的元素全部改为0
data[data < 5] = 0
print(data)
# one two three four
# Ohio 0 0 0 0
# Colorado 0 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15

⽤loc和iloc进⾏选取

⾏的标签索引:

  • loc:轴标签
  • iloc:整数索引
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])

print(data)
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15


# 选取单行多列
print(data.loc['Colorado', ['two', 'three']])
# two 5
# three 6
# Name: Colorado, dtype: int32


# 选取多行多列
print(data.loc[['Colorado','Ohio'], ['two', 'three']])
# two three
# Colorado 5 6
# Ohio 1 2


# 选取索引值为2的行,索引值为3,0,1的列
print(data.iloc[2, [3, 0, 1]])
# four 11
# one 8
# two 9
# Name: Utah, dtype: int32


# 选取索引值为2的行
print(data.iloc[2])
# one 8
# two 9
# three 10
# four 11
# Name: Utah, dtype: int32

# 选取索引值为1,2的行,索引值为3,0,1的列
print(data.iloc[[1, 2], [3, 0, 1]])
# four one two
# Colorado 7 4 5
# Utah 11 8 9

这两个索引函数也适⽤于⼀个标签或多个标签的切⽚:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])

print(data)
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15


# 就像前面说的,当使用标签索引时,包含上下限
print(data.loc[:'Utah', 'two'])
# Ohio 1
# Colorado 5
# Utah 9
# Name: two, dtype: int32


# 选取所有行和第0,1,2列
print(data.iloc[:, :3])
# one two three
# Ohio 0 1 2
# Colorado 4 5 6
# Utah 8 9 10
# New York 12 13 14


# 选取所有行和第0,1,2列,然后再筛选掉three列小于5的行
print(data.iloc[:, :3][data.three > 5])
# one two three
# Colorado 4 5 6
# Utah 8 9 10
# New York 12 13 14


# 链式调用(可用,但不推荐)
print(data.iloc[:, :3][data.three > 5][data.one > 5][data.two > 10])
# one two three
# New York 12 13 14

DataFrame的索引选项:

68747437

整数索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import numpy as np

people = ['hyl','dsz','gzr','czj']

ser = pd.Series(people,index=[1,2,3,4])
print(ser)
# 1 hyl
# 2 dsz
# 3 gzr
# 4 czj
# dtype: object

# 产生歧义,是选择标签索引的1还是位置索引的1
# 事实上选择的是标签的1
print(ser[1])
# hyl

# 为了避免歧义,当标签索引是整数时,索引为-1不可用,会发生KeyError
# print(ser[-1])


ser2 = pd.Series(np.arange(1,5))
print(ser2)
# 0 1
# 1 2
# 2 3
# 3 4
# dtype: int32


# 为了避免歧义,当没有标签索引时,索引为-1不可用,会发生KeyError
# print(ser2[-1])


# 当标签明确不为整数时,索引为-1可用
ser3 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
print(ser3[-1])
# 2.0

造成上面的原因是:
为了进⾏统⼀,==如果标签索引含有整数,数据选取总会使⽤标签。==

为了更准确,请使⽤loc(标签)或iloc(整数):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import numpy as np

people = ['hyl','dsz','gzr','czj']

ser = pd.Series(people,index=[1,2,3,4])
print(ser)
# 1 hyl
# 2 dsz
# 3 gzr
# 4 czj
# dtype: object

print(ser[:1])
# 1 hyl
# dtype: object

print(ser.loc[:1])
# 1 hyl
# dtype: object

print(ser.iloc[:1])
# 1 hyl
# dtype: object

算术运算和数据对⻬

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
import numpy as np

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])

print(s1)
# a 7.3
# c -2.5
# d 3.4
# e 1.5
# dtype: float64

print(s2)
# a -2.1
# c 3.6
# e -1.5
# f 4.0
# g 3.1
# dtype: float64

# 存在不同的索引对,则结果的索引就是该索引对的并集
# 有点类似于数据库的全外连接
print(s1+s2)
# a 5.2
# c 1.1
# d NaN
# e 0.0
# f NaN
# g NaN
# dtype: float64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(df1)
# b c d
# Ohio 0.0 1.0 2.0
# Texas 3.0 4.0 5.0
# Colorado 6.0 7.0 8.0

print(df2)
# b d e
# Utah 0.0 1.0 2.0
# Ohio 3.0 4.0 5.0
# Texas 6.0 7.0 8.0
# Oregon 9.0 10.0 11.0

print(df1 + df2)
# b c d e
# Colorado NaN NaN NaN NaN
# Ohio 3.0 NaN 6.0 NaN
# Oregon NaN NaN NaN NaN
# Texas 9.0 NaN 12.0 NaN
# Utah NaN NaN NaN NaN

pandas的算术运算有点类似于数据库的全外连接:
如果存在不同的索引对,则取索引对的并集.
并且⾃动的数据对⻬操作在不重叠的索引处引⼊了NaN值。缺失值会在算术运算过程中传播。

在算术⽅法中填充值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))

# 将其中一个元素设置为nan
df2.loc[1, 'b'] = np.nan

print(df1)
# a b c d
# 0 0.0 1.0 2.0 3.0
# 1 4.0 5.0 6.0 7.0
# 2 8.0 9.0 10.0 11.0

print(df2)
# a b c d e
# 0 0.0 1.0 2.0 3.0 4.0
# 1 5.0 NaN 7.0 8.0 9.0
# 2 10.0 11.0 12.0 13.0 14.0
# 3 15.0 16.0 17.0 18.0 19.0


print(df1 + df2)
# a b c d e
# 0 0.0 2.0 4.0 6.0 NaN
# 1 9.0 NaN 13.0 15.0 NaN
# 2 18.0 20.0 22.0 24.0 NaN
# 3 NaN NaN NaN NaN NaN

# 在计算之前,把所有的NaN都改为0
print(df1.add(df2, fill_value=0))
# a b c d e
# 0 0.0 2.0 4.0 6.0 4.0
# 1 9.0 5.0 13.0 15.0 9.0
# 2 18.0 20.0 22.0 24.0 14.0
# 3 15.0 16.0 17.0 18.0 19.0

Series和DataFrame的算术⽅法。

747269702537

r开头的方法会翻转参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))

print(1/df1)
# a b c d
# 0 inf 1.000000 0.500000 0.333333
# 1 0.250000 0.200000 0.166667 0.142857
# 2 0.125000 0.111111 0.100000 0.090909

print(df1.rdiv(1))
# a b c d
# 0 inf 1.000000 0.500000 0.333333
# 1 0.250000 0.200000 0.166667 0.142857
# 2 0.125000 0.111111 0.100000 0.090909

对Series或DataFrame重新索引时,也可以指定⼀
个填充值fill_value:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))

print(df1)
# a b c d
# 0 0.0 1.0 2.0 3.0
# 1 4.0 5.0 6.0 7.0
# 2 8.0 9.0 10.0 11.0

print(df2)
# a b c d e
# 0 0.0 1.0 2.0 3.0 4.0
# 1 5.0 6.0 7.0 8.0 9.0
# 2 10.0 11.0 12.0 13.0 14.0
# 3 15.0 16.0 17.0 18.0 19.0


x = df1.reindex(columns=df2.columns, fill_value=0)
print(x)
# a b c d e
# 0 0.0 1.0 2.0 3.0 0
# 1 4.0 5.0 6.0 7.0 0
# 2 8.0 9.0 10.0 11.0 0

DataFrame和Series之间的运算

DataFrame和Series之间的关系就是narray二维数组和一维数组的关系,所以,我们可以使用narray二维数组和一维数组的运算类比DataFrame和Series之间的运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np

arr = np.arange(12.).reshape((3, 4))
print(arr)
# [[ 0. 1. 2. 3.]
# [ 4. 5. 6. 7.]
# [ 8. 9. 10. 11.]]

print(arr[0])
# [0. 1. 2. 3.]

# 广播,会扩散到所有选区
print(arr - arr[0])
# [[0. 0. 0. 0.]
# [4. 4. 4. 4.]
# [8. 8. 8. 8.]]

DataFrame和Series之间的算术运算:

  • 将Series的索引匹配到DataFrame的列,然后沿着⾏⼀直向下⼴播.
  • 如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

print(frame)
# b d e
# Utah 0.0 1.0 2.0
# Ohio 3.0 4.0 5.0
# Texas 6.0 7.0 8.0
# Oregon 9.0 10.0 11.0

print(series)
# b 0.0
# d 1.0
# e 2.0
# Name: Utah, dtype: float64

print(frame - series)
# b d e
# Utah 0.0 0.0 0.0
# Ohio 3.0 3.0 3.0
# Texas 6.0 6.0 6.0
# Oregon 9.0 9.0 9.0

series2 = pd.Series(range(3), index=['b', 'e', 'f'])

print(series2)
# b 0
# e 1
# f 2
# dtype: int64

print(frame + series2)
# b d e f
# Utah 0.0 NaN 3.0 NaN
# Ohio 3.0 NaN 6.0 NaN
# Texas 6.0 NaN 9.0 NaN
# Oregon 9.0 NaN 12.0 NaN

如果你希望匹配⾏且在列上⼴播,则必须使⽤算术运算⽅法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

series3 = frame['d']

print(frame)
# b d e
# Utah 0.0 1.0 2.0
# Ohio 3.0 4.0 5.0
# Texas 6.0 7.0 8.0
# Oregon 9.0 10.0 11.0

print(series3)
# Utah 1.0
# Ohio 4.0
# Texas 7.0
# Oregon 10.0
# Name: d, dtype: float64

# axis='index'或者axis=0
print(frame.sub(series3, axis='index'))
# b d e
# Utah -1.0 0.0 1.0
# Ohio -1.0 0.0 1.0
# Texas -1.0 0.0 1.0
# Oregon -1.0 0.0 1.0

函数应⽤和映射

NumPy的ufuncs(元素级数组⽅法)也可⽤于操作pandas对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame)
# b d e
# Utah -0.967495 -0.247553 1.483448
# Ohio -0.256602 -0.881130 -0.732070
# Texas 0.379607 0.959606 1.048860
# Oregon 0.225625 0.292465 -0.406547

print(np.abs(frame))
# b d e
# Utah 0.967495 0.247553 1.483448
# Ohio 0.256602 0.881130 0.732070
# Texas 0.379607 0.959606 1.048860
# Oregon 0.225625 0.292465 0.406547

# 还可以使用py里的匿名函数
f = lambda x: x.max() - x.min()

print(frame.apply(f))
# b 2.290285
# d 0.572808
# e 2.553155
# dtype: float64


# 用例还可以使用axis='columns'
print(frame.apply(f,axis='columns'))
# Utah 2.410839
# Ohio 1.082203
# Texas 1.472795
# Oregon 2.333930
# dtype: float64

我们甚至可以使用函数来创建新的Series:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame)
# b d e
# Utah -1.176057 -0.783786 -0.543606
# Ohio 0.777264 1.151499 -1.428496
# Texas 0.036316 0.648344 -2.312486
# Oregon 1.286525 -1.263399 1.165449


def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])

print(frame.apply(f))
# b d e
# min -1.176057 -1.263399 -2.312486
# max 1.286525 1.151499 1.165449

我们还可以使用元素级的python函数:
这时要使用applymap函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame)
# b d e
# Utah -1.322140 0.974776 -0.832036
# Ohio 2.061849 0.307122 0.085941
# Texas -1.420665 -0.193728 -0.401482
# Oregon -0.703217 0.103967 -0.689496

# 元素级的python函数
format = lambda x: '%.2f' % x
data = frame.applymap(format)

print(data)
# b d e
# Utah -1.32 0.97 -0.83
# Ohio 2.06 0.31 0.09
# Texas -1.42 -0.19 -0.40
# Oregon -0.70 0.10 -0.69

applymap函数是DataFrame的元素级应用函数,
map函数就是Series的元素级应用函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame)
# b d e
# Utah -0.817937 -0.802316 -0.293776
# Ohio 0.413619 -0.647371 -0.615123
# Texas -0.469394 0.663940 0.440380
# Oregon 1.413984 0.933500 1.031781


format = lambda x: '%.2f' % x

print(frame['e'].map(format))
# Utah -0.29
# Ohio -0.62
# Texas 0.44
# Oregon 1.03
# Name: e, dtype: object

需要注意的是:
许多最为常⻅的数组统计功能都被实现成DataFrame的⽅法(如sum和mean),因此⽆需使⽤apply⽅法。

排序和排名

  • 按索引进行排序:
    sort_index
  • 按值进行排序:
    sort_values
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd

# 对索引进行排序
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
print(obj)
# d 0
# a 1
# b 2
# c 3
# dtype: int64

print(obj.sort_index())
# a 1
# b 2
# c 3
# d 0
# dtype: int64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import numpy as np


frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
print(frame)
# d a b c
# three 0 1 2 3
# one 4 5 6 7

# 对索引进行排序,默认axis=0
print(frame.sort_index())
# d a b c
# one 4 5 6 7
# three 0 1 2 3

print(frame.sort_index(axis=1))
# a b c d
# three 1 2 3 0
# one 5 6 7 4


# 默认升序,可以使用ascending=False实现降序
print(frame.sort_index(axis=1, ascending=False))
# d c b a
# three 0 3 2 1
# one 4 7 6 5

按值进行排序:
在排序时,任何缺失值默认都会被放到Series的末尾

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd

# 按值进行排序
obj = pd.Series([4, 7, -3, 2])
print(obj)
# 0 4
# 1 7
# 2 -3
# 3 2
# dtype: int64

print(obj.sort_values())
# 2 -3
# 3 2
# 0 4
# 1 7
# dtype: int64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np

# 按值进行排序
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
print(obj)
# 0 4.0
# 1 NaN
# 2 7.0
# 3 NaN
# 4 -3.0
# 5 2.0
# dtype: float64

print(obj.sort_values())
# 4 -3.0
# 5 2.0
# 0 4.0
# 2 7.0
# 1 NaN
# 3 NaN
# dtype: float64

主要关键字/次要关键字排序:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import numpy as np

# 按值进行排序
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
print(frame)
# b a
# 0 4 0
# 1 7 1
# 2 -3 0
# 3 2 1

print(frame.sort_values(by='b'))
# b a
# 2 -3 0
# 3 2 1
# 0 4 0
# 1 7 1

print(frame.sort_values(by=['a', 'b']))
# b a
# 2 -3 0
# 0 4 0
# 3 2 1
# 1 7 1

rank⽅法:
rank是默认通过为各组分配⼀个平均排名的⽅式来排序的.
可以通过关键字method得到其他排位方式method=‘first’,‘max’等.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import pandas as pd
import numpy as np


obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(obj)
# 0 7
# 1 -5
# 2 7
# 3 4
# 4 2
# 5 0
# 6 4
# dtype: int64

# rank默认的排序方式是取平均值
# rank 表示在这个数在原来的Series中排第几名,有相同的数,取其排名平均(默认)作为值。
# 在obj中,4和4的排名是第4名和第五名,取平均得4.5。7和7的排名分别是第六名和第七名,则其排名取平均得6.5
print(obj.rank())
# 0 6.5
# 1 1.0
# 2 6.5
# 3 4.5
# 4 3.0
# 5 2.0
# 6 4.5
# dtype: float64


# 更改排序方式
# 根据值在原数据中出现的顺序给出排名:
print(obj.rank(method='first'))
# 0 6.0
# 1 1.0
# 2 7.0
# 3 4.0
# 4 3.0
# 5 2.0
# 6 5.0
# dtype: float64


# 使用降序
print(obj.rank(ascending=False, method='max'))
# 0 2.0
# 1 7.0
# 2 2.0
# 3 4.0
# 4 5.0
# 5 6.0
# 6 4.0
# dtype: float64

method方法的选项:

Tables Are
average 默认:在线等分组中,为各个值分配平均排名
min 使用整个分组的最小排名
max 使用整个分组的最大排名
first 按值的原始数据中的出现顺序分配排名
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import pandas as pd
import numpy as np

frame = pd.DataFrame({'b': [4.3, 7, -3, 2],
'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})

print(frame)
# b a c
# 0 4.3 0 -2.0
# 1 7.0 1 5.0
# 2 -3.0 0 8.0
# 3 2.0 1 -2.5

print(frame.rank(axis='columns'))
# b a c
# 0 3.0 2.0 1.0
# 1 3.0 1.0 2.0
# 2 1.0 2.0 3.0
# 3 3.0 2.0 1.0

print(frame.rank(axis=0))
# b a c
# 0 3.0 1.5 2.0
# 1 4.0 3.5 3.0
# 2 1.0 1.5 4.0
# 3 2.0 3.5 1.0

print(frame.rank(axis=1))
# b a c
# 0 3.0 2.0 1.0
# 1 3.0 1.0 2.0
# 2 1.0 2.0 3.0
# 3 3.0 2.0 1.0

带有重复标签的轴索引

  • 如果一个索引对应多个值,则返回⼀个Series
  • 如果一个索引对应单个值,则返回⼀个标量值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
import numpy as np

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

print(obj)
# a 0
# a 1
# b 2
# b 3
# c 4
# dtype: int64


# 通过索引的is_unique属性可以判断他的值是否唯一
print(obj.index.is_unique)
# False


# 如果一个索引对应多个值,则返回⼀个Series
# 如果一个索引对应单个值,则返回⼀个标量值
print(obj['a'])
# a 0
# a 1
# dtype: int64

print(obj['c'])
# 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
print(df)
# 0 1 2
# a 0.827146 -1.291108 -0.068053
# a -0.614791 -2.891564 -1.103444
# b 2.066921 -2.490818 0.991229
# b -1.383593 1.431614 -0.888320

print(df.loc['b'])
# 0 1 2
# b 2.066921 -2.490818 0.991229
# b -1.383593 1.431614 -0.888320

5.3汇总和计算描述统计

pandas函数都是基于没有缺失数据的假设⽽构建的:

  • NaN会自动被忽略(NaN视为0)
  • 如果整个切片都是NaN,那么会返回NaN
  • 通过skipna来禁用该功能
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import pandas as pd
import numpy as np


df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])

print(df)
# one two
# a 1.40 NaN
# b 7.10 -4.5
# c NaN NaN
# d 0.75 -1.3


# 自动忽略NaN值(NaN视为0)
print(df.sum())
# one 9.25
# two -5.80
# dtype: float64


print(df.sum(axis='columns'))
# a 1.40
# b 2.60
# c 0.00
# d -0.55
# dtype: float64


print(df.mean(axis='columns'))
# a 1.400
# b 1.300
# c NaN
# d -0.275
# dtype: float64


# 使用skipna禁用'将NaN忽略',这样NaN就会被传递
print(df.mean(axis='columns', skipna=False))
# a NaN
# b 1.300
# c NaN
# d -0.275
# dtype: float64


# 最大值的索引,同理还有idxmin
print(df.idxmax())
# one b
# two d
# dtype: object


# 累加
print(df.cumsum())
# one two
# a 1.40 NaN
# b 8.50 -4.5
# c NaN NaN
# d 9.25 -5.8


# 一次性产生多个汇总统计
print(df.describe())
# one two
# count 3.000000 2.000000
# mean 3.083333 -2.900000
# std 3.493685 2.262742
# min 0.750000 -4.500000
# 25% 1.075000 -3.700000
# 50% 1.400000 -2.900000
# 75% 4.250000 -2.100000
# max 7.100000 -1.300000

需要注意的是累加:

  • 就算是skipna=True,结果中一样有NaN
    (不过也没差,反正这个NaN就等于上面的值或左边的值)
  • 如果skipna=False,根据NaN传递原则,那么NaN接下来的数据都会是NaN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import pandas as pd
import numpy as np


df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])

print(df)
# one two
# a 1.40 NaN
# b 7.10 -4.5
# c NaN NaN
# d 0.75 -1.3


print(df.cumsum(axis=0))
# one two
# a 1.40 NaN
# b 8.50 -4.5
# c NaN NaN
# d 9.25 -5.8

print(df.cumsum(axis=1))
# one two
# a 1.40 NaN
# b 7.10 2.60
# c NaN NaN
# d 0.75 -0.55


print(df.cumsum(axis=1,skipna=False))
# one two
# a 1.40 NaN
# b 7.10 2.60
# c NaN NaN
# d 0.75 -0.55

# NaN接下去的数据都是NaN
print(df.cumsum(axis=0,skipna=False))
# one two
# a 1.4 NaN
# b 8.5 NaN
# c NaN NaN
# d NaN NaN

对于⾮数值型数据,describe会产⽣另外⼀种汇总统计:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np

# 对于⾮数值型数据,describe会产⽣另外⼀种汇总统计:
obj = pd.Series(['a', 'a', 'b', 'c'] * 3)
print(obj)
# 0 a
# 1 a
# 2 b
# 3 c
# 4 a
# 5 a
# 6 b
# 7 c
# 8 a
# 9 a
# 10 b
# 11 c
# dtype: object

print(obj.describe())
# count 12
# unique 3
# top a
# freq 6
# dtype: object

所有与描述统计相关的方法。

69702537

相关系数与协⽅差

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import pandas as pd
import numpy as np
import pandas_datareader.data as web

all_data = {ticker: web.get_data_yahoo(ticker)
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

# 获取DataFrame对象
price = pd.DataFrame({ticker: data['Adj Close']
for ticker, data in all_data.items()})
# 获取DataFrame对象
volume = pd.DataFrame({ticker: data['Volume']
for ticker, data in all_data.items()})

returns = price.pct_change()

# 计算价格的百分比变化
print(returns.tail())
# AAPL GOOG IBM MSFT
# Date
# 2016-10-17 -0.000680 0.001837 0.002072 -0.003483
# 2016-10-18 -0.000681 0.019616 -0.026168 0.007690
# 2016-10-19 -0.002979 0.007846 0.003583 -0.002255
# 2016-10-20 -0.000512 -0.005652 0.001719 -0.004867
# 2016-10-21 -0.003930 0.003011 -0.012474 0.042096


# Series的corr⽅法⽤于计算两个Series中重叠的、⾮NA的、按索引对⻬的值的相关系数。
print(returns['MSFT'].corr(returns['IBM']))
# 0.49976361144151144

# cov⽤于计算协⽅差
print(returns['MSFT'].cov(returns['IBM']))
# 8.8706554797035462e-05


# 使用MSTF用更简洁的语法选择列
print(returns.MSFT.corr(returns.IBM))
# 0.49976361144151144


# corr⽅法:以DataFrame的形式返回完整的相关系数:
print(returns.corr())
# AAPL GOOG IBM MSFT
# AAPL 1.000000 0.407919 0.386817 0.389695
# GOOG 0.407919 1.000000 0.405099 0.465919
# IBM 0.386817 0.405099 1.000000 0.499764
# MSFT 0.389695 0.465919 0.499764 1.000000


# cov⽅法:以DataFrame的形式返回完整的协⽅差矩阵:
print(returns.cov())
# AAPL GOOG IBM MSFT
# AAPL 0.000277 0.000107 0.000078 0.000095
# GOOG 0.000107 0.000251 0.000078 0.000108
# IBM 0.000078 0.000078 0.000146 0.000089
# MSFT 0.000095 0.000108 0.000089 0.000215


# corrwith⽅法:计算其列或⾏跟另⼀个Series或DataFrame之间的相关系数。
# 传⼊⼀个Series将会返回⼀个相关系数值Series(针对各列进⾏计算):
print(returns.corrwith(returns.IBM))
# AAPL 0.386817
# GOOG 0.405099
# IBM 1.000000
# MSFT 0.499764
# dtype: float64

# 传⼊⼀个DataFrame则会计算按列名配对的相关系数。
# 这⾥,计算百分⽐变化与成交量的相关系数:
print(returns.corrwith(volume))
# AAPL -0.075565
# GOOG -0.007067
# IBM -0.204849
# MSFT -0.092950
# dtype: float64

唯⼀值、值计数以及成员资格

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import pandas as pd
import numpy as np

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

# unique方法:Series中的唯⼀值数组
print(obj.unique())
# ['c' 'a' 'd' 'b']

# value_counts方法:计算⼀个Series中各值出现的频率
# 默认按值频率降序排列
print(obj.value_counts())
# a 3
# c 3
# b 2
# d 1
# dtype: int64

# 使用sort不排序
print(pd.value_counts(obj.values, sort=False))
# b 2
# c 3
# d 1
# a 3
# dtype: int64

# 判断集合的成员资格
mask = obj.isin(['b', 'c'])
print(mask)
# 0 True
# 1 False
# 2 False
# 3 False
# 4 False
# 5 True
# 6 True
# 7 True
# 8 True
# dtype: bool

# 通过传入布尔型Series过滤出需要的数据
print(obj[mask])
# 0 c
# 5 b
# 6 b
# 7 c
# 8 c
# dtype: object

Index.get_indexer⽅法:
功能和isin类似:
获取to_match在unique_vals中的位置

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np


to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])

print(pd.Index(unique_vals))
# Index(['c', 'b', 'a'], dtype='object')

print(pd.Index(unique_vals).get_indexer(to_match))
# [0 2 1 1 0 2]

这几个方法的参考信息:

方法 说明
isin 计算一个表示”Series各值是否包含于传入的值序列中”的布尔型数组
match 计算一个数组中的各值到另一个不同值数组的整数索引;对于数据对齐和连接类型的操作十分有用
unique 计算Series的唯一值数组,按发现的顺序返回
value_counts 返回一个Series,其索引为唯一值,其值为频率,按计数值降序排列
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
import numpy as np


data = pd.DataFrame({'Qu1': ['a','b','e','d','e'],
'Qu2': ['e','b','b','d','e'],
'Qu3': ['c','c','c','d','e']})

print(data)
# Qu1 Qu2 Qu3
# 0 a e c
# 1 b b c
# 2 e b c
# 3 d d d
# 4 e e e


# 左侧为出现的全部元素,矩阵为出现次数
result1 = data.apply(pd.value_counts)
print(result1)
# Qu1 Qu2 Qu3
# a 1.0 NaN NaN
# b 1.0 2.0 NaN
# c NaN NaN 3.0
# d 1.0 1.0 1.0
# e 2.0 2.0 1.0


# 得到每一个元素在每一列中出现的次数,对有的列中没有出现的数值的频率为NA的值设置为0
result2 = data.apply(pd.value_counts).fillna(0)
print(result2)
# Qu1 Qu2 Qu3
# a 1.0 0.0 0.0
# b 1.0 2.0 0.0
# c 0.0 0.0 3.0
# d 1.0 1.0 1.0
# e 2.0 2.0 1.0