第6章 数据加载、存储与⽂件格式

输⼊输出通常可以划分为⼏个⼤类:

  • 读取⽂本⽂件和其他更⾼效的磁盘存储格式
  • 加载数据库中的数据
  • 利⽤Web API操作⽹络资源。

6.1 读写⽂本格式的数据

pandas提供了⼀些⽤于将表格型数据读取为DataFrame对象的函数。
read_csv和read_table可能会是你今后⽤得最多的。

687702537

这些函数的选项可以划分为以下⼏个⼤类:

  • 索引:
    将⼀个或多个列当做返回的DataFrame处理,以及是否从⽂件、⽤户获取列名。
  • 类型推断和数据转换:
    包括⽤户定义值的转换、和⾃定义的缺失值标记列表等。
  • ⽇期解析:
    包括组合功能,⽐如将分散在多个列中的⽇期时间信息组合成结果中的单个列。
  • 迭代:
    ⽀持对⼤⽂件进⾏逐块迭代。
  • 不规整数据问题:
    跳过⼀些⾏、⻚脚、注释或其他⼀些不重要的东⻄(⽐如由成千上万个逗号隔开的数值数据)。

使用read_csv时需要注意:
列数据的类型不属于数据类型。

ex1.csv内容如下:(带表头)

col1 col2 col3 col4 col5
a b c d hello
e f g h world
i j k l foo

ex2.csv内容如下:(不带表头)

a b c d hello
e f g h world
i j k l foo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import numpy as np

# 使用read_csv读取csv
df = pd.read_csv('ex1.csv')
print(df)
# col1 col2 col3 col4 col5
# 0 a b c d hello
# 1 e f g h world
# 2 i j k l foo

# 使用read_table读取csv
# 必须使用sep=','
df2 = pd.read_table('ex1.csv',sep=',')
print(df2)
# col1 col2 col3 col4 col5
# 0 a b c d hello
# 1 e f g h world
# 2 i j k l foo

# 使用read_table读取csv
df3 = pd.read_table('ex1.csv')
print(df3)
# col1,col2,col3,col4,col5
# 0 a,b,c,d,hello
# 1 e,f,g,h,world
# 2 i,j,k,l,foo

虽然csv默认是使用逗号的,但是也有人使用空格什么的.
这时为了获取这种csv文件,可以使用read_table,但是必须要用sep参数.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import numpy as np


# se2.csv没有标题行
print(pd.read_csv('ex2.csv'))
# a b c d hello
# 0 e f g h world
# 1 i j k l foo

# 为其自动添加默认的列名
print(pd.read_csv('ex2.csv',header=None))
# 0 1 2 3 4
# 0 a b c d hello
# 1 e f g h world
# 2 i j k l foo

# 添加自定义列名
names = ('c1 c2 c3 c4 c5').split()
print(pd.read_csv('ex2.csv',names=names))
# c1 c2 c3 c4 c5
# 0 a b c d hello
# 1 e f g h world
# 2 i j k l foo

如果自定义列名的列数量不够的话,

  • ==最左边的列将不能分配列名,他们将自动变成行索引==
  • 如果缺少的列数量大于等于2,那么这些列就会形成MultiIndex(多级索引)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import pandas as pd
import numpy as np


# 具有0,1,2的行索引
names = ('c1 c2 c3 c4 c5').split()
df1 = pd.read_csv('ex2.csv',names=names)
print(df1)
# c1 c2 c3 c4 c5
# 0 a b c d hello
# 1 e f g h world
# 2 i j k l foo


print(df1.index)
# RangeIndex(start=0, stop=3, step=1)

print(df1.columns)
# Index(['c1', 'c2', 'c3', 'c4', 'c5'], dtype='object')


# 获取行标签为2的数据
print(df1.loc[2])
# c1 i
# c2 j
# c3 k
# c4 l
# c5 foo
# Name: 2, dtype: object

print('------------------------')

# 列数量缺一
# 第一列变成了行索引
names = ('c1 c2 c3 c4').split()
df2 = pd.read_csv('ex2.csv',names=names)
print(df2)
# c1 c2 c3 c4
# a b c d hello
# e f g h world
# i j k l foo

print(df2.index)
# Index(['a', 'e', 'i'], dtype='object')

print(df2.columns)
# Index(['c1', 'c2', 'c3', 'c4'], dtype='object')


# 获取行标签为a的数据
print(df2.loc['a'])
# c1 b
# c2 c
# c3 d
# c4 hello
# Name: a, dtype: object

print('-----------------------------')

# 形成多级索引
names = ('c1 c2 c3').split()
df3 = pd.read_csv('ex2.csv',names=names)
print(df3)
# c1 c2 c3
# a b c d hello
# e f g h world
# i j k l foo

print(df3.index)
# MultiIndex(levels=[['a', 'e', 'i'], ['b', 'f', 'j']],
# labels=[[0, 1, 2], [0, 1, 2]])

print(df3.columns)
# Index(['c1', 'c2', 'c3'], dtype='object')

# 获取多级索引为(a,b)的数据
print(df3.loc[('a','b')])
# c1 c
# c2 d
# c3 hello
# Name: (a, b), dtype: object

什么是多级索引?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np

m_index1=pd.Index([("A","x1"),("A","x2"),("B","y1"),("B","y2"),("B","y3")],name=["class1","class2"])
print(m_index1)
# MultiIndex(levels=[['A', 'B'], ['x1', 'x2', 'y1', 'y2', 'y3']],
# labels=[[0, 0, 1, 1, 1], [0, 1, 2, 3, 4]],
# names=['class1', 'class2'])

df1 = pd.DataFrame(np.random.randint(1,10,(5,3)),index=m_index1)
print(df1)
# class1 class2
# A x1 3 1 2
# x2 4 6 8
# B y1 6 4 9
# y2 3 7 3
# y3 3 6 7

上面代码形成的表格就是:
20190508123218

我们还可以使用from_arrays()来创建多级索引:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np

class1=["A","A","B","B"]
class2=["x1","x2","y1","y2"]
m_index2=pd.MultiIndex.from_arrays([class1,class2],names=["class1","class2"])

print(m_index2)
# MultiIndex(levels=[['A', 'B'], ['x1', 'x2', 'y1', 'y2']],
# labels=[[0, 0, 1, 1], [0, 1, 2, 3]],
# names=['class1', 'class2'])

df2 = pd.DataFrame(np.random.randint(1,10,(4,3)),index=m_index2)
print(df2)
# class1 class2
# A x1 7 4 1
# x2 1 1 5
# B y1 3 6 7
# y2 3 9 6

1557290260556

多级索引的常用属性:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import pandas as pd
import numpy as np

m_index1=pd.Index([("A","x1"),("A","x2"),("B","y1"),("B","y2"),("B","y3")],name=["class1","class2"])
print(m_index1)
# MultiIndex(levels=[['A', 'B'], ['x1', 'x2', 'y1', 'y2', 'y3']],
# labels=[[0, 0, 1, 1, 1], [0, 1, 2, 3, 4]],
# names=['class1', 'class2'])

df1 = pd.DataFrame(np.random.randint(1,10,(5,3)),index=m_index1)
print(df1)
# class1 class2
# A x1 3 1 2
# x2 4 6 8
# B y1 6 4 9
# y2 3 7 3
# y3 3 6 7


print(df1.index[0])
# ('A', 'x1')

# 调用.get_loc()和.get_indexer()获取标签的下标:
print(df1.index.get_loc(("A","x2")))
# 1

print(df1.index.get_indexer([("A","x2"),("B","y1"),"nothing"]))
# [ 1 2 -1]


print(df1.index.levels[0])
# Index(['A', 'B'], dtype='object', name='class1')

print(df1.index.levels[1])
# Index(['x1', 'x2', 'y1', 'y2', 'y3'], dtype='object', name='class2')

print(df1.index.levels[1][1])
# x2

# MultiIndex对象还有属性labels保存标签的下标:
print(df1.index.labels[0])
# FrozenNDArray([0, 0, 1, 1, 1], dtype='int8')

print(df1.index.labels[1])
# FrozenNDArray([0, 1, 2, 3, 4], dtype='int8')

上面csv代码设置行标签都是隐式,不明确.
如果想明确的将列表前行标签的话,需要使用index_col参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import pandas as pd
import numpy as np


names = ('c1 c2 c3 c4 c5').split()
df = pd.read_csv('ex2.csv', names=names)
print(df)
# c1 c2 c3 c4 c5
# 0 a b c d hello
# 1 e f g h world
# 2 i j k l foo

# 使用index_col将c5列的数据作为行标签
df = pd.read_csv('ex2.csv', names=names, index_col='c5')
print(df)
# c1 c2 c3 c4
# c5
# hello a b c d
# world e f g h
# foo i j k l

print(df.index)
# Index(['hello', 'world', 'foo'], dtype='object', name='c5')

print(df.loc['hello'])
# c1 a
# c2 b
# c3 c
# c4 d
# Name: hello, dtype: object


# 传入列表则形成多级索引
fd2 = pd.read_csv('ex2.csv',names=names, index_col=['c4', 'c5'])
print(fd2)
# c1 c2 c3
# c4 c5
# d hello a b c
# h world e f g
# l foo i j k

print(fd2.index)
# MultiIndex(levels=[['d', 'h', 'l'], ['foo', 'hello', 'world']],
# labels=[[0, 1, 2], [1, 2, 0]],
# names=['c4', 'c5'])

read_table:
前面讲了read_table使用sep=','实现读取csv,实际上read_table可以读取很多种类的文件,sep参数支持很多的写法,甚至可以使用正则:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np

print(list(open('ex3.txt')))
# [' A B C\n',
# 'aaa -0.264438 -1.026059 -0.619500\n',
# 'bbb 0.927272 0.302904 -0.032399\n',
# 'ccc -0.264273 -0.386314 -0.217601\n',
# 'ddd -0.871858 -0.348382 1.100491\n']

# 使用空白字符作分隔符
result = pd.read_table('ex3.txt', sep=r'\s+')
print(result)
# A B C
# aaa -0.264438 -1.026059 -0.619500
# bbb 0.927272 0.302904 -0.032399
# ccc -0.264273 -0.386314 -0.217601
# ddd -0.871858 -0.348382 1.100491

# 因为列名比数据行的数量少,自动推测第一列是行标签
print(result.index)
# Index(['aaa', 'bbb', 'ccc', 'ddd'], dtype='object')

事实上,pandas默认会将-1.#IND, 1.#QNAN, 1.#IND, -1.#QNAN, #N/A N/A,#N/A, N/A, NA, #NA, NULL, NaN, -NaN, nan, -nan, 转换为NaN,na_values参数还支持定义另外的应处理为缺失值的值。
值得注意的是keep_default_na参数,这个参数的作用是决定要不要保留默认应该转换的缺失值列表,将这个参数设为False之后同时不定义na_values参数,就可以在读取文件时不将任何值转换为缺失值NaN。

read_csv常用操作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import pandas as pd
import numpy as np


print(pd.read_csv('ex2.csv'))
# a b c d hello
# 0 e f g h world
# 1 i j k l foo

# 跳过第0,2行
print(pd.read_csv('ex2.csv', skiprows=[0, 2]))
# Columns: [e, f, g, h, world]

print('-------------------------------------')

# examples/ex5.csv文件内容为:
# something,a,b,c,d,message
# one,1,2,3,4,NA
# two,5,6,,8,world
# three,9,10,11,12,foo

# 自动将NA和空字符串识别为pd的NaN对象
result = pd.read_csv('examples/ex5.csv')
print(result)
# something a b c d message
# 0 one 1 2 3.0 4 NaN
# 1 two 5 6 NaN 8 world
# 2 three 9 10 11.0 12 foo

print(pd.isnull(result))
# something a b c d message
# 0 False False False False False True
# 1 False False False True False False
# 2 False False False False False False

# na_values可以⽤⼀个列表或集合的字符串表示缺失值:
result = pd.read_csv('examples/ex5.csv',na_values=['NULL'])
print(result)
# something a b c d message
# 0 one 1 2 3.0 4 NaN
# 1 two 5 6 NaN 8 world
# 2 three 9 10 11.0 12 foo


# na_values:在这个list里面的的元素都会被认为是NaN
# 这里one被视为NaN
result = pd.read_csv('examples/ex5.csv',na_values=['one'])
print(result)
# something a b c d message
# 0 NaN 1 2 3.0 4 NaN
# 1 two 5 6 NaN 8 world
# 2 three 9 10 11.0 12 foo

# 为不同的列分配不同的na_values
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
print(pd.read_csv('examples/ex5.csv', na_values=sentinels))
# something a b c d message
# 0 one 1 2 3.0 4 NaN
# 1 NaN 5 6 NaN 8 world
# 2 three 9 10 11.0 12 NaN

pandas.read_csv和pandas.read_table常用的选项。

参数 说明
path 表示文件系统位置、URL、文件型对象的字符串
sep或delimiter 用于对行中各字段进行拆分的字符序列或正则表达式
header 用作列名的行号。默认为0(第一行),如果没有header行就应该设置为None
index_col 用作行索引的列编号或列名。可以是单个名称/数字或由多个名称/数字组成的列表(层次化索引)
names 用于结果的列名列表,结合header=None
skiprows 需要忽略的行数(从文件开始处算起),或需要跳过的行号列表(从0开始)
na_values 一组用于替换NA的值
comment 用于将注释信息从行尾拆分出去的字符(一个或多个)
parse_dates 尝试将数据解析为日期,默认为False。如果为True,则尝试解析所有列。此外,还可以指定需要解析的一组列号或列名。如果列表的元素为列表或元组,就会将多个列组合到一起再进行日期解析工作(例如,日期/时间分别位于两个列中)
keep_data_col 如果连接多列解析日期,则保持参与连接的列。默认为 False。
converters 由列号/列名跟函数之间的映射关系组成的字典。例如,{‘foo’:f}会对foo列的所有值应用函数f
dayfirst 当解析有歧义的日期时,将其看做国际格式(例如,7/6/2012→June 7,2012)。默认为Fase
date_parser 用于解析日期的函数
nrows 需要读取的行数(从文件开始处算起)
iterator 返回一个TextParser以便逐块读取文件
chunksize 文件块的大小(用于迭代)
skip_footer 需要忽略的行数(从文件末尾处算起)
verbose 打印各种解析器输出信息,比如“非数值列中缺失值的数量”等
encoding 用于unicode的文本编码格式。例如,“utf-8”表示用UTF-8编码的文本
squeeze 如果数据经解析后仅含一列,则返回 Series
thousands 千分位分隔符,如“ , ”或“ . ”

逐块读取⽂本⽂件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import numpy as np

# 要展示几行
pd.options.display.max_rows = 8
result = pd.read_csv('examples/ex6.csv')
print(result)
# one two three four key
# 0 0.467976 -0.038649 -0.295344 -1.824726 L
# 1 -0.358893 1.404453 0.704965 -0.200638 B
# 2 -0.501840 0.659254 -0.421691 -0.057688 G
# 3 0.204886 1.074134 1.388361 -0.982404 R
# ... ... ... ... ... ..
# 9996 -0.479893 -0.650419 0.745152 -0.646038 E
# 9997 0.523331 0.787112 0.486066 1.093156 K
# 9998 -0.362559 0.598894 -1.843201 0.887292 G
# 9999 -0.096376 -1.012999 -0.657431 -0.573315 0
# [10000 rows x 5 columns]


# 只读3行
print(pd.read_csv('examples/ex6.csv', nrows=3))
# one two three four key
# 0 0.467976 -0.038649 -0.295344 -1.824726 L
# 1 -0.358893 1.404453 0.704965 -0.200638 B
# 2 -0.501840 0.659254 -0.421691 -0.057688 G


# 逐块读取⽂件,每块读几行,返回TextParser对象
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
print(chunker)
# <pandas.io.parsers.TextFileReader object at 0x00000219B4ACAF60>


# TextParser对象使你可以根据chunksize对⽂件进⾏逐块迭代
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
tot = pd.Series([])
for piece in chunker:
# 给key列的计数
tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)
print(tot[:6])
# E 368.0
# X 364.0
# L 346.0
# O 343.0
# Q 340.0
# M 338.0
# dtype: float64

将数据写出到⽂本格式:
to_csv⽅法:

  • sep分隔符
  • na_rep:将缺失值表示成什么字符串
  • index:是否展示行标签
  • header:是否展示列标签
  • columns:指定要输出的列并排序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import numpy as np


data = pd.read_csv('examples/ex5.csv')
print(data)
# something a b c d message
# 0 one 1 2 3.0 4 NaN
# 1 two 5 6 NaN 8 world
# 2 three 9 10 11.0 12 foo

data.to_csv('out.csv')


import sys
# sep:使用分隔符
data.to_csv(sys.stdout, sep='|')
# |something|a|b|c|d|message
# 0|one|1|2|3.0|4|
# 1|two|5|6||8|world
# 2|three|9|10|11.0|12|foo


# na_rep:缺失值在输出结果中会被表示为空字符串,na_rep参数改为其他标记值
# 空字符串变成NULL
data.to_csv(sys.stdout, na_rep='NULL')
# ,something,a,b,c,d,message
# 0,one,1,2,3.0,4,NULL
# 1,two,5,6,NULL,8,world
# 2,three,9,10,11.0,12,foo

# index,header:不输出行标签和列标签.注意这里是header,不是columns
data.to_csv(sys.stdout, index=False, header=False)
# one,1,2,3.0,4,
# two,5,6,,8,world
# three,9,10,11.0,12,foo

# columns:指定要输出的列并排序
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])
# a,b,c
# 1,2,3.0
# 5,6,
# 9,10,11.0

Series也有to_csv方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import sys

import pandas as pd
import numpy as np

dates = pd.date_range('1/1/2000', periods=7)
print(dates)
# DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
# '2000-01-05', '2000-01-06', '2000-01-07'],
# dtype='datetime64[ns]', freq='D')

ts = pd.Series(np.arange(7), index=dates)
ts.to_csv(sys.stdout)
# 2000-01-01,0
# 2000-01-02,1
# 2000-01-03,2
# 2000-01-04,3
# 2000-01-05,4
# 2000-01-06,5
# 2000-01-07,6

处理分隔符格式

⼤部分存储在磁盘上的表格型数据都能⽤pandas.read_table进⾏加载。

使用python的csv模块:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import csv

f = open('examples/ex7.csv')

# 将打开的文件对象传给csv.reader
reader = csv.reader(f)

# 对这个reader进⾏迭代将会为每⾏产⽣⼀个元组
# type(line)为list
for line in reader:
print(line)
# ['a', 'b', 'c']
# ['1', '2', '3']
# ['1', '2', '3']

可以修改一下,产生格式化输出

1
2
3
4
5
6
7
8
9
import csv

with open('examples/ex7.csv') as f:
lines = list(csv.reader(f))
# 获取列标签和剩余的数据
header,values = lines[0],lines[1:]
data_dict = {h: v for h, v in zip(header, zip(*values))}
print(data_dict)
# {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

甚至还可以定义一个类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import csv

class my_dialect(csv.Dialect):
lineterminator = '\n'
# 将换行符作为分隔符
delimiter = ';'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL

with open('examples/ex7.csv') as f:
lines = list(csv.reader(f,dialect=my_dialect))
# 获取列标签和剩余的数据
header,values = lines[0],lines[1:]
data_dict = {h: v for h, v in zip(header, zip(*values))}
print(data_dict)
# {'a,"b","c"': ('1,"2","3"', '1,"2","3"')}

CSV.Dialect选项

269702537

写csv文件:

1
2
3
4
5
6
7
8
import csv

with open('mydata.csv', 'w') as f:
writer = csv.writer(f, dialect=my_dialect)
writer.writerow(('one', 'two', 'three'))
writer.writerow(('1', '2', '3'))
writer.writerow(('4', '5', '6'))
writer.writerow(('7', '8', '9'))

JSON数据

除其空值为null和一些其他的细微差别(如列表末尾不允许存在多余的逗号)之外,JSON非常接近于有效的Python代码。

JSON数据所有的键都必须是字符串

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import json

obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

# json.loads:读取json字符串
result = json.loads(obj)
print(result) # 输出略

# json.dumps:将python对象转为json
asjson = json.dumps(result)

# json.load:读取json文件
with open("../config/record.json",'r') as load_f:
load_dict = json.load(load_f)
print(load_dict) # 输出略

# json.dump:将数据写入json文件中
with open("../config/record.json","w") as f:
json.dump(new_dict,f)
print("加载入文件完成...")

将json数据转为DataFrame对象:

  • json对象转为DataFrame对象:
    pd.DataFrame(result['siblings'])
    直接传入DataFrame()即可

  • DataFrame对象转为json对象:

    df.to_json()

  • json文件转为DataFrame对象:
    pd.read_json('example.json')

  • DataFrame对象转为json文件:
    df.to_json('hyl.json')

也就是说,DataFrame转为json文件和转为json对象共用一个to_json方法.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import json

import pandas as pd


obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

# 读取为json数据
result = json.loads(obj)
print(result) # 输出略

# 转为DateFrame对象
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
print(siblings)
# name age
# 0 Scott 30
# 1 Katie 38


# examples/example.json文件数据为:
# [{"a": 1, "b": 2, "c": 3},
# {"a": 4, "b": 5, "c": 6},
# {"a": 7, "b": 8, "c": 9}]


# 读取json文件
# pandas.read_json的默认选项假设JSON数组中的每个对象是表格中的⼀⾏:
# 第七章介绍很深嵌套的json的读取
data = pd.read_json('examples/example.json')
print(data)
# a b c
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9

# DataFrame转为json数据:
print(data.to_json())
# {"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}

# 更改输出样式
print(data.to_json(orient='records'))
# [{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]

# 写入json文件
print(data.to_json('hyl.json',orient='records'))
# [{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]

json的格式如下 :

  • split,样式为 {index -> [index], columns -> [columns], data -> [values]}
  • records,样式为[{column -> value}, … , {column -> value}]
  • index ,样式为 {index -> {column -> value}}
  • columns,样式为 {index -> {column -> value}}
  • values,数组样式
  • table,样式为{‘schema’: {schema}, ‘data’: {data}},和records类似

XML和HTML:Web信息收集
read_html方法:它可以使⽤lxml和Beautiful Soup⾃动将HTML⽂件中的表格解析为DataFrame对象。

pandas.read_html有⼀些选项,默认条件下,它会搜索、尝试解析标签内的的表格数据。结果是⼀个列表的DataFrame对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import pandas as pd
import numpy as np

tables = pd.read_html('examples/fdic_failed_bank_list.html')

print(tables)
# [ Bank Name ... Updated Date
# 0 Allied Bank ... November 17, 2016
# 1 The Woodbury Banking Company ... November 17, 2016
# 2 First CornerStone Bank ... September 6, 2016
# ...
# 544 First Alliance Bank & Trust Co. ... February 18, 2003
# 545 National State Bank of Metropolis ... March 17, 2005
# 546 Bank of Honolulu ... March 17, 2005
# [547 rows x 7 columns]]

# 说明这个html中只有一个表格
print(len(tables))
# 1

failures = tables[0]
# 读取前5条数据
print(failures.head())
# Bank Name ... Updated Date
# 0 Allied Bank ... November 17, 2016
# 1 The Woodbury Banking Company ... November 17, 2016
# 2 First CornerStone Bank ... September 6, 2016
# 3 Trust Company Bank ... September 6, 2016
# 4 North Milwaukee State Bank ... June 16, 2016
# [5 rows x 7 columns]


close_timestamps = pd.to_datetime(failures['Closing Date'])
# 计算按年份计算倒闭的银⾏数:
print(close_timestamps.dt.year.value_counts())
# 2010 157
# 2009 140
# ...
# 2003 3
# 2000 2
# Name: Closing Date, dtype: int64

利⽤lxml.objectify解析XML

XML和HTML的结构很相似,但XML更为通用。
XML记录形如下面格式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<INDICATOR>
<INDICATOR_SEQ>373889</INDICATOR_SEQ>
<PARENT_SEQ></PARENT_SEQ>
<AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
<INDICATOR_NAME>Escalator Availability</INDICATOR_NAME>
<DESCRIPTION>Percent of the time that escalators are operational
systemwide. The availability rate is based on physical observations performed
the morning of regular business days only. This is a new indicator the agency
began reporting in 2009.</DESCRIPTION>
<PERIOD_YEAR>2011</PERIOD_YEAR>
<PERIOD_MONTH>12</PERIOD_MONTH>
<CATEGORY>Service Indicators</CATEGORY>
<FREQUENCY>M</FREQUENCY>
<DESIRED_CHANGE>U</DESIRED_CHANGE>
<INDICATOR_UNIT>%</INDICATOR_UNIT>
<DECIMAL_PLACES>1</DECIMAL_PLACES>
<YTD_TARGET>97.00</YTD_TARGET>
<YTD_ACTUAL></YTD_ACTUAL>
<MONTHLY_TARGET>97.00</MONTHLY_TARGET>
<MONTHLY_ACTUAL></MONTHLY_ACTUAL>
</INDICATOR>

使用lxml解析HTML代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from io import StringIO
from lxml import objectify

tag = '<a href="http://www.google.com">Google</a>'
# 使用lxml解析HTML代码
root = objectify.parse(StringIO(tag)).getroot()

print(root)
# Google

print(root.get('href'))
# http://www.google.com

print(root.text)
# Google

将XML转为DataFrame:
直接传给DataFrame即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from lxml import objectify
import pandas as pd
import numpy as np


path = 'datasets/mta_perf/Performance_MNR.xml'

with open(path) as f:
parsed = objectify.parse(f)
root = parsed.getroot()

data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
'DESIRED_CHANGE', 'DECIMAL_PLACES']

# root.INDICATOR返回⼀个⽤于产⽣各个<INDICATOR>XML元素的⽣成器
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
# 跳过某些字段
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)

# 转化为DataFrame
perf = pd.DataFrame(data)
print(perf.head())
# AGENCY_NAME CATEGORY ... YTD_ACTUAL YTD_TARGET
# 0 Metro-North Railroad Service Indicators ... 96.9 95
# 1 Metro-North Railroad Service Indicators ... 96 95
# 2 Metro-North Railroad Service Indicators ... 96.3 95
# 3 Metro-North Railroad Service Indicators ... 96.8 95
# 4 Metro-North Railroad Service Indicators ... 96.6 95

# [5 rows x 12 columns]

6.2 ⼆进制数据格式

pickle:就是一种⼆进制格式

DataFrame有to_json(),to_csv(),自然也有to_pickle方法:

  • 将DataFrame对象转成pickle文件:
    fd.to_pickle('frame_pickle')
  • 将pickle文件转为DataFrame对象:
    data = pd.read_pickle('frame_pickle')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd


frame = pd.read_csv('examples/ex1.csv')
print(frame)
# a b c d message
# 0 1 2 3 4 hello
# 1 5 6 7 8 world
# 2 9 10 11 12 foo

# 将DataFrame对象转为pickle文件
frame.to_pickle('frame_pickle')


data = pd.read_pickle('frame_pickle')
print(data)
# a b c d message
# 0 1 2 3 4 hello
# 1 5 6 7 8 world
# 2 9 10 11 12 foo

print(type(data))
# <class 'pandas.core.frame.DataFrame'>

pandas内置了两个⼆进制数据格式:

  • HDF5
  • MessagePack。

其他的存储格式:

  • bcolz:⼀种可压缩的列存储⼆进制格式,基于Blosc压缩库。
  • Feather:⼀种跨语⾔的列存储⽂件格式。Feather使⽤了Apache Arrow的列式内存格式。

使⽤HDF5格式

HDF5是⼀种存储大规模科学数组数据的⾮常好的⽂件格式。
HDF5中的HDF指的是层次型数据格式(hierarchical data format)。它最适合⽤作“⼀次写多次读”的数据集。

DataFrame与HDF5的互转:

  • DataFrame对象转为HDF5文件:
    store = pd.HDFStore('mydata.h5')

返回store这个文件对象,之后可以填充数据:
store.put('obj2', df, format='table')
或者
store['obj1_col'] = df['a']

  • HDF5文件转为DataFrame对象:
    x = store.select('obj2', where=['index >= 10 and index <= 15'])

还有更方便的API:

  • DataFrame对象转为HDF5文件:
    df.to_hdf('mydata.h5', 'obj3', format='table')
  • HDF5文件转为DataFrame对象:
    df = pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import pandas as pd
import numpy as np

# 创建一个DataFrame对象
frame = pd.DataFrame({'a': np.random.randn(100)})

# 将DataFrame对象存储成mydata.h5
# 返回一个文件对象
store = pd.HDFStore('mydata.h5')

print(type(store))
# <class 'pandas.io.pytables.HDFStore'>

# 给这个文件对象填充数据
store['obj1'] = frame
store['obj1_col'] = frame['a']

print(store)
# File path: mydata.h5
# /obj1 frame (shape->[100,1])

# /obj1_col series (shape->[100])

# /obj2 frame_table (typ->appendable,nrows->100,ncols->1,indexers->
# [index])
# /obj3 frame_table (typ->appendable,nrows->100,ncols->1,indexers->
# [index])

# 使用类似于字典的方式获取HDF5文件的对象
print(store['obj1'])
# a
# 0 -1.093319
# 1 0.863954
# ...
# 98 0.209667
# 99 -0.824430
# [100rows x 1 columns]

# 将DataFrame对象存储到HDF5文件中
store.put('obj2', frame, format='table')

# HDFStore文件转为DataFrame对象
df = store.select('obj2', where=['index >= 10 and index <= 15'])
print(df)
# a
# 10 0.108707
# 11 0.346106
# 12 -2.397788
# 13 0.948306
# 14 0.442709
# 15 -0.902998

print(type(df))
# <class 'pandas.core.frame.DataFrame'>

store.close()

HDFStore⽀持两种存储模式,’fixed’和’table’。
后者通常会更慢,但是⽀持使⽤特殊语法进⾏查询操作.

store.put('obj2', df, format='table')store['obj2'] = frame⽅法的显示版本,允许我们设置其它的选项,⽐如格式。

使用更方便API:read_hdf和to_hdf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd


frame = pd.DataFrame({'a': np.random.randn(100)})

print(frame)
# a
# 0 -0.022453
# 1 0.117988
# 2 -0.102322
# 3 -1.179342
# 4 -0.177205
# .. ...
# 98 -0.905328
# 99 0.589680
# [100 rows x 1 columns]

# DataFrame对象写入到HDF文件
frame.to_hdf('mydata.h5', 'obj3', format='table')

# 读取HDF文件中的数据
df = pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])

print(type(df))
# <class 'pandas.core.frame.DataFrame'>

print(df)
# a
# 0 -0.022453
# 1 0.117988
# 2 -0.102322
# 3 -1.179342
# 4 -0.177205

读取Microsoft Excel⽂件

ExcelFile类或pandas.read_excel函数⽀持读取Excel的表格型数据

  • Excel文件转为DataFrame对象:
    read_excel
    参数可以是excel文件路径,也可以是ExcelFile对象
  • DataFrame对象转为Excel文件:
    to_excel
    参数可以是excel文件路径,也可以是ExcelWriter对象
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import pandas as pd


# 传入路径来创建ExcelFile实例
excelfile = pd.ExcelFile('examples/ex1.xlsx')

print(excelfile)
# <pandas.io.excel.ExcelFile object at 0x00000249ECB3A080>

# 传入一个ExcelFile实例来创建DataFrame对象
df = pd.read_excel(excelfile, 'Sheet1')
print(type(df))
# <class 'pandas.core.frame.DataFrame'>

print(df)
# a b c d message
# 0 1 2 3 4 hello
# 1 5 6 7 8 world
# 2 9 10 11 12 foo

# 也可以传入Excel文件路径来创建DataFrame对象
df = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
print(type(df))
# <class 'pandas.core.frame.DataFrame'>

print(df)
# a b c d message
# 0 1 2 3 4 hello
# 1 5 6 7 8 world
# 2 9 10 11 12 foo


# 创建ExcelWriter对象
writer = pd.ExcelWriter('ex2.xlsx')
# 将DataFrame对象写入到Excel文件中
frame.to_excel(writer, 'Sheet1')
# 注意要保存writer对象
writer.save()

# 当然也可以不创建ExcelWriter对象,直接传入文件路径来写入
frame.to_excel('ex2.xlsx')

6.3 Web APIs交互

老本行了,不多谈:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import requests

url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)

# json():读取这个Http响应为json文件
data = resp.json()

print(data[0].get('title'))
# 输出略

issues = pd.DataFrame(data, columns=['number', 'title','labels', 'state'])
print(issues)
# 输出略

6.4 数据库交互

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import pandas as pd
import sqlite3

# 创建表
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);
"""
# 连接数据库
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()


# 插入数据
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
sql = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(sql, data)
con.commit()


# 读取数据
# 创建游标
cursor = con.execute('select * from test')
rows = cursor.fetchall()

# ⼤部分Python SQL驱动器(PyODBC、
# psycopg2、MySQLdb、pymssql等)都会返回⼀个元组列表:
print(rows)
# [('Atlanta', 'Georgia', 1.25, 6),
# ('Tallahassee', 'Florida', 2.6, 3),
# ('Sacramento', 'California', 1.7, 5)]


# 列名就位于光标的description属性中
print(cursor.description)
# (('a', None, None, None, None, None, None),
# ('b', None, None, None, None, None, None),
# ('c', None, None, None, None, None, None),
# ('d', None, None, None, None, None, None))


# 元组列表传给DataFrame构造器,创建DataFrame对象
# 列名就位于光标的description属性中
df = pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
print(df)
# a b c d
# 0 Atlanta Georgia 1.25 6
# 1 Tallahassee Florida 2.60 3
# 2 Sacramento California 1.70 5

当然也有read_sql用于从SQLAlchemy连接读取数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import sqlalchemy as sqla


db = sqla.create_engine('sqlite:///mydata.sqlite')
df = pd.read_sql('select * from test', db)

print(type(df))
# <class 'pandas.core.frame.DataFrame'>

print(df)
# a b c d
# 0 Atlanta Georgia 1.25 6
# 1 Tallahassee Florida 2.60 3
# 2 Sacramento California 1.70 5