python-cookbook1_第一章_数据结构和算法

第一章：数据结构和算法

1.1 解压序列赋值给多个变量

实际上,元组拆包适合用在任何可迭代对象上面，包括字符串，文件对象，迭代器和生成器。

def gen():
	yield 1
	yield 2

g = gen()

a,b = g
print(a,b)  # 1 2

mystr = '456'
a,b,c = mystr
print(a,b,c)  # 4 5 6

所以,有一个引申用法:
用于快速掐头去尾

alist  = [1,2,3,4]

_,*alist,_ = alist
print(alist)
print(type(alist))
# [2, 3]
# <class 'list'>

解压出的 alist变量永远都是列表类型，不管解压的数量是多少（包括 0 个）。

特别常用于split:

line = 'nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false' 
uname, *fields, homedir, sh = line.split(':') 
print(uname)
print(homedir)
print(sh)

甚至还可以用于递归:

items = [1, 10, 7, 4, 5, 9] 

def sum(items):
    head, *tail = items
    return head + sum(tail) if tail else head

print(sum(items))

迭代解压语法:

解压不确定个数或任意个数元素的可迭代对象

理论上来说可以和切片互换

经常==用它来快速截取需要的元素==

1.3 保留最后 N 个元素

保留有限历史记录使用 collections.deque

from collections import deque

def search(lines, pattern, history=5):
    previous_lines = deque(maxlen=history)
    for line in lines:
        if pattern in line:
            yield line, previous_lines
        previous_lines.append(line)

# Example use on a file
if __name__ == '__main__':
    with open('somefile.txt') as f:
        for line, prevlines in search(f, 'python', 5):
            for pline in prevlines:
                print(pline, end='')
            print(line, end='')
            print('-'*20)

1.4 查找最大或最小的 N 个元素

heapq 模块两个函数：nlargest() 和 nsmallest()

heapq :heap queue堆队列

import heapq

nums = [1,2,3,4,5,6,7]
print(heapq.nlargest(3,nums))
print(heapq.nsmallest(3,nums))
# [7, 6, 5]
# [1, 2, 3]

就像max有key参数可以使用一样,nlargest也有

from heapq import nlargest

adict = [{'name':13 ,'score':10},
		 {'name':456,'score':120},
		 {'name':74 ,'score':103},
		 {'name':58 ,'score':140}]

x = max(adict,key=lambda x: x['name'])
print(x)   # {'name': 456, 'score': 120}

y = nlargest(2,adict,key=lambda x: x['score'])
print(y)   # [{'name': 58, 'score': 140}, {'name': 456, 'score': 120}]

或者使用heapq.heappop查找最小或最大的 N 个元素:

import heapq

x = [9,5,1,2,3,5,9,4,2,6]

# 将x进行堆排序
heapq.heapify(x)
print(x)  # [1, 2, 5, 2, 3, 9, 9, 4, 5, 6]

# 弹出第一个,并且继续进行堆排序
print(heapq.heappop(x))  # 1
print(x)  # [2, 2, 5, 4, 3, 9, 9, 6, 5]

因为堆排序后第一个值 heap[0] 永远是最小的元素.所以只需执行N次heapq.heappop(x)即可

由于 push 和 pop 操作时间复杂度为 O(log N)，其中 N 是堆的大小，因此就算是 N 很大的时候它们运行速度也依旧很快。

总结:

当要查找的元素个数相对比较小的时候，使用函数 nlargest() 和 nsmallest()

仅仅想查找唯一的最小或最大（N=1）的元素的话，使用 min() 和 max() 函数

如果 N 的大小和集合大小接近的时候，通常先排序这个集合然后再使用切片操作（sorted(items)[:N] 或者是 sorted(items)[-N:] ）

1.5 实现一个优先级队列

按优先级排序的队列
在这个队列上面每次 pop 操作总是返回优先级最高的那个元素
如果两个有着相同优先级的元素（foo 和 grok ），pop 操作按照它们被插入到队列的顺序返回。

import heapq

class PriorityQueue:
    def __init__(self):
        self._queue = []
        self._index = 0

    def push(self, item, priority):
    	# 将一个元组作为堆排序的元素插入堆

    	# -priority:优先级越高-priotity就越小,越容易排到前面

    	# 当-priority相同时,就比较self._index,保证按照被插入到队列的顺序返回
        heapq.heappush(self._queue, (-priority, self._index, item))
        self._index += 1

    def pop(self):
    	# 每次弹出第一个
        return heapq.heappop(self._queue)[-1]

# Example use
class Item:
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return 'Item({!r})'.format(self.name)

q = PriorityQueue()
q.push(Item('foo'), 1)
q.push(Item('bar'), 5)
q.push(Item('spam'), 4)
q.push(Item('grok'), 1)

print("Should be bar:", q.pop())
print("Should be spam:", q.pop())
print("Should be foo:", q.pop())
print("Should be grok:", q.pop())
# Should be bar: Item('bar')
# Should be spam: Item('spam')
# Should be foo: Item('foo')
# Should be grok: Item('grok')

函数 heapq.heappush()和 heapq. heappop() 分别在队列上插入和删除第一个元素，然后执行堆排序

执行效果就是保证第一个元素拥有最高优先级

代码总结:
tuple是可以比较的
1
2
3
4
tuple1 = (2,3)
tuple2 = (3,6)

print(tuple1 > tuple2)
heapq.heappush(heap,item),将item放入heap中

所以我们可以将元组作为item插入到heap中

1.6 字典中的键映射多个值

defaultdict 会自动初始化每个 key 对应的值，所以你只需要关注添加元素操作了。

from collections import defaultdict

# dict的值是一个list
d1 = defaultdict(list)

d1['a'].append(1)
print(d1)
# defaultdict(<class 'list'>, {'a': [1]})


d2 = defaultdict(set)
d2['a'].add(2)
print(d2)
# defaultdict(<class 'set'>, {'a': {2}})

1.7 字典排序

OrderedDict 类 : 在迭代操作的时候它会保持元素被插入时的顺序

当你想要构建一个将来需要序列化或编码成其他格式的映射的时候，OrderedDict
是非常有用的

from collections import OrderedDict

d = OrderedDict()

d['a'] = 1
d['b'] = 2
d['c'] = 3

print(d)
# OrderedDict([('a', 1), ('b', 2), ('c', 3)])

OrderedDict 内部维护着一个根据键插入顺序排序的双向链表。每次当一个新的
元素插入进来的时候，它会被放到链表的尾部。对于一个已经存在的键的重复赋值不会改变键的顺序。

需要注意的是，一个 OrderedDict 的大小是一个普通字典的两倍，因为它内部维
护着另外一个链表。

1.8 字典的运算

在数据字典中执行一些计算操作（比如求最小值、最大值、排序等等）

就像前面说的,元组是可以比较的,所以我们可以使用zip,将其打包成为元组:

prices = {
'ACME': 45.23,
'AAPL': 612.78,
'IBM': 205.55,
'HPQ': 37.20,
'FB': 10.75
}

max_num = max(zip(prices.values(),prices.keys()))
print(max_num)
# (612.78, 'AAPL')


min_num = min(prices,key=lambda x: prices[x])
print(min_num)
# FB

sorted_dict = sorted(zip(prices.values(),prices.keys()))
print(sorted_dict)
# [(10.75, 'FB'), (37.2, 'HPQ'), (45.23, 'ACME'), (205.55, 'IBM'), (612.78, 'AAPL')]

总结,对于找出字典最大值有两种方式

使用zip

1	min_num = min(zip(prices.values(),prices.keys()))

使用max的key参数

1	min_num = min(prices,key=lambda x:prices[x])

注意:

min_num = min(prices,key=lambda x:prices[x])中的==x指的是第一个参数的迭代元素,也就是字典的键==

同理min_num = min([(1,2),(1,3),(1,4)],key=lambda x:x[-1])这里的x指的是[(1,2),(1,3),(1,4)]的迭代元素,也就是(1,2),(1,3),(1,4)

根据值来找对应的键:

prices = {
'ACME': 45.23,
'AAPL': 612.78,
'IBM': 205.55,
'HPQ': 37.20,
'FB': 10.75
}

x = [val for val in zip(prices.values(),prices.keys())]

for score,name in x:
	if score == 205.55:
		print(name)

对于两个复杂对象的比较,总是要想到zip

1.9 查找两字典的相同点

使用集合运算

a = {
'x' : 1,
'y' : 2,
'z' : 3
}

b = {
'w' : 10,
'x' : 11,
'y' : 2
}

res1 = a.keys() & b.keys()
res2 = a.keys() | b.keys()
res3 = a.keys() - b.keys()

res4 = a.items() & b.items()

print(res1)  # {'y', 'x'}
print(res2)  # {'y', 'w', 'x', 'z'}
print(res3)  # {'z'}

print(res4)  # {('y', 2)}

在构造字典的时候,从字典中删除某些键:

a = {
'x' : 1,
'y' : 2,
'z' : 3
}

res = {key:a[key] for key in a.keys() - {'x'}}
print(res) # {'z': 3, 'y': 2}

1.10 删除序列相同元素并保持顺序

def dedupe(items):
    seen = set()
    for item in items:
        if item not in seen:
            yield item
            seen.add(item)

a = [1, 5, 2, 1, 9, 1, 5, 10]

print(list(dedupe(a)))
# [1, 5, 2, 9, 10]

添加一个函数key参数

def dedupe(items, key=None):
    seen = set()
    for item in items:
        val = item if key is None else key(item)
        if val not in seen:
            yield item
            seen.add(val)

a = [ {'x':1, 'y':2}, {'x':1, 'y':3}, {'x':1, 'y':2}, {'x':2, 'y':4}]
print(list(dedupe(a, key=lambda d: (d['x'],d['y']))))
# [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]
print(list(dedupe(a, key=lambda d: d['x'])))
# [{'x': 1, 'y': 2}, {'x': 2, 'y': 4}]

1.11 命名切片

record = '....................100 .......513.25 ..........'
cost = int(record[20:23]) * float(record[31:37])

print(cost)  # 51325.0

硬编码不可取,应改为:

SHARES = slice(20,23)
PRICE = slice(31,37)
cost = int(record[SHARES]) * float(record[PRICE])
print(cost)  # # 51325.0

1.12 序列中出现次数最多的元素

使用collections.Counter

from collections import Counter

x = [1,2,3,1,2,5,6,4,2,3,5,1,2,5,2,6,2,6,5,2,6]
c = Counter(x)
print(c)  # Counter({2: 7, 5: 4, 6: 4, 1: 3, 3: 2, 4: 1})
print(c.most_common())  # [(2, 7), (5, 4), (6, 4), (1, 3), (3, 2), (4, 1)]

1.13 通过某个关键字排序一个字典列表

你有一个字典列表，你想根据某个或某几个字典字段来排序这个列表。

使用 operator 模块的 itemgetter 函数

operator.itemgetter函数获取的不是值，而是定义了一个函数，通过该函数作用到对象上才能获取值。
1
2
3
4
5
6
7
8
9
from operator import itemgetter

get_first = itemgetter(0)

alist1 = [1,2,3,4,5,6]
alist2 = [7,8,9,4,5,6,2]

print(get_first(alist1))
print(get_first(alist2))

使用itemgetter通过某个关键字排序一个字典列表 :

from operator import itemgetter

rows = [
{'fname': 'Brian', 'lname': 'Jones', 'uid': 1003},
{'fname': 'David', 'lname': 'Beazley', 'uid': 1002},
{'fname': 'John', 'lname': 'Cleese', 'uid': 1001},
{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}
]


rows_by_fname = sorted(rows, key=itemgetter('fname'))
rows_by_uid = sorted(rows, key=itemgetter('uid'))

print(rows_by_fname)
# [{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}, {'fname': 'Brian', 'lname': 'Jones', 'uid': 1003}, {'fname': 'David', 'lname': 'Beazley', 'uid': 1002}, {'fname': 'John', 'lname': 'Cleese', 'uid': 1001}]

print(rows_by_uid)
# [{'fname': 'John', 'lname': 'Cleese', 'uid': 1001}, {'fname': 'David', 'lname': 'Beazley', 'uid': 1002}, {'fname': 'Brian', 'lname': 'Jones', 'uid': 1003}, {'fname': 'Big', 'lname': 'Jones', 'uid': 1004}]

或者:

rows = [
{'fname': 'Brian', 'lname': 'Jones', 'uid': 1003},
{'fname': 'David', 'lname': 'Beazley', 'uid': 1002},
{'fname': 'John', 'lname': 'Cleese', 'uid': 1001},
{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}
]

sorted(rows,key=lambda x:x['fname'])
print(rows)

itemgetter() 函数也支持多个 keys,实现主要关键字和次要关键字的排序

from operator import itemgetter

rows = [
{'fname': 'Brian', 'lname': 'Jones', 'uid': 1003},
{'fname': 'David', 'lname': 'Beazley', 'uid': 1002},
{'fname': 'John', 'lname': 'Cleese', 'uid': 1001},
{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}
]

rows_by_fname_and_uid = sorted(rows, key=itemgetter('fname','uid'))
print(rows_by_fname_and_uid)
# [{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}, {'fname': 'Brian', 'lname': 'Jones', 'uid': 1003}, {'fname': 'David', 'lname': 'Beazley', 'uid': 1002}, {'fname': 'John', 'lname': 'Cleese', 'uid': 1001}]

1.14 排序不支持原生比较的对象

你想排序类型相同的对象，但是他们不支持原生的比较操作。

class User:
    def __init__(self, user_id):
        self.user_id = user_id
    def __repr__(self):
        return 'User({})'.format(self.user_id)

users = [User(each_num) for each_num in (5,2,6)]
print(users)  
# [User(5), User(2), User(6)]

print(sorted(users,key=lambda x:x.user_id))
# [User(2), User(5), User(6)]

或者使用attrgetter:

from operator import attrgetter

get_userid = attrgetter('user_id')

class User:
    def __init__(self, user_id):
        self.user_id = user_id
    def __repr__(self):
        return 'User({})'.format(self.user_id)

users = [User(each_num) for each_num in (5,2,6)]
print(users)  
# [User(5), User(2), User(6)]

print(sorted(users,key=get_userid))
# [User(2), User(5), User(6)]

不管是attrgetter还是itemgetter,都支持多个参数

也就是受,可以使用这两个参数来实现主要关键字和次要关键字的排序

1.15 通过某个字段将记录分组

你有一个字典或者实例的序列，然后你想根据某个特定的字段比如 date 来分组迭
代访问。

from itertools import groupby


rows = [
{'address': '5412 N CLARK', 'date': '07/01/2012'},
{'address': '5148 N CLARK', 'date': '07/04/2012'},
{'address': '5800 E 58TH', 'date': '07/02/2012'},
{'address': '2122 N CLARK', 'date': '07/03/2012'},
{'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'},
{'address': '1060 W ADDISON', 'date': '07/02/2012'},
{'address': '4801 N BROADWAY', 'date': '07/01/2012'},
{'address': '1039 W GRANVILLE', 'date': '07/04/2012'},
]

for each in groupby(rows,key=lambda x:x['date']):
	print(each)
# ('07/01/2012', <itertools._grouper object at 0x0000021DAEBFDB38>)
# ('07/04/2012', <itertools._grouper object at 0x0000021DAEBFDBA8>)
# ('07/02/2012', <itertools._grouper object at 0x0000021DAEBFDB70>)
# ('07/03/2012', <itertools._grouper object at 0x0000021DAEBFDB38>)
# ('07/02/2012', <itertools._grouper object at 0x0000021DAEBFDBA8>)
# ('07/01/2012', <itertools._grouper object at 0x0000021DAEBFDB70>)
# ('07/04/2012', <itertools._grouper object at 0x0000021DAEBFDB38>)

from itertools import groupby


rows = [
{'address': '5412 N CLARK', 'date': '07/01/2012'},
{'address': '5148 N CLARK', 'date': '07/04/2012'},
{'address': '5800 E 58TH', 'date': '07/02/2012'},
{'address': '2122 N CLARK', 'date': '07/03/2012'},
{'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'},
{'address': '1060 W ADDISON', 'date': '07/02/2012'},
{'address': '4801 N BROADWAY', 'date': '07/01/2012'},
{'address': '1039 W GRANVILLE', 'date': '07/04/2012'},
]

for date,val in groupby(rows,key=lambda x:x['date']):
	print(date)
	for item in val:
		print(item)
# 07/01/2012
# {'address': '5412 N CLARK', 'date': '07/01/2012'}
# 07/04/2012
# {'address': '5148 N CLARK', 'date': '07/04/2012'}
# 07/02/2012
# {'address': '5800 E 58TH', 'date': '07/02/2012'}
# 07/03/2012
# {'address': '2122 N CLARK', 'date': '07/03/2012'}
# 07/02/2012
# {'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'}
# {'address': '1060 W ADDISON', 'date': '07/02/2012'}
# 07/01/2012
# {'address': '4801 N BROADWAY', 'date': '07/01/2012'}
# 07/04/2012
# {'address': '1039 W GRANVILLE', 'date': '07/04/2012'}

也可以使用defaultdict:

from collections import defaultdict

rows = [
{'address': '5412 N CLARK', 'date': '07/01/2012'},
{'address': '5148 N CLARK', 'date': '07/04/2012'},
{'address': '5800 E 58TH', 'date': '07/02/2012'},
{'address': '2122 N CLARK', 'date': '07/03/2012'},
{'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'},
{'address': '1060 W ADDISON', 'date': '07/02/2012'},
{'address': '4801 N BROADWAY', 'date': '07/01/2012'},
{'address': '1039 W GRANVILLE', 'date': '07/04/2012'},
]

d = defaultdict(list)

for row in rows:
    d[row['date']].append(row['address'])

print(d)
# defaultdict(<class 'list'>, {'07/01/2012': ['5412 N CLARK', '4801 N BROADWAY'], '07/04/2012': ['5148 N CLARK', '1039 W GRANVILLE'], '07/02/2012': ['5800 E 58TH', '5645 N RAVENSWOOD', '1060 W ADDISON'], '07/03/2012': ['2122 N CLARK']})

1.16 过滤序列元素

列表解析

mylist = [1, 4, -5, 10, -7, 2, 3, -1]

new_list = [num for num in mylist if num > 0]
print(new_list)  # [1, 4, 10, 2, 3]

filter

mylist = [1, 4, -5, 10, -7, 2, 3, -1]

x = list(filter(lambda x:x>0,mylist))
print(x)  # [1, 4, 10, 2, 3]

关于列表解析有一点需要注意,那就是if的位置:

if放后面

mylist = [1, 4, -5, 10, -7, 2, 3, -1]

clip_pos = [n for n in mylist if n < 0]
print(clip_pos)  # [-5, -7, -1]

if放前面:

mylist = [1, 4, -5, 10, -7, 2, 3, -1]

clip_pos = [n if n < 0 else 0 for n in mylist]
print(clip_pos)  # [0, 0, -5, 0, -7, 0, 0, -1]

if放后面不能使用else,放前面可以使用else

mylist = [1, 4, -5, 10, -7, 2, 3, -1]
clip_pos = [n for n in mylist if n < 0 else 0]

print(clip_pos)  # SyntaxError: invalid syntax

itertools.compress() ，它以一个 iterable 对象和一个相对应的 Boolean 选择器序列作为输入参数。然后输出 iterable 对象中对应选择器为 True 的元素。

from itertools import compress

addresses = [
'5412 N CLARK',
'5148 N CLARK',
'5800 E 58TH',
'2122 N CLARK',
'5645 N RAVENSWOOD',
'1060 W ADDISON',
'4801 N BROADWAY',
'1039 W GRANVILLE',
]
counts = [ 0, 3, 10, 4, 1, 7, 6, 1]


more5 = [n > 5 for n in counts]
print(more5)
# [False, False, True, False, False, True, True, False]

print(list(compress(addresses, more5)))
# ['5800 E 58TH', '1060 W ADDISON', '4801 N BROADWAY']

1.17 从字典中提取子集

构造一个字典，它是另外一个字典的子集。

最简单的方式是使用字典推导。

prices = {
'ACME': 45.23,
'AAPL': 612.78,
'IBM': 205.55,
'HPQ': 37.20,
'FB': 10.75
}


tech_names = {'AAPL', 'IBM', 'HPQ', 'MSFT'}
p2 = {key: value for key, value in prices.items() if key in tech_names}
print(p2)

1.18 映射名称到序列元素

collections.namedtuple() 函数通过使用一个普通的元组对象来帮你解决这个问
题。

from collections import namedtuple

Subscriber = namedtuple('Subscriber', ['addr', 'joined'])
sub = Subscriber('jonesy@example.com', '2012-10-19')

print(sub)
# Subscriber(addr='jonesy@example.com', joined='2012-10-19')

print(sub.addr)
# jonesy@example.com

print(sub.joined)
# 2012-10-19

namedtuple 的实例支持所有的普通元组操作，比如索引和解压。

1
2
3

addr, joined = sub
print(addr)    # jonesy@example.com
print(joined)  # 2012-10-19

命名元组另一个用途就是作为替代字典，因为字典存储需要更多的内存空间。如果你需要构建一个非常大的包含字典的数据结构，那么使用命名元组会更加高效。

注意:
因为元组是不可修改的,所以不能修改命名元素的属性值

from collections import namedtuple

Subscriber = namedtuple('Subscriber', ['addr', 'joined'])
sub = Subscriber('jonesy@example.com', '2012-10-19')

print(sub)
# Subscriber(addr='jonesy@example.com', joined='2012-10-19')

sub.addr = 'xxx'  # AttributeError: can't set attribute

如果真的需要改变属性的值，那么可以使用命名元组实例的 _replace() 方法

from collections import namedtuple

Subscriber = namedtuple('Subscriber', ['addr', 'joined'])
sub = Subscriber('jonesy@example.com', '2012-10-19')

print(sub)
# Subscriber(addr='jonesy@example.com', joined='2012-10-19')

sub = sub._replace(addr='xxx')
print(sub)
# Subscriber(addr='xxx', joined='2012-10-19')

1.19 转换并同时计算数据

你需要在数据序列上执行聚集函数（比如 sum() , min() , max() ），但是首先你需
要先转换或者过滤数据

使用生成器表达式参数:

num = [1,2,3,4,5]

res = sum(x**2 for x in num)
print(res)

import os

files = os.listdir('dirname')
if any(name.endswith('.py') for name in files):
	print('There be python!')
else:
	print('Sorry, no python.')

1
2
3

# Output a tuple as CSV
s = ('ACME', 50, 123.45)
print(','.join(str(x) for x in s))

# Data reduction across fields of a data structure
portfolio = [
{'name':'GOOG', 'shares': 50},
{'name':'YHOO', 'shares': 75},
{'name':'AOL', 'shares': 20},
{'name':'SCOX', 'shares': 65}
]

min_shares = min(s['shares'] for s in portfolio)

1.20 合并多个字典或映射

现在有多个字典或者映射，你想将它们从逻辑上合并为一个单一的映射后执行某
些操作，比如查找值或者检查某些键是否存在。

使用 collections 模块中的 ChainMap 类

from collections import ChainMap

a = {'x': 1, 'z': 3 }
b = {'y': 2, 'z': 4 }
c = ChainMap(a,b)

print(c['x']) # Outputs 1 (from a)
print(c['y']) # Outputs 2 (from b)

# 重要 : 如果出现重复键，那么第一次出现的映射值会被返回。
print(c['z']) # Outputs 3 (from a)

ChainMap 类只是在内部创建了一个容纳这些字典的列表并重新定义了一些常见的字典操作来遍历这个列表。

所以print(c['z']) 才会为3

ChainMap 引用原来的字典，它自己不创建新的字典。

a = {'x': 1, 'z': 3 }
b = {'y': 2, 'z': 4 }

merged = ChainMap(a, b)

print(merged['x'])  # 1
a['x'] = 42

print(merged['x']) # 42