第06章 数据加载、存储与文件格式
pandas提供了一些用于将表格型数据读取为DataFrame对象的函数。表1对它们进行了总结,其中read_csv和read_table可能会是你今后用得最多的。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dP7Sl4F7-1580471723408)(attachment:11.png)]
import pandas as pd
import numpy as np
df = pd.read_csv('examples/ex1.csv')
df
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
pd.read_table('examples/ex1.csv', sep=',')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
#下面这个命令电脑不能用
#!cat examples/ex1.csv
对于没有标题行的数据,可以通过header属性或者names属性进行标注,假设你希望将message列做成DataFrame的索引。你可以明确表示要将该列放到索引4的位置上,也可
以通过index_col参数指定"message"。
pd.read_csv('examples/ex2.csv', header=None)
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('examples/ex2.csv', names=names, index_col='message')
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
parsed = pd.read_csv('examples/csv_mindex.csv',
index_col=['key1', 'key2'])
parsed
value1 value2 key1 key2
one a 1 2
one b 3 4
one c 5 6
one d 7 8
two a 9 10
two b 11 12
two c 13 14
two d 15 16
#有些表格可能不是用固定的分隔符去分隔字段的(比如空白符或其它模式)
list(open('examples/ex3.txt'))
[' A B C\n',
'aaa -0.264438 -1.026059 -0.619500\n',
'bbb 0.927272 0.302904 -0.032399\n',
'ccc -0.264273 -0.386314 -0.217601\n',
'ddd -0.871858 -0.348382 1.100491\n']
result = pd.read_table('examples/ex3.txt', sep='\s+')
result
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
#可以用skiprows跳过文件的第一行、第三行和第四行
pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
result = pd.read_csv("examples/ex5.csv")
print(result)
pd.isnull(result)
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False
pandas.read_csv和pandas.read_table常用的选项
12
13
逐块读取文本文件
我们先设置pandas显示地更紧些.
pd.options.display.max_rows = 10
result = pd.read_csv('examples/ex6.csv')
result
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
... ... ... ... ... ...
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0
10000 rows × 5 columns
#只想读取几行(避免读取整个文件),通过nrows进行指定即可:
pd.read_csv('examples/ex6.csv', nrows=5)
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
#要逐块读取文件,可以指定chunksize(行数)
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0xb66a590>
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
tot[:10]
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
M 338.0
J 337.0
F 335.0
K 334.0
H 330.0
dtype: float64
将数据写出到文本格式
利用DataFrame的to_csv用法,我们可以将数据写到1个以逗号分隔的文件中
data = pd.read_csv('examples/ex5.csv')
data.to_csv('examples/out2020.csv')
import sys
data.to_csv(sys.stdout, sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
#缺失值在输出结果中会被表示为空字符串。你可能希望将其表示为别的标记值:
data.to_csv(sys.stdout, na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
#如果没有设置其他选项,则会写出⾏和列的标签。当然,它们也都可以被禁用
data.to_csv(sys.stdout, index=False, header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
#可以只写出1部分的列,并以你指定的顺序排列:
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0
9,10,11.0
处理分隔符格式
import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)
#对这个reader进行迭代将会为每行产生一个元组(并移除了所有的引号):对这个reader进行迭代将会
#为每行产生一个元组(并移除了所有的引号):
for line in reader:
print(line)
with open('examples/ex7.csv') as f:
lines = list(csv.reader(f))
header, values = lines[0], lines[1:]
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
csv语支选项
14JSON数据
是1种比表格型文本格式(如CSV)灵活得多的数据格式。
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},{"name": "Katie", "age": 38,"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
import json
result = json.loads(obj)
result
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
#json.dumps则将Python对象转换成JSON格式:
asjson = json.dumps(result)
#如何将(1个或1组)JSON对象转换为DataFrame或其他便于分析的数据结构就由你决定了。最简单
#方便的方式是:向DataFrame构造器传入1个字典的列表(就是原先的JSON对象),并选取数据字段
#的子集
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings
name age
0 Scott 30
1 Katie 38
#pandas.read_json的默认选项假设JSON数组中的每个对象是表格中的一行
data = pd.read_json('examples/example.json')
data
a b c
0 1 2 3
1 4 5 6
2 7 8 9
6.2 二进制数据格式
实现数据的⾼效⼆进制格式存储最简单的办法之1是使用Python内置的pickle序列化。pandas对象都有1个用于将数据以pickle格式保存到磁盘上的to_pickle用法。
frame = pd.read_csv('examples/ex1.csv')
frame
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
frame.to_pickle('examples/frame_pickle')
pd.read_pickle('examples/frame_pickle')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
#要使用ExcelFile,通过传递xls或xlsx路径创建1个实例:
xlsx = pd.ExcelFile('examples/ex1.xlsx')
pd.read_excel(xlsx, 'Sheet1')
Unnamed: 0 a b c d message
0 0 1 2 3 4 hello
1 1 5 6 7 8 world
2 2 9 10 11 12 foo
frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame
Unnamed: 0 a b c d message
0 0 1 2 3 4 hello
1 1 5 6 7 8 world
2 2 9 10 11 12 foo
#如果要将pandas数据写入为Excel格式,你必须首先创建1个ExcelWriter,然后使用andas对象的
#to_excel方法将数据写入到其中:
writer = pd.ExcelWriter('examples/ex2020.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()
#还可以不使用ExcelWriter,而是传递文件的路径到to_excel:
frame.to_excel('examples/ex2019.xlsx')
6.3 Web APIs交互
使用requests包
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
<Response [200]>
data = resp.json()
data[0]['title']
"DOC: Fix the description of the 'day' field accessor in DatetimeArray"
issues = pd.DataFrame(data, columns=['number', 'title',
'labels', 'state'])
issues
number title labels state
0 31490 DOC: Fix the description of the 'day' field ac... [] open
1 31489 ~ operator on Series with BooleanDtype casts t... [] open
2 31488 Unclosed file on EmptyDataError [] open
3 31487 Maybe wrong default axis with operators (add, ... [] open
4 31486 DOC: Parameter doc strings for Groupby.(sum|pr... [] open
... ... ... ... ...
25 31459 ENH: pd.cut should be able to return a Series ... [{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=... open
26 31458 Fix to_csv and to_excel links on read_csv, rea... [] open
27 31457 Timedelta multiplication crashes for large arrays [{'id': 47223669, 'node_id': 'MDU6TGFiZWw0NzIy... open
28 31456 BUG: Groupby.apply wasn't allowing for functio... [{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj... open
29 31455 jobs failling with error raise RuntimeError("C... [{'id': 307649777, 'node_id': 'MDU6TGFiZWwzMDc... open
30 rows × 4 columns
6.4 数据库交互
在商业场景下,大多数数据可能不是存储在文本或Excel文件中。基于SQL的关系型数据库(如SQLServer、PostgreSQL和MySQL等)使用非常广泛,其它⼀些数据库也很流行。数据库的选择通常取决于性能、数据完整性以及应⽤程序的伸缩性需求。
将数据从SQL加载到DataFrame的过程很简单,此外pandas还有1些能够简化该过程的函数。例如,我将使用SQLite数据库(通过Python内置的sqlite3驱动器)。
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
<sqlite3.Cursor at 0xe49e5a0>
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
cursor.description
(('a', None, None, None, None, None, None),
('b', None, None, None, None, None, None),
('c', None, None, None, None, None, None),
('d', None, None, None, None, None, None))
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from test', db)
a b c d
网友评论