【笔记】《Python语言以及应用》- 数据操作

作者: u14e | 来源:发表于2018-04-01 01:02 被阅读14次

字符串：Unicode字符组成的序列
字节和字节数组：8 byte组成的序列

1. 编码和解码

编码：将Unicode字符串转化为一系列字节的过程

# str类型
snowman = '\u2603'
len(snowman)  # 1 只包含一个Unicode字符串，与所存储的字节数无关

# bytes类型
ds = snowman.encode('utf-8')  # b'\xe2\x98\x83'
len(ds) # 3 占据3个字节的空间

# encode(encoding='utf-8', errors='strict') -> bytes
# 第一个参数为编码方式，第二个参数为编码异常的处理

# strict, 出错时抛出UnicodeEncodeError错误

# ignore, 抛弃无法编码的字符
snowman = 'abc\u2603'
ds = snowman.encode('ascii', 'ignore')
print(ds) # b'abc'
print(len(ds))  # 3

# replace, 将无法编码的字符替换为?
snowman = 'abc\u2603'
ds = snowman.encode('ascii', 'replace')
print(ds) # b'abc?'
print(len(ds))  # 4

# backslashreplace, 将无法编码的字符转为\\u这种形式
snowman = 'abc\u2603'
ds = snowman.encode('ascii', 'backslashreplace')
print(ds) # b'abc\\u2603'
print(len(ds))  # 9

# xmlcharrefreplace, 将无法编码的字符转为字符串实体
snowman = 'abc\u2603'
ds = snowman.encode('ascii', 'xmlcharrefreplace')
print(ds) # b'abc&#9731;'
print(len(ds))  # 10

解码：将字节序列转化为Unicode字符串的过程(注意编码和解码的格式必须一致，如都用utf-8，否则会得不到预期的值)

place = 'caf\u00e9'
place_bytes = place.encode('utf-8') # b'caf\xc3\xa9'
place2 = place_bytes.decode('utf-8')  # café

2. 格式化

旧的格式化：%
新的格式化：{}和format

n = 42
f = 7.03
s = 'string cheese'

print('{} {} {}'.format(n, f, s)) # 默认
print('{2} {0} {1}'.format(f, s, n))  # 位置索引
print('{n} {f} {s}'.format(n=n, f=f, s=s))  # 关键字

d = {'n': n, 'f': f, 's': s}
print('{0[n]} {0[f]} {0[s]} {1}'.format(d, 'other'))  # 字典

print('{0:5d} {1:10f} {2:20s}'.format(n, f, s)) # 默认左对齐
print('{0:>5d} {1:>10f} {2:>20s}'.format(n, f, s))  # 左对齐
print('{0:<5d} {1:<10f} {2:<20s}'.format(n, f, s))  # 右对齐
print('{0:^5d} {1:^10f} {2:^20s}'.format(n, f, s))  # 居中对齐
print('{0:!^5d} {1:#^10f} {2:&^20s}'.format(n, f, s)) # 占位
print('{0:!^5d} {1:#^10.4f} {2:&^20.4s}'.format(n, f, s)) # 精度

3. 正则表达式

match检查是否以...开头

import re

source = 'Young man'
m = re.match('You', source)
if m:
  print(m.group())  # You

search返回第一次匹配成功的项

import re

source = 'Young man'
m = re.search('man', source)
if m:
  print(m.group())  # man

m1 = re.match('.*man', source)
if m1:
  print(m1.group())  # Young man

findall返回所有匹配项

import re

source = 'Young man'
m = re.findall('n.?', source)
print(m)  # ['ng', 'n']，没有就返回[]

split类似于字符串的split,只不过这里是模式而不是文本

import re

source = 'Young man'
m = re.split('n', source)
print(m)  # ['You', 'g ma', '']

sub替换匹配，类似于字符串的replace，只不过这里是模式而不是文本

import re

source = 'Young man'
m = re.sub('n', '?', source)
print(m)  # You?g ma?

模式	匹配
.	任意一个除\n外的字符
*	任意多个字符(包括0个)
+	一个或多个字符
?	可选字符(0个或1个)
\d	一个数字字符
\w	一个字母或数字或下划线字符
\s	空白符
\b	单词边界

特殊字符：

模式	匹配
.	任意一个除\n外的字符
*	任意多个字符(包括0个)
+	一个或多个字符
?	可选字符(0个或1个)
\d	一个数字字符
\w	一个字母或数字或下划线字符
\s	空白符
\b	单词边界

import string
import re

printable = string.printable

re.findall('\d', printable) # ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

re.findall('\s', printable) # [' ', '\t', '\n', '\r', '\x0b', '\x0c']

定义匹配的输出

m.groups()获得匹配的元组

import re

source = 'a dish of fish tonight.'

m = re.search(r'(. dish\b).*(\bfish)', source)
print(m.group())  # a dish of fish
print(m.groups()) # ('a dish', 'fish')

(?P<name>expr)匹配expr，并将结果存储在名为name的组中

import re

source = 'a dish of fish tonight.'

m = re.search(r'(?P<DISH>. dish\b).*(?P<FISH>\bfish)', source)
print(m.group())   # a dish of fish
print(m.group('DISH'))  # a dish
print(m.group('FISH'))  # fish
print(m.groups()) # ('a dish', 'fish')

4. 读写文件

open(filename, mode)
其中mode的第一个字母表明对文件的操作：

r表示读模式
w表示写模式（若文件不存在，则新创建；若存在则重写新内容）
x表示文件不存在的情况下新创建并写文件
a表示如果文件存在，在文件末尾追加写内容

第二个字母：

t文本类型(默认)
b二进制文件

使用write()写文件：

poem = '''
床前明月光，
疑是地上霜。
举头望明月，
低头思故乡。
'''

with open('a.txt', 'wt', encoding='utf-8') as fout:
  fout.write(poem)


# 数据分块
with open('a.txt', 'wt', encoding='utf-8') as fout:
  size = len(poem)
  offset = 0
  chunk = 100
  while True:
    if offset > size:
      break
    fout.write(poem[offset:offset+chunk])
    offset += chunk

# 避免重写
try:
  with open('a.txt', 'xt', encoding='utf-8') as fout:
    fout.write(poem)
except FileExistsError:
  print('文件已经存在')

使用read()、readline()、readlines()读文件：

with open('a.txt', 'rt', encoding='utf-8') as fin:
  poem = fin.read()

# 每次读一行
with open('a.txt', 'rt', encoding='utf-8') as fin:
  poem = ''
  while True:
    line = fin.readline()
    if not line:
      break
    poem += line

# 使用迭代器
with open('a.txt', 'rt', encoding='utf-8') as fin:
  poem = ''
  for line in fin:
    poem += line

# 读入所有行，返回单行字符串的列表
with open('a.txt', 'rt', encoding='utf-8') as fin:
  lines = fin.readlines()
# 输出 ['\n', '床前明月光，\n', '疑是地上霜。\n', '举头望明月，\n', '低头思故乡。\n']

tell()返回文件此刻的字节偏移量，seek(n)跳到文件字节偏移量为n的位置：

seek(offset, origin)

origin=0(默认), 从开头偏移offset个字节
origin=1, 从当前位置偏移offset个字节
origin=2, 距离最后结尾处偏移offset个字节

with open('b', 'rb') as fin:
  print(fin.tell())   # 从0开始
  fin.seek(254, 0)    # 跳到254(返回最后两个字节)
  print(fin.tell())   # 254
  fin.seek(1, 1)      # 在此基础上前进一个字节
  print(fin.tell())   # 255
  data = fin.read()   # 一直读到文件结尾 b'\xff'

print(data[0])  # 255

网友评论

本文标题：【笔记】《Python语言以及应用》- 数据操作

本文链接：https://www.haomeiwen.com/subject/nwxocftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！