25.3-正则数据提取和数据载入

作者: BeautifulSoulpy | 来源:发表于2019-10-21 18:21 被阅读0次

25.3-正则数据提取和数据载入
数据提取方法
学习小组Day4-Cheng Guo
使用jsoup将表格内容展开方便Regex进行内容定位
爬虫数据筛选
爬虫处理之结构化数据操作
爬虫处理之结构化数据操作
爬虫处理之结构化数据操作
（一）刚入爬虫坑(3)——boss直聘数据爬取案例(xpath版
python爬虫　非结构化数据与结构化的数据提取

人，除了自渡，他人爱莫能助。
有时候，我们不得不坚强，于是乎，在假装坚强中，就真的越来越坚强。

总结：

内建函数用处非常大，效率也不低；

遇到文本文件就一行行处理；

正则表达式提取日志

构造一个正则表达式提取需要的字段，改造extract函数、names和ops

names = ('remote', 'datetime', 'method', 'url', 'protocol', 'status', 'length', 'useragent')
ops = (None, lambda timestr: datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z'),
None, None, None, int, int, None)
pattern = '''([\d.]{7,}) - - \[([/\w +:]+)\] "(\w+) (\S+) ([\w/\d.]+)" (\d+) (\d+) .+ "(.+)"'''

能够使用命名分组呢？
进一步改造pattern为命名分组，ops也就可以和名词对应了，names就没有必要存在了

ops = {
'datetime': lambda timestr: datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z'),
'status': int,
'length': int
}
pattern = '''(?P<remote>[\d.]{7,}) - - \[(?P<datetime>[/\w +:]+)\] \
"(?P<method>\w+) (?P<url>\S+) (?P<protocol>[\w/\d.]+)" \
(?P<status>\d+) (?P<length>\d+) .+ "(?P<useragent>.+)"'''

import re,datetime
line = '''183.69.210.164 - - [07/Apr/2017:09:32:40 +0800] "GET /index.php?m=login HTTP/1.1" 200 3661 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"'''

#pattern = '''([\d.]{7,15}) - - \[([/\w +:]+)\] "(\w+) (\S+) ([\w/\d.]+)" (\d+) (\d+) .+ "(.+)"'''
pattern = '(?P<remote>[\d.]{7,15}) - - \[(?P<datetime>[^\[\]]+)\] "(?P<method>[^" ]+) (?P<url>[^" ]+) (?P<protocol>[^" ]+)" (?P<status>\d+) (?P<size>\d+) \S+ "(?P<userggent>[^"]*)"'

regex = re.compile(pattern)
matcher = regex.match(line)

ops = {
    'datetime':lambda dstr:datetime.datetime.strptime(dstr,'%d/%b/%Y:%H:%M:%S %z'),
    'status':int,'size':int   
}

def extract(line:str):
    matcher = regex.match(line)
    if matcher:
        return {k:ops.get(k,lambda x:x)(v) for k,v in matcher.groupdict().items()}
    
print(extract(line))
#----------------------------------------------------------------
{'remote': '183.69.210.164', 'datetime': datetime.datetime(2017, 4, 7, 9, 32, 40, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800))), 'method': 'GET', 'url': '/index.php?m=login', 'protocol': 'HTTP/1.1', 'status': 200, 'size': 3661, 'userggent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}

数据读取

遇到文本文件就一行行处理；没有必要多行读取；

def load(filename:str):
    with open(filename,encoding='utf-8') as f:
        for line in f:
            fields = extract(line)
            yield fields
            
for x in load('test.log'):
    print(x)
#---------------------------------------
{'remote': '123.125.71.36', 'datetime': datetime.datetime(2017, 4, 6, 18, 9, 25, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800))), 'method': 'GET', 'url': '/', 'protocol': 'HTTP/1.1', 'status': 200, 'size': 8642, 'userggent': 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'}
{'remote': '112.64.118.97', 'datetime': datetime.datetime(2017, 4, 6, 19, 13, 59, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800))), 'method': 'GET', 'url': '/favicon.ico', 'protocol': 'HTTP/1.1', 'status': 200, 'size': 4101, 'userggent': 'Dalvik/2.1.0 (Linux; U; Android 5.1.1; SM-G9250 Build/LMY47X)'}

异常处理

日志中不免会出现一些不匹配的行，需要处理。
这里使用re.match方法，有可能匹配不上。所以要增加一个判断
采用抛出异常的方式，让调用者获得异常并自行处理。

def extract(logline:str) -> dict:
"""返回字段的字典，抛出异常说明匹配失败"""
    matcher = regex.match(line)
    if matcher:
        return {k:ops.get(k, lambda x:x)(v) for k,v in matcher.groupdict().items()}
    else:
        raise Exception('No match. {}'.format(line)) # 或输出日志记录

def load(filename:str):
    with open(filename,encoding='utf-8') as f:
        for line in f:
            fields = extract(line)
            yield fields
        else:
            continue # TODO 解析失败就抛弃，或者打印日志
            
for x in load('test.log'):
    print(x)