美文网首页Python全栈工程师
25.3-正则数据提取和数据载入

25.3-正则数据提取和数据载入

作者: BeautifulSoulpy | 来源:发表于2019-10-21 18:21 被阅读0次

    人,除了自渡,他人爱莫能助。
    有时候,我们不得不坚强,于是乎,在假装坚强中,就真的越来越坚强。

    总结:

    1. 内建函数用处非常大,效率也不低;
    2. 遇到文本文件就一行行处理;
    正则表达式提取日志

    构造一个正则表达式提取需要的字段,改造extract函数、names和ops

    names = ('remote', 'datetime', 'method', 'url', 'protocol', 'status', 'length', 'useragent')
    ops = (None, lambda timestr: datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z'),
    None, None, None, int, int, None)
    pattern = '''([\d.]{7,}) - - \[([/\w +:]+)\] "(\w+) (\S+) ([\w/\d.]+)" (\d+) (\d+) .+ "(.+)"'''
    

    能够使用命名分组呢?
    进一步改造pattern为命名分组,ops也就可以和名词对应了,names就没有必要存在了

    ops = {
    'datetime': lambda timestr: datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z'),
    'status': int,
    'length': int
    }
    pattern = '''(?P<remote>[\d.]{7,}) - - \[(?P<datetime>[/\w +:]+)\] \
    "(?P<method>\w+) (?P<url>\S+) (?P<protocol>[\w/\d.]+)" \
    (?P<status>\d+) (?P<length>\d+) .+ "(?P<useragent>.+)"'''
    
    import re,datetime
    line = '''183.69.210.164 - - [07/Apr/2017:09:32:40 +0800] "GET /index.php?m=login HTTP/1.1" 200 3661 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"'''
    
    #pattern = '''([\d.]{7,15}) - - \[([/\w +:]+)\] "(\w+) (\S+) ([\w/\d.]+)" (\d+) (\d+) .+ "(.+)"'''
    pattern = '(?P<remote>[\d.]{7,15}) - - \[(?P<datetime>[^\[\]]+)\] "(?P<method>[^" ]+) (?P<url>[^" ]+) (?P<protocol>[^" ]+)" (?P<status>\d+) (?P<size>\d+) \S+ "(?P<userggent>[^"]*)"'
    
    regex = re.compile(pattern)
    matcher = regex.match(line)
    
    ops = {
        'datetime':lambda dstr:datetime.datetime.strptime(dstr,'%d/%b/%Y:%H:%M:%S %z'),
        'status':int,'size':int   
    }
    
    def extract(line:str):
        matcher = regex.match(line)
        if matcher:
            return {k:ops.get(k,lambda x:x)(v) for k,v in matcher.groupdict().items()}
        
    print(extract(line))
    #----------------------------------------------------------------
    {'remote': '183.69.210.164', 'datetime': datetime.datetime(2017, 4, 7, 9, 32, 40, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800))), 'method': 'GET', 'url': '/index.php?m=login', 'protocol': 'HTTP/1.1', 'status': 200, 'size': 3661, 'userggent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
    
    
    数据读取

    遇到文本文件就一行行处理;没有必要多行读取;

    def load(filename:str):
        with open(filename,encoding='utf-8') as f:
            for line in f:
                fields = extract(line)
                yield fields
                
    for x in load('test.log'):
        print(x)
    #---------------------------------------
    {'remote': '123.125.71.36', 'datetime': datetime.datetime(2017, 4, 6, 18, 9, 25, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800))), 'method': 'GET', 'url': '/', 'protocol': 'HTTP/1.1', 'status': 200, 'size': 8642, 'userggent': 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'}
    {'remote': '112.64.118.97', 'datetime': datetime.datetime(2017, 4, 6, 19, 13, 59, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800))), 'method': 'GET', 'url': '/favicon.ico', 'protocol': 'HTTP/1.1', 'status': 200, 'size': 4101, 'userggent': 'Dalvik/2.1.0 (Linux; U; Android 5.1.1; SM-G9250 Build/LMY47X)'}
    
    
    异常处理

    日志中不免会出现一些不匹配的行,需要处理。
    这里使用re.match方法,有可能匹配不上。所以要增加一个判断
    采用抛出异常的方式,让调用者获得异常并自行处理。

    def extract(logline:str) -> dict:
    """返回字段的字典,抛出异常说明匹配失败"""
        matcher = regex.match(line)
        if matcher:
            return {k:ops.get(k, lambda x:x)(v) for k,v in matcher.groupdict().items()}
        else:
            raise Exception('No match. {}'.format(line)) # 或输出日志记录
    
    def load(filename:str):
        with open(filename,encoding='utf-8') as f:
            for line in f:
                fields = extract(line)
                yield fields
            else:
                continue # TODO 解析失败就抛弃,或者打印日志
                
    for x in load('test.log'):
        print(x)
    
    

    相关文章

      网友评论

        本文标题:25.3-正则数据提取和数据载入

        本文链接:https://www.haomeiwen.com/subject/gzvtvctx.html