Python日常-难啃的论文

作者: JianlingZhou | 来源:发表于2017-02-10 21:11 被阅读984次

Python日常-难啃的论文
聊一些鲜为人知但有趣的Python特性，附案例
难啃的书
难啃的骨头
难啃的评价这样啃如何？
《思考，快与慢》思维导图（下）
每一次坚持，都是一种看不见的进化
漫无目标的假学习者
昱言（96）｜啃骨头
难啃的孤独、更难啃的是心境

剧情

这两天在看论文，密密麻麻的英文，各种专有名词，看得很头痛。借助谷歌翻译可以辅助理解（其实谷歌翻译得挺不错的），就是用谷歌翻译的时候遇到了一点麻烦，见下图。

PDF格式的论文里文字选中的时候是这样的：

论文选中

复制到记事本上是这样的：

可以看到：

很多单词被截断，中间用‘-’进行连接。
完整的文本中间全是断行。也就是说，在ASCII码里有很多 ‘\n’ ，直接复制到Google的话，每一个'\n'之前的文本都会当做一个句子去处理。

这样翻译效果就很差。需要手动一个个把断行给接起来。理想效果是这样：

理想效果

论文里估计有几百行近千行，这样一行行地按Backspace键既乏味又蠢，对于一个程序员是不可忍受的。然后我就想搞一个Python脚本去处理。

开始动手

思路很简单：

先找一个pdf处理库，把文字提取出来
然后进行字符串增删操作，把断句连起来
把生成的文本调用Google API翻译出来，或者模拟浏览器访问 http://translate.google.cn ，把翻译结果拿回来

先去Github搜索： python pdf process，结果如下：

第一个结果里依赖项太多了。而且列表里也没啥好的库，我怀疑关键词是不是错了，就换成： python pdf extract。后来就找到了slate这个库。后来又发现slate这个库完全是基于pdfminer这个库做的，我干脆就直接用pdfminer了。

去豆瓣的pypi国内镜像里看了一下，确实有pdfminer这玩意，就用pip把pdfminer给装上了。因为证书的问题，我最后用的清华的源。安装命令如下：

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfminer

在pdfminer的repo里发现了一个脚本pdf2txt.py（地址），就是用来从pdf里提取文字的，很开心，直接用上了。第一步搞定。

文字再处理

从pdf2text.py中提取中夹杂大量断行的文字到一个txt文件中，呈现如下效果：

思考一会后，我打算做如下的处理逻辑：

只提取摘要后面的文字
对于一行，如果只有数字和空格，则删掉（因为这样的行是叶号、页脚）
每一行都去掉末尾的‘\n’
对于只有'\n'的一行，则多加一个'\n'换行符到新文本中
去掉换行符后，如果行尾是'-'，则删掉

边写边试，第二步也基本搞定了，最后在调用Google翻译的时候遇到了一点麻烦，解析html时发现文字没有翻译出来，不知何故。但是最后一步已经不重要了，时间宝贵，到此为止。

源代码与处理效果

用脚本处理后的效果：

新文本

命令行

Google翻译效果

最后附上源代码：

import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from pdfminer.image import ImageWriter



"""
        print ('usage: %s [-d] [-p pagenos] [-m maxpages] [-P password] [-o output]'
               ' [-C] [-n] [-A] [-V] [-M char_margin] [-L line_margin] [-W word_margin]'
               ' [-F boxes_flow] [-Y layout_mode] [-O output_dir] [-R rotation] [-S]'
               ' [-t text|html|xml|tag] [-c codec] [-s scale]'
               ' file ...' % argv[0])
"""
# main
def extract_text(filename, password_param, output_file):

    # debug option
    debug = 0
    # input option
    password = ''
    pagenos = set()
    maxpages = 0
    # output option
    outfile = None
    outtype = None
    imagewriter = None
    rotation = 0
    stripcontrol = False
    layoutmode = 'normal'
    codec = 'utf-8'
    pageno = 1
    scale = 1
    caching = True
    showpageno = True
    laparams = LAParams()

    if filename.strip()[-4:] != '.pdf':
        print 'file type is not pdf!'
        return
    elif output_file is not None:
        outfile = output_file
    else:
        outfile = filename.strip()[:-4] + '.txt'
        print 'output file path: %s' % outfile
    if password_param is not None:
        password = password_param

    PDFDocument.debug = debug
    PDFParser.debug = debug
    CMapDB.debug = debug
    PDFPageInterpreter.debug = debug
    #
    rsrcmgr = PDFResourceManager(caching=caching)
    if not outtype:
        outtype = 'text'
        if outfile:
            if outfile.endswith('.htm') or outfile.endswith('.html'):
                outtype = 'html'
            elif outfile.endswith('.xml'):
                outtype = 'xml'
            elif outfile.endswith('.tag'):
                outtype = 'tag'
    if outfile:
        outfp = file(outfile, 'w')
    else:
        outfp = sys.stdout
    if outtype == 'text':
        device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams,
                               imagewriter=imagewriter)
    elif outtype == 'xml':
        device = XMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams,
                              imagewriter=imagewriter,
                              stripcontrol=stripcontrol)
    elif outtype == 'html':
        device = HTMLConverter(rsrcmgr, outfp, codec=codec, scale=scale,
                               layoutmode=layoutmode, laparams=laparams,
                               imagewriter=imagewriter, debug=debug)
    elif outtype == 'tag':
        device = TagExtractor(rsrcmgr, outfp, codec=codec)
    else:
        return
    fname = filename
    fp = file(fname, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    print 'extracting text in pdf ... ...'
    page_cnt = 1
    for page in PDFPage.get_pages(fp, pagenos,
                                maxpages=maxpages, password=password,
                                caching=caching, check_extractable=True):
        page.rotate = (page.rotate+rotation) % 360
        print 'processing page %d ...' % page_cnt
        interpreter.process_page(page)
        page_cnt += 1
    fp.close()
    device.close()
    outfp.close()
    print 'text has been written into %s ' % outfile
    return outfile


def check_line_valid(line):# only line like '1 ' will be invalid
    line = line[:-1]
    if line == '':
        return True
    digits = '0123456789'
    for c in line:
        if c != ' ' and c not in digits:
            return True
    return False


def process_line(line):
    if line != '\n':  # single line with '\n' will be ignored
        line = line[:-1]
        if line[-1:] == '-':
            line = line[:-1]
        else:
            line += ' '
    else:
        line += '\n'
    return line

"""
"""
def reformat_output_file(outfile):
    text_reformated = ''
    file_handler = open(outfile)
    line = file_handler.readline()
    recording = False
    while line:
        if line == 'Abstract\n':
            recording = True
        if recording is True:
            if check_line_valid(line):
                line = process_line(line)
                text_reformated += line
        line = file_handler.readline()
    file_handler.close()
    print '%s has been reformated.' % outfile
    file_reformated_name = outfile[:-4] + '.reformated.txt'
    file_handler = open(file_reformated_name, 'w')
    file_handler.write(text_reformated)
    return text_reformated

of = extract_text('H://Hendricks_Deep_Compositional_Captioning_CVPR_2016_paper.pdf', '', None)
tr = reformat_output_file(of)

代码说明：

依赖项：pdfminer
环境：Windows 10，Python 2.7
reformated后的文件会写入源pdf所在目录下的一个txt文件内

Python日常-难啃的论文
剧情这两天在看论文，密密麻麻的英文，各种专有名词，看得很头痛。借助谷歌翻译可以辅助理解（其实谷歌翻译得挺不错的）...
聊一些鲜为人知但有趣的Python特性，附案例
Python其实是难搞的之前有个小学弟毕业论文用到Python做NLP，学了半个月跟我说Python真的太良心了...
难啃的书
喜欢古典文学，买的很多古典的书，还没有读完，又买了几本类似文言文。（道德经）、（易经）刚拆。还有聊斋，山海经。...
难啃的骨头
新年第二天，继续打扫卫生，今天轮到清洗油烟机，真是够劲儿。看着被油污堵住的油烟机滤网真是头大，清洁剂用了...
难啃的评价这样啃如何？
深度学习、教学评一体化不绕开且勇敢面对教学质量，因此格外重视课堂学习性评价（老师们普遍反映“评价”最难）。一、达...
《思考，快与慢》思维导图（下）
终于把这本难啃的书第二遍啃完了。。
每一次坚持，都是一种看不见的进化
在我们日常的工作生活的做事过程中，不会总是每一件事都能迎刃而解，一蹴而就，总会遇到难啃的骨头、难做的事。在这种情况...
漫无目标的假学习者
「漫无目的的假学习者」。然后我就自学 Python，拿『Python 核心编程』这本书一点点啃，好像啃了一...
昱言（96）｜啃骨头
啃骨头，骨头很硬，很难啃，只能一点一点的啃，但虽然难啃，骨头就那么多，啃啃就啃没了。有的时候，要有一点啃骨头的精...
难啃的孤独、更难啃的是心境
只要是不用费力就坚持的，他都有耐性，但只要是需要一直努力付出才能坚持的事，他永远都没有长性。我就是那个没有长性的...