美文网首页
python读pdf文本

python读pdf文本

作者: Py_Explorer | 来源:发表于2018-09-03 15:54 被阅读0次
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert_pdf_2_text(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    with open(path, 'rb') as fp:
        for page in PDFPage.get_pages(fp, set()):
            interpreter.process_page(page)
        text = retstr.getvalue()
    device.close()
    retstr.close()
    return text
#pdf路径
text = convert_pdf_2_text('http.pdf')
#读pdf保存为1.txt
open('1.txt','wb').write(text)

相关文章

网友评论

      本文标题:python读pdf文本

      本文链接:https://www.haomeiwen.com/subject/wwfiwftx.html