使用pdfminer.six进行电子发票的文本读取。(基于Python 3.7)
首先, 安装 pdfminer.six
pip install pdfminer.six=20201018
安装成功之后,参考以下代码进行文本读取:
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def readPdf2(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
device = TextConverter(rsrcmgr,retstr,codec='utf-8',laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
with open(path,'rb') as fp:
for page in PDFPage.get_pages(fp,set()):
interpreter.process_page(page)
text = retstr.getvalue()
device.close()
retstr.close()
return text
text = readPdf2(r"C:\test.pdf")
print(text)
网友评论