python读取pdf内容和图片OCR

作者: Chting | 来源:发表于2023-04-05 15:45 被阅读0次

pytesseract
Python读取PDF内容
pdf to image python实现笔记
Python PDF文件OCR - OCR on PDF fil
Task 2 数据读取与数据扩增
如何将PDF转换为pages？使用Enolsoft PDF快速帮
如何使用ABBYY软件校正不能完全识别的表格
ABBYY软件的OCR文字识别工具有什么用
Readiris Pro 17 for Mac(PDF/OCR识
Readiris Pro 17 for Mac(PDF/OCR识

你可以使用Python中的PyPDF2或pdfminer库来读取PDF文件并提取文本内容。以下是使用PyPDF2库的示例代码：

import PyPDF2

pdf_file = open('example.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

text = ''
for page_num in range(pdf_reader.getNumPages()):
    page_obj = pdf_reader.getPage(page_num)
    text += page_obj.extractText()

print(text)

以上代码打开名为example.pdf的PDF文件，循环遍历每一页，并使用extractText()方法从每页中提取文本内容，添加到变量text中。

如果PDF文件包含图片，您需要使用OCR（光学字符识别）技术才能将其转换为文本。OCR库，如pytesseract，可以在Python中执行此操作。以下是一个示例代码：

from PIL import Image
import pytesseract

pdf_file = open('example.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

text = ''
for page_num in range(pdf_reader.getNumPages()):
    page_obj = pdf_reader.getPage(page_num)
    if '/XObject' in page_obj['/Resources']:
        xObject = page_obj['/Resources']['/XObject'].getObject()
        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                data = xObject[obj].getData()
                image = Image.frombytes("L", size, data)
                text += pytesseract.image_to_string(image)

print(text)

上述代码首先检查是否有/XObject资源，如果有则遍历每个对象以查找图像。找到图像后，将其转换为PIL Image对象，然后使用pytesseract库将其转换为文本。