Python PDF文件OCR - OCR on PDF fil

作者: YichengYe | 来源:发表于2018-07-06 09:08 被阅读0次

Installing Tesseract

brew install tesseract --all-languages

Installing PyOCR

pip3 install pyocr

Installing Wand and PIL

brew install imagemagick@6
export MAGICK_HOME=/usr/local/opt/imagemagick@6

pip2 install wand

Warming up

from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io

Get Going

tool = pyocr.get_available_tools()[0] # tesseract
lang = tool.get_available_languages()[0] # check with tesseract to find out which index you need to use

req_image = []
final_text = []
image_pdf = Image(filename="./PDF_FILE_NAME", resolution=300)
image_jpeg = image_pdf.convert('jpeg')

for img in image_jpeg.sequence:
    img_page = Image(image=img)
    req_image.append(img_page.make_blob('jpeg'))

for img in req_image: 
    txt = tool.image_to_string(
        PI.open(io.BytesIO(img)),
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )
    final_text.append(txt)

网友评论

python

本文标题：Python PDF文件OCR - OCR on PDF fil

本文链接：https://www.haomeiwen.com/subject/uycvuftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python PDF文件OCR - OCR on PDF fil

Installing Tesseract

Installing PyOCR

Installing Wand and PIL

Warming up

Get Going

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

python