美文网首页
3、how to extract text from PDFs

3、how to extract text from PDFs

作者: BigBigGuy | 来源:发表于2019-01-11 21:07 被阅读0次

    Using wand, pillow and tesseract

    注意:pdf必须是白色底,否则识别不出来。

    其实就是根据pdf转为jpg再解析,真的是,就是从前面两篇提取结合,easy job!

    import io #多用了io库
    from PIL import Image
    import pytesseract
    from wand.image import Image as wi
    
    pdf = wi(filename='jun.pdf',resolution=300)
    pdfImg = pdf.convert('jpeg')
    
    imgBlobs = []
    
    for img in pdfImg.sequence:
        page = wi(image=img)
        imgBlobs.append(page.make_blob('jpeg'))
    
    extracted_text = []
    
    for imgBlobs in imgBlobs:
        im = Image.open(io.BytesIO(imgBlobs))
        text = pytesseract.image_to_string(im,lang='chi_sim')
        extracted_text.append(text)
    
    print(extracted_text[0])
    
    image.png

    相关文章

      网友评论

          本文标题:3、how to extract text from PDFs

          本文链接:https://www.haomeiwen.com/subject/wzlfdqtx.html