美文网首页爬虫&图片识别
图片文字提取(mac+python3.8+pytesseract

图片文字提取(mac+python3.8+pytesseract

作者: 乐观的星辰 | 来源:发表于2021-07-02 14:43 被阅读0次

    一:Mac 安装 tesseract (brew安装)
    a. pip3 install tesseract -- 杜绝这种
    pip3 install tesseract
    Collecting tesseract
    Downloading
    https://files.pythonhosted.org/packages/8d/b7/c4fae9af5842f69d9c45bf1195a94aec090628535c102894552a7a7dbe6c/tesseract-0.1.3.tar.gz (45.6MB)
    坑:该版本不支持python,需要修改tesseract, 注释其中print相关内容;ConfigParser 跑需要重新安装,并修改tesseract 创建ConfigParser class相关内容点;折腾好后,后续的模块依赖也会缺少

    正确姿势
    brew install tesseract
    坑:libtiff 安装失败,资源包地址不可用
    rying a mirror...
    ==> Downloading https://ghcr.io/v2/homebrew/core/bottles/libtiff-4.2.0.big_sur.bottle.tar.gz
    ==> Downloading from https://github.com/-/v2/packages/container/package/homebrew%2Fcore%2Fbottles%2Flibtiff-4.2.0.big_sur.bottle.tar.gz
    Warning: Transient problem: timeout Will retry in 1 seconds. 3 retries left.
    Warning: Transient problem: timeout Will retry in 2 seconds. 2 retries left.
    Warning: Transient problem: timeout Will retry in 4 seconds. 1 retries left. #
    -=O=- # # # #
    curl: (22) The requested URL returned error: 404

    切换到手工安装:
    地址: https://github.com/vadz/libtiff
    % ./configure
    % make
    % su
    # make install

    然后在 brew install tesseract 很丝滑

    二 : python 相关资源包
    pip3 install Image
    pip3 install pytesseract

    三:下载文字匹配语言包
    地址:(https://github.com/tesseract-ocr/tessdata)
    下载:chi_sim.traineddata
    保存:/usr/local/Cellar/tesseract/4.0.0(version)/share/tessdata

    四:测试脚本

    -- coding: utf-8 --

    """
    @Project :xxxxx
    @Time : 2021/7/1 下午4:09
    @Auth : 肖彬
    @File :Image_test_data
    @IDE :PyCharm

    """
    from PIL import Image
    import pytesseract

    def image_to_str(image_path):
    image = Image.open(image_path)
    words = pytesseract.image_to_string(image, lang='chi_sim')
    aa = pytesseract.image_to_data(image)
    print(words, aa)

    if name == 'main':
    image_path_001 = '/Users/xiaobin/Downloads/image_test/a.png'
    image_path_002 = '/Users/xiaobin/Downloads/image_test/b.jpeg'
    image_to_str(image_path_002)

    相关文章

      网友评论

        本文标题:图片文字提取(mac+python3.8+pytesseract

        本文链接:https://www.haomeiwen.com/subject/fageultx.html