软件安装
- 这里我使用
pytesseract
- 可以通过pip安装,python环境建议使用
Anaconda
识别的一般步骤
- 读取文件
- 文件转换为numpy数组
- 调用pytesseract的image_to_string函数
- 具体代码如下
from PIL import Image
import pytesseract
import numpy as np
import sys
filename = sys.argv[1]
img1 = np.array(Image.open(filename))
text = pytesseract.image_to_string(img1)
print(text)
问题
汉字支持
- 下载
chi_sim.traineddata
复制到 /usr/local/share/tessdata/
中
- 汉字识别代码
img1 = np.array(Image.open(filename))
text = pytesseract.image_to_string(img1,lang = "chi_sim")
print(text)
paddle-ocr
- 使用百度paddle框架训练的ocr识别模型
- 支持英文、中文、韩语、法语等语言
安装
## https://pypi.org/project/paddleocr/
pip install "paddleocr>=2.0.1" # Recommend to use version 2.0.1+
## 同时需要安装paddle
pip install paddlepaddle
识别代码
from paddleocr import PaddleOCR
import sys
# Paddleocr supports Chinese, English, French, German, Korean and Japanese.
# You can set the parameter `lang` as `ch`, `en`, `french`, `german`, `korean`, `japan`
# to switch the language model in order.
ocr = PaddleOCR(use_angle_cls=True, lang='ch') # need to run only once to download and load model into memory
img_path = sys.argv[1:]
result = ocr.ocr(img_path, cls=True)
for line in result:
print(line)
参考
网友评论