python3.6 图片识别转文本
环境
conda + python3.6 + jupyter notebook
安装依赖
pip install pillow
pip install tesseract
pip install pytesseract
For CentOS 7 run the following as root:
yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
yum update
yum install tesseract
yum install tesseract-langpack-deu
安装完成后配置环境变量:
vim /etc/profile
export TESSDATA_PREFIX="/usr/share/tesseract/4/tessdata"
export PATH=$PATH:$TESSDATA_PREFIX
检查当前语言包:tesseract --list-langs
(python36) [root@centos-7 ~]# tesseract --list-langs
List of available languages (3):
chi_sim
eng
osd
(python36) [root@centos-7 ~]#
下载语言库将语言包拷贝到/usr/share/tesseract/4/tessdata目录下
运行代码:
# -*- coding: utf-8 -*-
from PIL import Image
import pytesseract
#上面都是导包,只需要下面这一行就能实现图片文字识别
text=pytesseract.image_to_string(Image.open('img1.jpg'),lang='chi_sim') #设置为中文文字的识别
#text=pytesseract.image_to_string(Image.open('test.png'),lang='eng') #设置为英文或阿拉伯字母的识别
print(text)
遍历一个目录:
from PIL import Image
import pytesseract
import os
path="/home/imgs"
file_list=os.listdir(path)
fo=open("data.txt","w")
for file in file_list:
text=pytesseract.image_to_string(Image.open(os.path.join(path,file)),lang='chi_sim')
print(text)
fo.write(text)
fo.close
网友评论