Tesseract-OCR的简单使用与训练

作者: 再好一点点 | 来源:发表于2018-04-26 13:07 被阅读0次

Tesseract-OCR的简单使用与训练
使用tesseract-ocr识别验证码，tesseract-o
深入学习Tesseract-ocr识别中文并训练字库的方法
Tesseract-OCR（图像文字识别）
android 新闻应用、Xposed模块、酷炫的加载动画、下载
OCR样本训练
使用 python 识别简单验证码
Python识别验证码的另一种花样玩法
tesseract-ocr 的centos的安装与使用
使用Tesseract-OCR训练文字识别记录

以下所有步骤都需要在同一个文件夹中执行，需要先cd到该文件夹

Merge样本文件

打开jTessBoxEditor，Tools->Merge TIFF，将样本文件全部选上，并将合并文件保存为num.font.exp0.tif

生成BOX文件

打开命令行并切换至num.font.exp0.tif所在目录，

1，输入:tesseract mh.font.exp0.tif mh.font.exp0 -l chi_sim batch.nochop makebox，

生成文件名为mh.font.exp0.box

-l chi_sim 选择中文简体，需要在jTessBoxEditor/tesseract-ocr/tessdata文件夹下放入下载好的中文字库chi_sim.traineddata

如果上述命令省略-l chi_sim 表示默认使用引文字符集

【语法】：tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

lang为语言名称，fontname为字体名称，num为序号；在tesseract中，一定要注意格式。

2，导出了box文件，此时可以再jTessBoxEditor gui下边开始修正字体了如图：

3，定义字符配置文件

在目标文件夹内生成一个名为font_properties的文本文件，内容为: font 00000

【语法】： <fontname> <italic> <bold> <fixed> <serif> <frakture>

fontname为字体名称，italic为斜体，bold为黑体字，fixed为默认字体，serif为衬线字体，fraktur德文黑字体，1和0代表有和无，精细区分时可使用。

4，执行：

tesseract mh.font.exp0.tif mh.font.exp0 nobatch box.train

unicharset_extractor mh.font.exp0.box

shapeclustering -F font_properties.txt -U unicharset mh.font.exp0.tr

mftraining -F font_properties.txt -U unicharset -O unicharset mh.font.exp0.tr

cntraining mh.font.exp0.tr

combine_tessdata mh.

对生成的5个文件添加前缀mh. 加上生成的一共是6个如图：

5，最后执行combine_tessdata mh. 将生成mh.traineddata

最后将生成的mh.traineddata放入share文件夹即可使用

如果是多个训练集需要合并可以这样使用（懒得写了，直接用别人的吧）：

-psm N

Set Tesseract to onlyrun a subset of layout analysis and assume a certain form of image.The options for N are:

0 = Orientation andscript detection(OSD)only.

1 = Automatic pagesegmentation with OSD.

2 = Automatic pagesegmentation,but no OSD,or OCR.

3 = Fully automatic pagesegmentation,but no OSD.(Default)

4 = Assume a singlecolumn of text of variable sizes.

5 = Assume a singleuniform block of vertically aligned text.

6 = Assume a singleuniform block of text.

7 = Treat the image as asingle text line.

8 = Treat the image as asingle word.

9 = Treat the image as asingle word in a circle.

10 = Treat the image asa single character.

若图片为一行字符，可以在tesseract命令中添加"-psm 7"；若图片为一个字符，可以在tesseract命令中添加"-psm 10"；选择最合适的布局参数值，提高准确率。

如：

tesseract mh.font.exp0.tif mh.font.exp0 -l chi_sim -psm 6 batch.nochop makebox

mh.font.exp0.tif mh.font.exp0 -psm 6 nobatch box.train

网友评论

本文标题：Tesseract-OCR的简单使用与训练

本文链接：https://www.haomeiwen.com/subject/ahbplftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！