tess4.0中主要的数据结构
- Page analysis result:
PAGE_RES
(ccstruct/pageres.h). - Page analysis result contains a list of block analysis result field:
BLOCK_RES_LIST
. - Block analysis result:
BLOCK_RES
(ccstruct/pageres.h). - Block analysis result contains a list of row analysis result field:
ROW_RES_LIST
. - Row analysis result:
ROW_RES
(ccstruct/pageres.h). - Row analysis result contains a list of word analysis result field:
WERD_RES_LIST
. -
WERD_RES
(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result.
源码分析
Tesseract主要文字识别主要流程:二值化,切分处理,识别,纠错等步骤。本文主要总结二值化和预处理两部分步骤的处理过程。
Page Layout 分析步骤
二值化
- 算法: OTSU
- 调用栈
- main[api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages[api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage[api/baseapi.cpp] ->
- TessBaseAPI::Recognize[api/baseapi.cpp] ->
- TessBaseAPI::FindLines[api/baseapi.cpp] ->
- TessBaseAPI::Threshold[api/baseapi.cpp] ->
- ImageThresholder::ThresholdToPix[ccmain/thresholder.cpp] ->
- ImageThresholder::OtsuThresholdRectToPix [ccmain/thresholder.cpp]
OTSU 是一个全局二值化算法. 如果图片中包含阴影而且阴影不平均,二值化算法效果就会比较差。OCRus利用一个局部的二值化算法,olf Jolion, 对包含有阴影的图片也有比较好的二值化结果。
切分处理
Remove vertical lines
This step removes vertical and horizontal lines in the image.
- 调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::FindLines [api/baseapi.cpp] ->
- Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
- Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
- Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]
- LineFinder::FindAndRemoveLines [textord/linefind.cpp]
Remove images
This step remove images from the picture.
-
调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::FindLines [api/baseapi.cpp] ->
- Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
- Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
- Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]
- ImageFind::FindImages [textord/linefind.cpp]
I never try this function successfully. May be the image needs to satisfy some conditions.
Filter connected component
This step generate all the connected components and filter the noise blobs.
-
调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::FindLines [api/baseapi.cpp] ->
- Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
- Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
- Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp] ->
- (i) Textord::find_components [textord/tordmain.cpp] ->
{ extract_edges[textord/edgblob.cpp] //extract outlines and assign outlines to blobs assign_blobs_to_blocks2[textord/edgblob.cpp] //assign normal, noise, rejected blobs to TO_BLOCK_LIST for further filter blobs operations Textord::filter_blobs[textord/tordmain.cpp] -> Textord::filter_noise_blobs[textord/tordmain.cpp] //Move small blobs to a separate list }
(ii) ColumnFinder::SetupAndFilterNoise [textord/colfind.cpp]
This step will generate the intermediate result, refer to http://blog.csdn.net/kaelsass/article/details/46874627
Finding candidate tab-stop components
-
调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::FindLines [api/baseapi.cpp] ->
- Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
- Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
- ColumnFinder::FindBlocks [textord/ colfind.cpp] ->
- TabFind::FindInitialTabVectors[textord/tabfind.cpp] ->
- TabFind::FindTabBoxes [textord/tabfind.cpp]
This step finds the initial candidate tab-stop CCs by a radial search starting at every filtered CC from preprocessing. Results can refer to http://blog.csdn.net/kaelsass/article/details/46874627
Finding the column layout
-
调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::FindLines [api/baseapi.cpp] ->
- Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
- Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
- ColumnFinder::FindBlocks [textord/ colfind.cpp] ->
- ColumnFinder::FindBlocks (begin at line 369) [textord/ colfind.cpp]
This step finds the column layout of the page
Finding the regions
-
调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::FindLines [api/baseapi.cpp] ->
- Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
- Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
- ColumnFinder::FindBlocks [textord/ colfind.cpp]
This step recognizes the different type of blocks
网友评论