美文网首页googleTesseract
Tesseract源码分析(一)——二值化与版面分析

Tesseract源码分析(一)——二值化与版面分析

作者: RobertY | 来源:发表于2017-07-27 13:43 被阅读1022次

    tess4.0中主要的数据结构

    1. Page analysis result: PAGE_RES (ccstruct/pageres.h).
    2. Page analysis result contains a list of block analysis result field: BLOCK_RES_LIST.
    3. Block analysis result: BLOCK_RES (ccstruct/pageres.h).
    4. Block analysis result contains a list of row analysis result field: ROW_RES_LIST.
    5. Row analysis result: ROW_RES (ccstruct/pageres.h).
    6. Row analysis result contains a list of word analysis result field: WERD_RES_LIST.
    7. WERD_RES(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result.

    源码分析

    Tesseract主要文字识别主要流程:二值化,切分处理,识别,纠错等步骤。本文主要总结二值化和预处理两部分步骤的处理过程。

    Page Layout 分析步骤

    二值化

    • 算法: OTSU
    • 调用栈
      1. main[api/tesseractmain.cpp] ->
      2. TessBaseAPI::ProcessPages[api/baseapi.cpp] ->
      3. TessBaseAPI::ProcessPage[api/baseapi.cpp] ->
      4. TessBaseAPI::Recognize[api/baseapi.cpp] ->
      5. TessBaseAPI::FindLines[api/baseapi.cpp] ->
      6. TessBaseAPI::Threshold[api/baseapi.cpp] ->
      7. ImageThresholder::ThresholdToPix[ccmain/thresholder.cpp] ->
      8. ImageThresholder::OtsuThresholdRectToPix [ccmain/thresholder.cpp]

    OTSU 是一个全局二值化算法. 如果图片中包含阴影而且阴影不平均,二值化算法效果就会比较差。OCRus利用一个局部的二值化算法,olf Jolion, 对包含有阴影的图片也有比较好的二值化结果。

    切分处理

    Remove vertical lines

    This step removes vertical and horizontal lines in the image.

    • 调用栈
      1. main [api/tesseractmain.cpp] ->
      2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
      3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
      4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
      5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
      6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
      7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
      8. Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]
      9. LineFinder::FindAndRemoveLines [textord/linefind.cpp]

    Remove images

    This step remove images from the picture.

    • 调用栈

      1. main [api/tesseractmain.cpp] ->
      2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
      3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
      4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
      5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
      6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
      7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
      8. Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]
      9. ImageFind::FindImages [textord/linefind.cpp]

      I never try this function successfully. May be the image needs to satisfy some conditions.

    Filter connected component

    This step generate all the connected components and filter the noise blobs.

    • 调用栈

      1. main [api/tesseractmain.cpp] ->
      2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
      3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
      4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
      5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
      6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
      7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
      8. Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp] ->
      9. (i) Textord::find_components [textord/tordmain.cpp] ->
      {
          extract_edges[textord/edgblob.cpp] //extract outlines and assign outlines to blobs
          assign_blobs_to_blocks2[textord/edgblob.cpp] //assign normal, noise, rejected blobs to TO_BLOCK_LIST for further filter blobs operations
          Textord::filter_blobs[textord/tordmain.cpp] ->
          Textord::filter_noise_blobs[textord/tordmain.cpp] //Move small blobs to a separate list
      }
      

      (ii) ColumnFinder::SetupAndFilterNoise [textord/colfind.cpp]

      This step will generate the intermediate result, refer to http://blog.csdn.net/kaelsass/article/details/46874627

    Finding candidate tab-stop components

    • 调用栈

      1. main [api/tesseractmain.cpp] ->
      2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
      3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
      4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
      5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
      6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
      7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
      8. ColumnFinder::FindBlocks [textord/ colfind.cpp] ->
      9. TabFind::FindInitialTabVectors[textord/tabfind.cpp] ->
      10. TabFind::FindTabBoxes [textord/tabfind.cpp]

      This step finds the initial candidate tab-stop CCs by a radial search starting at every filtered CC from preprocessing. Results can refer to http://blog.csdn.net/kaelsass/article/details/46874627

    Finding the column layout

    • 调用栈

      1. main [api/tesseractmain.cpp] ->
      2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
      3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
      4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
      5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
      6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
      7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
      8. ColumnFinder::FindBlocks [textord/ colfind.cpp] ->
      9. ColumnFinder::FindBlocks (begin at line 369) [textord/ colfind.cpp]

      This step finds the column layout of the page

    Finding the regions

    • 调用栈

      1. main [api/tesseractmain.cpp] ->
      2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
      3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
      4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
      5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
      6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
      7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
      8. ColumnFinder::FindBlocks [textord/ colfind.cpp]

      This step recognizes the different type of blocks

    相关文章

      网友评论

        本文标题:Tesseract源码分析(一)——二值化与版面分析

        本文链接:https://www.haomeiwen.com/subject/efezkxtx.html