美文网首页Tesseract
特殊字符语言包训练流程

特殊字符语言包训练流程

作者: RobertY | 来源:发表于2017-08-04 13:52 被阅读354次

    本篇文章是旧版本的训练教程,适应于新版本的训练流程请查看:http://www.jianshu.com/p/7a2c40dd6560

    题库特殊字符语言包训练流程

    题库中会出现很多Tesseract无法识别的特殊字符,例如:≤ ≥ ×等等。本教程旨在训练新的语言包改善这些特殊字符的识别效果。训练思路为,利用Tesseract认识的字符替代这些特殊字符(与特殊字符外形相近,比如,‘,’替代‘,’,或者题库中没有的字符替换原有字符)。

    准备训练依赖项

    git clone https://github.com/tesseract-ocr/tesseract.git
    git clone https://github.com/tesseract-ocr/langdata.git
    git clone https://github.com/tesseract-ocr/tessdata.git
    cp ./tessdata/eng.traineddata ./tesseract/tessdata
    cp ./tessdata/chi_sim.traineddata ./tesseract/tessdata

    下载tesseract包后需要本地安装,可看前面发布的安装教程:Tess4.0 windows编译与使用tesseract linux 安装与编译

    添加新的字体

    打开./training/language-specific.sh文件,在CHI_SIM_FONTS栏目下添加Times New Roman字体内容,添加后内容如下:

    CHI_SIM_FONTS=( \
        "AR PL UKai CN" \
        "AR PL UMing Patched Light"
        "AR PL UKai TW" \
        "AR PL UMing TW MBE Light" \
        "AR PL UKai Patched" \
        "Arial Unicode MS" \
        "Arial Unicode MS Bold" \
        "WenQuanYi Zen Hei Medium" \
        "Times New Roman, Bold" \
        "Times New Roman, Bold Italic" \
        "Times New Roman, Italic" \
        "Times New Roman," \
        )
    

    外形相近的字符替换

    应用Times New Roman字体中现有的字符替换外形相近的字符,见以下是替换代码:

    text_line.replace('(','(').replace(')',')').replace('﹣','-').replace('.','.')\
             .replace(':', ':')
    

    替换其他特殊字符

    训练流程

    第一步

    将文件training/tesstrain.sh复制备份training/tesstrain0.sh,并在tesstrain0.sh中将54行以下的内容添加注释,注释后的代码为:

    source "$(dirname $0)/tesstrain_utils.sh"
    
    ARGV=("$@")
    parse_flags
    echo $WORKSPACE_DIR
    
    mkdir -p ${TRAINING_DIR}
    # tlog "\n=== Starting training for language '${LANG_CODE}'"
    #
    # source "$(dirname $0)/language-specific.sh"
    # set_lang_specific_parameters ${LANG_CODE}
    #
    # initialize_fontconfig
    #
    # phase_I_generate_image 8
    # phase_UP_generate_unicharset
    # phase_D_generate_dawg
    # if ((LINEDATA)); then
    #   phase_E_extract_features "lstm.train" 8 "lstmf"
    #   make__lstmdata
    # else
    #   phase_E_extract_features "box.train" 8 "tr"
    #   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
    #   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
    #       phase_S_cluster_shapes
    #   fi
    #   phase_M_cluster_microfeatures
    #   phase_B_generate_ambiguities
    #   make__traineddata
    # fi
    #
    # tlog "\nCompleted training for language '${LANG_CODE}'\n"
    
    

    执行以下命令:

    training/tesstrain0.sh --fonts_dir /usr/share/fonts \
    --training_text ../training_data/input_data0.txt \
    --langdata_dir ../langdata --tessdata_dir ./tessdata \
    --lang chi_sim --linedata_only --noextract_font_properties \
    --exposures "0" --fontlist "SIMSUN" \
    --output_dir ~/tesstutorial/chitest
    

    第二步

    按照以下步骤对shell文件进行修改:

    1. 查看系统/tmp目录下的文件夹,找到格式为'tmp.*'的文件夹,并将此文件夹名进行复制。
    2. 将文件training/tesstrain_utils.sh复制备份training/tesstrain_utils0.sh,并将tesstrain_utils0.sh中的WORKSPACE_DIR变量名赋值为'tmp.*'(/tmp目录下相应的文件夹名称)
    3. 将training/tesstrain0.sh文件中source "$(dirname $0)/tesstrain_utils.sh"语句改成source "$(dirname $0)/tesstrain_utils0.sh"
    4. 将training/tesstrain0.sh文件的注释内容进行更改,更改后的内容如下:
    source "$(dirname $0)/tesstrain_utils0.sh"
    
    ARGV=("$@")
    parse_flags
    echo $WORKSPACE_DIR
    
    mkdir -p ${TRAINING_DIR}
    tlog "\n=== Starting training for language '${LANG_CODE}'"
    
    source "$(dirname $0)/language-specific.sh"
    set_lang_specific_parameters ${LANG_CODE}
    
    initialize_fontconfig
    
    phase_I_generate_image 8
    # phase_UP_generate_unicharset
    # phase_D_generate_dawg
    # if ((LINEDATA)); then
    #   phase_E_extract_features "lstm.train" 8 "lstmf"
    #   make__lstmdata
    # else
    #   phase_E_extract_features "box.train" 8 "tr"
    #   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
    #   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
    #       phase_S_cluster_shapes
    #   fi
    #   phase_M_cluster_microfeatures
    #   phase_B_generate_ambiguities
    #   make__traineddata
    # fi
    #
    # tlog "\nCompleted training for language '${LANG_CODE}'\n"
    

    执行以下命令:

    training/tesstrain0.sh --fonts_dir /usr/share/fonts \
    --training_text ../training_data/input_data0.txt \
    --langdata_dir ../langdata --tessdata_dir ./tessdata \
    --lang chi_sim --linedata_only --noextract_font_properties \
    --exposures "0" --fontlist "SIMSUN" \
    --output_dir ~/tesstutorial/chitest
    

    第三步

    在系统桌面进入"/tmp/tmp.J8JYbdYYrv/chi_sim"目录,修改后缀名为.box文件,将此文件中的特殊字符修改为想要替换的其他字符,注意:每个字符的替换需要保持一致哦。

    例如:本文例将所有的'≤'符号替换成"小"。

    第四步

    修改training/tesstrain0.sh文件中的注释内容,修改后如下:

    source "$(dirname $0)/tesstrain_utils0.sh"
    
    ARGV=("$@")
    parse_flags
    
    # mkdir -p ${TRAINING_DIR}
    # tlog "\n=== Starting training for language '${LANG_CODE}'"
    #
    # source "$(dirname $0)/language-specific.sh"
    # set_lang_specific_parameters ${LANG_CODE}
    #
    # initialize_fontconfig
    #
    # phase_I_generate_image 8
    echo $TRAINING_DIR
    phase_UP_generate_unicharset
    phase_D_generate_dawg
    if ((LINEDATA)); then
      phase_E_extract_features "lstm.train" 8 "lstmf"
      make__lstmdata
    else
      phase_E_extract_features "box.train" 8 "tr"
      phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
      if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
          phase_S_cluster_shapes
      fi
      phase_M_cluster_microfeatures
      phase_B_generate_ambiguities
      make__traineddata
    fi
    
    tlog "\nCompleted training for language '${LANG_CODE}'\n"
    

    执行以下命令:

    training/tesstrain0.sh --fonts_dir /usr/share/fonts \
    --training_text ../training_data/input_data0.txt \
    --langdata_dir ../langdata --tessdata_dir ./tessdata \
    --lang chi_sim --linedata_only --noextract_font_properties \
    --exposures "0" --fontlist "SIMSUN" \
    --output_dir ~/tesstutorial/chitest
    

    第五步

    按步骤执行以下命令:

    1. mkdir -p ~/tesstutorial/chituned_from_chisim
    2. combine_tessdata -e ../tessdata/chi_sim.traineddata ~/tesstutorial/chituned_from_chisim/chi_sim.lstm
    3. lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm --train_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt --eval_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt --target_error_rate 0.01
    4. lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned.lstm
      --continue_from ~/tesstutorial/chituned_from_chisim/chituned_checkpoint
      --stop_training
    5. combine_tessdata -o ./tessdata/chi_sim.traineddata
      ~/tesstutorial/chituned_from_chisim/chituned.lstm
      ~/tesstutorial/chitest/chi_sim.lstm-number-dawg
      ~/tesstutorial/chitest/chi_sim.lstm-punc-dawg
      ~/tesstutorial/chitest/chi_sim.lstm-word-dawg

    新生成的traineddata在tesseract/tessdata目录下。

    相关文章

      网友评论

      • 88fc717959f4:请问training_data/input_data0.txt文件里是什么样的内容?

      本文标题:特殊字符语言包训练流程

      本文链接:https://www.haomeiwen.com/subject/wzvvlxtx.html