主要调研目前单据中表格识别的流程和方法
从图像中检测和识别表格,北航&微软提出新型数据集TableBank
- TableBank
- 基于图像的表格分析
ICDAR 2019表格识别论文与竞赛综述(下)
- table2latex
参考论文
- A Dataset of Annotated Spreadsheets for Layout and Table Recognition
- 商业提供: http://ai.onlyou.com/site/tableDocument.htm
- python+opencv将表格图片按照表格框线分割、识别https://blog.csdn.net/HXiao0805/article/details/90729541
- 从包含表格的扫描图片中识别表格和文字 C++项目
- Optical table recognition - recognize tables in scan images using OpenCV python项目
- 腾讯云API表格识别
- 百度云API表格识别
- Table Detection, Information Extraction and Structuring using Deep Learning
- Rethinking Table Recognition using Graph Neural Networks
- TableBank: Table Benchmark for Image-based Table Detection and Recognition
- Deep Learning for Detection and Structure Recognition of Tables in Document Images
- 讨论: Deep Learning for Invoice Information Extraction
- 讨论:How to extract the structure of invoice data using tensorflow API faster crnn object detection
- 论文:Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images
- 论文:Visual Detection with Context for Document Layout Analysis
单据信息提取 Table Recognition
- 关键字:Table Recognition | Table Dection
- 表格检测
- https://github.com/doc-analysis/TableBank
- https://github.com/weidafeng/TableCell 数据暂未发表,等待论文被录用
- https://github.com/open-mmlab/mmdetection
- MMDetection中文文档—详解
- https://github.com/mawanda-jun/TableTrainNet
- https://github.com/rinkstiekema/PDF-Table-Structure-Recognition-using-deep-learning
- https://github.com/Academic-Hammer/SciTSR
- https://github.com/Sargunan/Table-Detection-using-Deep-learning
- https://www.jianshu.com/p/9511366f951d Table Detection using Deep Learning论文解析
- R-CNN, Fast R-CNN,Faster R-CNN再到Mask R-CNN
- https://github.com/interviewBubble/Tabulo
- https://github.com/DevashishPrasad/CascadeTabNet
- https://github.com/weidafeng/TableCell
- 表格检测 Table-detection-Mask_RCNN
- 实现了对象检测的框架 https://github.com/facebookresearch/Detectron
- 基于tablebank发布的模型的fine-tunning项目 https://github.com/holms-ur/fine-tuning
- Mask RCNN https://github.com/matterport/Mask_RCNN/
- ICDAR2017 https://flashgene.com/archives/80180.html
- CDAR 2019表格识别论文与竞赛综述(上) https://blog.csdn.net/moxibingdao/article/details/106667444
- PDF-Table-Structure-Recognition-using-deep-learning
- IBM pubtabnet https://github.com/ibm-aur-nlp/PubTabNet
- 表格识别
- 表格内容解析
数据集
- 中文OCR数据集 https://blog.csdn.net/javastart/article/details/104069709
- ICDAR 数据 https://rrc.cvc.uab.es/?ch=2&com=downloads
- 403张表格图片 https://github.com/sgrpanchal31/table-detection-dataset
- tablebank标注的数据 https://drive.google.com/drive/folders/1lxpK4sa4LTSHPFuQEsjFdx87NAlQ8F5O
- tablenet数据集http://www.icst.pku.edu.cn/cpdp/sjzy/index.htm
常用工具
- pytesseract Python-tesseract is a python wrapper for Google's Tesseract-OCR https://pypi.org/project/pytesseract/
- tesseract.exe 安装 https://www.jianshu.com/p/3326c7216696
有边框表格
- 去除图片中的非文字部分,由于处理的数据中字体都是黑色,直接转到hsv空间,将黑色提取出来即可,这样也就把图片中的签章、水印等都去除了。
- 图片空白裁剪
- 图片表格识别,通过轮廓寻找,并去除嵌入在表格内部的表格。
- 使用腐蚀、膨胀的方式提取表格,可以迭代多次,再腐蚀一次。
- 使用opencv的联通区域查找,找出单元格。
无边框表格
- 使用opencv识别
网友评论