第5课：三、Kranken 模块

作者: uptnv | 来源:发表于2020-06-12 15:56 被阅读0次

第5课：三、Kranken 模块
Worte
Kranken2: rsync_from_ncbi.pl: un
学写作的第十一天
树木图四
程序化的新心关系49
接口自动化测试怎么一步步学、按照我的方式、你不可能学不会-第二集
第10周工作总结
02-Node 第三方模块
高阶理论：企业变革如何才能成功？

[TOC]

Release the Kraken

我们要看的下一个库是 Kraken，它是由巴黎PSL大学开发的。我们将使用Kraken的目的是在给定图像中检测文本行来作为的边界框（detect lines of text as bounding boxes）, tesseract 的最大局限在于其内部缺少布局（layout）引擎。 Tesseract 希望输入 clean 的文本图像，并且，如果我们不 crop up 其他artifacts，它就可能无法正确处理，但是Kraken可以帮助我们分割页面。让我们来看看。

首先我们来看看 Kranken 模块

import kraken
help(kraken)
----
Help on package kraken:

NAME
    kraken - entry point for kraken functionality

PACKAGE CONTENTS
    binarization
    ketos
    kraken
    lib (package)
    linegen
    pageseg # 
    repo
    rpred
    serialization
    transcribe

DATA
    absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...
    division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...
    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...

FILE
    /opt/conda/lib/python3.7/site-packages/kraken/__init__.py

这里没有太多讨论，但是有许多看起来很有趣的子模块。我花了一些时间在他们的网站上，我认为处理所有页面细分的 the pageseg module 是我们要使用的模块。让我们看看

from kranken import pageseg
help(pageseg)

因此，看起来我们可以调用一些不同的函数，而 the segment function 看起来特别合适。

 segment(im, text_direction='horizontal-lr', scale=None, maxcolseps=2, black_colseps=False, no_hlines=True, pad=0, mask=None)
        # Segments a page into text lines.
        
        Segments a page into text lines and returns the absolute coordinates of
        each line in reading order.
        ...
        Args:
            im (PIL.Image): A bi-level page of mode '1' or 'L'
            text_direction (str): Principal direction of the text
                                  (horizontal-lr/rl/vertical-lr/rl)
            scale (float): Scale of the image
            maxcolseps (int): Maximum number of whitespace column separators
            black_colseps (bool): Whether column separators are assumed to be
                                  vertical black lines or not # 列分隔符是否为垂直黑线
            no_hlines (bool): Switch for horizontal line removal
            pad (int or tuple): Padding to add to line bounding boxes. If int the
                                same padding is used both left and right. If a
                                2-tuple, uses (padding_left, padding_right).
            mask (PIL.Image): A bi-level mask image of the same size as `im` where
                              0-valued regions are ignored for segmentation
                              purposes. Disables column detection.
        Returns:
            {'script_detection': True, 'text_direction': '$dir', 'boxes':
            [[(script, (x1, y1, x2, y2)),...]]}: A dictionary containing the text
            direction and a list of lists of reading order sorted bounding boxes
            under the key 'boxes' with each list containing the script segmentation
            of a single line. Script is a ISO15924 4 character identifier.

我喜欢这个库在文档方面的表现力-我可以立即看到我们正在使用PIL.Image文件，并且作者甚至指示我们需要传递二值图（e.g. 'l'）或灰度图（e.g. 'L'）。我们还可以看到，返回值是带有两个键的字典对象，“ text_direction”将返回文本方向的字符串，“ boxes”将显示为元组列表，其中每个元组为 a box iln the original image.。

让我们在文本图像上尝试一下。我在一个名为 two_col.png 的文件中有一段简单的文字，该文件来自校园的报纸

from PIL import Image
im = Image.open('readonly/two_col.png')
# 显示图片
display(im)
# 现在让我们将其转换为黑白，并用kranken包把它分成几行
bounding_boxes = pageseg.segment(im.convert('l'))['boxes']
# 然后我们把 box 打印出来：box的左上和右下的坐标
print(bounding_boxes)

[[100, 50, 449, 74], [131, 88, 414, 120], [59, 196, 522, 229], [18, 239, 522, 272], [19, 283, 522, 316], [19, 327, 525, 360], [19, 371, 523, 404], [18, 414, 524, 447], [17, 458, 522, 491], [19, 502, 141, 535], [58, 546, 521, 579], [18, 589, 522, 622], [19, 633, 521, 665], [563, 21, 1066, 54], [564, 64, 1066, 91], [563, 108, 1066, 135], [564, 152, 1065, 179], [563, 196, 1065, 229], [563, 239, 1066, 272], [562, 283, 909, 316], [600, 327, 1066, 360], [562, 371, 1066, 404], [562, 414, 1066, 447], [563, 458, 1065, 485], [563, 502, 1065, 535], [562, 546, 1066, 579], [562, 589, 1064, 622], [562, 633, 1066, 660], [18, 677, 833, 704], [18, 721, 1066, 754], [18, 764, 1065, 797], [17, 808, 1065, 841], [18, 852, 1067, 885], [18, 895, 1065, 928], [17, 939, 1065, 972], [17, 983, 1067, 1016], [18, 1027, 1065, 1060], [18, 1070, 1065, 1103], [18, 1114, 1065, 1147]]

image-20200509233703620

好吧，非常简单的两列文本，然后是列表，这些列表是该文本行的边界框（a list of lists which are the bounding boxes of lines of that text. ）让我们编写一些例程以尝试更清楚地看到效果。我将整理一下自己的行为并编写真实的文档，这是一个好习惯

def show_boxes(img):
    '''修改传递的图像，以显示由kraken运行的图像上的一系列边界框
    :param img: A PIL.Image object
    :return img: The modified PIL.Image object'''
    # 引入 PIL 模块的 draw 对象 以对获取到的坐标进行方框绘制
    from PIL import ImageDraw
    drawing_object = ImageDraw.Draw(img)
    # 使用 pageseg.segment 方法创建一系列方框坐标
    bounding_boxes = pages.segment(img.convert('l'))['boxes']
    # 遍历这个（包含方框坐标元组）的列表
    for box in bounding_boxes:
        drawing_object.rectangle(box, fill = None, outline = 'red')
        
    return img

# 使用display 展示变更过后的 图像
display(show_boxes(Image.open('readonly/two_col.png')))

image-20200510101344089

很不错！有趣的是，kraken不能完全确定如何处理这两列格式的文本。在某些情况下，kraken 在每一单列中标识了一条线，而在其他情况下，kraken把两列文本识别为一行。这有关系吗？好吧，这确实取决于我们的目标。在这种情况下，我想看看我们是否可以对此有所改进。

因此，我们将在此处不说脚本。尽管本周的讲座是关于库的，但最后一门课程的目标是使您充满信心，即使您使用的库不能完全满足您的要求，也可以将您的知识应用于实际的编程任务。看上图，带有两列示例和红色框，您认为我们如何修改此图像以提高kraken的文本行能力？

感谢您分享您的想法，我期待看到课程中每个人都提出的广泛想法。这是我的解决方案-在浏览 pageseg() 函数上的kraken文档时，我看到有一些参数来提高segmentation的效果。其中之一是 black_colseps 参数。如果设置为True，则kraken将假定各列是用黑线分隔。我们这里并没有用黑色竖线分割两列，但是，我认为我们可以将源图像更改为在列之间使用黑色分隔符。

我们在

def show_boxes(img):
    # Lets bring in our ImageDraw object
    from PIL import ImageDraw
    # And grab a drawing object to annotate that image
    drawing_object=ImageDraw.Draw(img)
    # 增加一个文本分割的参数 black_colseps=True
    bounding_boxes=pageseg.segment(img.convert('1'), black_colseps=True)['boxes']
    # Now lets go through the list of bounding boxes
    for box in bounding_boxes:
        # An just draw a nice rectangle
        drawing_object.rectangle(box, fill = None, outline ='red')
    # And to make it easy, lets return the image object
    return img

下一步是考虑我们要检测空白列分隔符（detect a white column separator）的算法。在进行一些实验时，我决定仅在的间隔至少为25个像素宽（大约是一个字符的宽度，六倍行高）时才认为这是分隔符。宽度很容易，让我们做一个变量

char_width = 25
# 高度较难，因为它取决于文本的高度。 我将编写一个例程来计算一行的平均高度
def calculate_line_height(img):
    '''计算 一个image 的 the average height of a line 
    :param img: A PIL.Image object
    :retuen: The average line height in pixels'''
    # 首先得到 bounding box 坐标的列表
    bounding_boxes = pageseg.segment(img.convert('l'))['boxes']
    # each box is a tuple of (top, left, bottom, right)
    # 因此每行的高度是(top-bottom)
    height_acculator = 0
    for box in bounding_boxes:
        height_accmulator = height_accumlator + box[3]-box[1]
        # 这有点棘手，请记住，我们从PIL的左上角开始计数！ 现在让我们只返回平均高度，就可以通过将其变为整数来将其更改为最接近的完整像素
    return int(height_accumulator/len(bounding_boxes))

# 让我们测试下每行的平均高度
line_height = calculate_line_height(Image.open('readonly/two_col.png'))
print(line_height)

好的，所以一条线的平均高度是31。现在，我们要扫描图像-依次查看每个像素：以确定是否存在空白的区域（whitespace）。 How bit of a block should we look for? 那是一门艺术，而不是一门科学。看我们的示例图像，我要说一个合适的块的大小应该是 one char_width wide, and six line_heights tall. 但是我也只是目测的，让我们创建一个新的框，称为间隙框（gap_box），代表该区域

# Lets create a new box called gap box that represents this area
gap_box=(0,0,char_width,line_height*6)
gap_box
---
(0, 0, 25, 186)

我们将希望有一个函数，输入一幅图像中的像素，可以检查该像素是否在其右侧和下方有whitespace。本质上，我们要测试像素是否是 gap_box 的对象的左上角。 If so, we should insert a line to "break up" this box before sending to kraken

回想一下，我们可以使用 img.getpixel() 函数获得一个坐标的像素值。它以整数元组的形式返回此值，每个颜色通道一个。当适用于二值化图像（黑白），将只返回一个值。如果值为0，则为黑色像素；如果为白色，则值为255
我们将假设图像已被二值化。检查边界框的算法非常简单：我们有一个位置作为起点，然后我们要检查该位置右边的所有像素，直到 gap_box[2]

# Lets call this new function gap_check
def gap_check(img, location):
    '''Checks the img in a given (x,y) location to see if it fits the description
    of a gap_box
    :param img: A PIL.Image file
    :param location: A tuple (x,y) which is a pixel location in that image
    :return: True if that fits the definition of a gap_box, otherwise False
    '''
    for x in range(location[0], location[0]+gap_box[2]):
        # the height is similar, so lets iterate a y variable to gap_box[3]
        for y in range(location[1], location[1]+gap_box[3]):
            # 我们想要check这个区域内的像素是不是白色，如果是白色我们什么都不做，
            # 如果是黑色,return False 说明从这个坐标点爱看i是的 block区域内不是 列间隔区域
            # 我们最好确认这在图像内
            if x < img.width and y < img.height:
                if img.getpixel((x,y)) != 255:
                    return False
    # 如果我们遍历了所有的以 loaction 为起点的 gap_box 则这就是一个gap
    return True

好的现在我们有了一个 function 来 check 列文本图像中的列gap，当我们找到这个 gap 之后我们想要做的是在这个 gap_box 的正中间画一条线，让我们写一个function来完成这个工作

def draw_sep(img,location):
    '''Draws a line in img in the middle of the gap discovered at location. Note that
    this doesn't draw the line in location, but draws it at the middle of a gap_box
    starting at location.
    :param img: A PIL.Image file
    :param location: A tuple(x,y) which is a pixel location in the image
    '''
    # First 画 操作的代码段 
    from PIL import ImageDraw
    drawing_object=ImageDraw.Draw(img)
    # next, lets decide what the middle means in terms of coordinates in the image
    x1=location[0]+int(gap_box[2]/2)
    # and our x2 is just the same thing, since this is a one pixel vertical line
    x2=x1
    # our starting y coordinate is just the y coordinate which was passed in, the top of the box
    y1=location[1]
    # but we want our final y coordinate to be the bottom of the box
    y2=y1+gap_box[3]
    drawing_object.rectangle((x1,y1,x2,y2), fill = 'black', outline ='black')
    # 这个函数不需要 return，因为我们改变了这个图

现在，我们把所有的方法合起来，我们迭代整个文本图像的所有像素来确认它是否存在一个 gap，如果存在则在整个 gap_box 中间画一条线，

def process_image(img):
    '''Takes in an image of text and adds black vertical bars to break up columns
    :param img: A PIL.Image file
    :return: A modified PIL.Image file
    '''
    # we'll start with a familiar iteration process
    for x in range(img.width):
        for y in range(img.height):
            # check if there is a gap at this point
            if (gap_check(img, (x,y))):
                # then update image to one which has a separator drawn on it
                draw_sep(img, (x,y))
    # and for good measure we'll return the image we modified
    return img

# Lets read in our test image and convert it through binarization
i=Image.open("readonly/two_col.png").convert("L")
i=process_image(i)
display(i)

#Note: This will take some time to run! Be patient!

image-20200604234503228

一点也不差！图片底部的效果对我来说有点出乎意料，但这是有道理的。您可以想象有几种方法可以尝试控制它。让我们看看通过kraken layout engine运行时新图像的工作方式

display(show_boxes(i))  # i=process_image(i)

看起来非常准确，并且可以解决我们面临的问题。虽然我们创建的方法确实非常慢，但是如果我们想在较大的文本上使用它，这将是一个问题。但我想向您展示如何混合使用自己的逻辑并使用正在使用的库。仅仅因为Kraken不能完美地工作，并不意味着我们不能在其之上构建更具体的用例。
我想暂停一下本讲座，并请您反思一下我们在此处编写的代码。我们以一些非常简单的库用法开始了本课程，但是现在我们正在更深入地研究并在这些库的帮助下自行解决问题。在进入最后一个库之前，您认为您准备如何充分利用python技能？

第5课：三、Kranken 模块
[TOC] Release the Kraken 我们要看的下一个库是 Kraken，它是由巴黎PSL大学开发的。...
Worte
Das Essen widert den Kranken an.
Kranken2: rsync_from_ncbi.pl: un
使用conda安装Kranken2，下载数据库报错信息原因：ftp地址已改为http 解决办法：修改脚本文件'...
学写作的第十一天
第11节总是不知道写什么该怎么办？模块三选题能力摘要: 我们开始进入第三模块的学习，聊聊选题，也...
树木图四
在树木图的旅程当中，我们现在已经来到了第19站，进入第三个模块的学习。第三模块是对之前第一和第二模块的深化。第一模...
程序化的新心关系49
2021.3.5 星期三深圳盐田龙笨笨啾啾过不去才使用模块模块只是工具戛然而止打包封存不触碰第...
接口自动化测试怎么一步步学、按照我的方式、你不可能学不会-第二集
第14节 logging日志模块封装 logging模块介绍 logging模块是Python内置的标准模块，主要...
第10周工作总结
英语期中考试要考到第6模块。本周我们结束了第6模块的学习，同时学习了第7模块第1单元。本周英语早读，同样是在7:...
02-Node 第三方模块
Node 第三方模块什么是第三方模块获取第三方模块第三方模块 nrm第三方模块 nodemon第三方模块 Gulp...
高阶理论：企业变革如何才能成功？
模块三重塑组织第22讲高阶理论：企业变革如何才能成功？ ➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖ ️高阶理论：企业变...