PDF文本信息提取（二）

本文作者：王碧琪
文字编辑：方言
技术总编：张邯

在之前的推文《提取PDF文本信息：入门》中，我们使用pdfminer提取了PDF文档中的文本信息，相较之下，今天要介绍的pdfplumber提取文本信息所使用的程序更加简洁，处理方式更直接，一起来学习一下吧~

一、简介

待处理的PDF文档内容如下图示：

image

pdfplumber中的extract_text函数就可以实现提取文本信息的功能。官方文档如下：

.extract_text(x_tolerance=0, y_tolerance=0) Collates all of the page's character objects into a single string. Adds spaces where the difference between the x1 of one character and the x0 of the next is greater than x_tolerance. Adds newline characters where the difference between the doctop of one character and the doctop of the next is greater than y_tolerance.

另外，extract_words函数也可以实现提取文本信息的功能，二者有些不同，官方描述如下：

.extract_words(x_tolerance=0, y_tolerance=0) Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where the difference between the x1 of one character and the x0 of the next is less than or equal to x_tolerance and where the doctop of one character and the doctop of the next is less than or equal to y_tolerance.

二者皆是返回文本内容，但是具体的返回信息有所不同，下面用一个实际的例子具体讲解。

二、案例应用

（一）首先引入该库，并且导入待处理的PDF文档,生成pages对象

import pdfplumber

pdf=pdfplumber.open(r"E: \01.pdf")
pages=pdf.pages

也可以使用with open语句，效果相同：

import pdfplumber
with pdfplumber.open(r"E: \01.pdf") as pdf:
    pages=pdf.pages

（二）对PDF的每一页进行处理
pages是一个可迭代对象，我们需要逐页处理：

for p in pages:
    print(p)
    print(p.page_number)
    print(p.width)
    print(p.height)
    print(p.objects)  #lines chars rects

部分结果如下图：

image

p是一个pdfplumber处理后得到的每一页文档的对象，它有一些属性，如page_number返回页码，width返回宽度，height返回高度，objects返回p中识别到的所有对象，包括lines、chars、rects等。extract_text()函数就是提取了这些objects中的text。

for p in pages:
    text=p.extract_text()
    print(text)
    print(type(text))

结果是：

image

可以看到，PDF文档中的文本内容按照原文中的换行格式（并非实际的段落）呈现出来，得到的对象类型是字符串。

另外还可以使用extract_words()函数。

for p in pages:
    word=p.extract_words()
    print(word)
    print(len(word))
    print(type(word))

结果如下：

image

结果显示，word是一个列表，列表中包含8个字典元素。每一个字典元素对应了一行文字，列示了x0、x1、top、bottom来表示对象所处的 位置信息 ，text对应的value是文字本身。因此，为了得到所有的text，需要进行一下遍历,提取text的内容：

for p in pages:
    word=p.extract_words()
    for unitword in word:
        print(unitword['text'])

结果是：

image

这里生成的也是字符串对象，得到的结果与上面使用extract_text()无异。最后把生成的结果导出即可。

三、完整程序

import pdfplumber
with pdfplumber.open(r"E:\01.pdf") as pdf:
    pages=pdf.pages 
    for p in pages:
        print(p)
        print(p.page_number)
        print(p.width)
        print(p.height)
        print(p.objects)  #lines chars rects

    text=p.extract_text()
    print(text)
    print(type(text))

    word=p.extract_words()
    print(word)
    print(len(word))
    print(type(word))

    for unitword in word:
        print(unitword['text'])

如果只需要快速提取所有的文本信息，那么.extract_text()无疑是好的选择，而.extract_words()在提取信息时会给出位置信息，这在批量处理pdf文件清洗数据的时候可以作为清洗条件或筛选范围，二者各有好处，有需要的小伙伴们可以自己尝试一下~