来，教你用Python玩转PDF文档！

作者: 编程新视野 | 来源:发表于2019-01-29 13:29 被阅读0次

来，教你用Python玩转PDF文档！
用Python玩转数据：python基础语法
办公实用软件系列（一），玩转PDF
Python玩转PDF文档，感受Python的强大！
python数据分析数据科学中文英文工具书籍下载-持续更新
翻译python官方文档：如何开始
实用的在线网站
PDF文件的读写
Python处理PDF文档-拆分&合并
有了它，可解决90%的PDF文档转换、编辑问题！

python作为一种具有相对简单语法的高级解释语言，即使对于那些没有编程经验的人来说，Python也是简单易操作的。强大的Python库让你事半功倍。

在处理文本信息时，通常我们需要从word、PDF文档中提取出信息，而PDF是最重要和最广泛使用的用来呈现和交换文件的数字媒体之一，。PDF包含有用的信息，链接和按钮，表单域，音频，视频和业务逻辑。python库很好地集成并提供处理非结构化数据源。运用python可以轻松从PDF中提取有用信息后，您可以轻松地将该数据用于任何机器学习或自然语言处理模型。

常见的Python库

以下是可用于处理PDF文件的一些Python库

PDFMiner ：一个从PDF文档中提取信息的工具。与其他PDF相关工具不同，它完全专注于获取和分析文本数据。

PyPDF2 ：一个纯python PDF库，能够分割，合并，裁剪和转换PDF文件的页面。它还可以向PDF文件添加自定义数据，查看选项和密码。它可以从PDF中检索文本和元数据，以及将整个文件合并在一起。

Tabula-py：一个 tabula-java的简单Python包装器，它可以读取PDF表。您可以从PDF读取表格并转换为pandas的DataFrame。tabula-py还允许您将PDF文件转换为CSV / TSV / JSON文件。

Slate：PDFMiner的包装器实现

PDFQuery：pdfminer，lxml和pyquery的轻量级包装器。它旨在使用尽可能少的代码可靠地从PDF集合中提取数据。

**xpdf **：xpdf的 Python包装器（目前只是“pdftotext”实用程序）

从pdf中提取文本

使用PyPDF2从pdf中提取简单文本，示例代码如下：

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232);">import PyPDF2

pdf file object

you can find find the pdf file with complete code in below

pdfFileObj = open('example.pdf', 'rb')

pdf reader object

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

number of pages in pdf

print(pdfReader.numPages)

a page object

pageObj = pdfReader.getPage(0)

extracting text from page.

this will print the text you can also save that into String

print(pageObj.extractText())

</pre>

从pdf中读取表格数据

使用Pdf中的Table数据，我们可以使用Tabula-py,示例代码如下：

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232);">import tabula

readinf the PDF file that contain Table Data

you can find find the pdf file with complete code in below

read_pdf will save the pdf table into Pandas Dataframe

df = tabula.read_pdf("offense.pdf")

in order to print first 5 lines of Table

df.head()

</pre>

如果您的Pdf文件包含多个表，可以进行如下设置：

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232);">df = tabula.read_pdf（“crime.pdf”，multiple_tables = True）

</pre>

还可以从任何特定PDF页面的特定部分提取信息

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232);">tabula.read_pdf（“crime.pdf”，area =（126,149,212,462），pages = 1）

</pre>

设置读取输出为JSON格式

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232);">tabula.read_pdf（“crime.pdf”，output_format =“json”）

</pre>

将Pdf导出到Excel

使用以下代码将PDF数据转换为Excel或CSV

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232);">tabula.convert_into（“crime.pdf”，“crime_testing.xlsx”，output_format =“xlsx”）

</pre>

源码视频书籍练习题等资料进群696541369 即可免费获取

更多python记得关注我的公众号从0到1Python之路