python3提取PDF图片，不引入第三方库

作者: SystemLight | 来源:发表于2020-11-21 13:16 被阅读0次

python3提取PDF图片，不引入第三方库
Flutter 引入packages 遇到AndroidX不兼容
PDF图片提取
❖ 利用命令行工具pdfimages来提取PDF中的图片
python3 爬虫学习python爬虫库-requests使用
Android环境下生成PDF文件
Android Studio查看第三方库依赖树
pandas选择数据
Spire.Cloud.PDF 添加及提取PDF图片
Ant网络请求框架

def extract_jpg_from_pdf(path):
    pdf = open(path, "rb").read()

    start_mark = b"\xff\xd8"
    start_fix = 0
    end_mark = b"\xff\xd9"
    end_fix = 2

    i = 0
    n_jpg = 0

    while True:
        is_stream = pdf.find(b"stream", i)
        if is_stream < 0:
            break

        is_start = pdf.find(start_mark, is_stream, is_stream + 20)
        if is_start < 0:
            i = is_stream + 20
            continue

        is_end = pdf.find(b"endstream", is_start)
        if is_end < 0:
            raise Exception("Didn't find end of stream !")
        is_end = pdf.find(end_mark, is_end - 20)
        if is_end < 0:
            raise Exception("Didn't find end of JPG!")

        is_start += start_fix
        is_end += end_fix

        print("JPG %d from %d to %d" % (n_jpg, is_start, is_end))
        jpg = pdf[is_start:is_end]

        print("提取图片" + "pic_%d.jpg" % n_jpg)
        jpg_file = open("pic_%d.jpg" % n_jpg, "wb")
        jpg_file.write(jpg)
        jpg_file.close()

        n_jpg += 1
        i = is_end


if __name__ == '__main__':
    extract_jpg_from_pdf("./data/a.pdf")

使用第三方库的情况可以通过pymupdf轻松完成提取工作