Paddle模型部署

作者: Mr_Michael | 来源:发表于2023-05-23 10:21 被阅读0次

自动部署开源AI模型到生产环境：Scikit-learn、XGB
自动部署深度神经网络模型TensorFlow（Keras）到生产
TF各类资源
2018-01-13
Tensorflow介绍与安装
使用开源AI-Serving部署推断PMML和ONNX模型
Paddle
Paddle初探
PaddleNLP预训练模型实现文本分类
树莓派上使用paddle预训练模型

一、简介

飞桨(PaddlePaddle)以百度深度学习技术研究和业务应用为基础，集深度学习核心训练和推理框架、基础模型库、端到端开发套件和丰富的工具组件于一体的深度学习平台。

1.训练模型库

PaddleHub旨在为开发者提供丰富的、高质量的、直接可用的预训练模型。飞桨的产业级模型库包含PaddleClas、PaddleDetection、PaddleSeg、PaddleOCR、PaddleGAN、PaddleVideo、PaddleNLP等方面。

模型库下载

1）PaddleDetection

PaddleDetection是一个基于PaddlePaddle的目标检测端到端开发套件，在提供丰富的模型组件和测试基准的同时，注重端到端的产业落地应用，帮助开发者实现数据准备、模型选型、模型训练、模型部署的全流程打通，快速进行落地应用。

不仅复现了常见的目标检测模型，还对模型的进行了图像增强、骨干网络优化、DropBlock，IoU Loss IoUAware等一系列深度优化。同时内置集成模型压缩能力，提供了一键式剪裁，蒸馏，量化的脚本，大大提高模型精度和速度，并减小模型体积。

优势：

模型丰富：PaddleDetection提供了丰富的模型，包含目标检测、实例分割、人脸检测等100+个预训练模型，10+算法，持续推出针对服务器端和移动端、嵌入式等的增强模型，涵盖多种数据集竞赛冠军方案，并提供适合云端/边缘端设备部署的检测方案。
高灵活度：PaddleDetection通过模块化设计来解耦各个组件，模型网络结构和数据处理均可定制，基于配置文件可以轻松地搭建各种检测模型。
易部署：PaddleDetection的模型中使用的核心算子均通过C++或CUDA实现，提供跨平台推理引擎，实现了模型训练到部署的无缝衔接，同时内置集成了一键式压缩和部署能力。
- 对于低算力的设备，推出SSDLite及其量化模型，通过模型增强，SSDLite在骁龙845芯片上推理时延达到41ms。
- 对于需要兼顾速度和精度的应用场景，推出YOLOv3剪裁+蒸馏的压缩模型，在100ms左右推理时延上能在COCO数据集上达到25左右的精度。
- 对于算力比较高的设备，推出Cascade Faster RCNN模型，最高能在COCO数据集上达到30.2的精度。
高性能：基于飞桨框架的高性能内核，实现了模型的高效训练和推理。

2）PaddleClas

PaddleClas是飞桨为工业界和学术界所准备的一个图像识别和图像分类任务的工具集。支持多种前沿图像分类、识别相关算法，发布产业级特色骨干网络PP-HGNet、PP-LCNetv2、 PP-LCNet和SSLD半监督知识蒸馏方案等模型。

模型简介	应用场景
PULC 超轻量图像分类方案	固定图像类别分类方案
PP-ShituV2 轻量图像识别系统	针对场景数据类别频繁变动、类别数据多
PP-LCNet 轻量骨干网络	针对Intel CPU设备及MKLDNN加速库定制
PP-LCNetV2 轻量骨干网络	针对Intel CPU设备，适配OpenVINO
PP-HGNet 高精度骨干网络	GPU设备上相同推理时间精度更高

3）PaddleOCR

PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库。最新开源的超轻量PP-OCRv3模型大小仅为16.2M。同时支持中英文识别；支持倾斜、竖排等多种方向文字识别；支持GPU、CPU预测；

模型简介	模型名称	推荐场景
中英文超轻量PP-OCRv3模型（16.2M）	ch_PP-OCRv3_xx	移动端&服务器端
英文超轻量PP-OCRv3模型（13.4M）	en_PP-OCRv3_xx	移动端&服务器端

2.推理部署

1）服务器部署

方案	硬件	API 语言	模型支持	适用场景
Paddle Inference	服务器（CPU、GPU）	C++、Python、C、Go等	支持飞桨所有模型	适合直接应用，既可通过Python API对性能要求不太高的场景快速支持；也提供C++高性能接口，可与线上系统联编；并通过基础的C API可扩展支持更多语言的生产环境。
Paddle Serving	服务器（CPU、GPU）	C++、Python、Go等		适用于将推理计算作为一个远程调用服务的场景，客户端发出请求，服务端返回推理结果。可支持多机部署。

Paddle Inference特点

内存/显存复用提升服务吞吐量

在推理初始化阶段，对模型中的OP输出Tensor 进行依赖分析，将两两互不依赖的Tensor在内存/显存空间上进行复用，进而增大计算并行量，提升服务吞吐量。
细粒度OP横向纵向融合减少计算量

在推理初始化阶段，按照已有的融合模式将模型中的多个OP融合成一个OP，减少了模型的计算量的同时，也减少了 Kernel Launch的次数，从而能提升推理性能。目前Paddle Inference支持的融合模式多达几十个。
内置高性能的CPU/GPU Kernel

内置同Intel、Nvidia共同打造的高性能kernel，保证了模型推理高性能的执行。
子图集成TensorRT加快GPU推理速度

Paddle Inference采用子图的形式集成TensorRT，针对GPU推理场景，TensorRT可对一些子图进行优化，包括OP的横向和纵向融合，过滤冗余的OP，并为OP自动选择最优的kernel，加快推理速度。
子图集成Paddle Lite轻量化推理引擎

Paddle Lite 是飞桨深度学习框架的一款轻量级、低框架开销的推理引擎，除了在移动端应用外，还可以使用服务器进行 Paddle Lite 推理。Paddle Inference采用子图的形式集成 Paddle Lite，以方便用户在服务器推理原有方式上稍加改动，即可开启 Paddle Lite 的推理能力，得到更快的推理速度。
支持加载PaddleSlim量化压缩后的模型

PaddleSlim是飞桨深度学习模型压缩工具，Paddle Inference可联动PaddleSlim，支持加载量化、裁剪和蒸馏后的模型并部署，由此减小模型存储空间、减少计算占用内存、加快模型推理速度。其中在模型量化方面，Paddle Inference在X86 CPU上做了深度优化，常见分类模型的单线程性能可提升近3倍，ERNIE模型的单线程性能可提升2.68倍。

Paddle Inference推理部署流程

导出模型文件。
- 对于动态图模型，可以通过paddle.jit.save 接口来导出用于部署的标准化模型文件。
- 对于静态图模型，可以使用paddle.static.save_inference_model保存模型。
配置推理选项。Config是飞桨提供的配置管理器API。在使用Paddle Inference进行推理部署过程中，需要使用Config详细地配置推理引擎参数，包括但不限于在何种设备（CPU/GPU）上部署、加载模型路径、开启/关闭计算图分析优化、使用MKLDNN/TensorRT进行部署的加速等。参数的具体设置需要根据实际需求来定。
创建Predictor。Predictor是飞桨提供的推理引擎API。根据设定好的推理配置Config创建推理引擎Predictor，也就是推理引擎的一个实例。创建期间会进行模型加载、分析和优化等工作。
准备输入数据。准备好待输入推理引擎的数据，首先获得模型中每个输入的名称以及指向该数据块（CPU或GPU上）的指针，再根据名称将对应的数据块拷贝进Tensor。飞桨采用Tensor作为输入/输出数据结构，可以减少额外的拷贝，提升推理性能。
调用Predictor.Run()执行推理。
获取推理输出。与输入数据类似，根据输出名称将输出的数据（矩阵向量）由Tensor拷贝至（CPU或GPU上）以进行后续的处理。

Paddle Serving特点

集成高性能服务端推理引擎 Paddle Inference 和端侧引擎 Paddle Lite，其他机器学习平台（Caffe/TensorFlow/ONNX/PyTorch）可通过 x2paddle 工具迁移模型。
具有高性能 C++ Serving 和高易用 Python Pipeline 两套框架。
- C++ Serving 基于高性能 bRPC 网络框架打造高吞吐、低延迟的推理服务。
- Python Pipeline 基于 gRPC/gRPC-Gateway 网络框架和 Python 语言构建高易用、高吞吐推理服务框架。
支持 HTTP、gRPC、bRPC 等多种协议；提供 C++、Python、Java 语言 SDK。
设计并实现基于有向无环图(DAG) 的异步流水线高性能推理框架，具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理、请求缓存等特性
适配 x86(Intel) CPU、ARM CPU、Nvidia GPU、昆仑 XPU、华为昇腾310/910、海光 DCU、Nvidia Jetson 等多种硬件。
集成 Intel MKLDNN、Nvidia TensorRT 加速库，以及低精度量化推理。
提供一套模型安全部署解决方案，包括加密模型部署、鉴权校验、HTTPs 安全网关，并在实际项目中应用。
支持云端部署，提供百度云智能云 kubernetes 集群部署 Paddle Serving 。
支持大规模稀疏参数索引模型分布式部署，具有多表、多分片、多副本、本地高频 cache 等特性、可单机或云端部署
支持服务监控，提供基于普罗米修斯的性能数据统计及端口访问。

2）端侧部署

方案	硬件	API 语言	适用场景
Paddle Lite	移动终端、嵌入式终端广泛硬件	C++、Python、Java等	适用于移动端/嵌入式芯片等端侧硬件中的高性能、轻量化部署。

Paddle Lite是飞桨自研的新一代端侧推理推理框架，支持PaddlePaddle/TensorFlow/Caffe/ONNX模型的推理部署，目前已经支持ARM CPU, Mali GPU, Adreno GPU, Huawei NPU等多种硬件，正在逐步增加X86 CPU, Nvidia GPU 等多款硬件。

Paddle Lite推理流程

获取模型：
- 可以直接使用飞桨训练出的模型进行部署；
- 也可以使用Caffe, TensorFlow或ONNX平台训练的模型，但是需要使用X2Paddle工具将其它框架训练的模型转换到Paddle格式。
(可选) 模型压缩：主要优化模型大小，借助PaddleSlim提供的剪枝、量化等手段降低模型大小，以便在端上使用。
通过Model Optimize Tool将模型转换为Paddle lite的nb模型，然后开始部署。
在终端上通过调用Paddle Lite提供的API接口（提供C++、Java、Python等API接口），完成推理相关的所有计算。

Paddle Lite支持的模型

类别	类别细分	模型	支持平台
CV	分类	MobileNetV1	ARM，X86，NPU，RKNPU，APU
CV	分类	MobileNetV2	ARM，X86，NPU
CV	分类	ResNet18	ARM，NPU
CV	分类	ResNet50	ARM，X86，NPU，XPU
CV	分类	MnasNet	ARM，NPU
CV	分类	EfficientNet*	ARM
CV	分类	SqueezeNet	ARM，NPU
CV	分类	ShufflenetV2*	ARM
CV	分类	ShuffleNet	ARM
CV	分类	InceptionV4	ARM，X86，NPU
CV	分类	VGG16	ARM
CV	分类	VGG19	XPU
CV	分类	GoogleNet	ARM，X86，XPU
CV	检测	MobileNet-SSD	ARM，NPU*
CV	检测	YOLOv3-MobileNetV3	ARM，NPU*
CV	检测	Faster RCNN	ARM
CV	检测	Mask RCNN*	ARM
CV	分割	Deeplabv3	ARM
CV	分割	UNet	ARM
CV	人脸	FaceDetection	ARM
CV	人脸	FaceBoxes*	ARM
CV	人脸	BlazeFace*	ARM
CV	人脸	MTCNN	ARM
CV	OCR	OCR-Attention	ARM
CV	GAN	CycleGAN*	NPU
NLP	机器翻译	Transformer*	ARM，NPU*
NLP	机器翻译	BERT	XPU
NLP	语义表示	ERNIE	XPU

二、安装PaddlePaddle

飞桨支持的GPU架构型号

快速安装

1.通过pip安装

# 根据CUDA版本选择安装

# cuda 11.2
python -m pip install paddlepaddle-gpu==2.4.2.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# CPU
python -m pip install paddlepaddle==2.4.2 -i https://pypi.tuna.tsinghua.edu.cn/simple

2.通过docker安装

docker hub paddlepaddle/paddle

# 拉取预安装 PaddlePaddle 的镜像
$ sudo docker pull registry.baidubce.com/paddlepaddle/paddle:2.4.2-gpu-cuda11.7-cudnn8.4-trt8.4
# $ sudo docker pull registry.baidubce.com/paddlepaddle/paddle:2.4.2-gpu-cuda11.2-cudnn8.2-trt8.0

# 进入Docker容器
$ nvidia-docker run --name paddle -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.4.2-gpu-cuda11.7-cudnn8.4-trt8.4 /bin/bash

# 测试paddle
$ paddle --version
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
PaddlePaddle 2.4.2.post117, compiled with
    with_avx: ON
    with_gpu: ON
    with_mkl: ON
    with_mkldnn: ON
    with_python: ON

# 容器中安装paddle2onnx
$ pip install paddle2onnx -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
$ paddle2onnx --version
[INFO]  paddle2onnx-1.0.6 with python>=3.6, paddlepaddle>=2.0.0

# 容器中安装paddleslim
$ pip install paddleslim  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
$ pip list |grep paddleslim
paddleslim          2.4.1

三、模型部署SDK FastDeploy

随着人工智能的发展，新的算法模型层出不穷，各种AI硬件芯片也不断涌现。在实际算法落地中，也需要处理不同场景（服务器部署、服务化部署、嵌入式部署、手机端部署等），不同操作系统（Linux、Windows、Android、iOS等），不同编程语言（Python、C++等）。

为了解决AI部署落地难题，百度发布了新一代面向产业实践的推理部署工具FastDeploy，它是一款全场景、易用灵活、极致高效的AI推理部署工具，支持云端、移动端和边缘端部署。

特点

简单易用
- 多语言统一部署API
- 预置多种热门模型
- 多种端到端部署demo
全场景
- 支持多种推理引擎部署
  - 原生推理库Paddle Inference、轻量化推理引擎Paddle Lite、前端推理引擎Paddle js；
  - TensorRT、OpenVINO、ONNX Runtime
  - RKNN Toolkit、Poros等
- 多框架支持，实现模型协议互转
  - FastDeploy中内置了X2Paddle和Paddle2ONNX模型转换工具。
- 多硬件适配，快速实现多硬件跨平台部署
  - 目前与Intel、NVIDA、瑞芯微、芯原、Graphcore、昆仑芯、飞腾、算能、昇腾等硬件厂商完成了硬件适配。
极致高效
- 软硬联合自动压缩优化，减少部署资源消耗。
  - 内置PaddleSlim模型量化压缩工具。
- 端到端前后预处理优化，减少部署资源消耗
  - 在CPU上，对预处理操作进行融合，减少数据预处理过程中内存创建、拷贝和计算量。
  - 在GPU上，引入了CV-CUDA预处理算子优化。
  - 在移动端，引入高性能图像预处理库FlyCV，显著提升图像数据预处理的性能。

推理后端及能力

1.编译安装

https://github.com/PaddlePaddle/FastDeploy/tree/develop/docs/cn/build_and_install

1）python 预编译库安装

fastdeploy whl包

# 环境要求
CUDA >= 11.2
cuDNN >= 8.0
python >= 3.6

# gpu版本
pip install numpy opencv-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
# cpu版本
pip install numpy opencv-python fastdeploy-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html

2）python sdk编译安装

# 环境要求
gcc/g++ >= 5.4(推荐8.2)
cmake >= 3.18.0
python >= 3.6
cuda >= 11.2
cudnn >= 8.2

git clone https://github.com/PaddlePaddle/FastDeploy.git
cd FastDeploy/python
export ENABLE_ORT_BACKEND=ON
export ENABLE_PADDLE_BACKEND=ON
export ENABLE_OPENVINO_BACKEND=ON
export ENABLE_VISION=ON
export ENABLE_TEXT=ON
export ENABLE_TRT_BACKEND=ON
export WITH_GPU=ON
export TRT_DIRECTORY=/Paddle/TensorRT-8.4.1.5
export CUDA_DIRECTORY=/usr/local/cuda
# OPENCV_DIRECTORY可选，不指定会在编译过程下载FastDeploy预编译的OpenCV库
export OPENCV_DIRECTORY=/usr/lib/x86_64-linux-gnu/cmake/opencv4 \

python setup.py build
python setup.py bdist_wheel

3）c++ 预编译库安装

# g++ 8.2, CUDA 11.2, cuDNN 8.2编译产出
$ wget https://bj.bcebos.com/fastdeploy/release/cpp/fastdeploy-linux-x64-gpu-1.0.6.tgz
$ tar -zxvf fastdeploy-linux-x64-gpu-1.0.6.tgz

$ tree -L 3 fastdeploy-linux-x64-gpu-1.0.6
├── FastDeploy.cmake
├── FastDeployConfig.cmake
├── FastDeployCSharp.cmake
├── fastdeploy_init.sh
├── include # 头文件
│   ├── fastdeploy
│   │   ├── benchmark
│   │   ├── core
│   │   ├── encryption
│   │   │   ├── include
│   │   │   │   ├── decrypt.h
│   │   │   │   ├── encrypt.h
│   │   │   │   └── model_code.h
│   │   │   ├── src
│   │   │   └── util
│   │   ├── encryption.h
│   │   ├── fastdeploy_model.h
│   │   ├── function
│   │   │   ├── cast.h
│   │   │   ├── concat.h
│   │   │   ├── cuda_cast.h
│   │   │   ├── cumprod.h
│   │   │   ├── eigen.h
│   │   │   ├── ....
│   │   │   └── transpose.h
│   │   ├── pipeline
│   │   ├── pipeline.h
│   │   ├── pybind
│   │   ├── runtime
│   │   │   ├── backends
│   │   │   │   ├── backend.h
│   │   │   │   ├── common
│   │   │   │   ├── lite
│   │   │   │   ├── openvino
│   │   │   │   ├── ort
│   │   │   │   ├── paddle
│   │   │   │   ├── poros
│   │   │   │   ├── rknpu2
│   │   │   │   ├── sophgo
│   │   │   │   └── tensorrt
│   │   │   ├── enum_variables.h
│   │   │   ├── runtime.h
│   │   │   └── runtime_option.h
│   │   ├── runtime.h
│   │   ├── text
│   │   ├── text.h
│   │   ├── utils
│   │   ├── vision
│   │   │   ├── classification
│   │   │   ├── common
│   │   │   ├── detection
│   │   │   ├── facealign
│   │   │   ├── facedet
│   │   │   ├── faceid
│   │   │   ├── generation
│   │   │   ├── headpose
│   │   │   ├── keypointdet
│   │   │   ├── matting
│   │   │   ├── ocr
│   │   │   ├── segmentation
│   │   │   ├── sr
│   │   │   ├── tracking
│   │   │   ├── utils
│   │   │   └── visualize
│   │   └── vision.h
│   ├── fastdeploy_capi
│   │   ├── core
│   │   ├── internal
│   │   ├── runtime
│   │   ├── vision
│   │   └── vision.h
│   └── onnx
│       ├── backend
│       ├── bin
│       ├── checker.h
│       ├── common
│       ├── defs
│       ├── examples
│       ├── frontend
│       ├── ......
│       ├── test
│       ├── tools
│       └── version_converter
├── lib # fastdeploy动态库【已编译】
│   ├── libfastdeploy.so -> libfastdeploy.so.1.0.6
│   ├── libfastdeploy.so.1.0.6
│   └── libonnxifi.so
├── lib64
│   ├── cmake
│   │   └── ONNX
│   ├── libonnx.a
│   ├── libonnxifi_dummy.so
│   ├── libonnxifi_loader.a
│   └── libonnx_proto.a
├── LICENSE
├── openmp.cmake
├── summary.cmake
├── third_libs  # 第三方库依赖【已编译】
│   └── install
│       ├── fast_tokenizer
│       │   ├── commit.log
│       │   ├── FastTokenizer.cmake
│       │   ├── include
│       │   ├── lib
│       │   └── third_party
│       ├── onnxruntime     # 推理后端
│       │   ├── include
│       │   └── lib
│       ├── opencv
│       │   ├── bin
│       │   ├── include
│       │   ├── lib64
│       │   └── share
│       ├── openvino        # 推理后端
│       │   └── runtime
│       │       ├── 3rdparty
│       │       ├── cmake
│       │       ├── include
│       │       └── lib
│       ├── paddle_inference    # 用于paddle模型的服务器端推理
│       │   ├── CMakeCache.txt
│       │   ├── paddle
│       │   │   ├── extension.h
│       │   │   ├── include
│       │   │   │   ├── crypto
│       │   │   │   ├── paddle_api.h
│       │   │   │   ├── paddle_inference_api.h
│       │   │   │   ├── ......
│       │   │   └── lib
│       │   │       └── libpaddle_inference.so
│       │   ├── third_party # paddle_inference依赖的第三方库【已编译】
│       │   │   ├── externalError
│       │   │   │   └── data
│       │   │   ├── install
│       │   │   │   ├── cryptopp
│       │   │   │   ├── gflags
│       │   │   │   ├── glog
│       │   │   │   ├── mkldnn
│       │   │   │   ├── mklml
│       │   │   │   ├── protobuf
│       │   │   │   ├── utf8proc
│       │   │   │   └── xxhash
│       │   │   └── threadpool
│       │   │       └── ThreadPool.h
│       │   └── version.txt
│       └── tensorrt        # 推理后端
│           └── lib
├── ThirdPartyNotices.txt
├── utils
│   └── gflags.cmake
├── utils.cmake
└── VERSION_NUMBER

4）c++ sdk编译安装

# 前置依赖
- gcc/g++ >= 5.4(推荐8.2)
- cmake >= 3.18.0
- cuda >= 11.2
- cudnn >= 8.2

sudo apt-get install libopencv-dev

git clone https://github.com/PaddlePaddle/FastDeploy.git
cd FastDeploy
mkdir build && cd build
cmake .. -DENABLE_ORT_BACKEND=ON \
         -DENABLE_PADDLE_BACKEND=ON \
         -DENABLE_OPENVINO_BACKEND=ON \
         -DENABLE_TRT_BACKEND=ON \
         -DWITH_GPU=ON \
         -DTRT_DIRECTORY=/Paddle/TensorRT-8.4.1.5 \
         -DCUDA_DIRECTORY=/usr/local/cuda \
         -DCMAKE_INSTALL_PREFIX=${PWD}/compiled_fastdeploy_sdk \
         -DENABLE_VISION=ON \
         -DOPENCV_DIRECTORY=/usr/lib/x86_64-linux-gnu/cmake/opencv4 \
         -DENABLE_TEXT=ON
make -j12
make install

# Run cpack to generate a .deb package 【可选】
cpack -G DEB
# Install .deb package
dpkg -i xxx.deb

5）编译选项说明

选项	说明
ENABLE_ORT_BACKEND	默认OFF, 是否编译集成ONNX Runtime后端(CPU/GPU上推荐打开)
ENABLE_PADDLE_BACKEND	默认OFF，是否编译集成Paddle Inference后端(CPU/GPU上推荐打开)
ENABLE_LITE_BACKEND	默认OFF，是否编译集成Paddle Lite后端(编译Android库时需要设置为ON)
ENABLE_RKNPU2_BACKEND	默认OFF，是否编译集成RKNPU2后端(RK3588/RK3568/RK3566上推荐打开)
ENABLE_SOPHGO_BACKEND	默认OFF，是否编译集成SOPHGO后端, 当在SOPHGO TPU上部署时，需要设置为ON
WITH_ASCEND	默认OFF，当在华为昇腾NPU上部署时, 需要设置为ON
WITH_KUNLUNXIN	默认OFF，当在昆仑芯XPU上部署时，需设置为ON
WITH_TIMVX	默认OFF，需要在RV1126/RV1109/A311D上部署时，需设置为ON
ENABLE_TRT_BACKEND	默认OFF，是否编译集成TensorRT后端(GPU上推荐打开)
ENABLE_OPENVINO_BACKEND	默认OFF，是否编译集成OpenVINO后端(CPU上推荐打开)
ENABLE_VISION	默认OFF，是否编译集成视觉模型的部署模块
ENABLE_TEXT	默认OFF，是否编译集成文本NLP模型的部署模块
WITH_GPU	默认OFF, 当需要在GPU上部署时，需设置为ON
RKNN2_TARGET_SOC	ENABLE_RKNPU2_BACKEND时才需要使用这个编译选项。无默认值, 可输入值为RK3588/RK356X, 必须填入，否则将编译失败
CUDA_DIRECTORY	默认/usr/local/cuda, 当需要在GPU上部署时，用于指定CUDA(>=11.2)的路径
TRT_DIRECTORY	当开启TensorRT后端时，必须通过此开关指定TensorRT(>=8.4)的路径
ORT_DIRECTORY	当开启ONNX Runtime后端时，用于指定用户本地的ONNX Runtime库路径；如果不指定，编译过程会自动下载ONNX Runtime库
OPENCV_DIRECTORY	当ENABLE_VISION=ON时，用于指定用户本地的OpenCV库路径；如果不指定，编译过程会自动下载OpenCV库
OPENVINO_DIRECTORY	当开启OpenVINO后端时, 用于指定用户本地的OpenVINO库路径；如果不指定，编译过程会自动下载OpenVINO库

2.Python SDK使用

1）常用Python API

Python API文档

# fastdeploy.RuntimeOption API 后端配置
    set_model_path
    
    use_ascend
    use_cpu
        set_cpu_thread_num
    use_gpu
    
    use_paddle_infer_backend
    use_openvino_backend
    use_ort_backend
    use_trt_backend                 # Use TensorRT inference
        enable_trt_fp16
        enable_paddle_to_trt    # Use Paddle-TensorRT inference
        set_trt_cache_file      # 通过加载保存的缓存，可以快速完成模型加载初始化。

# fastdeploy.Runtime API 多后端推理引擎API
    get_input_info
    get_output_info
    num_inputs
    num_outputs
    infer
    
# Vision Processor (图像预处理库) 
    fastdeploy.vision.common.processors.ResizeByShort
    fastdeploy.vision.common.processors.Resize
    fastdeploy.vision.common.processors.CenterCrop
    fastdeploy.vision.common.processors.Cast
    fastdeploy.vision.common.processors.HWC2CHW
    fastdeploy.vision.common.processors.Normalize
    fastdeploy.vision.common.processors.NormalizeAndPermute
    fastdeploy.vision.common.processors.Pad
    fastdeploy.vision.common.processors.PadToSize
    fastdeploy.vision.common.processors.StridePad

# fastdeploy vision Model API  视觉任务模型API
    fastdeploy.vision.detection.PPYOLOE
    fastdeploy.vision.detection.YOLOv5
        preprocessor
        predict
        batch_predict
        postprocessor

2）Python 推理后端使用

Runtime Python使用示例

Deploy Paddle model with Paddle Inference(CPU/GPU)、TensorRT(GPU)、OpenVINO(CPU)、ONNX Runtime(CPU/GPU)

Deploy ONNX model with TensorRT(GPU)、OpenVINO(CPU)、ONNX Runtime(CPU/GPU)

示例1：

# Deploy Paddle model with Paddle Inference(CPU/GPU)
import fastdeploy as fd
import numpy as np

# 下载模型并解压
model_url = "https://bj.bcebos.com/fastdeploy/models/mobilenetv2.tgz"
fd.download_and_decompress(model_url)

option = fd.RuntimeOption()
option.set_model_path("mobilenetv2/inference.pdmodel",
                      "mobilenetv2/inference.pdiparams")
# **** CPU 配置 ****
option.use_cpu()
option.use_paddle_infer_backend()
option.set_cpu_thread_num(12)
# **** GPU 配置 ***
# 如需使用GPU，使用如下注释代码
# option.use_gpu(0)
# **** IPU 配置 ***
# 如需使用IPU，使用如下注释代码
# option.use_ipu()

# 初始化构造runtime
runtime = fd.Runtime(option)
# 获取模型输入名
input_name = runtime.get_input_info(0).name

# 构造随机数据进行推理
results = runtime.infer({
    input_name: np.random.rand(1, 3, 224, 224).astype("float32")
})
print(results[0].shape)

示例2：

#   Deploy ONNX model with TensorRT(GPU)
import fastdeploy as fd
from fastdeploy import ModelFormat
import numpy as np

# 下载模型并解压
model_url = "https://bj.bcebos.com/fastdeploy/models/mobilenetv2.onnx"
fd.download(model_url, path=".")

option = fd.RuntimeOption()
option.set_model_path("mobilenetv2.onnx", model_format=ModelFormat.ONNX)
# **** GPU 配置 ***
option.use_gpu(0)
option.use_trt_backend()

# 初始化构造runtime
runtime = fd.Runtime(option)
# 获取模型输入名
input_name = runtime.get_input_info(0).name

# 构造随机数据进行推理
results = runtime.infer({
    input_name: np.random.rand(1, 3, 224, 224).astype("float32")
})
print(results[0].shape)

3）使用预置视觉网络推理

FastDeploy/examples/vision

示例1：yolov5

import fastdeploy as fd
import cv2

# 配置runtime，加载模型
option = fd.RuntimeOption()
option.use_trt_backend()
option.set_trt_input_shape("images", [1, 3, 640, 640])
model = fd.vision.detection.YOLOv5(
    "yolov5-model.pdmodel",
    "yolov5-model.pdiparams",
    runtime_option=option,
    model_format=fd.ModelFormat.PADDLE)

# 预测图片检测结果
im = cv2.imread(image)
result = model.predict(im)
print(result)

# 预测结果可视化
vis_im = fd.vision.vis_detection(im, result)
cv2.imwrite("visualized_result.jpg", vis_im)

3.C++ SDK使用

1）C++ API

C++ API文档

2）C++ 推理后端使用

Runtime C++使用示例

示例1：Deploy Paddle model with Paddle Inference(CPU/GPU)

#include "fastdeploy/runtime.h"
#include <cassert>

namespace fd = fastdeploy;

int main(int argc, char* argv[]) {
  // Download from https://bj.bcebos.com/paddle2onnx/model_zoo/pplcnet.tar.gz
  std::string model_file = "pplcnet/inference.pdmodel";
  std::string params_file = "pplcnet/inference.pdiparams";

  // configure runtime
  // How to configure by RuntimeOption, refer its api doc for more information
  // https://baidu-paddle.github.io/fastdeploy-api/cpp/html/structfastdeploy_1_1RuntimeOption.html
  fd::RuntimeOption runtime_option;
  runtime_option.SetModelPath(model_file, params_file);
  runtime_option.UseCpu();
 
  // If need to configure Paddle Inference backend for more option, we can configure runtime_option.paddle_infer_option
  // refer https://baidu-paddle.github.io/fastdeploy-api/cpp/html/structfastdeploy_1_1PaddleBackendOption.html
  runtime_option.paddle_infer_option.enable_mkldnn = true;

  fd::Runtime runtime;
  assert(runtime.Init(runtime_option));

  // Get model's inputs information
  // API doc refer https://baidu-paddle.github.io/fastdeploy-api/cpp/html/structfastdeploy_1_1Runtime.html
  std::vector<fd::TensorInfo> inputs_info = runtime.GetInputInfos();

  // Create dummy data fill with 0.5
  std::vector<float> dummy_data(1 * 3 * 224 * 224, 0.5);

  // Create inputs/outputs tensors
  std::vector<fd::FDTensor> inputs(inputs_info.size());
  std::vector<fd::FDTensor> outputs;

  // Initialize input tensors
  // API doc refer https://baidu-paddle.github.io/fastdeploy-api/cpp/html/structfastdeploy_1_1FDTensor.html
  inputs[0].SetData({1, 3, 224, 224}, fd::FDDataType::FP32, dummy_data.data());
  inputs[0].name = inputs_info[0].name;

  // Inference
  assert(runtime.Infer(inputs, &outputs));
 
  // Print debug information of outputs 
  outputs[0].PrintInfo();

  // Get data pointer and print it's elements
  const float* data_ptr = reinterpret_cast<const float*>(outputs[0].GetData());
  for (size_t i = 0; i < 10 && i < outputs[0].Numel(); ++i) {
    std::cout << data_ptr[i] << " ";
  }
  std::cout << std::endl;
  return 0;
}

示例2：

#include "fastdeploy/runtime.h"
#include <cassert>

namespace fd = fastdeploy;

int main(int argc, char* argv[]) {
  // Download from https://bj.bcebos.com/paddle2onnx/model_zoo/pplcnet.onnx
  std::string model_file = "pplcnet.onnx";

  // configure runtime
  // How to configure by RuntimeOption, refer its api doc for more information
  // https://baidu-paddle.github.io/fastdeploy-api/cpp/html/structfastdeploy_1_1RuntimeOption.html
  fd::RuntimeOption runtime_option;
  runtime_option.SetModelPath(model_file, "", fd::ModelFormat::ONNX);
  runtime_option.UseTrtBackend();
  
  // Use NVIDIA GPU to inference
  // If need to configure TensorRT backend for more option, we can configure runtime_option.trt_option
  // refer https://baidu-paddle.github.io/fastdeploy-api/cpp/html/structfastdeploy_1_1TrtBackendOption.html
  runtime_option.UseGpu(0);
  // Use float16 inference to improve performance
  runtime_option.trt_option.enable_fp16 = true;
  // Cache trt engine to reduce time cost in model initialize
  runtime_option.trt_option.serialize_file = "./model.trt";

  fd::Runtime runtime;
  assert(runtime.Init(runtime_option));

  // Get model's inputs information
  // API doc refer https://baidu-paddle.github.io/fastdeploy-api/cpp/html/structfastdeploy_1_1Runtime.html
  std::vector<fd::TensorInfo> inputs_info = runtime.GetInputInfos();

  // Create dummy data fill with 0.5
  std::vector<float> dummy_data(1 * 3 * 224 * 224, 0.5);

  // Create inputs/outputs tensors
  std::vector<fd::FDTensor> inputs(inputs_info.size());
  std::vector<fd::FDTensor> outputs;

  // Initialize input tensors
  // API doc refer https://baidu-paddle.github.io/fastdeploy-api/cpp/html/structfastdeploy_1_1FDTensor.html
  inputs[0].SetData({1, 3, 224, 224}, fd::FDDataType::FP32, dummy_data.data());
  inputs[0].name = inputs_info[0].name;

  // Inference
  assert(runtime.Infer(inputs, &outputs));
 
  // Print debug information of outputs 
  outputs[0].PrintInfo();

  // Get data pointer and print it's elements
  const float* data_ptr = reinterpret_cast<const float*>(outputs[0].GetData());
  for (size_t i = 0; i < 10 && i < outputs[0].Numel(); ++i) {
    std::cout << data_ptr[i] << " ";
  }
  std::cout << std::endl;
  return 0;
}

3）使用预置视觉网络推理

FastDeploy/examples/vision

示例1：yolov5 使用onnx模型

#include "fastdeploy/vision.h"

// onnxruntime cpu
void CpuInfer(const std::string& model_file, const std::string& image_file) {
  auto model = fastdeploy::vision::detection::YOLOv5(model_file);
  if (!model.Initialized()) {
    std::cerr << "Failed to initialize." << std::endl;
    return;
  }

  auto im = cv::imread(image_file);

  fastdeploy::vision::DetectionResult res;
  if (!model.Predict(&im, &res)) {
    std::cerr << "Failed to predict." << std::endl;
    return;
  }
  std::cout << res.Str() << std::endl;

  auto vis_im = fastdeploy::vision::VisDetection(im, res);
  cv::imwrite("vis_result.jpg", vis_im);
  std::cout << "Visualized result saved in ./vis_result.jpg" << std::endl;
}

// onnxruntime gpu
void GpuInfer(const std::string& model_file, const std::string& image_file) {
  auto option = fastdeploy::RuntimeOption();
  option.UseGpu();
  auto model = fastdeploy::vision::detection::YOLOv5(model_file, "", option);
  if (!model.Initialized()) {
    std::cerr << "Failed to initialize." << std::endl;
    return;
  }

  auto im = cv::imread(image_file);

  fastdeploy::vision::DetectionResult res;
  if (!model.Predict(&im, &res)) {
    std::cerr << "Failed to predict." << std::endl;
    return;
  }
  std::cout << res.Str() << std::endl;

  auto vis_im = fastdeploy::vision::VisDetection(im, res);
  cv::imwrite("vis_result.jpg", vis_im);
  std::cout << "Visualized result saved in ./vis_result.jpg" << std::endl;
}

void TrtInfer(const std::string& model_file, const std::string& image_file) {
  auto option = fastdeploy::RuntimeOption();
  option.UseGpu();
  option.UseTrtBackend();
  option.SetTrtInputShape("images", {1, 3, 640, 640});
  auto model = fastdeploy::vision::detection::YOLOv5(model_file, "", option);
  if (!model.Initialized()) {
    std::cerr << "Failed to initialize." << std::endl;
    return;
  }

  auto im = cv::imread(image_file);

  fastdeploy::vision::DetectionResult res;
  if (!model.Predict(&im, &res)) {
    std::cerr << "Failed to predict." << std::endl;
    return;
  }
  std::cout << res.Str() << std::endl;

  auto vis_im = fastdeploy::vision::VisDetection(im, res);
  cv::imwrite("vis_result.jpg", vis_im);
  std::cout << "Visualized result saved in ./vis_result.jpg" << std::endl;
}

int main(int argc, char* argv[]) {
  if (std::atoi(argv[3]) == 0) {
    CpuInfer(argv[1], argv[2]);
  } else if (std::atoi(argv[3]) == 1) {
    GpuInfer(argv[1], argv[2]);
  } else if (std::atoi(argv[3]) == 2) {
    TrtInfer(argv[1], argv[2]);
  }
  return 0;
}

4.FastDeploy 工具包

一键安装

# 通过pip安装fastdeploy-tools. 此工具包目前支持模型一键自动化压缩和模型转换的功能.
pip install fastdeploy-tools==0.0.1

FastDeploy的python包已包含此工具，不需重复安装.

1）模型压缩工具PaddleSlim

https://github.com/PaddlePaddle/PaddleSlim

PaddleSlim是一个专注于深度学习模型压缩的工具库，提供低比特量化、知识蒸馏、稀疏化和模型结构搜索等模型压缩策略，帮助开发者快速实现模型的小型化。

python -m pip install paddlepaddle-gpu==2.4.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

pip install paddleslim

版本对齐：

PaddleSlim PaddlePaddle PaddleLite

2.0.0 2.0 2.8

2.1.0 2.1.0 2.8

2.1.1 2.1.1 >=2.8

2.3.0 2.3.0 >=2.11

2.4.0 2.4.0 >=2.11

develop develop >=2.11

PaddleSlim	PaddlePaddle	PaddleLite
2.0.0	2.0	2.8
2.1.0	2.1.0	2.8
2.1.1	2.1.1	>=2.8
2.3.0	2.3.0	>=2.11
2.4.0	2.4.0	>=2.11
develop	develop	>=2.11

自动化压缩

相比于传统手工压缩，自动化压缩的“自动”主要体现在4个方面：解耦训练代码、离线量化超参搜索、策略自动组合、硬件感知（硬件延时预估）。

# 导入依赖包
import paddle
from PIL import Image
from paddle.vision.datasets import DatasetFolder
from paddle.vision.transforms import transforms
from paddleslim.auto_compression import AutoCompression
paddle.enable_static()
# 定义DataSet
class ImageNetDataset(DatasetFolder):
    def __init__(self, path, image_size=224):
        super(ImageNetDataset, self).__init__(path)
        normalize = transforms.Normalize(
            mean=[123.675, 116.28, 103.53], std=[58.395, 57.120, 57.375])
        self.transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(image_size), transforms.Transpose(),
            normalize
        ])

    def __getitem__(self, idx):
        img_path, _ = self.samples[idx]
        return self.transform(Image.open(img_path).convert('RGB'))

    def __len__(self):
        return len(self.samples)

# 定义DataLoader
train_dataset = ImageNetDataset("./ILSVRC2012_data_demo/ILSVRC2012/train/")
image = paddle.static.data(
    name='inputs', shape=[None] + [3, 224, 224], dtype='float32')
train_loader = paddle.io.DataLoader(train_dataset, feed_list=[image], batch_size=32, return_list=False)
# 开始自动压缩
ac = AutoCompression(
    model_dir="./MobileNetV1_infer",
    model_filename="inference.pdmodel",
    params_filename="inference.pdiparams",
    save_dir="MobileNetV1_quant",
    config={"QuantPost": {}, "HyperParameterOptimization": {'ptq_algo': ['avg'], 'max_quant_count': 3}},
    ### config={"QuantAware": {}, "Distillation": {}}, ### 如果您的系统为Windows系统, 请使用当前这一行配置
    train_dataloader=train_loader,
    eval_dataloader=train_loader)
ac.compress()

量化过程划分：

权重量化：对网络中的权重执行量化操作。可以选择逐层（layer-wise）或者逐通道（channel-wise）的量化粒度，也就是说每层或者每个通道选取一个量化scale。在PaddleSlim中所有权重量化都采用abs_max或者channel_wise_abs_max的方法。
激活量化：即对网络中不含权重的激活类OP进行量化。一般只能采用逐层（layer-wise）的量化粒度。在PaddleSlim的中默认采用moving_average_abs_max的采样策略。

量化方式：

静态离线量化

基于采样数据，离线的使用KL散度、MSE等方法计算量化比例因子的方法。
- 加载预训练的FP32模型，配置用于校准的DataLoader；
- 读取小批量样本数据，执行模型的前向推理，保存更新待量化op的量化Scale等信息；
- 将FP32模型转成INT8模型，进行保存。
在线量化训练

在模型训练前需要先对网络计算图进行处理，先在需要量化的算子前插入量化-反量化节点，再经过训练，产出模拟量化的模型。
- 构建模型和数据集
- 进行浮点模型的训练
- 加载预训练模型，进行量化训练微调
- 导出量化预测模型
动态离线量化

将模型中特定OP的权重从FP32类型量化成INT8等类型，该方式的量化有两种预测方式：
- 第一种是反量化预测方式（Paddle Lite支持），即是首先将INT8/16类型的权重反量化成FP32类型，然后再使用FP32浮运算运算进行预测；
- 第二种量化预测方式，即是预测中动态计算量化OP输入的量化信息，基于量化的输入和权重进行INT8整形运算。

2）模型转换工具X2Paddle

https://github.com/PaddlePaddle/X2Paddle

X2Paddle用于将其它深度学习框架的模型迁移至飞桨框架。目前支持推理模型的框架转换与PyTorch训练代码迁移。

目前X2Paddle支持130+ PyTorch OP，90+ ONNX OP，90+ TensorFlow OP 以及 30+ Caffe OP，详见 支持列表。

# 环境依赖
python >= 3.5
paddlepaddle >= 2.2.2
tensorflow == 1.14 (如需转换TensorFlow模型)
onnx >= 1.6.0 (如需转换ONNX模型)
torch >= 1.5.0 (如需转换PyTorch模型)
paddlelite >= 2.9.0 (如需一键转换成Paddle-Lite支持格式,推荐最新版本)

# 安装
pip install x2paddle

PyTorch模型转换

from x2paddle.convert import pytorch2paddle
pytorch2paddle(module=torch_module,
               save_dir="./pd_model",
               jit_type="trace",
               input_examples=[torch_input])
# module (torch.nn.Module): PyTorch的Module。
# save_dir (str): 转换后模型的保存路径。
# jit_type (str): 转换方式。默认为"trace"。
# input_examples (list[torch.tensor]): torch.nn.Module的输入示例，list的长度必须与输入的长度一致。默认为None。

ONNX模型转换

x2paddle --framework=onnx \
        --model=onnx_model.onnx \
        --save_dir=pd_model

3）模型转换工具Paddle2ONNX

https://github.com/PaddlePaddle/Paddle2ONNX

Paddle2ONNX 支持将 PaddlePaddle 模型格式转化到 ONNX 模型格式。

# 安装
pip install paddle2onnx

# 命令行使用
paddle2onnx --model_dir saved_inference_model \
            --model_filename model.pdmodel \
            --params_filename model.pdiparams \
            --save_file model.onnx \
            --enable_dev_version True \
            --opset_version 11 \
            --enable_onnx_checker True

参数选项

参数	参数说明
--model_dir	配置包含 Paddle 模型的目录路径
--model_filename	[可选] 配置位于 `--model_dir` 下存储网络结构的文件名
--params_filename	[可选] 配置位于 `--model_dir` 下存储模型参数的文件名称
--save_file	指定转换后的模型保存目录路径
--opset_version	[可选] 配置转换为 ONNX 的 OpSet 版本，目前支持 7~16 等多个版本，默认为 9
--enable_dev_version	[可选] 是否使用新版本 Paddle2ONNX（推荐使用），默认为 True
--enable_onnx_checker	[可选] 配置是否检查导出为 ONNX 模型的正确性, 建议打开此开关，默认为 False
--enable_auto_update_opset	[可选] 是否开启 opset version 自动升级功能，当低版本 opset 无法转换时，自动选择更高版本的 opset进行转换，默认为 True
--deploy_backend	[可选] 量化模型部署的推理引擎，支持 onnxruntime、tensorrt 或 others，当选择 others 时，所有的量化信息存储于 max_range.txt 文件中，默认为 onnxruntime
--save_calibration_file	[可选] TensorRT 8.X版本部署量化模型需要读取的 cache 文件的保存路径，默认为 calibration.cache
--version	[可选] 查看 paddle2onnx 版本
--external_filename	[可选] 当导出的 ONNX 模型大于 2G 时，需要设置 external data 的存储路径，推荐设置为：external_data
--export_fp16_model	[可选] 是否将导出的 ONNX 的模型转换为 FP16 格式，并用 ONNXRuntime-GPU 加速推理，默认为 False
--custom_ops	[可选] 将 Paddle OP 导出为 ONNX 的 Custom OP，例如：--custom_ops '{"paddle_op":"onnx_op"}，默认为 {}

Paddle 训练模型导出为 ONNX

import paddle

# export to ONNX
save_path = 'onnx.save/lenet' # 需要保存的路径
# 调用 paddle.static.InputSpec API 指定输入的 shape，如果输入中某一维为动态的，可以将该维指定为 None
x_spec = paddle.static.InputSpec([None, 1, 28, 28], 'float32', 'x') 
# 调用 paddle.onnx.export 接口，在指定的路径下生成 ONNX 模型。
paddle.onnx.export(model, save_path, input_spec=[x_spec], opset_version=11)

验证 ONNX 模型

# 导入 ONNX 库
import onnx
# 载入 ONNX 模型
onnx_model = onnx.load("model.onnx")
# 使用 ONNX 库检查 ONNX 模型是否合理
check = onnx.checker.check_model(onnx_model)
# 打印检查结果
print('check: ', check)


# 随机生成输入，用于验证 Paddle 和 ONNX 的推理结果是否一致
x = np.random.random((1, 3, 224, 224)).astype('float32')

# predict by ONNXRuntime
ort_sess = onnxruntime.InferenceSession("model.onnx")
ort_inputs = {ort_sess.get_inputs()[0].name: x}
ort_outs = ort_sess.run(None, ort_inputs)

四、精品模型部署

1.PP-YOLOE+

PP-YOLOE是基于PP-YOLOv2的卓越的单阶段Anchor-free模型，超越了多种流行的YOLO模型。PP-YOLOE+是PP-YOLOE的升级版本，从大规模的obj365目标检测预训练模型入手，在大幅提升收敛速度的同时，提升了模型在COCO数据集上的速度。同时，PP-YOLOE+大幅提升了包括数据预处理在内的端到端的预测速度。

尺寸多样：PP-YOLOE根据不同应用场景设计了s/m/l/x，4个尺寸的模型来支持不同算力水平的硬件，无论是哪个尺寸，精度-速度的平衡都超越当前所有同等计算量下的YOLO模型！可以通过width multiplier和depth multiplier配置。
性能卓越：PP-YOLOE-l在COCO test-dev上以精度51.4%，TRT FP16推理速度149 FPS的优异数据，相较YOLOX，精度提升1.3%，加速25%；相较YOLOv5，精度提升0.7%，加速26.8%。训练速度较PP-YOLOv2提高33%，降低模型训练成本。
部署友好：PP-YOLOE在结构设计上避免使用如deformable convolution或者matrix NMS之类的特殊算子，使其能轻松适配更多硬件。当前已经完备支持NVIDIA V100、T4这样的云端GPU架构以及如Jetson系列等边缘端GPU和FPGA开发板。

1）环境及模型准备

git clone https://github.com/PaddlePaddle/PaddleDetection.git
cd PaddleDetection
# 安装PaddleDetection
pip install -r requirements.txt
python setup.py install

# 下载预训练模型
mkdir -p models/PP-YOLOE+_s && cd models/PP-YOLOE+_s
wget https://paddledet.bj.bcebos.com/models/ppyoloe_plus_crn_s_80e_coco.pdparams 
wget https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.6/configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco.yml
cd -

2）paddle源模型推理

# 推理单张图片
$ CUDA_VISIBLE_DEVICES=0 python tools/infer.py \
        -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco.yml \
        -o weights=models/PP-YOLOE+_s/ppyoloe_plus_crn_s_80e_coco.pdparams \
        --infer_img=demo/000000014439_640x640.jpg

Detection bbox results save in output/000000014439_640x640.jpg

3）paddle模型转onnx

由于tensorrt转换时不支持nms相关算子，建议设置exclude_post_process为True，然后自行实现后处理（参考yolov8）。

# 模型微调，不包含nms，其会在tensorrt转换时因缺少TopK算子而报错
$ python tools/export_model.py -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco.yml \
        -o weights=models/PP-YOLOE+_s/ppyoloe_plus_crn_s_80e_coco.pdparams \
        exclude_post_process=True trt=True

# 默认输出到output_inference
$ ls output_inference/ppyoloe_plus_crn_s_80e_coco/
infer_cfg.yml  model.pdiparams  model.pdiparams.info  model.pdmodel

# 转化成ONNX格式
$ paddle2onnx --model_dir output_inference/ppyoloe_plus_crn_s_80e_coco \
        --model_filename model.pdmodel \
        --params_filename model.pdiparams \
        --opset_version 11 \
        --enable_onnx_checker True \
        --save_file ppyoloe_plus_crn_s_80e_coco.onnx

export_model.py特殊参数说明

参考PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py文件，可知有以下参数。

class PPYOLOEHead(nn.Layer):
    __shared__ = [
        'num_classes', 'eval_size', 'trt', 'exclude_nms',
        'exclude_post_process', 'use_shared_conv', 'for_distill'
    ]
    def __init__(self,
                 in_channels=[1024, 512, 256],
                 num_classes=80,
                 act='swish',
                 fpn_strides=(32, 16, 8),
                 grid_cell_scale=5.0,
                 grid_cell_offset=0.5,
                 reg_max=16,
                 reg_range=None,
                 static_assigner_epoch=4,
                 use_varifocal_loss=True,
                 static_assigner='ATSSAssigner',
                 assigner='TaskAlignedAssigner',
                 nms='MultiClassNMS',
                 eval_size=None,
                 loss_weight={
                     'class': 1.0,
                     'iou': 2.5,
                     'dfl': 0.5,
                 },
                 trt=False,
                 attn_conv='convbn',
                 exclude_nms=False,
                 exclude_post_process=False,
                 use_shared_conv=True,
                 for_distill=False):

MultiClassNMS算子说明

该算子在Paddle2ONNX中实现。

infer_cfg.yml 配置文件说明

mode: paddle
draw_threshold: 0.5
metric: COCO
use_dynamic_shape: false
arch: YOLO
min_subgraph_size: 3
Preprocess:       # 前处理参数
- interp: 2
  keep_ratio: false
  target_size:
  - 640
  - 640
  type: Resize
- mean:
  - 0.0
  - 0.0
  - 0.0
  norm_type: none
  std:
  - 1.0
  - 1.0
  - 1.0
  type: NormalizeImage
- type: Permute
label_list:       # 标签列表
- person
- bicycle
- car
- ......

onnx模型输入输出说明

inputs：

image: float32 [1, 3, 640, 640]，输入的预处理后的图像数据
scale_factor: float32 [1, 2]，width和height在预处理中被缩放的系数（当exclude_post_process为True时不输出）

outputs参考代码

def post_process(self, head_outs, scale_factor):
    pred_scores, pred_dist, anchor_points, stride_tensor = head_outs
    pred_bboxes = batch_distance2bbox(anchor_points, pred_dist)
    pred_bboxes *= stride_tensor
    if self.exclude_post_process:
        return paddle.concat(
            [pred_bboxes, pred_scores.transpose([0, 2, 1])],
            axis=-1), None, None
    else:
        # scale bbox to origin
        scale_y, scale_x = paddle.split(scale_factor, 2, axis=-1)
        scale_factor = paddle.concat(
            [scale_x, scale_y, scale_x, scale_y],
            axis=-1).reshape([-1, 1, 4])
        pred_bboxes /= scale_factor
        if self.exclude_nms:
            # `exclude_nms=True` just use in benchmark
            return pred_bboxes, pred_scores, None
        else:
            bbox_pred, bbox_num, nms_keep_idx = self.nms(pred_bboxes, pred_scores)
            return bbox_pred, bbox_num, nms_keep_idx

当exclude_post_process==True时：
- 输出： [1, 8400, 84]
当exclude_post_process==False，exclude_nms==True时：
- 输出1 pred_bboxes：[1, 8400, 4]
- 输出2 pred_scores：[1, 80, 8400]

issues
- 执行deploy/python/infer.py，报错 ValueError: (InvalidArgument) Pass preln_embedding_eltwise_layernorm_fuse_pass has not been registered.
  - 检测paddlepaddle-gpu cuda版本与tensorrt要求版本是否一致，链接。

4）onnx转tensorrt

trtexec="/usr/src/tensorrt/bin/trtexec"
$trtexec --onnx=./ppyoloe_plus_crn_s_80e_coco.onnx --saveEngine=./ppyoloe_s_bs1.engine  --fp16  --buildOnly --workspace=4096

自动部署开源AI模型到生产环境：Scikit-learn、XGB
目录背景介绍部署准备部署Scikit-learn模型部署XGBoost模型部署LightGBM模型部署...
自动部署深度神经网络模型TensorFlow（Keras）到生产
目录 Keras简介 Keras模型分类 Keras模型部署准备默认部署Keras模型自定义部署Keras模型...
TF各类资源
模型部署 TF Serving部署TensorFlow模型how-to-deploy-tensorflow-mod...
2018-01-13
菜鸟物流挑战：算法模型层面，工程技术层面，效果评估梅特拉罗柔性自动化 paddle训练营，文档可查
Tensorflow介绍与安装
学习文档1 为什么深度学习使用TensorflowCaffe paddle paddle kears tf-l...
使用开源AI-Serving部署推断PMML和ONNX模型
目录 AI-Serving介绍部署PMML模型部署ONNX模型总结参考 AI-Serving介绍 AI-S...
Paddle
PP-Structure核心技术解读（1) 安装PaddlePaddlepip3 install --upgra...
Paddle初探
Paddle是百度的一个深度学习框架，所谓Paddle其实就是Parallel Distributed Deep ...
PaddleNLP预训练模型实现文本分类
为什么选择paddle？1.paddlenlp有很多的例子和开放了很多的预训练模型，比较容易上手。2.百度的ais...
树莓派上使用paddle预训练模型
树莓派安装paddlelite;x86电脑上安装paddlehub，并将paddlehub中的预训练模型转换为pa...

Paddle模型部署

一、简介

1.训练模型库

1）PaddleDetection

2）PaddleClas

3）PaddleOCR

2.推理部署

1）服务器部署

2）端侧部署

二、安装PaddlePaddle

1.通过pip安装

2.通过docker安装

三、模型部署SDK FastDeploy

1.编译安装

1）python 预编译库安装

2）python sdk编译安装

3）c++ 预编译库安装

4）c++ sdk编译安装

5）编译选项说明

2.Python SDK使用

1）常用Python API

2）Python 推理后端使用

3）使用预置视觉网络推理

3.C++ SDK使用

1）C++ API

2）C++ 推理后端使用

3）使用预置视觉网络推理

4.FastDeploy 工具包

1）模型压缩工具PaddleSlim

2）模型转换工具X2Paddle

3）模型转换工具Paddle2ONNX

四、精品模型部署

1.PP-YOLOE+

1）环境及模型准备

2）paddle源模型推理

3）paddle模型转onnx

4）onnx转tensorrt

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读