TensorRT与TVM性能比较（Resnet50）

作者: crazyhank | 来源:发表于2019-05-26 13:52 被阅读0次

TensorRT与TVM性能比较（Resnet50）
(深度学习优化大师)TVM的前辈，介绍下什么是HALIDE吧~
机器学习系统或者SysML&DL笔记(一)
NV软件 - 1 TensorRT
TensorRT 开始
【TVM系列二】TVM介绍
TVM内的Container（数据容器）学习
TensorRT Developer Guide
[zz]TVM之神经网络Auto-Tuning
GPU 加速库

搞深度学习的同学都知道，一旦一个模型训练好以后，都需要通过推理框架部署到实际的生产环境中去。如果采用GPU硬件平台，一般会使用TensorRT方式部署，因为TensorRT能够充分发挥GPU平台的性能，同时也做了很多的优化（算子融合，量化等），所以在性能上有比较大的优势。不过，TensorRT是闭源的，用户无法知道Nvidia究竟做了什么，与TensorRT相对标的一个开源项目就是陈天奇搞的TVM项目，目前正在不断迭代中，它的目标也是解决训练模型部署的问题，支持的硬件的平台包括了X86、ARM以及GPU。
今天针对Resnet50模型，分别通过TensorRT以及TVM运行相同的网络模型，比较了两者的性能，结果如下：

使用TensorRT运行

直接采用trtexec命令来运行，不需要实际的数据，运行的命令如下：

hank@hank-desktop:~/Study/TVM/example$ /usr/src/tensorrt/bin/trtexec --output=prob --deploy=/home/hank/Models/Caffe/resnet50/ResNet-50-deploy.prototxt  --fp16 --batch=1
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --output=prob --deploy=/home/hank/Models/Caffe/resnet50/ResNet-50-deploy.prototxt --fp16 --batch=1
[I] output: prob
[I] deploy: /home/hank/Models/Caffe/resnet50/ResNet-50-deploy.prototxt
[I] fp16
[I] batch: 1
[I] Input "data": 3x224x224
[I] Output "prob": 1000x1x1
[W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[I] Average over 10 runs is 10.7174 ms (host walltime is 10.9884 ms, 99% percentile time is 10.7385).
[I] Average over 10 runs is 10.718 ms (host walltime is 10.9872 ms, 99% percentile time is 10.743).
[I] Average over 10 runs is 10.1589 ms (host walltime is 10.4303 ms, 99% percentile time is 10.7429).
[I] Average over 10 runs is 9.97009 ms (host walltime is 10.2436 ms, 99% percentile time is 10.048).
[I] Average over 10 runs is 9.813 ms (host walltime is 10.0879 ms, 99% percentile time is 9.83677).
[I] Average over 10 runs is 9.82028 ms (host walltime is 10.0958 ms, 99% percentile time is 9.83814).
[I] Average over 10 runs is 9.81811 ms (host walltime is 10.1138 ms, 99% percentile time is 9.8297).
[I] Average over 10 runs is 9.82876 ms (host walltime is 10.1285 ms, 99% percentile time is 9.84746).
[I] Average over 10 runs is 9.81528 ms (host walltime is 10.1037 ms, 99% percentile time is 9.83718).
[I] Average over 10 runs is 9.82067 ms (host walltime is 10.0932 ms, 99% percentile time is 9.84448).
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --output=prob --deploy=/home/hank/Models/Caffe/resnet50/ResNet-50-deploy.prototxt --fp16 --batch=1

使用TVM运行

我这里先从一个Caffe2框架训练的Resnet50模型转换过来，然后再进行推理：
（1）从Caffe2模型转换

#coding=utf-8
from caffe2.python import caffe2_pb2
import numpy as np

INIT_NET    = './models/caffe2/resnet50/init_net.pb'
PREDICT_NET = './models/caffe2/resnet50/predict_net.pb'

init_def = caffe2_pb2.NetDef()
predict_def = caffe2_pb2.NetDef()

with open(INIT_NET, 'rb') as f:
    init_def.ParseFromString(f.read())

with open(PREDICT_NET, 'rb') as f:
    predict_def.ParseFromString(f.read())

shape_dict = {'gpu_0/data': (1, 3, 224, 224)}
dtype_dict = {'gpu_0/data': np.dtype('float32')}

import tvm
from tvm import relay
func, params = relay.frontend.from_caffe2(init_def, predict_def, shape_dict, dtype_dict)

target = tvm.target.cuda()
with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(func, target, params=params)

from tvm.contrib import util

path_lib = ("./deploy_lib.tar")
lib.export_library(path_lib)
with open(("./deploy_graph.json"), "w") as fo:
    fo.write(graph)
with open(("./deploy_param.params"), "wb") as fo:
    fo.write(relay.save_param_dict(params))

print("TVM model saved!")

转换后在当前目录下会生成TVM推理所需的三个文件：
*.tar : 运行库
*.json: 描述文件
*.params: 模型参数文件
（2）使用TVM API推理

#coding=utf-8
import tvm
from tvm.contrib import graph_runtime
from PIL import Image
import numpy as np
import datetime

json_file = "./deploy_graph.json"
lib_file = "./deploy_lib.tar"
params_file = "./deploy_param.params"

synset_path = './imagenet1000_clsid_to_human.txt'
with open(synset_path) as f:
    synset = eval(f.read())

data = []
def transform_image(image):
    image = np.array(image) - np.array([123., 117., 104.])
    image /= np.array([58.395, 57.12, 57.375])
    image = image.transpose((2, 0, 1))
    image = image[np.newaxis, :].astype('float32')
    return image
img_paths = ['./cat.png', './dog.jpg', './horses.jpg', './eagle.jpg', './person.jpg']
for i in range(len(img_paths)):
    img = Image.open(img_paths[i]).resize((224, 224))
    data.append(transform_image(img))

input_name = 'gpu_0/data'

loaded_json = open(json_file).read()
loaded_lib = tvm.module.load(lib_file)
loaded_params = bytearray(open(params_file, "rb").read())


ctx = tvm.gpu(0)
m = graph_runtime.create(loaded_json, loaded_lib, ctx)
m.load_params(loaded_params)
print("TVM model file loaded")

# do inference
for i in range(len(data)):
    top1 = 0
    m.set_input(input_name, tvm.nd.array(data[i].astype('float32')))
    start = datetime.datetime.now()
    m.run()
    end = datetime.datetime.now()
    output = m.get_output(0).asnumpy().reshape((1000))
    top1 = np.argmax(output)
    print('>>cost %.2f ms, top1 id: %d, class name: %s' %
        (((end - start).seconds*1000.0 + (end - start).microseconds/1000.0),
        top1, synset[top1]))

运行结果：

hank@hank-desktop:~/Study/TVM/example$ python3 ./run_tvm_model.py
TVM model file loaded
>>cost 4.35 ms, top1 id: 283, class name: Persian cat
>>cost 0.56 ms, top1 id: 249, class name: malamute, malemute, Alaskan malamute
>>cost 0.56 ms, top1 id: 349, class name: bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis
>>cost 0.63 ms, top1 id: 81, class name: ptarmigan
>>cost 0.55 ms, top1 id: 355, class name: llama

我在这里使用了5张不同的图片进行连续测试，可以看到除了第一张图片推理耗时较长以外，后面4张都是非常快的，而且推理的结果也是不同的。