一、前言
TVMC是TVM python包提供的一个工具,可以通过命令行的方式执行auto-tuning,编译,性能profiling以及模型运行。本文将根据TVM官网的指导文档跟大家一起熟悉TVMC的使用。
二、TVMC使用示例
1、安装
从github更新TVM到最新代码后:
cd tvm/python
python gen_requirements.py
python setup.py build
python setup.py install
遇到问题:
Processing dependencies for tvm==0.9.dev915+g5c29e55be
Searching for synr==0.6.0
Reading https://pypi.python.org/simple/synr/
Couldn't find index page for 'synr' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or working download links found for synr==0.6.0
error: Could not find suitable distribution for Requirement.parse('synr==0.6.0')
解决:
pip install synr
python setup.py build
python setup.py install
安装完成后输入tvmc --help会打印:
...
usage: tvmc [-v] [--version] [-h] {run,tune,compile} ...
TVM compiler driver
optional arguments:
-v, --verbose increase verbosity
--version print the version and exit
-h, --help show this help message and exit.
commands:
{run,tune,compile}
run run a compiled module
tune auto-tune a model
compile compile a model.
TVMC - TVM driver command-line interface
2、使用示例
TVMC是TVM python包的一个应用工具,安装完TVM之后可以通过tvmc命令来执行。接下来会使用tvmc跑一个图像分类的模型:
(1)模型下载
使用onnx格式的resnet50模型:
pip install onnx onnxoptimizer
wget https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet50-v2-7.onnx
(2)模型编译
tvmc compile --target "llvm" \
--output resnet50-v2-7-tvm.tar \
resnet50-v2-7.onnx
执行完成后:
target={1: llvm -keys=cpu -link-params=0}
target={1: llvm -keys=cpu -link-params=0}, target_host=None
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
会在当前目录下生成 resnet50-v2-7-tvm.tar 压缩包,解压后可以看到里面包含了三个文件:
-rw-rw-r-- 1 user user 88K Mar 29 16:42 mod.json
-rw-rw-r-- 1 user user 98M Mar 29 16:42 mod.params
-rwxrwxr-x 1 user user 582K Mar 29 16:42 mod.so
-
mod.so:tvm模型,编译成C++动态库的形式,可由 TVM runtime加载;
-
mod.json:TVM Relay的计算图,主要描述了模型的节点以及各各节点输入输出的类型与参数等;
-
mod.params:模型训练的参数。
(3)调优
TVMC默认使用xgboost调优器进行调优,需要指定调优记录的输出路径文件,整个调优过程本质上是一个参数选择的过程,对不同的算子使用不同的参数配置,然后选择模型运行最快的那一组参数,属于一种参数空间搜索的策略,一般情况下都会比较耗时。本例中通过指定--number和--repeat来限制调优运行的总次数:
tvmc tune --target "llvm" \
--output resnet50-v2-7-autotuner_records.json \
--number 10 \
--repeat 10 \
resnet50-v2-7.onnx
整个过程完成之后,得到一个调优记录文件resnet50-v2-7-autotuner_records.json:
{"input": ["llvm -keys=cpu -link-params=0", "conv2d_NCHWc.x86", [["TENSOR", [1, 3, 224, 224], "float32"], ["TENSOR", [64, 3, 7, 7], "float32"], [2, 2], [3, 3, 3, 3], [1, 1], "NCHW", "NCHW", "float32 "], {}], "config": {"index": 42, "code_hash": null, "entity": [["tile_ic", "sp", [-1, 1]], ["tile_oc", "sp", [-1, 1]], ["tile_ow", "sp", [-1, 64]], ["unroll_kw", "ot", true]]}, "result": [[0.0098876 83900000001], 0, 6.844898462295532, 1648601601.8722398], "version": 0.2, "tvm_version": "0.9.dev0"}
{"input": ["llvm -keys=cpu -link-params=0", "conv2d_NCHWc.x86", [["TENSOR", [1, 3, 224, 224], "float32"], ["TENSOR", [64, 3, 7, 7], "float32"], [2, 2], [3, 3, 3, 3], [1, 1], "NCHW", "NCHW", "float32 "], {}], "config": {"index": 172, "code_hash": null, "entity": [["tile_ic", "sp", [-1, 1]], ["tile_oc", "sp", [-1, 4]], ["tile_ow", "sp", [-1, 1]], ["unroll_kw", "ot", false]]}, "result": [[0.008670 090600000001], 0, 1.1055541038513184, 1648601602.1772938], "version": 0.2, "tvm_version": "0.9.dev0"}
...
可以看到每一组数据包括输入"input",配置"config",以及运行结果"result"。
(4)编译调优模型
有了调优的记录文件,就可以重新编译调优模型:
tvmc compile --target "llvm" \
--output resnet50-v2-7-tvm_autotuned.tar \
--tuning-records resnet50-v2-7-autotuner_records.json \
resnet50-v2-7.onnx
(5)结果对比
- 输入数据预处理
新建下面的代码文件保存为 tvmc_pre_process.py
from tvm.contrib.download import download_testdata
from PIL import Image # 需要安装依赖库:pip install pillow
import numpy as np
img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")
# resnet50 要求输入图像大小为224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")
# ONNX使用 NCHW 格式的输入,将NHWC转为NCHW
img_data = np.transpose(img_data, (2, 0, 1))
# 根据ImageNet数据库给的参数归一化输入
imagenet_mean = np.array([0.485, 0.456, 0.406])
imagenet_stddev = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(img_data.shape).astype("float32")
for i in range(img_data.shape[0]):
norm_img_data[i, :, :] = (img_data[i, :, :] / 255 - imagenet_mean[i]) / imagenet_stddev[i]
# 加上batch维
img_data = np.expand_dims(norm_img_data, axis=0)
# 保存为.npz格式,TVMC已经提供了对这种数据格式的支持
np.savez("imagenet_cat", data=img_data)
输出数据后处理
新建下面的代码文件保存为 tvmc_post_process.py
import os.path
import numpy as np
from scipy.special import softmax
from tvm.contrib.download import download_testdata
# 下载标签
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")
with open(labels_path, "r") as f:
labels = [l.rstrip() for l in f]
output_file = "predictions.npz"
# 读取输出结果
if os.path.exists(output_file):
with np.load(output_file) as data:
scores = softmax(data["output_0"]) # 对输出数据求softmax
scores = np.squeeze(scores) # 将scores的shape中为1的维度去掉
ranks = np.argsort(scores)[::-1] # 获取scores从小到大的索引值
for rank in ranks[0:5]: # 打印前top 5的分值
print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
- 未调优模型结果
python tvmc_pre_process.py
tvmc run --inputs imagenet_cat.npz \
--output predictions.npz \
--print-time \
--repeat 100 \
resnet50-v2-7-tvm.tar
python tvmc_post_process.py
输出:
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
148.2297 142.3968 217.4981 137.0922 15.2046
class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367180
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261
- 调优模型结果
python tvmc_pre_process.py
tvmc run --inputs imagenet_cat.npz \
--output predictions.npz \
--print-time \
--repeat 100 \
resnet50-v2-7-tvm_autotuned.tar
python tvmc_post_process.py
输出:
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
108.6551 108.0915 119.1327 106.1360 2.0859
class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367179
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261
可以看到调优后的速度还是有比较大的提升,而精度几乎完全不受影响。
三、基本实现流程
TVMC的代码在 python/tvm/driver/tvmc目录下,由于TVMC提供了很多的命令选项,这里我们主要看编译、调优与运行这三个子命令的实现流程。
1、main
tvmc的命令解析器是通过注册的形式添加的,首先在main.py中定义一个全局的列表REGISTERD_PARSE和一个注册函数register_parser():
REGISTERED_PARSER = []
def register_parser(make_subparser):
REGISTERED_PARSER.append(make_subparser)
return make_subparser
在新增命令解析时,比如增加子命令compile的命令解析器,在compiler.py中的实现如下:
@register_parser
def add_compile_parser(subparsers, _):
parser = subparsers.add_parser("compile", help="compile a model.")
parser.set_defaults(func=drive_compile) # 设置func属性
...
这就相当于将函数 add_compile_parser 对象添加到 REGISTERD_PARSER列表中。然后在main.py中的_main()函数遍历这个列表,并执行相应的函数:
def _main(argv):
...
for make_subparser in REGISTERED_PARSER:
make_subparser(subparser, parser)
...
args = parser.parse_args(argv)
...
try:
return args.func(args) # 执行func属性所指向的函数
except TVMCImportError as err:
...
此时args的内容为:
Namespace(FILE='resnet50-v2-7.onnx', cross_compiler='', cross_compiler_options='', desired_layout=None, disabled_pass=[''], dump_code='', executor='graph',
......,
func=, input_shapes=None, model_format=None, opt_level=3, output='resnet50-v2-7-tvm_autotuned.tar', output_format='so', pass_config=None, runtime='cpp',target='llvm',
......
tuning_records='resnet50-v2-7-autotuner_records.json', verbose=0, version=False)
其中func=<function drive_compile at 0x7f12ca2f78b0>就是子命令要执行的函数。
2、compiler
drive_compile()就两个主要的操作:
(1)通过frontends.load_model()加载模型
在frontends.py中,封装的前端接口有:
ALL_FRONTENDS = [
KerasFrontend,
OnnxFrontend,
TensorflowFrontend,
TFLiteFrontend,
PyTorchFrontend,
PaddleFrontend,
]
frontends.py首先定义一个抽象基类,各个子类需要实现这三个函数,其中在load函数中不同前端会根据各自的接口从模型文件路径读取model,然后调用 relay.frontend.from_xxx(model, ...) 函数进行相应的tvm模型加载与转换:
class Frontend(ABC):
@staticmethod
@abstractmethod
def name():
# 前端的名称
@staticmethod
@abstractmethod
def suffixes():
# 模型文件的后缀
@abstractmethod
def load(self, path, shape_dict=None, **kwargs):
# 模型加载函数
class OnnxFrontend(Frontend):
@staticmethod
def name():
return "onnx"
@staticmethod
def suffixes():
return ["onnx"]
def load(self, path, shape_dict=None, **kwargs):
onnx = lazy_import("onnx")
model = onnx.load(path)
return relay.frontend.from_onnx(model, shape=shape_dict, **kwargs)
(2)通过compile_model()编译模型
这个函数的主要做两个事情,一个是调用relay.build()执行编译,一个是导出编译结果:
graph_module = relay.build(mod, target=tvm_target, executor=executor, runtime=runtime, params=params)
...
package_path = tvmc_model.export_package(graph_module, package_path, cross, cross_options,output_format)
# 最后会调用export_classic_format(),它会将graph_module内部的模型信息写入到相关的文件中并以tar包的形式保存下来
...
3、autotuner
drive_tune()函数所做的工作是根据配置决定硬件参数,并判断是否进行rpc远端调优,然后调用tune_model()进行相应的处理。目前TVMC支持两种自动调优方式,分别为auto-scheduling和autotvm,默认使用的是autotvm,它们对应的最终调优任务启动接口是schedule_tasks()和tune_tasks()。这里主要看默认方式的实现:
def tune_tasks(
tasks: List[autotvm.task.Task],
log_file: str,
measure_option: autotvm.measure_option,
tuner: str,
trials: int,
early_stopping: Optional[int] = None,
tuning_records: Optional[str] = None,
):
if not tasks:
logger.warning("there were no tasks found to be tuned")
return
if not early_stopping:
early_stopping = trials
# 多任务处理
for i, tsk in enumerate(tasks):
prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
# 创建调优器
if tuner in ("xgb", "xgb-rank"):
tuner_obj = XGBTuner(tsk, loss_type="rank")
elif tuner == "xgb_knob":
tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="knob")
elif tuner == "ga":
tuner_obj = GATuner(tsk, pop_size=50)
elif tuner == "random":
tuner_obj = RandomTuner(tsk)
elif tuner == "gridsearch":
tuner_obj = GridSearchTuner(tsk)
else:
raise TVMCException("invalid tuner: %s " % tuner)
# 如果有调优的历史记录,可以从历史记录开始调优,相当于"断点续调"
if tuning_records and os.path.exists(tuning_records):
logger.info("loading tuning records from %s", tuning_records)
start_time = time.time()
tuner_obj.load_history(autotvm.record.load_from_file(tuning_records))
logging.info("loaded history in %.2f sec(s)", time.time() - start_time)
tuner_obj.tune(
n_trial=min(trials, len(tsk.config_space)),
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(trials, prefix=prefix),
autotvm.callback.log_to_file(log_file),
],
)
4、runner
drive_run()主要是判断是否使用rpc远端执行,调用run_module()运行模型推理,最后输出和保存推理结果。这里摘取run_module()的主要步骤进行分析,把micro-tvm和rpc相关的部分略掉:
def run_module(tvmc_package: TVMCPackage, device: str,...):
...
# 加载编译好的模型.so库
session.upload(tvmc_package.lib_path)
lib = session.load_module(tvmc_package.lib_name)
...
# 设置运行的目标设备
if device == "cuda":
dev = session.cuda()
elif device == "cl":
dev = session.cl()
...
else:
dev = session.cpu()
...
# 创建 module 对象
module = executor.create(tvmc_package.graph, lib, dev)
...
# 加载模型训练参数
module.load_params(tvmc_package.params)
...
# 设置模型输入数据
shape_dict, dtype_dict = module.get_input_info()
inputs_dict = make_inputs_dict(shape_dict, dtype_dict, inputs, fill_mode)
module.set_input(**inputs_dict)
...
# 模型推理
times = module.benchmark(dev, number=number, repeat=repeat, end_to_end=end_to_end)
...
# 获取推理输出结果
num_outputs = module.get_num_outputs()
outputs = {}
for i in range(num_outputs):
output_name = "output_{}".format(i)
outputs[output_name] = module.get_output(i).numpy()
return TVMCResult(outputs, times)
四、总结
本文介绍了TVMC工具的使用以及基本的实现流程,TVMC提供了很丰富的命令选项,可以说TVMC是一个很好的TVM Python API使用范例,建议感兴趣的同学可以通过深入了解。
网友评论