美文网首页
移植TensorFlow Lite到ARM板i.MX6上

移植TensorFlow Lite到ARM板i.MX6上

作者: lucca_x | 来源:发表于2019-03-28 18:14 被阅读0次

    上一篇文章说到移植到LC1860C板上失败后,我又换了一块库更全更新的板子,继续大业。

    运行label_image

     ./label_image -v 1 -m ./mobilenet_v1_1.0_224.tflite  -i ./grace_hopper.jpg-l ./imagenet_slim_labels.txt
    

    alloc失败

    遇到的第一个问题是alloc失败。

    ...
    83: MobilenetV1/MobilenetV1/Conv2d_9_depthwise/weights_quant/FakeQuantWithMinMaxVars, 1152, 3, 0.0212288, 120
    84: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Conv2D_Fold_bias, 512, 2, 0.000260965, 0
    85: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6, 8192, 3, 0.0235285, 0
    86: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/weights_quant/FakeQuantWithMinMaxVars, 16384, 3, 0.0110914, 146
    87: MobilenetV1/Predictions/Reshape_1, 1001, 3, 0.00390625, 0
    88: input, 49152, 3, 0.0078125, 128
    len: 61306
    width, height, channels: -16842752, 1766213120, 246279780
    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    Aborted (core dumped)
    

    一开始我也没有在意,以为是板子太破,tensorflow lite这个label_image的例子耗内存太大,所以才挂的。准备自己新写一个简单的例子,来看看会不会挂。后来在学习tensorflow准备写例子的间隙里,发现log里的width, height, channels的值好像不对,怎么这么大还有负的。于是去仔细看了label_image的代码,发现执行到这里根本还没有invoke,也就是还没有开始跑tensorflow lite,再发现代码里面停在了read_bmp里面,才后知后觉发现我给的图片格式是jpg而不是bmp的,感紧换成bmp的,就没有这个alloc问题了。

      std::vector<uint8_t> in = read_bmp(s->input_bmp_name, &image_width,
                                         &image_height, &image_channels, s);
    

    illegal instruction

    然后遇到的就是illegal instruction问题

    ...
    Node  29 Operator Builtin Code  22
      Inputs: 1 5
      Outputs: 4
    Node  30 Operator Builtin Code  25
      Inputs: 4
      Outputs: 87
    Illegal instruction
    

    以前并没有遇到过Illegal instruction的问题。一开始还以为是tensorflow报的log,在代码里面找了一圈没找到这个log,上网查了才知道这个是linux报的。通常是程序的某条指令板子上的CPU不识别,一般编译时候的架构选择的与板子上的ARM实际架构不符合不兼容导致的,所以还是环境的锅。
    从网上知道这个illegal instruction其实是一种core dump,那就从core文件开始吧。
    此处必须感谢:https://blog.csdn.net/chyxwzn/article/details/8879750?utm_source=tuicool

    ulimit -c unlimited
    

    重新跑一遍label_image得到core文件,上gdb

    (gdb) bt
    #0  0x0001f030 in tflite::optimized_ops::ResizeBilinear(tflite::ResizeBilinearParams const&, tflite::RuntimeShape const&, float const*, tflite::RuntimeShape const&, int const*, tflite::RuntimeShape const&, float*) ()
    #1  0x00000000 in ?? ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)
    (gdb) p $pc
    $1 = (void (*)()) 0x1f030 <tflite::optimized_ops::ResizeBilinear(tflite::ResizeBilinearParams const&, tflite::RuntimeShape const&, float const*, tflite::RuntimeShape const&, int const*, tflite::RuntimeShape const&, float*)+7720>
    (gdb) p $sp
    $2 = (void *) 0x7e9e43c8
    (gdb) x/5i $pc
    => 0x1f030 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7720>:
        vfma.f32    s14, s13, s15
       0x1f034 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7724>:
        vstmia      r2!, {s14}
       0x1f038 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7728>:
        bgt 0x1f01c <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7700>
       0x1f03c <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7732>:
        vorr        d30, d16, d16
       0x1f040 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7736>:
        vorr        d31, d17, d17
    

    由此可以看出是挂在vfma.f32 s14, s13, s15命令上。看起来是我手上这个板子不支持这个VFM浮点操作。

    先查编好的程序平台属性:

    root@imx6dl-albatross2:~/march_build# readelf -A label_image
    Attribute Section: aeabi
    File Attributes
      Tag_CPU_name: "7-A"
      Tag_CPU_arch: v7
      Tag_CPU_arch_profile: Application
      Tag_ARM_ISA_use: Yes
      Tag_THUMB_ISA_use: Thumb-2
      Tag_FP_arch: VFPv4
      Tag_Advanced_SIMD_arch: NEONv1 with Fused-MAC
      Tag_ABI_PCS_wchar_t: 4
      Tag_ABI_FP_rounding: Needed
      Tag_ABI_FP_denormal: Needed
      Tag_ABI_FP_exceptions: Needed
      Tag_ABI_FP_number_model: IEEE 754
      Tag_ABI_align_needed: 8-byte
      Tag_ABI_align_preserved: 8-byte, except leaf SP
      Tag_ABI_enum_size: int
      Tag_ABI_VFP_args: VFP registers
      Tag_CPU_unaligned_access: v6
    ...
    

    说明用的是VFPv4指令集。

    再查看板子的情况:

    root@imx6dl-albatross2:~# gcc -march=native -Q --help=target|grep march
      -march=                               armv7-a
      Known ARM architectures (for use with the -march= option):
    
    root@imx6dl-albatross2:~# cat /proc/cpuinfo
    processor       : 0
    model name      : ARMv7 Processor rev 10 (v7l)
    BogoMIPS        : 3.00
    Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
    CPU implementer : 0x41
    CPU architecture: 7
    CPU variant     : 0x2
    CPU part        : 0xc09
    CPU revision    : 10
    
    processor       : 1
    model name      : ARMv7 Processor rev 10 (v7l)
    BogoMIPS        : 3.00
    Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
    CPU implementer : 0x41
    CPU architecture: 7
    CPU variant     : 0x2
    CPU part        : 0xc09
    CPU revision    : 10
    
    Hardware        : Freescale i.MX6 Quad/DualLite (Device Tree)
    Revision        : 0000
    Serial          : 0000000000000000
    

    而板子却只支持VFP3,应该就是这里不一致导致的指令不识别。
    所以需要把vfp编译指令重新写。
    一开始我在\tensorflow\contrib\lite\tools\make\Makefile里的CXXFLAGS中增加了-mfpu=vfpv3,但是发现生成的还是VFP4的,观察编译时的log,可以看到:

    arm-poky-linux-gnueabi-g++  -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a9 --sysroot=/opt/fsl-imx-x11/4.1.15-1.2.0/sysroots/cortexa9hf-vfp-neon-poky-linux-gnueabi -O3 -DNDEBUG -mfpu=vfpv3 -march=armv4t --std=c++11 -march=armv7-a -mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize -fPIC -I. -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/../../../../../ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/../../../../../../ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/eigen -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/absl -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/gemmlowp -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/neon_2_sse -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/farmhash/src -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/flatbuffers/include -I -I/usr/local/include -c tensorflow/contrib/lite/kernels/slice.cc -o /home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/gen/rpi_armv7l/obj/tensorflow/contrib/lite/kernels/slice.o
    

    其中还是有-mfpu=neon-vfpv4,说明还有其他地方设置了,但是却不在Makefile里面。只好在工程里面全局搜索-mfpu,发现\tensorflow\contrib\lite\tools\make\target\rpi_makefile.inc里面还有定义,这个名字一看就是会被调用的,我把其中的-mfpu=neon-vfpv4 \都注释掉了。

        CXXFLAGS += \
          -march=armv7-a \
          -mfpu=neon-vfpv4 \
          -funsafe-math-optimizations \
          -ftree-vectorize \
          -fPIC
        CCFLAGS += \
          -march=armv7-a \
          -mfpu=neon-vfpv4 \
          -funsafe-math-optimizations \
          -ftree-vectorize \
          -fPIC
    

    重新运行编译log里面果然没有再出现VFPv4,编好的label_image在板子上也能顺利运行:

    root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 1 -m ./mobilenet_v1_0.25_128_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
    ...
    Node  30 Operator Builtin Code  25
      Inputs: 4
      Outputs: 87
    invoked
    average time: 380.068 ms
    0.164706: 401 academic gown
    0.145098: 835 suit
    0.0745098: 668 mortarboard
    0.0745098: 458 bow tie
    0.0509804: 653 military uniform
    

    不过好像结果不大好,我看别人都是大概率是military uniform,可能是我用的mobilenet_v1_0.25_128_quant.tflite模型不行,换mobilenet_v1_1.0_224.tflite试试:

    root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 1 -m ./mobilenet_v1_1.0_224.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
    ...
    Node  30 Operator Builtin Code  25
      Inputs: 31
      Outputs: 86
    invoked
    average time: 2784.13 ms
    0.860174: 653 military uniform
    0.0481022: 907 Windsor tie
    0.007867: 466 bulletproof vest
    0.00644933: 514 cornet
    0.00608031: 543 drumstick
    

    果然出来结果对了,不过运行时间也长了很多。
    参考https://blog.csdn.net/computerme/article/details/80345065 ,它的结果只要800多ms,看来这个板子可能性能还是不够啊。

    换量化后的mobilenet_v1_1.0_224_quant.tflite

    root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 4 -m ./mobilenet_v1_1.0_224_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
    ...
    Node  30 Operator Builtin Code  25
      Inputs: 4
      Outputs: 87
    invoked
    average time: 2311.57 ms
    0.780392: 653 military uniform
    0.105882: 907 Windsor tie
    0.0156863: 458 bow tie
    0.0117647: 466 bulletproof vest
    0.00784314: 835 suit
    

    链接里面那位量化后运行时间显著减少,我的却没有。。。

    mobilenet_v2_1.0_224_quant.tflite好像也没有什么改进。。

    root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 4 -m ./mobilenet_v2_1.0_224_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
    ...
    Node  64 Operator Builtin Code  22
      Inputs: 7 10
      Outputs: 172
    invoked
    average time: 2073.31 ms
    0.717647: 653 military uniform
    0.560784: 835 suit
    0.533333: 458 bow tie
    0.52549: 907 Windsor tie
    0.517647: 753 racket
    

    这个时间问题就留到后面解决啦,至少tensorflow lite跑起来了,我可以继续写自己的例子了。
    完美的下班!

    相关文章

      网友评论

          本文标题:移植TensorFlow Lite到ARM板i.MX6上

          本文链接:https://www.haomeiwen.com/subject/iiifbqtx.html