Vectorization 向量化编程实践 [WIP]

作者: rickif | 来源:发表于2022-07-29 00:46 被阅读0次

Vectorization 向量化编程实践 [WIP]
inner_product in C++
软件(octave)-矢量
OKEx虚拟货币交易平台量化交易入门-API入门及实践
MATLAB 排列组合问题
矢量化编程
Matlab编程思想的一点总结
向量化：低矩阵分解（Vectorization: Low ran
Git 团队协作中常用术语
量化交易基础—Python基础整理以及推荐

本文是一篇向量化编程实践文章。

Vectorization is the process of converting an algorithm from a scalar implementation, which does an operation one pair of operands at a time, to a vector process where a single instruction can refer to a vector (a series of adjacent values).

与通常的标量程序不同，向量化的程序可以一个指令处理一个向量（若干个相邻值），是指令级并行的技术。向量化可以分为由编译器自动完成的自动向量化和由开发者显式调用向量化指令完成的显式向量化。

自动向量化

自动向量化由编译器尝试自动使用 SIMD 指令。用户也可以提供一些附加信息（如 hint 和 pragma）来提示编译器。

向量化编程守则

Use:

straight-line code (a single basic block)
vector data only; that is, arrays and invariant expressions on the right hand side of assignments. Array references can appear on the left hand side of assignments.
only assignment statements.
Avoid:
function calls (other than math library calls)
non-vectorizable operations (either because the loop cannot be vectorized, or because an operation is emulated through a number of instructions)
mixing vectorizable types in the same loop (leads to lower resource utilization)
data-dependent loop exit conditions (leads to loss of vectorization)

const int repeat_times = 1000000;
const int sz = 8192;

int a[sz], b[sz], c[sz];
int foo () {
  for (int i=0; i<sz; i++){ // loop vectorized
    a[i] = b[i] + c[i];
  }

  int sum = 0;
  for (int i=0; i<sz; i++){ // loop vectorized
    sum += a[i];
  }
  return sum;
}

int main() {
  int sum = 0;
  for(int i = 0; i < repeat_times; ++i) {
    sum += foo();
  }
  return sum;
}

我们通过设置 -fopt-info-vec-optimized，在屏幕上打印编译器的优化信息。实际上，-fopt-info-vec- 可以支持 all、note、optimized 等多个级别。

> g++ main.cc -fopt-info-vec-optimized -mavx512f  -O3 -o a5.out
main.cc:11:18: optimized: loop vectorized using 64 byte vectors
main.cc:6:18: optimized: loop vectorized using 64 byte vectors
main.cc:11:18: optimized: loop vectorized using 64 byte vectors
main.cc:6:18: optimized: loop vectorized using 64 byte vectors
> time ./a5.out
./a5.out  0.79s user 0.00s system 99% cpu 0.800 total

通过不同的编译选项，得到最终的执行结果如下。

编译选项	执行时间	优化方式
-O3 -fno-tree-vectorize	10.31s	无
-O3	2.62s	loop vectorized using 16 byte vectors
-msse4.2 -O3	2.53s	loop vectorized using 16 byte vectors
-mavx2 -O3	1.44s	loop vectorized using 32 byte vectors
-mavx512f -O3	0.79s	loop vectorized using 64 byte vectors