本文是一篇向量化编程实践文章。
Vectorization is the process of converting an algorithm from a scalar implementation, which does an operation one pair of operands at a time, to a vector process where a single instruction can refer to a vector (a series of adjacent values).
与通常的标量程序不同,向量化的程序可以一个指令处理一个向量(若干个相邻值),是指令级并行的技术。向量化可以分为由编译器自动完成的自动向量化和由开发者显式调用向量化指令完成的显式向量化。
自动向量化
自动向量化由编译器尝试自动使用 SIMD 指令。用户也可以提供一些附加信息(如 hint 和 pragma)来提示编译器。
向量化编程守则
Use:
- straight-line code (a single basic block)
- vector data only; that is, arrays and invariant expressions on the right hand side of assignments. Array references can appear on the left hand side of assignments.
- only assignment statements.
Avoid: - function calls (other than math library calls)
- non-vectorizable operations (either because the loop cannot be vectorized, or because an operation is emulated through a number of instructions)
- mixing vectorizable types in the same loop (leads to lower resource utilization)
- data-dependent loop exit conditions (leads to loss of vectorization)
const int repeat_times = 1000000;
const int sz = 8192;
int a[sz], b[sz], c[sz];
int foo () {
for (int i=0; i<sz; i++){ // loop vectorized
a[i] = b[i] + c[i];
}
int sum = 0;
for (int i=0; i<sz; i++){ // loop vectorized
sum += a[i];
}
return sum;
}
int main() {
int sum = 0;
for(int i = 0; i < repeat_times; ++i) {
sum += foo();
}
return sum;
}
我们通过设置 -fopt-info-vec-optimized
,在屏幕上打印编译器的优化信息。实际上,-fopt-info-vec-
可以支持 all、note、optimized 等多个级别。
> g++ main.cc -fopt-info-vec-optimized -mavx512f -O3 -o a5.out
main.cc:11:18: optimized: loop vectorized using 64 byte vectors
main.cc:6:18: optimized: loop vectorized using 64 byte vectors
main.cc:11:18: optimized: loop vectorized using 64 byte vectors
main.cc:6:18: optimized: loop vectorized using 64 byte vectors
> time ./a5.out
./a5.out 0.79s user 0.00s system 99% cpu 0.800 total
通过不同的编译选项,得到最终的执行结果如下。
编译选项 | 执行时间 | 优化方式 |
---|---|---|
-O3 -fno-tree-vectorize | 10.31s | 无 |
-O3 | 2.62s | loop vectorized using 16 byte vectors |
-msse4.2 -O3 | 2.53s | loop vectorized using 16 byte vectors |
-mavx2 -O3 | 1.44s | loop vectorized using 32 byte vectors |
-mavx512f -O3 | 0.79s | loop vectorized using 64 byte vectors |
显式向量化
参考文献
Vectorization (intel.com)
gcc Multiply.c Driver.c -lm -O3 -fopt-info-vec-all -o a2.out
网友评论