美文网首页CUDAstm32
【CUDA】学习记录(5)-Reduction优化

【CUDA】学习记录(5)-Reduction优化

作者: 不会code的程序猿 | 来源:发表于2017-05-02 21:02 被阅读484次

    Professional CUDA C Programing
    代码下载:http://www.wrox.com/WileyCDA/#并行规约#并行规约
    串行的规约:

    int sum = 0;
    for (int i = 0; i < N; i++)
        sum += array[i];
    

    规约的特点:可交换性,计算结果和计算顺序无关。所以max,min等问题也可以用相同的方法解决。
    并行的基本思路:
    ➤ 1将输入的vector分成更小的数据块
    ➤ 2一个thread负责计算一个小数据块的总和
    ➤ 3累加个分块的总和
    最普遍的思想:一个数据块只包含两个数据,一个thread完成一对数据对的求和,每次迭代的过程中数据量减半,直到只剩一个元素。(每次迭代的计算的过程都在本地进行运算)
    ➤ 相邻的数据对
    ➤ 间隔的数据对(等长度的间隔)


    Screenshot from 2017-05-02 14:42:59.png

    C语言串行计算:

    // Recursive Implementation of Interleaved Pair Approach
    int cpuRecursiveReduce(int *data, int const size)
    {
        // stop condition
        if (size == 1) return data[0];
        // renew the stride
        int const stride = size / 2;
        // in-place reduction
        for (int i = 0; i < stride; i++)
        {
            data[i] += data[i + stride];
        }
        // call recursively
        return cpuRecursiveReduce(data, stride);
    }
    

    1.Neighbored Pair Implementation with divergence

    每个线程计算两个相邻元素的和

    __global__ void reduceNeighbored (int *g_idata, int *g_odata, unsigned int n)
    {
        // set thread ID
        unsigned int tid = threadIdx.x;
        unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
        // convert global data pointer to the local pointer of this block
        int *idata = g_idata + blockIdx.x * blockDim.x;
    
        // boundary check
        if (idx >= n) return;
    
        // in-place reduction in global memory
        for (int stride = 1; stride < blockDim.x; stride *= 2)
        {
            if ((tid % (2 * stride)) == 0)
            {
                idata[tid] += idata[tid + stride];
            }
            // synchronize within threadblock
            __syncthreads();
        }
    
        // write result for this block to global mem
        if (tid == 0) g_odata[blockIdx.x] = idata[0];
    }
    
    一个block处理的过程
    一个block处理过程分析:
    block 0:data 0~data 7
    一个block处理的开始数据地址:
    int *idata = g_idata + blockIdx.x * blockDim.x;
    for (int stride = 1; stride < blockDim.x; stride *= 2)
    1.stride=1,tid=0,2,4,6四个thread计算相邻元素的和。
    _syncthreads()//保证在这次for循环过程中block中的0,2,4,6线程都已经完成了计算。
    

    2.stride=2,tid=0,4的线程参与计算
    3.stride=4,tid=0参与了计算,完成了3 1 7 0 4 1 6 3 这8个数据的求和计算,并将结果保存到g_odata[0]=25;


    Screenshot from 2017-05-02 15:17:56.png

    假设有64个数据,8个block,每个block处理8个数据的和。g_idata保存了输入的原始数据,g_odata有8个数据,g_odata[i]表示的是block i的计算总和,当在CPU端串行计算g_odata数组的和则计算出了所有数据的总和。

    nvcc -arch=35 -rdc=true nestedReduce.cu -o nestedReduce
    ./nestedReduce
    

    注意:直接nvcc编译报错
    http://stackoverflow.com/questions/19287461/compiling-code-containing-dynamic-parallelism-fails

    2. Improving Divergence in Parallel Reduction

    每个线程计算两个相邻元素的和的计算过程我们发现:

    if ((tid % (2 * stride)) == 0)
    

    只有当tid满足一定条件时才会执行,如比计算8个数据时:
    第一次迭代:tid=0,2,4,6只有一半的线程需要进行计算。
    第二次迭代:tid=0,4只有25%的线程参与了计算。
    第三次迭代:tid=0,只有1个线程参与了计算。
    虽然只有一部分线程参与计算,但是也需要调度所有的线程。因为线程的调度以warp为基本单位,warp是32个连续的thread,比如第二次迭代时:线程0-4,虽然只需要0和4进行计算,但是仍然要调度1,3线程。
    合理安排thread处理的数据,减少分支,thread计算相邻元素的和。

    Screenshot from 2017-05-02 16:49:31.png
     for (int stride = 1; stride < blockDim.x; stride *= 2)
        {
            // convert tid into local array index
            int index = 2 * stride * tid;//最大的区别
            if (index < blockDim.x)
            {
                idata[index] += idata[index + stride];
            }
    
            // synchronize within threadblock
            __syncthreads();
        }
    

    如果一个block有512个threads,第一次迭代时8个warp reduction,剩余8个warp do nothing,第二次迭代 4个warp reduction,剩余12个warp do nothing,第三次迭代2个warp执行,剩余14个warp donothing,第四次1个warp执行...

    3. Reducing with Interleaved Pairs

    Screenshot from 2017-05-02 17:30:52.png
    每个线程计算等间距的两个元素的和:全局内存访问顺序有所不同
    __global__ void reduceInterleaved (int *g_idata, int *g_odata, unsigned int n)
    {
        // set thread ID
        unsigned int tid = threadIdx.x;
        unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
        // convert global data pointer to the local pointer of this block
        int *idata = g_idata + blockIdx.x * blockDim.x;
    
        // boundary check
        if(idx >= n) return;
    
        // in-place reduction in global memory
        for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
        {
            if (tid < stride)
            {
                idata[tid] += idata[tid + stride];
            }
    
            __syncthreads();
        }
    
        // write result for this block to global mem
        if (tid == 0) g_odata[blockIdx.x] = idata[0];
    }
    

    Unrolling Loops

    展开循环是能够降低分支的频率和减少循环的维护开销,是一项优化的技术。

    //方法1
    for (int i = 0; i < 100; i++) {
        a[i] = b[i] + c[i];
    }
    //方法2
    for (int i = 0; i < 100; i+=2) {
        a[i] = b[i] + c[i];
        a[i+1] = b[i+1] + c[i+1];
    }
    

    从高级语言层面是无法看出性能提升的原因的,需要从low-level instruction层面去分析,第二段代码循环次数减少了一半,而循环体两句语句的读写操作的执行在CPU上是可以同时执行互相独立的,所以相对第一段,第二段性能要好。
    Unrolling 在CUDA编程中意义更重。我们的目标依然是通过减少指令执行消耗,增加更多的独立指令来提高性能。这样就会增加更多的并行操作从而产生更高的指令和内存带宽(bandwidth)。也就提供了更多的eligible warps来帮助hide instruction/memory latency 。

    4. Reducing with Unrolling

    在reduceInterleaved核函数中一个block 处理一个data block。我们如何用single block处理两个data block来展开循环?对于每个线程块,计算来自两个data block的数据。这是一个循环分区(在第1章中介绍):每个线程在多个数据块上工作处理每个数据块中的单个元素。
    Step1:单个线程块计算相邻两个data block对应位置上的数据。

     block0:1  0  1  1  3  4  0  2
     block1:1  2  5  7  2  1  6  3
     block1:2  2  6  8  5  5  6  5
    

    Step2:一个线程块按照reduceInterleaved计算一个data blcok中的数据总和。

    __global__ void reduceUnrolling2 (int *g_idata, int *g_odata, unsigned int n)
    {
        // set thread ID
        unsigned int tid = threadIdx.x;
        unsigned int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;
    
        // convert global data pointer to the local pointer of this block
        int *idata = g_idata + blockIdx.x * blockDim.x * 2;
    
        // unrolling 2
        if (idx + blockDim.x < n) g_idata[idx] += g_idata[idx + blockDim.x];
    
        __syncthreads();
    
        // in-place reduction in global memory
        for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
        {
            if (tid < stride)
            {
                idata[tid] += idata[tid + stride];
            }
    
            // synchronize within threadblock
            __syncthreads();
        }
    
        // write result for this block to global mem
        if (tid == 0) g_odata[blockIdx.x] = idata[0];
    }
    
    
    Screenshot from 2017-05-02 19:59:46.png
        int size = 1 << 24; // total number of elements to reduce
        int blocksize = 512;   // initial block size
        dim3 block (blocksize, 1);
        dim3 grid  ((size + block.x - 1) / block.x, 1);
        //由于每个block处理两个data block,所以需要调整grid的配置.
        reduceUnrolling2<<<grid.x / 2, block>>>(d_idata, d_odata, size);
    

    实验结果1:Improving Divergence in Parallel Reduction

    nvprof --metrics inst_per_warp ./reduceInteger
    Device "GeForce GT 740M (0)"
    reduceNeighbored Instructions per warp  291.625000  291.625000  291.625000
    reduceNeighboredLess Instructions per warp  115.812500  115.812500  115.812500
    nvprof --metrics gld_throughput ./reduceInteger
    reduceNeighbored Global Load Throughput  16.840GB/s  16.840GB/s  16.839GB/s
    reduceNeighboredLess Global Load Throughput  17.905GB/s  17.905GB/s  17.905GB/s
    

    实验结果2:Reducing with Interleaved Pairs

    ./reduceInteger starting reduction at device 0: GeForce GT 740M     with array size 16777216  grid 32768 block 512
    cpu reduce      elapsed 0.046758 sec cpu_sum: 2139353471
    gpu Neighbored  elapsed 0.031896 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    gpu Neighbored2 elapsed 0.029930 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    gpu Interleaved elapsed 0.018442 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    

    实验结果3:循环展开的不同程度
    reduceUnrolling4: each threadblock handles 4 data blocks
    reduceUnrolling8: each threadblock handles 8 data blocks
    可以看出,同一个thread中如果能有更多的独立的load/store操作,会产生更好的性能,因为这样做memory latency能够更好的被隐藏。我们可以使用nvprof的dram_read_throughput来验证:

    nvprof --metrics dram_read_throughput ./reduceInteger
    //gt740m
    reduceUnrolling8 Device Memory Read Throughput  10.086GB/s  
    reduceUnrolling4 Device Memory Read Throughput  7.5325GB/s  
    reduceUnrolling2 Device Memory Read Throughput  5.6396GB/s  
    

    5. Reducing with Unrolled Warps

    __syncthreads用于块内同步。在reduction kernel中,他被用来在每次循环中保证同一个线程块中所有thread的写global memory的操作都已完成,这样才能进行下一阶段的计算。那么,当kernel进行到只需要少于或等32个thread(也就是一个warp)呢?由于我们是使用的SIMT模式,warp内的thread 是有一个隐式的同步过程的。最后六次迭代可以用下面的语句展开:

     // unrolling warp
        if (tid < 32)
        {
            volatile int *vmem = idata;
            vmem[tid] += vmem[tid + 32];
            vmem[tid] += vmem[tid + 16];
            vmem[tid] += vmem[tid +  8];
            vmem[tid] += vmem[tid +  4];
            vmem[tid] += vmem[tid +  2];
            vmem[tid] += vmem[tid +  1];
        }
    

    warp unrolling避免了最后一个warp内的__syncthreads同步操作。这里注意下volatile修饰符,他告诉编译器每次执行赋值时必须将vmem[tid]的值store回global memory。如果不这样做的话,编译器或cache可能会优化我们读写global/shared memory。有了这个修饰符,编译器就会认为这个值会被其他thread修改,从而使得每次读写都直接去memory而不是去cache或者register。
    最初我没有理解unrolling warp的含义,我们知道在单个warp中,指令遵循SIMT,就是在同一时刻32个线程执行相同的指令,也就是说当活动线程数目少于32个时,我们不需要进行同步控制。
    32个thread处理64个数据相加:
    第一次迭代:[0]+[0+32]、[1]+[[1+32]、[2]+[[2+32]...[31]+[31+32]由于这32个线程在一个warp中,所以这些计算默认是同步的,计算结果保存到[0-31]。
    第二次迭代:[0]+[0+16]、[1]+[[1+16]、[2]+[[2+16]...[15]+[15+16]...([15]+[15+16]),由于这32个线程在一个warp中,所以这些计算默认是同步的,计算结果仍然保存到[0-31],但是有效的计算结果只有[0-15].
    第三次迭代:[0]+[0+8]、[1]+[[1+8]、[2]+[[2+8]...[7]+[7+8]...([31]+[31+8]),有效的计算结果保存到[0-7].
    第四次迭代:[0]+[0+4]、[1]+[[1+4]、[2]+[[2+4]、[3]+[3+4]...([31]+[31+4]),有效的计算结果保存到[0-3].
    第五次迭代:[0]+[0+2]、[1]+[[1+2]...([31]+[31+2]),有效的计算结果保存到[0-1].
    第六次迭代:[0]+[0+1]...[31]+[31+1],有效的计算结果保存到[0].
    在每次迭代的过程中,由于是在一个warp内,SIMT,所以不需再使用块内同步。

    reduceUnrollWarps8<<<grid.x / 8, block>>> (d_idata, d_odata, size);
    //实验结果gt740m
    nvprof --metrics stall_sync ./reduceInteger
    gpu Unrolling8  elapsed 0.097200 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    gpu UnrollWarp8 elapsed 0.096202 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    nvprof --metrics stall_sync ./reduceInteger//因为同步占用的时间
    reduceUnrollWarps8 Issue Stall Reasons(Synchronization)       9.97% 
    reduceUnrolling8   Issue Stall Reasons (Synchronization)      14.54%  
    

    6. Reducing with Complete Unrolling

    如果在编译时已知了迭代次数,就可以完全把循环展开。Fermi和Kepler每个block的最大thread数目都是1024,博文中的kernel的迭代次数都是基于blockDim的,所以完全展开循环是可行的。

    __global__ void reduceCompleteUnrollWarps8 (int *g_idata, int *g_odata,
            unsigned int n)
    {
        // set thread ID
        unsigned int tid = threadIdx.x;
        unsigned int idx = blockIdx.x * blockDim.x * 8 + threadIdx.x;
    
        // convert global data pointer to the local pointer of this block
        int *idata = g_idata + blockIdx.x * blockDim.x * 8;
    
        // unrolling 8
        if (idx + 7 * blockDim.x < n)
        {
            int a1 = g_idata[idx];
            int a2 = g_idata[idx + blockDim.x];
            int a3 = g_idata[idx + 2 * blockDim.x];
            int a4 = g_idata[idx + 3 * blockDim.x];
            int b1 = g_idata[idx + 4 * blockDim.x];
            int b2 = g_idata[idx + 5 * blockDim.x];
            int b3 = g_idata[idx + 6 * blockDim.x];
            int b4 = g_idata[idx + 7 * blockDim.x];
            g_idata[idx] = a1 + a2 + a3 + a4 + b1 + b2 + b3 + b4;
        }
    
        __syncthreads();
    
        // in-place reduction and complete unroll
        if (blockDim.x >= 1024 && tid < 512) idata[tid] += idata[tid + 512];
    
        __syncthreads();
    
        if (blockDim.x >= 512 && tid < 256) idata[tid] += idata[tid + 256];
    
        __syncthreads();
    
        if (blockDim.x >= 256 && tid < 128) idata[tid] += idata[tid + 128];
    
        __syncthreads();
    
        if (blockDim.x >= 128 && tid < 64) idata[tid] += idata[tid + 64];
    
        __syncthreads();
    
        // unrolling warp
        if (tid < 32)
        {
            volatile int *vsmem = idata;
            vsmem[tid] += vsmem[tid + 32];
            vsmem[tid] += vsmem[tid + 16];
            vsmem[tid] += vsmem[tid +  8];
            vsmem[tid] += vsmem[tid +  4];
            vsmem[tid] += vsmem[tid +  2];
            vsmem[tid] += vsmem[tid +  1];
        }
        // write result for this block to global mem
        if (tid == 0) g_odata[blockIdx.x] = idata[0];
    }
    

    main中调用方式

    reduceCompleteUnrollWarps8<<<grid.x / 8, block>>>(d_idata, d_odata, size);
    

    7. Reducing with Template Functions

    CUDA代码支持模板,我们可以如下设置block大小:

    template <unsigned int iBlockSize>
    __global__ void reduceCompleteUnroll(int *g_idata, int *g_odata,
                                         unsigned int n)
    {
        // set thread ID
        unsigned int tid = threadIdx.x;
        unsigned int idx = blockIdx.x * blockDim.x * 8 + threadIdx.x;
    
        // convert global data pointer to the local pointer of this block
        int *idata = g_idata + blockIdx.x * blockDim.x * 8;
    
        // unrolling 8
        if (idx + 7 * blockDim.x < n)
        {
            int a1 = g_idata[idx];
            int a2 = g_idata[idx + blockDim.x];
            int a3 = g_idata[idx + 2 * blockDim.x];
            int a4 = g_idata[idx + 3 * blockDim.x];
            int b1 = g_idata[idx + 4 * blockDim.x];
            int b2 = g_idata[idx + 5 * blockDim.x];
            int b3 = g_idata[idx + 6 * blockDim.x];
            int b4 = g_idata[idx + 7 * blockDim.x];
            g_idata[idx] = a1 + a2 + a3 + a4 + b1 + b2 + b3 + b4;
        }
    
        __syncthreads();
    
        // in-place reduction and complete unroll
        if (iBlockSize >= 1024 && tid < 512) idata[tid] += idata[tid + 512];
    
        __syncthreads();
    
        if (iBlockSize >= 512 && tid < 256)  idata[tid] += idata[tid + 256];
    
        __syncthreads();
    
        if (iBlockSize >= 256 && tid < 128)  idata[tid] += idata[tid + 128];
    
        __syncthreads();
    
        if (iBlockSize >= 128 && tid < 64)   idata[tid] += idata[tid + 64];
    
        __syncthreads();
    
        // unrolling warp
        if (tid < 32)
        {
            volatile int *vsmem = idata;
            vsmem[tid] += vsmem[tid + 32];
            vsmem[tid] += vsmem[tid + 16];
            vsmem[tid] += vsmem[tid +  8];
            vsmem[tid] += vsmem[tid +  4];
            vsmem[tid] += vsmem[tid +  2];
            vsmem[tid] += vsmem[tid +  1];
        }
    
        // write result for this block to global mem
        if (tid == 0) g_odata[blockIdx.x] = idata[0];
    }
    

    注意:最大的不同则是定义了template <unsigned int iBlockSize>来设置block的大小。
    代码中的if (iBlockSize >= 1024 && tid < 512) idata[tid] += idata[tid + 512];等判断语句会根据iBlockSize的值去掉无用的语句,比如iBlockSize=256,则该语句永远为false,则编译器会自动把该语句去除。实际的作用类似于case:

    switch (blocksize) {  
        case 1024:  
            reduceCompleteUnroll<1024><<<grid.x/8, block>>>(d_idata, d_odata, size);  
            break;  
        case 512:  
            reduceCompleteUnroll<512><<<grid.x/8, block>>>(d_idata, d_odata, size);  
            break;  
        case 256:  
            reduceCompleteUnroll<256><<<grid.x/8, block>>>(d_idata, d_odata, size);  
            break;  
        case 128:  
            reduceCompleteUnroll<128><<<grid.x/8, block>>>(d_idata, d_odata, size);  
            break;  
        case 64:  
            reduceCompleteUnroll<64><<<grid.x/8, block>>>(d_idata, d_odata, size);  
            break;  
    }      
    

    总结Reduction Kernel Performance

    ccit@ccit:~/hym/CodeSamples/chapter03$ ./reduceInteger 
    ./reduceInteger starting reduction at device 0: Tesla K80     with array size 16777216  grid 32768 block 512
    cpu reduce      elapsed 0.085437 sec cpu_sum: 2139353471
    gpu Neighbored  elapsed 0.008413 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    gpu Neighbored2 elapsed 0.007775 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    gpu Interleaved elapsed 0.005087 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    gpu Unrolling2  elapsed 0.002878 sec gpu_sum: 2139353471 <<<grid 16384 block 512>>>
    gpu Unrolling4  elapsed 0.001665 sec gpu_sum: 2139353471 <<<grid 8192 block 512>>>
    gpu Unrolling8  elapsed 0.000970 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    gpu UnrollWarp8 elapsed 0.000873 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    gpu Cmptnroll8  elapsed 0.000865 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    gpu Cmptnroll   elapsed 0.000831 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    
    ==11132== Profiling application: ./reduceInteger
    ==11132== Profiling result:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     73.30%  80.877ms         9  8.9864ms  8.8970ms  9.0921ms  [CUDA memcpy HtoD]
      7.67%  8.4588ms         1  8.4588ms  8.4588ms  8.4588ms  reduceNeighbored(int*, int*, unsigned int)
      7.13%  7.8633ms         1  7.8633ms  7.8633ms  7.8633ms  reduceNeighboredLess(int*, int*, unsigned int)
      4.66%  5.1374ms         1  5.1374ms  5.1374ms  5.1374ms  reduceInterleaved(int*, int*, unsigned int)
      2.64%  2.9117ms         1  2.9117ms  2.9117ms  2.9117ms  reduceUnrolling2(int*, int*, unsigned int)
      1.43%  1.5799ms         1  1.5799ms  1.5799ms  1.5799ms  reduceUnrolling4(int*, int*, unsigned int)
      0.82%  901.81us         1  901.81us  901.81us  901.81us  reduceUnrolling8(int*, int*, unsigned int)
      0.77%  847.57us         1  847.57us  847.57us  847.57us  reduceUnrollWarps8(int*, int*, unsigned int)
      0.76%  836.63us         1  836.63us  836.63us  836.63us  reduceCompleteUnrollWarps8(int*, int*, unsigned int)
      0.74%  821.97us         1  821.97us  821.97us  821.97us  void reduceCompleteUnroll<unsigned int=512>(int*, int*, unsigned int)
      0.10%  107.23us         9  11.914us  5.8560us  21.216us  [CUDA memcpy DtoH]
    
    ==11132== API calls:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     55.86%  341.89ms         2  170.94ms  533.87us  341.36ms  cudaMalloc
     24.73%  151.33ms         1  151.33ms  151.33ms  151.33ms  cudaDeviceReset
     13.32%  81.521ms        18  4.5289ms  37.614us  9.1564ms  cudaMemcpy
      5.01%  30.662ms        18  1.7035ms  123.31us  8.4688ms  cudaDeviceSynchronize
      0.49%  3.0276ms       364  8.3170us     133ns  289.29us  cuDeviceGetAttribute
      0.25%  1.5256ms         4  381.39us  364.76us  393.17us  cuDeviceTotalMem
      0.12%  707.06us         1  707.06us  707.06us  707.06us  cudaGetDeviceProperties
      0.10%  588.83us         2  294.41us  211.99us  376.83us  cudaFree
      0.07%  452.08us         9  50.231us  39.824us  65.419us  cudaLaunch
      0.04%  247.65us         4  61.912us  54.456us  65.017us  cuDeviceGetName
      0.00%  12.626us        27     467ns     152ns  1.3440us  cudaSetupArgument
      0.00%  9.8140us         9  1.0900us     648ns  1.6530us  cudaConfigureCall
      0.00%  9.7070us         1  9.7070us  9.7070us  9.7070us  cudaSetDevice
      0.00%  4.7200us        12     393ns     160ns     750ns  cuDeviceGet
      0.00%  4.1790us         3  1.3930us     288ns  3.2960us  cuDeviceGetCount
    ccit@ccit:~/hym/CodeSamples/chapter03$ nvprof ./reduceInteger
    ==11226== NVPROF is profiling process 11226, command: ./reduceInteger
    ./reduceInteger starting reduction at device 0: Tesla K80     with array size 16777216  grid 32768 block 512
    cpu reduce      elapsed 0.060118 sec cpu_sum: 2139353471
    gpu Neighbored  elapsed 0.008543 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    gpu Neighbored2 elapsed 0.007923 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    gpu Interleaved elapsed 0.005194 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
    gpu Unrolling2  elapsed 0.002968 sec gpu_sum: 2139353471 <<<grid 16384 block 512>>>
    gpu Unrolling4  elapsed 0.001631 sec gpu_sum: 2139353471 <<<grid 8192 block 512>>>
    gpu Unrolling8  elapsed 0.000953 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    gpu UnrollWarp8 elapsed 0.000905 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    gpu Cmptnroll8  elapsed 0.000911 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    gpu Cmptnroll   elapsed 0.000884 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
    ==11226== Profiling application: ./reduceInteger
    ==11226== Profiling result:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     72.93%  79.366ms         9  8.8184ms  8.7691ms  8.9602ms  [CUDA memcpy HtoD]
      7.77%  8.4588ms         1  8.4588ms  8.4588ms  8.4588ms  reduceNeighbored(int*, int*, unsigned int)
      7.22%  7.8620ms         1  7.8620ms  7.8620ms  7.8620ms  reduceNeighboredLess(int*, int*, unsigned int)
      4.72%  5.1380ms         1  5.1380ms  5.1380ms  5.1380ms  reduceInterleaved(int*, int*, unsigned int)
      2.68%  2.9131ms         1  2.9131ms  2.9131ms  2.9131ms  reduceUnrolling2(int*, int*, unsigned int)
      1.45%  1.5792ms         1  1.5792ms  1.5792ms  1.5792ms  reduceUnrolling4(int*, int*, unsigned int)
      0.83%  901.11us         1  901.11us  901.11us  901.11us  reduceUnrolling8(int*, int*, unsigned int)
      0.78%  849.88us         1  849.88us  849.88us  849.88us  reduceUnrollWarps8(int*, int*, unsigned int)
      0.77%  838.42us         1  838.42us  838.42us  838.42us  reduceCompleteUnrollWarps8(int*, int*, unsigned int)
      0.75%  821.56us         1  821.56us  821.56us  821.56us  void reduceCompleteUnroll<unsigned int=512>(int*, int*, unsigned int)
      0.09%  102.27us         9  11.363us  5.2480us  20.416us  [CUDA memcpy DtoH]
    
    ==11226== API calls:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     49.61%  248.53ms         2  124.26ms  574.69us  247.95ms  cudaMalloc
     27.09%  135.73ms         1  135.73ms  135.73ms  135.73ms  cudaDeviceReset
     15.93%  79.815ms        18  4.4342ms  33.741us  9.0381ms  cudaMemcpy
      6.14%  30.739ms        18  1.7077ms  122.83us  8.4672ms  cudaDeviceSynchronize
      0.56%  2.7905ms       364  7.6660us     128ns  274.24us  cuDeviceGetAttribute
      0.29%  1.4569ms         4  364.24us  326.50us  378.80us  cuDeviceTotalMem
      0.14%  683.42us         1  683.42us  683.42us  683.42us  cudaGetDeviceProperties
      0.10%  499.91us         2  249.96us  185.44us  314.47us  cudaFree
      0.09%  446.98us         9  49.663us  41.375us  67.690us  cudaLaunch
      0.04%  214.37us         4  53.591us  52.862us  54.100us  cuDeviceGetName
      0.00%  12.251us        27     453ns     146ns  1.2490us  cudaSetupArgument
      0.00%  11.131us         1  11.131us  11.131us  11.131us  cudaSetDevice
      0.00%  8.9040us         9     989ns     785ns  1.8110us  cudaConfigureCall
      0.00%  4.0720us         3  1.3570us     257ns  3.4250us  cuDeviceGetCount
      0.00%  2.9850us        12     248ns     125ns     518ns  cuDeviceGet
    

    根据该实验结果我们可以发现拷贝数据的时间要80ms,在cpu上计算的总时间60ms。

    Invocations                               Metric Name                        Metric Description         Min         Max         Avg
    Device "Tesla K80 (0)"
        Kernel: reduceInterleaved(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency      96.15%      96.15%      96.15%
              1                            gst_efficiency            Global Memory Store Efficiency      95.52%      95.52%      95.52%
        Kernel: reduceCompleteUnrollWarps8(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency     107.54%     107.54%     107.54%
              1                            gst_efficiency            Global Memory Store Efficiency      99.40%      99.40%      99.40%
        Kernel: reduceNeighbored(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency      25.02%      25.02%      25.02%
              1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
        Kernel: reduceUnrolling8(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency      99.21%      99.21%      99.21%
              1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
        Kernel: reduceUnrolling4(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency      98.68%      98.68%      98.68%
              1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
        Kernel: reduceUnrolling2(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency      98.04%      98.04%      98.04%
              1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
        Kernel: void reduceCompleteUnroll<unsigned int=512>(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency     107.54%     107.54%     107.54%
              1                            gst_efficiency            Global Memory Store Efficiency      99.40%      99.40%      99.40%
        Kernel: reduceUnrollWarps8(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency     107.54%     107.54%     107.54%
              1                            gst_efficiency            Global Memory Store Efficiency      99.40%      99.40%      99.40%
        Kernel: reduceNeighboredLess(int*, int*, unsigned int)
              1                            gld_efficiency             Global Memory Load Efficiency      25.02%      25.02%      25.02%
              1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
    

    相关文章

      网友评论

        本文标题:【CUDA】学习记录(5)-Reduction优化

        本文链接:https://www.haomeiwen.com/subject/etbctxtx.html