CUDA01-03运算优化与内存优化

作者: 杨强AT南京 | 来源:发表于2020-01-11 11:51 被阅读0次

CUDA01-03运算优化与内存优化
内存优化
过度绘制的解决
Android性能优化篇之内存优化--内存泄漏
Android性能优化篇之UI渲染性能优化
Android性能优化篇之计算性能优化
Android性能优化篇之电量优化(1)——电量消耗分析
Android性能优化篇之数据传输效率优化
Android性能优化篇之网络优化
Android性能优化篇之Bitmap优化

内存优化很重要，需要遵循内存的使用规则，但规则需要使用场景。同时运算优化也很重要，本主题使用例子来说明内存优化与运算优化的效果。
本主题的逻辑使用相对计算量比较大的图像处理：图像旋转；其中包含三角函数（正弦与余弦运算）。
先看看这两种优化的数据比对：

线程数	无优化	运算优化	内存优化
1个线程的时间	77.2400毫秒	25.0030毫秒	20.9440毫秒
2个线程的时间	40.7310毫秒	15.8480毫秒	10.7460毫秒
3个线程的时间	40.5910毫秒	10.8840毫秒	11.7090毫秒
4个线程的时间	38.1990毫秒	14.4070毫秒	12.3740毫秒
5个线程的时间	43.3370毫秒	13.0630毫秒	16.8020毫秒
6个线程的时间	45.2990毫秒	12.6330毫秒	12.1720毫秒
7个线程的时间	40.1500毫秒	11.0890毫秒	11.8420毫秒
8个线程的时间	42.8300毫秒	12.1220毫秒	11.4690毫秒

CPU核与内存的关系

CPU的核与内存结构

CPU
1. L1缓存:，
  - 总计64K：指令缓存32K(L1I) + 数据缓存32K(L1D)；
  - 每个核拥有独立的L1缓存；
  - 数据加载到使用的耗时为4个时钟周期（非常快）；
2. L2缓存
  - 总计256K，不区分数据与指令；
  - 每个核独立拥有L2缓存；
  - 数据加载到使用的耗时为11-12个时钟周期（非常快）；
3. L3缓存
  - 总计3M ~ ....（我的Mac是3M）；
  - 所有核共享L3缓存；
  - 数据加载到使用的耗时大约为22个时钟周期（非常快）；
DRAM内存(内存条:DDR2,DDR3,DDR4)
1. 基本访问单位是行，每行大小为2-8KB；
2. 每行的时间延迟是200-400个时钟周期；

数据与内存访问流程

内核访问DRAM不能直接访问，是按照下面的流程分层缓冲访问的：
- $\color{red}{DRAM} \to \color{blue}{L3} \to \color{blue}{L2} \to \color{blue}{L1}$
- DRAM到L3由内存控制器控制传输与格式转换。

编程的内存规则

内存规则1-DRAM

使用批量的方式访问DRAM；
栈内存使用的是SRAM，零碎数据尽量使用局部栈；这样可以保障使用核缓存。

内存规则2-核缓存

线程尽可能重复访问32K范围内的数据；
尽可能将数据访问范围限制在256K范围内；
对所有线程的内存访问规模限制在L3范围内，比如我的电脑现在在3M内；
如果超出3M，则需要尽量让3M范围内的数据尽可能多。

旋转的编程实现

算法理论

这个例子使用计算相对复杂的运算：图像旋转
- 为了在原来大小的图像内容纳旋转后的图像，图像采用自动缩放。
- 图像旋转的数学模型就是点的围绕圆心旋转。
  - 图像的中心点
  - 旋转模型就是高中的数学公式：
    - $\begin{bmatrix} x^{\prime} \\ y^{\prime} \end{bmatrix} = \begin{bmatrix} {cos \theta}&{sin \theta} \\ {-sin \theta}& {cos \theta} \end{bmatrix} \times \begin{bmatrix} x \\ y \end{bmatrix}$
    - C语言的三角函数中角度都采用弧度；
- 图像的缩放按照如下公式：
  - scale_factor = $\begin{cases} \dfrac{h}{d} \qquad h \gt w \\ \quad \\ \dfrac{w}{d} \qquad h \le w \end{cases} \qquad d = \sqrt{h^2 + w^2}$
  - 注意：
    - 应该在高宽中取小的作为缩放因子，这样旋转才不能下标越界。

算法模型

坐标计算
1. $x ^ \prime = ( x \cos \theta + y \sin \theta) \times sacle\_factor$
2. $y ^ \prime = (- x \sin \theta + y \cos \theta) \times sacle\_factor$
核心伪代码
- new_pixel[y'][x'] = pixel[y][x];

图像旋转原始实现

头文件
- 增加math.h提供数学计算功能。

#include <pthread.h>
#include <stdint.h>
#include <ctype.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <sys/time.h>

结构体定义
1. BMP文件头结构体
2. 像素结构体

#pragma pack(1)

struct img_header{
    // 文件头
    char                  magic[2];                  // 魔法字
    unsigned int          file_size;                 // 文件大小
    unsigned char         reserve1[4];               // 跳4字节
    unsigned int          data_off;                  // 数据区开始位置
    // 信息头
    unsigned char         reserve2[4];               // 跳4字节
    int                   width;                     // 图像宽度
    int                   height;                    // 图像高度
    unsigned char         reserve3[2];               // 跳2字节
    unsigned short int    bit_count;                 // 图像位数1，4，8，16，24，32
    unsigned char         reserve4[24];              // 跳24字节
};

struct img_pixel{                                    // 32位像素
    unsigned char         red;
    unsigned char         green;
    unsigned char         blue;
    unsigned char         alpha;
};

公共数据

struct img_header header = {0};        // 图像的头
struct img_pixel **pixels;             // 图像数据
struct img_pixel **new_pixels;         // 图像数据

#define         MAX_THREAD  10              // 最大线程数

int             N_THREAD = 1;               // 开启的线程任务数，默认8
int             th_param[MAX_THREAD];       // 传递给线程的参数
pthread_t       th_handle[MAX_THREAD];      // 线程句柄
pthread_attr_t  t_attribute;                // 线程属性
int             angle = 45;                 // 旋转角度

图像读函数

void read_bmp(const char *filename){
    FILE* file = fopen(filename, "rb");
    if(file == NULL){
        printf("文件打开错误\n");
        exit(1);
    }
    size_t n_bytes = fread(&header, 1, 54, file); 
    header.height = header.height >= 0? header.height : -header.height;
    pixels = (struct img_pixel **)malloc(header.height * sizeof(struct img_pixel *));
    for (int h = 0; h < header.height; h++){
        pixels[h] = (struct img_pixel *)malloc(4 * header.width); 
        size_t n_obj = fread(pixels[h], 1, 4 * header.width, file);
        if(n_obj <= 0){
            printf("读取错误，或者读取结束");  // fread返回值无法区分结束与错误，需要feof与ferror函数来判定
            break;
        }
    }

    fclose(file); // 关闭文件
}

图像写函数

void write_bmp(const char *filename){
    // 先把头中高度恢复成原来的相反数
    header.height = - header.height;
    FILE* o_file = fopen(filename, "wb");
    // 写头
    size_t o_size = fwrite(&header, 1, 54, o_file);

    // 写图像数据
    for(int h = 0; h < - header.height; h++){
        o_size = fwrite(new_pixels[h], sizeof(struct img_pixel), header.width, o_file);
        //     printf("数据写入大小：%zd\n", o_size);
    }
    // 关闭文件
    fclose(o_file);
}

创建等大小的图像函数

void create_plain_image(){
    new_pixels = (struct img_pixel **)malloc(header.height * sizeof(struct img_pixel *));
    for (int h = 0; h < header.height; h++){
        new_pixels[h] = (struct img_pixel *)malloc(4 * header.width); 
    }
}

图像释放

void free_data(){
    for(int i = 0; i < header.height; i++){
        free(pixels[i]); 
        free(new_pixels[i]);   
    }
    free(pixels);
    free(new_pixels);
}

线程任务代码 - 图像处理函数

void* handle_image(void *param){
    // 行数的循环根据参数来确定
    int t_id = *((int*)param);   // 线程的编号
    // 需要处理的行数
    int n_task = header.height / N_THREAD;
    // 计算需要处理的开始行 -> 结束行
    int row_start = t_id * n_task;
    int row_stop  = row_start + n_task; 
    
    // 图像中心坐标(标准坐标系的原点坐标)
    int ox ,oy; 
    // 旋转前后的坐标
    double x,y, new_x, new_y;
    // 缩放因子
    double scale;
    
    // 计算中心点
    ox = header.width / 2; 
    oy = header.height / 2;    
    
    // 计算缩放因子
    double d = sqrt(header.width * header.width + header.height * header.height);
    scale = header.width < header.height ? header.width / d : header.height / d;
    
    // 角度与弧度的转换
    double arc = 2 * 3.141592 / 360.0 * angle;
    
    // 逐个像素处理
    for(int h = row_start; h < row_stop; h++){
        for(int w = 0; w < header.width; w++){
            // 1. 图像坐标转换为标准坐标（图像中心为原点）
            x = w - ox;
            y = h - oy;
            // 2. 计算旋转坐标
            new_x = ( x * cos(arc)  + y * sin(arc)) * scale;
            new_y = (-x * sin(arc)  + y * cos(arc)) * scale;
            // 3. 把标准坐标转换为图像坐标
            int new_h = (int)new_y + oy;
            int new_w = (int)new_x + ox;
            // 4. 从源图像中指定坐标的像素，拷贝到新图像旋转后的位置
            new_pixels[new_h][new_w].red = pixels[h][w].red;
            new_pixels[new_h][new_w].green = pixels[h][w].green;
            new_pixels[new_h][new_w].blue = pixels[h][w].blue;
            new_pixels[new_h][new_w].alpha = pixels[h][w].alpha;
        }
    }
    // 完成任务后，结束线程
    pthread_exit(NULL);

}

主流程实现

struct timeval t;
double         t_start, t_stop;
double         t_elapsed;

read_bmp("gpu.bmp");
printf("读取图像完毕!\n");
create_plain_image(); // 创建存储旋转后的图像的缓冲空间
printf("创建旋转图像完毕!\n");
pthread_attr_init(&t_attribute);  // 初始化线程属性
pthread_attr_setdetachstate(&t_attribute, PTHREAD_CREATE_JOINABLE);   // 设置线程的属性
// 开始计时
gettimeofday(&t, NULL);
t_start = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
////////////////////被计时代码
for (int i = 0; i < N_THREAD; i++){   // 线程数
     th_param[i] = i;   // 线程编号
     pthread_create(&th_handle[i], &t_attribute,handle_image, &th_param[i]); // 创建线程
}
// 线程合并，整个任务完成
for(int i=0; i < N_THREAD; i++){
        pthread_join(th_handle[i], NULL);
}
pthread_attr_destroy(&t_attribute);  // 释放线程属性
////////////////////
printf("图像处理完毕!\n");
gettimeofday(&t, NULL);
t_stop = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
t_elapsed = (t_stop - t_start)/1000.00;
// 结束计时
write_bmp("gpu_rotate.bmp");
free_data();
printf("%d个线程的图像处理时间：%6.4f毫秒\n", N_THREAD, t_elapsed);

读取图像完毕!
创建旋转图像完毕!
图像处理完毕!
1个线程的图像处理时间：47.0990毫秒


(int) 48

优化规则的实现

规则：
1. 避免一些公用代码在循环中调用；
2. 避免过多的否点数运算，数学运算（平方根运算，三角运算）
代码例子

void* handle_image_o(void *param){
    // 行数的循环根据参数来确定
    int t_id = *((int*)param);   // 线程的编号
    // 需要处理的行数
    int n_task = header.height / N_THREAD;
    // 计算需要处理的开始行 -> 结束行
    int row_start = t_id * n_task;
    int row_stop  = row_start + n_task; 
    
    // 图像中心坐标(标准坐标系的原点坐标)
    int ox ,oy; 
    // 旋转前后的坐标
    double x,y, new_x, new_y;
    // 缩放因子
    double scale;
    
    // 转换后坐标：
    int new_h, new_w;
    // 计算中心点
    ox = header.width / 2; 
    oy = header.height / 2;  
    // 计算缩放因子
    double d = sqrt(header.width * header.width + header.height * header.height);
    scale = header.width < header.height ? header.width / d : header.height / d;   // 其中有个浮点运算不会发生
    // 角度与弧度的转换
    double arc = 2 * 3.141592 / 360.0 * angle;    // 这个可以在主流程中计算
    double f_sin = sin(arc);
    double f_cos = cos(arc);
    // 逐个像素处理
    for(int h = row_start; h < row_stop; h++){
        y = h - oy;
        double py_sin = y * f_sin;
        double py_cos = y * f_cos;
        for(int w = 0; w < header.width; w++){
            // printf("(%d,%d)->", w, h);
            // 1. 图像坐标转换为标准坐标（图像中心为原点）
            x = w - ox;
            
            // printf("(%f,%f)->", x, y);
            // 2. 计算旋转坐标
            new_x = ( x * f_cos + py_sin) * scale;
            new_y = (-x * f_sin + py_cos) * scale;
            // printf("(%6.0f,%6.0f)->", new_x, new_y);
            // 3. 把标准坐标转换为图像坐标
            new_h = (int)new_y + oy;
            new_w = (int)new_x + ox;
            // printf("(%d,%d)\n", new_w, new_h);
            // 4. 从源图像中指定坐标的像素，拷贝到新图像旋转后的位置
            new_pixels[new_h][new_w] = pixels[h][w];
            // new_pixels[new_h][new_w].green = pixels[h][w].green;
            // new_pixels[new_h][new_w].blue = pixels[h][w].blue;
            // new_pixels[new_h][new_w].alpha = pixels[h][w].alpha;
        }
    }
    // 完成任务后，结束线程
    pthread_exit(NULL);
}

主流程

read_bmp("gpu.bmp");
printf("读取图像完毕!\n");
create_plain_image(); // 创建存储旋转后的图像的缓冲空间
printf("创建旋转图像完毕!\n");
pthread_attr_init(&t_attribute);  // 初始化线程属性
pthread_attr_setdetachstate(&t_attribute, PTHREAD_CREATE_JOINABLE);   // 设置线程的属性
// 开始计时
gettimeofday(&t, NULL);
t_start = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
////////////////////被计时代码
for (int i = 0; i < N_THREAD; i++){   // 线程数
     th_param[i] = i;   // 线程编号
     pthread_create(&th_handle[i], &t_attribute,handle_image_o, &th_param[i]); // 创建线程
}
// 线程合并，整个任务完成
for(int i=0; i < N_THREAD; i++){
        pthread_join(th_handle[i], NULL);
}
pthread_attr_destroy(&t_attribute);  // 释放线程属性
////////////////////
printf("图像处理完毕!\n");
gettimeofday(&t, NULL);
t_stop = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
t_elapsed = (t_stop - t_start)/1000.00;
// 结束计时
write_bmp("gpu_rotate.bmp");
free_data();
printf("%d个线程的图像处理时间：%6.4f毫秒\n", N_THREAD, t_elapsed);

读取图像完毕!
创建旋转图像完毕!
图像处理完毕!
1个线程的图像处理时间：19.6050毫秒

(int) 48

结果：
- 效率提升了2倍。

内存优化的适用场景

在这个例子中，使用本地内存优化，效果不彰显。

void* handle_image_m(void *param){
    // 行数的循环根据参数来确定
    int t_id = *((int*)param);   // 线程的编号
    // 需要处理的行数
    int n_task = header.height / N_THREAD;
    // 计算需要处理的开始行 -> 结束行
    int row_start = t_id * n_task;
    int row_stop  = row_start + n_task; 
    
    // 图像中心坐标(标准坐标系的原点坐标)
    int ox ,oy; 
    // 旋转前后的坐标
    double x,y, new_x, new_y;
    // 缩放因子
    double scale;
    
    // 转换后坐标：
    int new_h, new_w;
    // 本地缓冲
    struct img_pixel buffer[4 * 1024];
    // 计算中心点
    ox = header.width / 2; 
    oy = header.height / 2;  
    // 计算缩放因子
    double d = sqrt(header.width * header.width + header.height * header.height);
    scale = header.width < header.height ? header.width / d : header.height / d;   // 其中有个浮点运算不会发生
    // 角度与弧度的转换
    double arc = 2 * 3.141592 / 360.0 * angle;    // 这个可以在主流程中计算
    double f_sin = sin(arc);
    double f_cos = cos(arc);
    // 逐个像素处理
    for(int h = row_start; h < row_stop; h++){
        y = h - oy;
        double py_sin = y * f_sin;
        double py_cos = y * f_cos;
        memcpy((void*)buffer, (void*)pixels[h], (size_t)(header.width * sizeof(struct img_pixel)));
        for(int w = 0; w < header.width; w++){
            // printf("(%d,%d)->", w, h);
            // 1. 图像坐标转换为标准坐标（图像中心为原点）
            x = w - ox;
            // printf("(%f,%f)->", x, y);
            // 2. 计算旋转坐标
            new_x = ( x * f_cos + py_sin) * scale;
            new_y = (-x * f_sin + py_cos) * scale;
            // printf("(%6.0f,%6.0f)->", new_x, new_y);
            // 3. 把标准坐标转换为图像坐标
            new_h = (int)new_y + oy;
            new_w = (int)new_x + ox;
            // printf("(%d,%d)\n", new_w, new_h);
            // 4. 从源图像中指定坐标的像素，拷贝到新图像旋转后的位置
            new_pixels[new_h][new_w] = buffer[w];
            // new_pixels[new_h][new_w].green = pixels[h][w].green;
            // new_pixels[new_h][new_w].blue = pixels[h][w].blue;
            // new_pixels[new_h][new_w].alpha = pixels[h][w].alpha;
        }
    }
    // 完成任务后，结束线程
    pthread_exit(NULL);
}

执行流程

read_bmp("gpu.bmp");
printf("读取图像完毕!\n");
create_plain_image(); // 创建存储旋转后的图像的缓冲空间
printf("创建旋转图像完毕!\n");
pthread_attr_init(&t_attribute);  // 初始化线程属性
pthread_attr_setdetachstate(&t_attribute, PTHREAD_CREATE_JOINABLE);   // 设置线程的属性
// 开始计时
gettimeofday(&t, NULL);
t_start = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
////////////////////被计时代码
for (int i = 0; i < N_THREAD; i++){   // 线程数
     th_param[i] = i;   // 线程编号
     pthread_create(&th_handle[i], &t_attribute,handle_image_m, &th_param[i]); // 创建线程
}
// 线程合并，整个任务完成
for(int i=0; i < N_THREAD; i++){
        pthread_join(th_handle[i], NULL);
}
pthread_attr_destroy(&t_attribute);  // 释放线程属性
////////////////////
printf("图像处理完毕!\n");
gettimeofday(&t, NULL);
t_stop = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
t_elapsed = (t_stop - t_start)/1000.00;
// 结束计时
write_bmp("gpu_rotate.bmp");
free_data();
printf("%d个线程的图像处理时间：%6.4f毫秒\n", N_THREAD, t_elapsed);

读取图像完毕!
创建旋转图像完毕!
图像处理完毕!
1个线程的图像处理时间：17.6410毫秒

(int) 48

为什么这儿内存的效果不彰显呢？从内存的读写的频率就可以分析出来。

附录

上面三种方式的优化数据比较

线程数	无优化	运算优化	内存优化
1个线程的时间	77.2400毫秒	25.0030毫秒	20.9440毫秒
2个线程的时间	40.7310毫秒	15.8480毫秒	10.7460毫秒
3个线程的时间	40.5910毫秒	10.8840毫秒	11.7090毫秒
4个线程的时间	38.1990毫秒	14.4070毫秒	12.3740毫秒
5个线程的时间	43.3370毫秒	13.0630毫秒	16.8020毫秒
6个线程的时间	45.2990毫秒	12.6330毫秒	12.1720毫秒
7个线程的时间	40.1500毫秒	11.0890毫秒	11.8420毫秒
8个线程的时间	42.8300毫秒	12.1220毫秒	11.4690毫秒

选装后的图像效果（默认45度）

旋转后的图像处理

CUDA01-03运算优化与内存优化
内存优化很重要，需要遵循内存的使用规则，但规则需要使用场景。同时运算优化也很重要，本主题使用例子来说明内存优化...
内存优化
内存优化、UI优化（布局优化、会只优化）、速度优化（线程优化、网络优化）、启动优化、电量优化内存优化内存抖动：...
过度绘制的解决
背景：《Google的性能优化典范》一文是Android程序内存优化的指导，分别从渲染、电量、运算和内存几个方面...
Android性能优化篇之内存优化--内存泄漏
引言 1. Android性能优化篇之内存优化--内存泄漏 2.Android性能优化篇之内存优化--内存优化分析...
Android性能优化篇之UI渲染性能优化
引言 1. Android性能优化篇之内存优化--内存泄漏 2.Android性能优化篇之内存优化--内存优化分析...
Android性能优化篇之计算性能优化
引言 1. Android性能优化篇之内存优化--内存泄漏 2.Android性能优化篇之内存优化--内存优化分析...
Android性能优化篇之电量优化(1)——电量消耗分析
引言 1. Android性能优化篇之内存优化--内存泄漏 2.Android性能优化篇之内存优化--内存优化分析...
Android性能优化篇之数据传输效率优化
引言 1. Android性能优化篇之内存优化--内存泄漏 2.Android性能优化篇之内存优化--内存优化分析...
Android性能优化篇之网络优化
引言 1. Android性能优化篇之内存优化--内存泄漏 2.Android性能优化篇之内存优化--内存优化分析...
Android性能优化篇之Bitmap优化
引言 1. Android性能优化篇之内存优化--内存泄漏 2.Android性能优化篇之内存优化--内存优化分析...