Metal与图形渲染九：多线程渲染

作者: 肠粉白粥_Hoben | 来源:发表于2021-10-14 15:59 被阅读0次

零. 前言

我们在介绍Metal入门篇的时候提到过，Metal支持多线程操作，以及具备线程间资源共享能力，但当时只是简单地提了一嘴，并没有进行实践，今天来实践一下Metal怎么使用多线程渲染。

本文将以Yuv转Rgb和反弹小球为例，介绍Metal中的多线程渲染，反弹小球的效果如下。

一. 纹理、像素、线程和线程组

一个纹理可以看成一个Grid，一个Grid又可以划分为若干个Group，而一个Group包含了若干个像素，GPU具备并行计算能力，每个线程会处理一个像素，因此，我们把若干个线程合成一个线程组，也就是Thread Group，由若干个Thread Group去执行渲染操作。

执行多线程操作使用的是MTLComputePipelineState，该管线允许用户传入线程组的数量threadGroupCount，和每个线程组包含的线程数threadGroupSize。

而MTLComputePipelineState拥有两个参数，threadExecutionWidth代表每个线程组的宽度，maxTotalThreadsPerThreadgroup代表每个线程组最大的线程数量，苹果建议将threadgroup包含的线程数量设定为threadExecutionWidth的整数倍。因此有公式：

NSUInteger w = pipelineState.threadExecutionWidth;
NSUInteger h = pipelineState.maxTotalThreadsPerThreadgroup / w;
MTLSize threadGroupSize = MTLSizeMake(w, h, 1);

那么，threadGroupCount可以根据我们上面求得的threadGroupSize进行计算，一个grid每一行拥有的线程组数量应为(outTexture.width - 1) / threadGroupSize.width + 1，同理，每一列拥有的线程组数量应为(outTexture.height - 1) / threadGroupSize.height + 1，因此有：

NSUInteger w = pipelineState.threadExecutionWidth;
NSUInteger h = pipelineState.maxTotalThreadsPerThreadgroup / w;
MTLSize threadGroupSize = MTLSizeMake(w, h, 1);

MTLSize threadGroupCount;
threadGroupCount.width = (outTexture.width + threadGroupSize.width - 1) / threadGroupSize.width;
threadGroupCount.height = (outTexture.height + threadGroupSize.height - 1) / threadGroupSize.height;
threadGroupCount.depth = 1;

[computeEncoder dispatchThreadgroups:threadGroupCount threadsPerThreadgroup:threadGroupSize];

二. 使用多线程渲染yuv转rgb

在MSL中，只有kernel类型的内核函数才能使用多线程，他既不属于顶点着色函数，也不属于片段着色函数，但我们可以根据kernel函数进行纹理的读写、数据的读写。我们这次使用kernel函数进行yuv和rgb的转换。

kernel void
ycbcrToRgb(texture2d<float, access::read> textureY [[ texture(0) ]],
           texture2d<float, access::read> textureUV [[ texture(1) ]],
           texture2d<float, access::write> outTexture [[ texture(2) ]],
           constant CCAlphaVideoMetalConvertMatrix *convertMatrix [[ buffer(0) ]],
           uint2 gid [[ thread_position_in_grid ]]) {
    if (gid.x >= outTexture.get_width() || gid.y >= outTexture.get_height()) {
        return;
    }
    float y = textureY.read(gid).r;
    float2 uv = textureUV.read(gid / 2).rg;
    float3 rgb = convertMatrix->matrix * (float3(y, uv) + convertMatrix->offset);
    outTexture.write(float4(rgb, 1), gid);
}

这个kernel函数中，thread_position_in_grid句柄代表当前渲染的像素位置相对于整个纹理grid的位置，因为前面我们是使用outTexture计算了ThreadGroup的宽高，因此这里的gid和输出纹理的像素点一一对应。

比如传入的gid为(9, 10)，那么这个gid就相对outTexture对应的(9,10)这个位置。

值得注意的是，Grid是纹理的尺寸，长宽不一定都能整除ThreadGroup，因此超出范围的需要return掉。

由于y纹理的宽高等于输出纹理的宽高；而uv纹理的宽高均为输出纹理宽高的1 / 2，因此采样的位置也有所不同。

最后再调用write即可对输出纹理进行写入。

对应的OC层代码如下：

- (void)computePipeline:(id <MTLComputePipelineState>)pipelineState
          inputTextures:(NSArray <HobenMetalTexture *> *)inputTextures
                buffers:(nullable NSArray <id<MTLBuffer>> *)buffers 
             outTexture:(id<MTLTexture>)outTexture
          commandBuffer:(id <MTLCommandBuffer>)commandBuffer {
    id <MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
    [computeEncoder setComputePipelineState:pipelineState];
    
    for (int i = 0; i < inputTextures.count; i++) {
        [computeEncoder setTexture:inputTextures[i].texture atIndex:i];
    }
    [computeEncoder setTexture:outTexture atIndex:inputTextures.count];
    
    for (NSInteger i = 0; i < buffers.count; i++) {
        id <MTLBuffer> buffer = buffers[i];
        [computeEncoder setBuffer:buffer offset:0 atIndex:i];
    }
    
    NSUInteger w = pipelineState.threadExecutionWidth;
    NSUInteger h = pipelineState.maxTotalThreadsPerThreadgroup / w;
    MTLSize threadGroupSize = MTLSizeMake(w, h, 1);
    
    MTLSize threadGroupCount;
    threadGroupCount.width = (outTexture.width + threadGroupSize.width - 1) / threadGroupSize.width;
    threadGroupCount.height = (outTexture.height + threadGroupSize.height - 1) / threadGroupSize.height;
    threadGroupCount.depth = 1;
    
    [computeEncoder dispatchThreadgroups:threadGroupCount threadsPerThreadgroup:threadGroupSize];
    [computeEncoder endEncoding];
}

三. 渲染反弹小球

kernel函数除了能更改纹理的内容外，也可以更改Buffer的内容，从而带入顶点着色器或片段着色器中。

在反弹小球的渲染中，kernel函数负责计算小球的速度、位置信息；vertex函数负责将kernel函数得到的信息，进行位置的归一化计算、决定小球的大小信息；fragment函数负责渲染小球的颜色。

由此，我们可以定义一个小球的数据，首先是需要kernel函数和vertex函数用于计算的数据，该数据为OC层和MSL层共用：

typedef struct {
    vector_float2 position;
    float speed;
    float size;
} BallData;

而光栅化数据需要定义小球的顶点位置、大小和传递给片段着色器的颜色：

typedef struct {
    float4 position [[ position ]];
    float size [[point_size]];
    float4 color;
} BallRasterizerData;

在kernel函数中，我们需要根据小球上一帧的位置、速度，计算出下一帧的位置和速度：

kernel void
bouncingBallCompute(device BallData *data [[ buffer(0) ]],
                    uint gid [[ thread_position_in_grid ]]) {
    constexpr float tickTime = 0.16;  // 每一帧的时间
    constexpr float g = 9.8;  // 重力加速度

    auto d = data[gid];

    d.position.y = d.position.y + d.speed * tickTime;

    if (d.position.y >= kHobenBallOutTextureHeight - d.size / 2) {
        // 触底回弹
        d.position.y = kHobenBallOutTextureHeight - d.size / 2;
        d.speed = -d.speed * 0.9;
    }

    // 让不同大小的小球加速度不一致
    float drag = (abs(d.speed) * d.size) / 100;

    d.speed = d.speed + tickTime * (g - drag);
    data[gid] = d;
}

在vertex函数中，我们需要根据计算好的位置，和外层传来的大小，进行位置的归一化和大小的决定，同时顺便根据大小不同展示不同颜色的小球：

vertex BallRasterizerData
bouncingBallVertex(const device BallData *data [[ buffer(0) ]],
                   uint vid [[ vertex_id ]],
                   uint instance [[ instance_id ]]) {
    BallRasterizerData vout;
    vout.position = float4(0, 0, 0, 1);
    
    auto d = data[instance];
    float2 p = d.position / float2(kHobenBallOutTextureWidth, kHobenBallOutTextureHeight);
    
    // 和顶点着色器坐标系对齐
    p.y = 1 - p.y;
    vout.position.xy = (p - 0.5) * 2;
    
    vout.size = d.size;
    vout.color = float4(1.0, d.size / 48, 1 - d.size / 48, 1.0);
    return vout;
}

在fragment函数中，则声明小球是圆形的即可，point_coord代表某个Point里面的像素点的位置信息。

fragment float4
bouncingBallFragment(BallRasterizerData vertexIn [[ stage_in ]],
                     float2 pointCoor [[ point_coord ]]) {

    if (distance(pointCoor, float2(0.5, 0.5)) >= 0.5) {
        // 画成圆形
        discard_fragment();
    }
    return vertexIn.color;
}

声明好小球的MSL后，我们再来看看在业务层的渲染。

首先需要初始化若干个小球信息，这里初始化了1024个小球，每个小球的大小、位置随机：

static const NSInteger HobenMetalBouncingBallCount = 1024;

@interface HobenMetalBouncingBallOutput() {
    BallData _ballDatas[HobenMetalBouncingBallCount];
}

...

- (instancetype)initWithRenderContext:(HobenMetalRenderContext *)renderContext {
    if (self = [super initWithRenderContext:renderContext]) {
        for (int i = 0; i < HobenMetalBouncingBallCount; i++) {
            BallData ballData;
            
            ballData.size = 8 + arc4random() % 28;
            ballData.speed = 0;
            vector_float2 position = {24 + arc4random() % (kHobenBallOutTextureWidth - 48), 24 + arc4random() % 400};
            ballData.position = position;
            _ballDatas[i] = ballData;
            id <MTLBuffer> buffer = [_renderContext.device newBufferWithBytes:&_ballDatas
                                                                       length:sizeof(BallData) * HobenMetalBouncingBallCount
                                                                      options:MTLResourceStorageModeShared];
            _buffer = buffer;
        }
    }
    return self;
}

由于我们只需要一个顶点就可以绘制出一个小球，因此，threadGroupSize和threadGroupCount也只需要一维的数据即可：

if (!_outputTexture) {
    _outputTexture = [HobenMetalTexture defaultTextureByWidth:kHobenBallOutTextureWidth height:kHobenBallOutTextureHeight];
}
    
id <MTLCommandBuffer> commandBuffer = _renderContext.commandBuffer;
    
id <MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
id <MTLComputePipelineState> computePipeline = [_renderContext computePipelineStateWithName:@"bouncingBallCompute"];
[computeEncoder setComputePipelineState:computePipeline];

[computeEncoder setBuffer:_buffer offset:0 atIndex:0];

MTLSize threadGroupSize = MTLSizeMake(computePipeline.threadExecutionWidth, 1, 1);

NSUInteger w = (HobenMetalBouncingBallCount + threadGroupSize.width - 1) / threadGroupSize.width;
MTLSize threadGroupCount = MTLSizeMake(w, 1, 1);

[computeEncoder dispatchThreadgroups:threadGroupSize threadsPerThreadgroup:threadGroupCount];
[computeEncoder endEncoding];

绘制函数也变成MTLPrimitiveTypePoint，主要调用了instanceCount方法，表示绘制这么多个数量的点，对应MSL中的[[ instance_id ]]句柄：

MTLRenderPassDescriptor *renderPass = [MTLRenderPassDescriptor renderPassDescriptor];
renderPass.colorAttachments[0].texture = _outputTexture;
renderPass.colorAttachments[0].clearColor = MTLClearColorMake(0, 0, 0, 0);
renderPass.colorAttachments[0].storeAction = MTLStoreActionStore;
renderPass.colorAttachments[0].loadAction = MTLLoadActionClear;

id <MTLRenderCommandEncoder> renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPass];

[renderEncoder setRenderPipelineState:[_renderContext pipelineStateWithVertexName:@"bouncingBallVertex" fragmentName:@"bouncingBallFragment"]];
[renderEncoder setVertexBuffer:_buffer offset:0 atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypePoint vertexStart:0 vertexCount:1 instanceCount:HobenMetalBouncingBallCount];
[renderEncoder endEncoding];

最后在主工程决定链式结构走向，并每帧回调就OK啦：

- (void)viewDidLoad {
    [super viewDidLoad];
        
    _renderContext = [[HobenMetalRenderContext alloc] init];

    self.view.backgroundColor = [UIColor grayColor];
    
    CGRect frame = self.view.frame;
    
    HobenMetalRenderView *renderView = [[HobenMetalRenderView alloc] initWithRenderContext:_renderContext];
    renderView.backgroundColor = [UIColor blackColor];
    renderView.frame = CGRectMake(0, 200, frame.size.width, frame.size.height - 400);
    renderView.userInteractionEnabled = NO;
    renderView.paused = NO;
    renderView.delegate = self;
    [self.view addSubview:renderView];
    self.renderView = renderView;
    
    HobenMetalBouncingBallOutput *balloutPut = [[HobenMetalBouncingBallOutput alloc] initWithRenderContext:_renderContext];
    [balloutPut addTarget:self.renderView];
    self.ballOutput = balloutPut;
}

- (void)drawInMTKView:(MTKView *)view {
    [self.ballOutput beginRender];
}

四. 多个Command Encoder的协作执行

在上面的小球渲染你可能注意到了，我们使用了MTLComputeCommandEncoder执行多线程操作，对应的是kernel函数；使用MTLRenderCommandEncoder执行渲染操作，对应的是vertex和fragment函数，最后将两个Encoder encode之后，就会将MTLCommandBuffer commit掉。

也就是说，一个MTLCommandBuffer可以包含多个不同类型的Encoder，开发者可以决定Encoder执行的是计算还是渲染操作。

而当开发者在CPU装载完所有的MTLCommandBuffer后，就会传入MTLCommandQueue中，接下来的工作就会交由GPU执行了。

下图用一个更加具体的例子介绍了MTLCommandEncoder、MTLCommandBuffer、MTLCommandQueue的关系：

五. 总结

Metal多线程渲染中，运用到了线程组的概念，一个纹理可以划分为若干个Group，一个线程组可以对应一个Group。

一个线程组包含若干个线程，每个线程负责一个像素，thread_position_in_grid句柄代表当前渲染的像素位置相对于整个纹理grid的位置。

我们可以根据kernel函数更改纹理、缓冲数据，并应用于vertex、fragment函数中。

不同Encoder工作不一样，kernel函数属于Compute Encoder，他和其他Encoder一起放入一个Command Buffer中，一个Command Queue有多个Command Buffer，最后交由CPU处理。

六. 参考文章

Parallel Computation using MPS

Metal 示例之图像处理

GitHub MetalPetal

Metal与图形渲染九：多线程渲染

零. 前言

一. 纹理、像素、线程和线程组

二. 使用多线程渲染yuv转rgb

三. 渲染反弹小球

四. 多个Command Encoder的协作执行

五. 总结

六. 参考文章

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

从八开始——图形渲染/Metal专题