【深度学习】视频行为检测&分类方案整理

作者: bit_teng | 来源:发表于2018-03-13 10:01 被阅读0次

主流几类方案

image

Two-stream

[2014] Large-scale Video Classification with Convolutional Neural Networks

Fusion Method

实验了不同的卷积神经网络表示出视频的时间信息

image

Multi-resolution CNN

image

使用两种不同的分辨率的帧分别输入到两个CNN中，在最后的两个全连接层将两个CNN统一起来
两个流分别是低分辨率的内容流和采用每帧中间部分的高分辨率流

Result

image

多帧卷积比单帧卷积好，多尺度比单尺度好

[2014] Two-Stream Convolutional Networks for Action Recognition in Videos

Multi-stream CNN

image

提出了two-stream结构的CNN，由空间和时间两个维度的网络组成。
一个是普通的single frame的CNN，在ImageNet的数据上pre-train，然后在视频数据上调整最后一层。
另一个CNN网络，使用多帧的密集光流场作为训练输入，可以提取动作的信息。
利用了多任务训练的方法把两个数据集联合起来。把两个softmax层的输出融合：一种是平均层，一种是训练一个以这些softmax输出作为特征的SVM。

光流场叠加

计算每两帧间的光流，简单地叠加一起。追踪L+1帧，产生L帧的光流，把光流分解成X，Y两个方向的光流，这时会有2L个通道。

计算光流是预处理后保存的，因为这会影响网络的速度。

轨迹追踪光流叠加 通过光流来追踪它在视频中的轨迹点，从而计算它在每一帧的相应位置的光流向量

减去平均光流 消去摄像头运动引起的相对运动

Result

UCF-101最高准确率87%
[图片上传失败...(image-ffc236-1520906810618)]
[图片上传失败...(image-dff411-1520906810618)]
光流叠加帧数越多准确率越高

[2016] Convolutional Two-Stream Network Fusion for Video Action Recognition

github：https://github.com/feichtenhofer/twostreamfusion
homepage：http://www.robots.ox.ac.uk/~vgg/software/two_stream_action/

在two stream network的基础上，利用两个独立的CNN网络进行了spatial以及temporal的融合。将单帧的图像信息和帧与帧之间的变化信息进行融合，单帧的图像可以形成对空间的描述，而通过光流法等方法形成的时间的描述（差分），从而达到时间和空间互补的目的。multi-task learning克服数据量不足的问题（CNN最后一层连到多个softmax的层上对应不同的数据集）

image

对于一大段时间t=1….T，把这段时间内的所有特征图（x1,…,xT）综合起来，进行一次3D时间卷积，得到融合后的特征图输出

image
左边是单纯在某一层融合，右边是融合之后还保留时间网络，在最后再把结果融合一次。论文的实验表明，后者的准确率要稍高。
融合方法：Sum、Max、Concatenation、Conv、Bilinear

[2016] Spatiotemporal Residual Networks for Video Action Recognition

github: https://feichtenhofer.github.io/

依然是使用了两个流，但是名字不是取为空间流和时间流，而是运动流（motion stream）和外观流（appearance stream），但是本质不变，运动流接收的输入依然是堆叠的多帧光流灰度图片，为什么是两幅，是因为光流计算后的结果分为x方向的光流和y方向的光流，真正计算的时候也是在同一位置取出x位置L=10帧做计算，y位置L=10做计算，而外观流和原来的空间流一致，接收的输入都是RGB图片，但是这里使用的双流的两个流之间是有数据交换的，而不是像TSN网络一样在最后的得分进行融合
使用的网络是残差网络ResNet

image

image

[2017] Hidden Two-Stream Convolutional Networks for Action Recognition

只需要原始视频帧作为输入，直接获取运动信息预测操作类，而不需要计算光流
计算速度比传统two stream快十倍
image
实验效果：UCF101-93.1%，HMDB51-66.8%

3D-Fused Two Stream

[2016] Temporal segment network:towards good practices for deep action recognition

github：https://github.com/yjxiong/temporal-segment-networks
homepage：https://www.researchgate.net/publication/316779776_Temporal_Segment_Networks_for_Action_Recognition_in_Videos

把稀疏时间采样策略和基于视频的监督相结合

Temporal Segment Network

image

空间stream卷积神经网络作用在single RGB images
时间stream卷积神经网络以stacked optical flow field 作为输入
two-stream卷积神经网络的4种输入形式：RGB image，stacked RGB difference，stacked optical flow field，stacked warped optical flow field

image
UCF101上准确率达到93.5%

LSTM

[2015] Beyond Short Snippets: Deep Networks for Video Classification

网络结构

two stream: 每秒取一帧彩色图像，连续多帧计算光流图获取运动信息

image

Feature Pooling

feature pooling networks使用CNN，组合帧间信息采用不同的pooling层

全链接和平均池化由于大量梯度计算导致较低的学习效率

image

[Conv Pooling] Pooling over the final convolutional layer across the video’s frames. The spatial information in the output of the convolutional layer is preserved through a max operation over the time domain.
[Late Pooling] First passes convolutional features through two fc layers before applying the max-pooling layer.
[Slow Pooling] Pooling is first applied over 10-frames of convolutional features with stride 5. In the second stage, a single max-pooling layer combines the outputs of all fc layers.
[Local Pooling] Local Pooling only contains a single stage of max-pooling after the convolutional layers.
[Time-Domain Convolution] It contains an extra time-domain convolutional layer before feature pooling across frames.

LSTM

image

Deep Video LSTM takes input the output from the final CNN layer at each consecutive video frame.
CNN outputs are processed forward through time and upwards
through five layers of stacked LSTMs. A softmax layer predicts
the class at each time step.
The parameters of the convolutional networks (pink) and softmax classifier (orange) are shared across time steps.