CVPR2019 Expressive Body Capture

作者: Hoyer | 来源:发表于2022-02-11 01:26 被阅读0次

0、关键词

3D model of human, body pose, hand pose, facial expression, STAR, SMPL, SMPL-X, SMPLify, SMPLify-X

1、链接

该论文来自德国图宾根大学（University of Tübingen）的马克斯普朗克智能系统研究所（Max Planck Institute for Intelligent Systems），所长是著名CV教授Michael Black。

论文链接：https://arxiv.org/abs/1904.05866

论文主页：https://smpl-x.is.tue.mpg.de/

论文代码：https://github.com/vchoutas/smplify-x

论文基于SMPL [1]模型（也是MPII实验室之前的工作），提出一个新的3D人体模型SMPL-X（extends SMPL with fully articulated hands and an expressive face），它同时包括三个人体主要部分：Hands, Face, and Body。另外，为了从单张图像中恢复SMPL-X模型，论文作者遵循SMPLify [2]方法（还是MPII实验室之前的工作），通过检测2D特征和优化模型拟合参数，并增加了众多tricks，提出了改进后的方法SMPLify-X。作者在自建的数据集(a new curated dataset)上验证了3D模型的精度。

From left to right: RGB image, major joints, skeleton, SMPL (female),SMPL-X (female)

2、主要内容概述

※ Introduction

为了理解图像中人类的行为，我们除了获取身体的关节点（2D body joints and pose），还需要捕获完整的3D外形（full 3D surface of the body, hands and the face），但到论文提出SMPL-X模型为止，由于缺少合适的3D模型和足量的3D训练数据，没有系统能做到这些。从上图中也可以看出，仅仅依赖SMPL这种只能拟合身体的模型，不够精细化，尤其是hands和facial expression。因此针对这一问题，本文提出了新的模型SMPL-X，以及对应的新的方法SMPLify-X。

在这之前，大量的2D人体姿态估计被用来你和body shape，接着Openpose可以同时预测2D的hand/face/body joints，但是这仍然不足以预测3D世界中的surfaces and human interactions。对于3D body的预测问题，传统的方法很多都是单独进行的（不包括hands和face），而大量建模3D的hand和face的文献也是单独进行的，没有和body关联。

最近（相对于文章提出SMPL-X时），也有一些方法同时关联了hand/face/body，比如Frank model [3] (CVPR2018 BestPaper)，但是作者认为其只是简单地缝合（stitch）了完全不同的三个模型（disparate models），结果不够真实。作者提出的SMPL-X则是基于大量语料库（a large corpus of 3D scans），同时建模hand/face/body，因此更具优越性：compatibility with graphics software, simple parametrization, small size, efficient, differentiable, etc。具体地，SMPL-X = SMPL [1] + FLAME head model [4] + MANO hand model [5]，该混合模型再在5586个经人工修正后的3D扫描件上进行拟合优化，效果要远好于Frank model。【[4]和[5]同样是MPII实验室的工作】

紧接着，提出了SMPL-X模型后，作者又改进了原SMPLify方法，用来从单张图像中恢复单个人体的精细化3D模型（包含hands和face），改进细节包括：用于产生pose prior的VAE网络、用于interpenetration的惩罚项、性别分类器以使用female/male/neutral模型、使用PyTorch代替Chumpy来加速回归方法的训练。一些定性的拟合结果见下图。

SMPL-X that jointly models the human body, face and hands; SMPLify-X fit the female SMPL-X model in single RGB images; New Method captures a rich variety of natural and expressive 3D human poses, gestures and facial expressions

另外，作者为了验证准确性，自建了一个数据集，在其上证明了SMPL-X模型和SMPLify-X方法的优越性，作者很有信心道：We believe that this work is a significant step towards expressive capture of bodies, hands and faces together from a single RGB image。【一般人写论文似乎从不敢这么自信~】

※ Related Work

Modeling the body.

1）Bodies, Faces and Hands. 根据以往经验，大部分方法都是将人体拆分成多个孤立的部分来建模。Blanz and Vetter [6] pioneered this direction with their 3D morphable face model. 该方法依赖FACS来构建表情相关的blend shapes，之后大量的工作都基于该开创新的方法，可参考综述[7] 。接着，FLAME [4]向前跨了一步，关注整个头部和颈部的建模（whole head and neck region）而不只是面部区域（face region）。但是，没有发现将face和body的shape同时考虑建模的方法。接下来，从3D扫描件构建3D模型的方法开始流行起来（The availability of 3D body scanners enabled learning of body shape from scans）。大量工作要么遵循triangle deformations，要么遵循vertex-based displacements，来分解式地构建body shape and pose，遗憾的是，这些方法仍旧没有考虑hands和face，而只是将手当作一个拳头或展开的手掌，将面部表情永远设置为正常。同样地，hand modeling也是鼓励地发展，不再赘述细节。

2）Unified Models. 也有与文章相近的方法，统一地建模人体，包括Frank model [3] 和 SMPL+H [5]（MPII实验室之前的工作）。Frank model = SMPL (with no pose blend shapes) for the body + an artist-created rig for the hands + the FaceWarehouse model [8] for the face，但结果不够真实；而SMPL+H缺少face的建模。因此，作者从SMPL+H出发，加入了FLAME head model [4]，并在大量数据上联合地拟合新的3D模型。

Inferring the body. 作者只关注能提取完整的3D人体外形的方法（full 3D body mesh），并枚举了SMPLify、HMR、NBF、MonoPerfCap等具体的方法，但它们都没有考虑结合face和hands来提取body shape。另外，通过另外其它的不同技术路线，如multi-camera setups来提取3D pose, 3D meshes (performance capture), or parametric 3D models也是一类主流方法，典型的代表如CMU Panoptic studio。在Frank model [3]中，同样使用了类似的方法，通过3D keypoints and 3D point clouds来拟合模型。这些硬件显然十分臃肿昂贵，相较而言，作者提出的方法只需要单个RGB图像作为输入，足够简单。

※ Technical approach

1）Unified model: SMPL-X 原SMPL [1]方法的拓展，并加入了FLAME [4]和MANO [5]，这三个模型/方法，都是MPII同一个实验室的工作；

2）SMPLify-X: SMPL-X from a single image 原SMPLify [2]方法的拓展，SMPLify方法也是MPII实验室之前的工作；

3）Variational Human Body Pose Prior 使用VAE网络，训练得到了人体姿态先验估计器VPoser，训练数据集包括CMU MoCap dataset、Human3.6M、PosePrior dataset；

4）Collision penalizer 为了缓解模型的自碰撞（self-collisions）和穿模（penetrations）问题，加入了任意两个相互碰撞的三角形（two colliding triangles）的惩罚项；

5）Deep Gender Classifier 以body和joints作为输入，预测图像中人物的性别，以便使用性别匹配的人体模型，性别分类是简单的ResNet18结构；

6）Optimization 将Chumpy和OpenDR，更换为PyTorch和Limited-memory BFGS optimizer (L-BFGS)

※ Experiments

1）Evaluation datasets 使用自建的数据集Expressive hands and faces dataset (EHF)，该数据集来自SMPL+H dataset，并加入了新的GTs；

2）Qualitative & Quantitative evaluations 实验设计较为简单，与三个模型作比较，即SMPL, SMPL+H, Frank。定量（Quantitative）实验主要是两张表，table1展示了 SMPL, SMPL+H和SMPL-X中，SMPL-X的精度最高，而table2则展示了消融实验，展示不同的trick对SMPLify-X方法带来的增益；定性（Qualitative）实验主要是三张图：SMPL-X v.s. Frank model；SMPL-X on the LSP dataset；compare SMPL-X and SMPLify-X to a hands-only approach。

Qualitative results of SMPL-X for the in-the-wild images of the LSP dataset

【reference RGB】【Frank model】【SMPL-X multiple cameras】【SMPL-X single camera】

※ Conclusion

In this work we present SMPL-X, a new model that jointly captures the body together with face and hands. We additionally present SMPLify-X, an approach to fit SMPL-X to a single RGB image and 2D OpenPose joint detections. We regularize fitting under ambiguities with a new powerful body pose prior and a fast and accurate method for detecting and penalizing penetrations. We present a wide range of qualitative results using images in-the-wild, showing the expressivity of SMPL-X and effectiveness of SMPLify-X. We introduce a curated dataset with pseudo ground-truth to perform quantitative evaluation, that shows the importance of more expressive models. In future work we will curate a dataset of in-the-wild SMPL-X fits and learn a regressor to directly regress SMPL-X parameters directly from RGB images. We believe that this work is an important step towards expressive capture of bodies, hands and faces together from an RGB image.【贴个原文，学习一下总结方式】

3、新颖点

※ 将hands和face同时考虑到body shape的建模中，很多人应该都能想到，但本文作者胜在刚好所在MPII实验室，已经有现成的body/hands/face shape modeling的工作；

※ N*tricks。不是每个trick加入后都能有用，但本文显然在将多项terms加入到拟合回归任务的目标函数中后，仍旧达到了提升的效果，堪称trick大师

4、总结

本文大部分工作都是在其MPII实验室原有的工作上进行扩展的，总的而言SMPL-X = SMPL [1] + FLAME [4] + MANO [5]，SMPLify-X = SMPLify [2] + N*tricks。这启发我们要积累和深挖自己的领域。

5、参考文献

[1] Loper M, Mahmood N, Romero J, et al. SMPL: A skinned multi-person linear model[J]. ACM transactions on graphics (TOG), 2015, 34(6): 1-16.

[2] Bogo F, Kanazawa A, Lassner C, et al. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image[C]//European conference on computer vision. Springer, Cham, 2016: 561-578.

[3] Joo H, Simon T, Sheikh Y. Total capture: A 3d deformation model for tracking faces, hands, and bodies[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8320-8329.

[4] Li T, Bolkart T, Black M J, et al. Learning a model of facial shape and expression from 4D scans[J]. ACM Trans. Graph., 2017, 36(6): 194:1-194:17.

[5] Romero J, Tzionas D, Black M J. Embodied hands: modeling and capturing hands and bodies together[J]. ACM Transactions on Graphics (TOG), 2017, 36(6): 1-17.

[6] Blanz V, Vetter T. A morphable model for the synthesis of 3D faces[C]//Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 1999: 187-194.

[7] Zollhöfer M, Thies J, Garrido P, et al. State of the art on monocular 3D face reconstruction, tracking, and applications[C]//Computer Graphics Forum. 2018, 37(2): 523-550.

[8] Cao C, Weng Y, Zhou S, et al. Facewarehouse: A 3d facial expression database for visual computing[J]. IEEE Transactions on Visualization and Computer Graphics, 2013, 20(3): 413-425.

CVPR2019 Expressive Body Capture

0、关键词

1、链接

2、主要内容概述

※ Introduction

※ Related Work

※ Technical approach

※ Experiments

※ Conclusion

3、新颖点

4、总结

5、参考文献

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读