0、关键词
3D model of human, body pose, hand pose, facial expression, STAR, SMPL, SMPL-X, SMPLify, SMPLify-X
1、链接
该论文来自德国图宾根大学(University of Tübingen)的马克斯普朗克智能系统研究所(Max Planck Institute for Intelligent Systems),所长是著名CV教授Michael Black。
论文链接:https://arxiv.org/abs/1904.05866
论文主页:https://smpl-x.is.tue.mpg.de/
论文代码:https://github.com/vchoutas/smplify-x
论文基于SMPL [1]模型(也是MPII实验室之前的工作),提出一个新的3D人体模型SMPL-X(extends SMPL with fully articulated hands and an expressive face),它同时包括三个人体主要部分:Hands, Face, and Body。另外,为了从单张图像中恢复SMPL-X模型,论文作者遵循SMPLify [2]方法(还是MPII实验室之前的工作),通过检测2D特征和优化模型拟合参数,并增加了众多tricks,提出了改进后的方法SMPLify-X。作者在自建的数据集(a new curated dataset)上验证了3D模型的精度。
From left to right: RGB image, major joints, skeleton, SMPL (female),SMPL-X (female)2、主要内容概述
※ Introduction
为了理解图像中人类的行为,我们除了获取身体的关节点(2D body joints and pose),还需要捕获完整的3D外形(full 3D surface of the body, hands and the face),但到论文提出SMPL-X模型为止,由于缺少合适的3D模型和足量的3D训练数据,没有系统能做到这些。从上图中也可以看出,仅仅依赖SMPL这种只能拟合身体的模型,不够精细化,尤其是hands和facial expression。因此针对这一问题,本文提出了新的模型SMPL-X,以及对应的新的方法SMPLify-X。
在这之前,大量的2D人体姿态估计被用来你和body shape,接着Openpose可以同时预测2D的hand/face/body joints,但是这仍然不足以预测3D世界中的surfaces and human interactions。对于3D body的预测问题,传统的方法很多都是单独进行的(不包括hands和face),而大量建模3D的hand和face的文献也是单独进行的,没有和body关联。
最近(相对于文章提出SMPL-X时),也有一些方法同时关联了hand/face/body,比如Frank model [3] (CVPR2018 BestPaper),但是作者认为其只是简单地缝合(stitch)了完全不同的三个模型(disparate models),结果不够真实。作者提出的SMPL-X则是基于大量语料库(a large corpus of 3D scans),同时建模hand/face/body,因此更具优越性:compatibility with graphics software, simple parametrization, small size, efficient, differentiable, etc。具体地,SMPL-X = SMPL [1] + FLAME head model [4] + MANO hand model [5],该混合模型再在5586个经人工修正后的3D扫描件上进行拟合优化,效果要远好于Frank model。【[4]和[5]同样是MPII实验室的工作】
紧接着,提出了SMPL-X模型后,作者又改进了原SMPLify方法,用来从单张图像中恢复单个人体的精细化3D模型(包含hands和face),改进细节包括:用于产生pose prior的VAE网络、用于interpenetration的惩罚项、性别分类器以使用female/male/neutral模型、使用PyTorch代替Chumpy来加速回归方法的训练。一些定性的拟合结果见下图。
SMPL-X that jointly models the human body, face and hands; SMPLify-X fit the female SMPL-X model in single RGB images; New Method captures a rich variety of natural and expressive 3D human poses, gestures and facial expressions另外,作者为了验证准确性,自建了一个数据集,在其上证明了SMPL-X模型和SMPLify-X方法的优越性,作者很有信心道:We believe that this work is a significant step towards expressive capture of bodies, hands and faces together from a single RGB image。【一般人写论文似乎从不敢这么自信~】
※ Related Work
Modeling the body.
1)Bodies, Faces and Hands. 根据以往经验,大部分方法都是将人体拆分成多个孤立的部分来建模。Blanz and Vetter [6] pioneered this direction with their 3D morphable face model. 该方法依赖FACS来构建表情相关的blend shapes,之后大量的工作都基于该开创新的方法,可参考综述[7] 。接着,FLAME [4]向前跨了一步,关注整个头部和颈部的建模(whole head and neck region)而不只是面部区域(face region)。但是,没有发现将face和body的shape同时考虑建模的方法。接下来,从3D扫描件构建3D模型的方法开始流行起来(The availability of 3D body scanners enabled learning of body shape from scans)。大量工作要么遵循triangle deformations,要么遵循vertex-based displacements,来分解式地构建body shape and pose,遗憾的是,这些方法仍旧没有考虑hands和face,而只是将手当作一个拳头或展开的手掌,将面部表情永远设置为正常。同样地,hand modeling也是鼓励地发展,不再赘述细节。
2)Unified Models. 也有与文章相近的方法,统一地建模人体,包括Frank model [3] 和 SMPL+H [5](MPII实验室之前的工作)。Frank model = SMPL (with no pose blend shapes) for the body + an artist-created rig for the hands + the FaceWarehouse model [8] for the face,但结果不够真实;而SMPL+H缺少face的建模。因此,作者从SMPL+H出发,加入了FLAME head model [4],并在大量数据上联合地拟合新的3D模型。
Inferring the body. 作者只关注能提取完整的3D人体外形的方法(full 3D body mesh),并枚举了SMPLify、HMR、NBF、MonoPerfCap等具体的方法,但它们都没有考虑结合face和hands来提取body shape。另外,通过另外其它的不同技术路线,如multi-camera setups来提取3D pose, 3D meshes (performance capture), or parametric 3D models也是一类主流方法,典型的代表如CMU Panoptic studio。在Frank model [3]中,同样使用了类似的方法,通过3D keypoints and 3D point clouds来拟合模型。这些硬件显然十分臃肿昂贵,相较而言,作者提出的方法只需要单个RGB图像作为输入,足够简单。
※ Technical approach
1)Unified model: SMPL-X 原SMPL [1]方法的拓展,并加入了FLAME [4]和MANO [5],这三个模型/方法,都是MPII同一个实验室的工作;
2)SMPLify-X: SMPL-X from a single image 原SMPLify [2]方法的拓展,SMPLify方法也是MPII实验室之前的工作;
3)Variational Human Body Pose Prior 使用VAE网络,训练得到了人体姿态先验估计器VPoser,训练数据集包括CMU MoCap dataset、Human3.6M、PosePrior dataset;
4)Collision penalizer 为了缓解模型的自碰撞(self-collisions)和穿模(penetrations)问题,加入了任意两个相互碰撞的三角形(two colliding triangles)的惩罚项;
5)Deep Gender Classifier 以body和joints作为输入,预测图像中人物的性别,以便使用性别匹配的人体模型,性别分类是简单的ResNet18结构;
6)Optimization 将Chumpy和OpenDR,更换为PyTorch和Limited-memory BFGS optimizer (L-BFGS)
※ Experiments
1)Evaluation datasets 使用自建的数据集Expressive hands and faces dataset (EHF),该数据集来自SMPL+H dataset,并加入了新的GTs;
2)Qualitative & Quantitative evaluations 实验设计较为简单,与三个模型作比较,即SMPL, SMPL+H, Frank。定量(Quantitative)实验主要是两张表,table1展示了 SMPL, SMPL+H和SMPL-X中,SMPL-X的精度最高,而table2则展示了消融实验,展示不同的trick对SMPLify-X方法带来的增益;定性(Qualitative)实验主要是三张图:SMPL-X v.s. Frank model;SMPL-X on the LSP dataset;compare SMPL-X and SMPLify-X to a hands-only approach。
Qualitative results of SMPL-X for the in-the-wild images of the LSP dataset 【reference RGB】【Frank model】【SMPL-X multiple cameras】【SMPL-X single camera】※ Conclusion
In this work we present SMPL-X, a new model that jointly captures the body together with face and hands. We additionally present SMPLify-X, an approach to fit SMPL-X to a single RGB image and 2D OpenPose joint detections. We regularize fitting under ambiguities with a new powerful body pose prior and a fast and accurate method for detecting and penalizing penetrations. We present a wide range of qualitative results using images in-the-wild, showing the expressivity of SMPL-X and effectiveness of SMPLify-X. We introduce a curated dataset with pseudo ground-truth to perform quantitative evaluation, that shows the importance of more expressive models. In future work we will curate a dataset of in-the-wild SMPL-X fits and learn a regressor to directly regress SMPL-X parameters directly from RGB images. We believe that this work is an important step towards expressive capture of bodies, hands and faces together from an RGB image.【贴个原文,学习一下总结方式】
3、新颖点
※ 将hands和face同时考虑到body shape的建模中,很多人应该都能想到,但本文作者胜在刚好所在MPII实验室,已经有现成的body/hands/face shape modeling的工作;
※ N*tricks。不是每个trick加入后都能有用,但本文显然在将多项terms加入到拟合回归任务的目标函数中后,仍旧达到了提升的效果,堪称trick大师
4、总结
本文大部分工作都是在其MPII实验室原有的工作上进行扩展的,总的而言SMPL-X = SMPL [1] + FLAME [4] + MANO [5],SMPLify-X = SMPLify [2] + N*tricks。这启发我们要积累和深挖自己的领域。
5、参考文献
[1] Loper M, Mahmood N, Romero J, et al. SMPL: A skinned multi-person linear model[J]. ACM transactions on graphics (TOG), 2015, 34(6): 1-16.
[2] Bogo F, Kanazawa A, Lassner C, et al. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image[C]//European conference on computer vision. Springer, Cham, 2016: 561-578.
[3] Joo H, Simon T, Sheikh Y. Total capture: A 3d deformation model for tracking faces, hands, and bodies[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8320-8329.
[4] Li T, Bolkart T, Black M J, et al. Learning a model of facial shape and expression from 4D scans[J]. ACM Trans. Graph., 2017, 36(6): 194:1-194:17.
[5] Romero J, Tzionas D, Black M J. Embodied hands: modeling and capturing hands and bodies together[J]. ACM Transactions on Graphics (TOG), 2017, 36(6): 1-17.
[6] Blanz V, Vetter T. A morphable model for the synthesis of 3D faces[C]//Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 1999: 187-194.
[7] Zollhöfer M, Thies J, Garrido P, et al. State of the art on monocular 3D face reconstruction, tracking, and applications[C]//Computer Graphics Forum. 2018, 37(2): 523-550.
[8] Cao C, Weng Y, Zhou S, et al. Facewarehouse: A 3d facial expression database for visual computing[J]. IEEE Transactions on Visualization and Computer Graphics, 2013, 20(3): 413-425.
网友评论