Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
深度卷积生成敌对网络无监督表示学习
论文:http://arxiv.org/pdf/1511.06434v2.pdf
ABSTRACT
摘要
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.
近年来,卷积网络(CNN)的监督式学习在计算机视觉应用中得到了广泛的应用。相比之下,无监督的CNN学习受到的关注较少。在这项工作中,我们希望能够帮助弥合有监督学习的CNN成功与无监督学习之间的差距。我们引入了一类称为深度卷积生成对抗网络(CNG)的类,它具有一定的架构约束,并证明它们是非监督学习的有力候选。对各种图像数据集进行训练,我们展示出令人信服的证据,证明我们深层卷积对抗对从发生器和鉴别器中的对象部分到场景学习了表示层次。此外,我们使用学习的功能进行新颖的任务 - 证明其作为一般图像表示的适用性。
1 INTRODUCTION
1引言
Learning reusable feature representations from large unlabeled datasets has been an area of active research. In the context of computer vision, one can leverage the practically unlimited amount of unlabeled images and videos to learn good intermediate representations, which can then be used on a variety of supervised learning tasks such as image classification. We propose that one way to build good image representations is by training Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), and later reusing parts of the generator and discriminator networks as feature extractors for supervised tasks. GANs provide an attractive alternative to maximum likelihood techniques. One can additionally argue that their learning process and the lack of a heuristic cost function (such as pixel-wise independent mean-square error) are attractive to representation learning. GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs. There has been very limited published research in trying to understand and visualize what GANs learn, and the intermediate representations of multi-layer GANs.
从大型未标记数据集学习可重用特征表示一直是一个积极研究的领域。在计算机视觉的背景下,人们可以利用实际上无限量的未标记图像和视频来学习良好的中间表示,然后可以将其用于各种监督学习任务,如图像分类。我们提出建立良好图像表示的一种方法是通过对生成敌对网络(GAN)进行训练(Goodfellow等人,2014),并且随后将生成器和鉴别器网络的部分重用为监督任务的特征提取器。GAN为最大似然技术提供了一个有吸引力的替代方案。人们还可以争辩说,他们的学习过程和缺乏启发式成本函数(如像素方式的独立均方误差)对表示学习很有吸引力。据了解,GAN在训练中不稳定,往往导致产生无意义输出的发电机。在尝试理解和可视化GAN学习的内容以及多层GAN的中间表示方面,发表的研究非常有限。
In this paper, we make the following contributions
在本文中,我们做出以下贡献
• We propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. We name this class of architectures Deep Convolutional GANs (DCGAN)
•我们提出并评估了一系列对卷积GAN的架构拓扑的约束条件,这些约束条件使得它们在大多数环境中都能够稳定地进行训练。我们将这类架构命名为Deep Convolutional GAN(DCGAN)
• We use the trained discriminators for image classification tasks, showing competitive performance with other unsupervised algorithms.
•我们使用训练过的鉴别器进行图像分类任务,显示与其他无监督算法的竞争性能。
• We visualize the filters learnt by GANs and empirically show that specific filters have learned to draw specific objects.
•我们将由GAN学习的滤波器可视化,并凭经验显示特定的滤波器已经学会了绘制特定的对象。
• We show that the generators have interesting vector arithmetic properties allowing for easy manipulation of many semantic qualities of generated samples.
•我们证明生成器具有有趣的矢量算术属性,可以轻松处理生成的样本的许多语义质量。
2 RELATED WORK
2相关工作
2.1 REPRESENTATION LEARNING FROM UNLABELED DATA
2.1表示从UNLABELED数据中学习
Unsupervised representation learning is a fairly well studied problem in general computer vision research, as well as in the context of images. A classic approach to unsupervised representation learning is to do clustering on the data (for example using K-means), and leverage the clusters for improved classification scores. In the context of images, one can do hierarchical clustering of image patches (Coates & Ng, 2012) to learn powerful image representations. Another popular method is to train auto-encoders (convolutionally, stacked (Vincent et al., 2010), separating the what and where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015)) that encode an image into a compact code, and decode the code to reconstruct the image as accurately as possible. These methods have also been shown to learn good feature representations from image pixels. Deep belief networks (Lee et al., 2009) have also been shown to work well in learning hierarchical representations.
无监督表示学习在计算机视觉一般性研究中以及在图像上下文中是一个相当好的研究问题。无监督表示学习的经典方法是对数据进行聚类(例如使用K均值),并利用聚类提高分类分数。在图像上下文中,可以对图像块进行分层聚类(Coates&Ng,2012),以学习强大的图像表示。另一种流行的方法是训练自动编码器(卷积,堆叠(Vincent et al。,2010),将代码的内容和组成部分分开(Zhao et al。,2015),阶梯结构(Rasmus等,2015) )将图像编码成紧凑的代码,并对代码进行解码以尽可能准确地重建图像。这些方法也被证明可以从图像像素学习好的特征表示。深度信念网络(Lee et al。,2009)也被证明在学习分层表示方面效果很好。
2.2 GENERATING NATURAL IMAGES
2.2生成自然图像
Generative image models are well studied and fall into two categories: parametric and nonparametric.
生成图像模型已经过很好的研究,分为两类:参数化和非参数化。
The non-parametric models often do matching from a database of existing images, often matching patches of images, and have been used in texture synthesis (Efros et al., 1999), super-resolution (Freeman et al., 2002) and in-painting (Hays & Efros, 2007).
非参数模型通常与现有图像的数据库进行匹配,通常匹配图像块,并且已经用于纹理合成(Efros等人,1999),超分辨率(Freeman等人,2002)和 - 绘画(Hays&Efros,2007)。
Parametric models for generating images has been explored extensively (for example on MNIST digits or for texture synthesis (Portilla & Simoncelli, 2000)). However, generating natural images of the real world have had not much success until recently. A variational sampling approach to generating images (Kingma & Welling, 2013) has had some success, but the samples often suffer from being blurry.Another approach generates images using an iterative forward diffusion process (Sohl-Dickstein et al., 2015). Generative Adversarial Networks (Goodfellow et al., 2014) generated images suffering from being noisy and incomprehensible. A laplacian pyramid extension to this approach (Denton et al., 2015) showed higher quality images, but they still suffered from the objects looking wobbly because of noise introduced in chaining multiple models. A recurrent network approach (Gregor et al., 2015) and a deconvolution network approach (Dosovitskiy et al., 2014) have also recently had some success with generating natural images. However, they have not leveraged the generators for supervised tasks.
用于生成图像的参数模型已被广泛探索(例如MNIST数字或纹理合成(Portilla&Simoncelli,2000))。然而,直到最近,生成真实世界的自然图像并没有取得太大的成功。用于生成图像的变分抽样方法(Kingma&Welling,2013)取得了一些成功,但样本经常遭受模糊。另一种方法使用迭代正向扩散过程生成图像(Sohl-Dickstein等,2015)。生成敌对网络(Goodfellow et al。,2014)生成的图像嘈杂和难以理解。这种方法的拉普拉斯金字塔延伸(Denton等人,2015)显示出更高质量的图像,但由于链接多个模型中引入的噪声,它们仍然受到物体晃动的影响。经常性网络方法(Gregor等,2015)和去卷积网络方法(Dosovitskiy et al。,2014)最近也在生成自然图像方面取得了一些成功。但是,他们没有将发电机用于监督任务。
2.3 VISUALIZING THE INTERNALS OF CNNS
2.3可视化CNNS的内部
One constant criticism of using neural networks has been that they are black-box methods, with little understanding of what the networks do in the form of a simple human-consumable algorithm. In the context of CNNs, Zeiler et. al. (Zeiler & Fergus, 2014) showed that by using deconvolutions and filtering the maximal activations, one can find the approximate purpose of each convolution filter in the network. Similarly, using a gradient descent on the inputs lets us inspect the ideal image that activates certain subsets of filters (Mordvintsev et al.).
对使用神经网络的一个不断批评是它们是黑盒子方法,几乎不了解网络以简单的人类可消费算法的形式做什么。在CNN的情况下,Zeiler et。人。 (Zeiler&Fergus,2014)表明,通过使用反卷积和过滤最大激活,可以找出网络中每个卷积滤波器的近似目的。类似地,在输入上使用梯度下降可以让我们检查激活某些过滤器子集的理想图像(Mordvintsev等人)。
3 APPROACH AND MODEL ARCHITECTURE
3方法和模型体系结构
Historical attempts to scale up GANs using CNNs to model images have been unsuccessful. This motivated the authors of LAPGAN (Denton et al., 2015) to develop an alternative approach to iteratively upscale low resolution generated images which can be modeled more reliably. We also encountered difficulties attempting to scale GANs using CNN architectures commonly used in the supervised literature. However, after extensive model exploration we identified a family of archi
使用CNN扩展GAN来模拟图像的历史尝试已经失败。这促使LAPGAN的作者(Denton等人,2015)开发了一种替代方法来迭代地提高可以更可靠地建模的低分辨率生成图像。我们也遇到了困难,试图使用监督文献中常用的CNN架构来规模化GAN。然而,经过广泛的模型探索后,我们确定了一个archi系列
tectures that resulted in stable training across a range of datasets and allowed for training higher resolution and deeper generative models.
通过一系列数据集进行稳定培训并允许培训更高分辨率和更深层次的生成模型。
Core to our approach is adopting and modifying three recently demonstrated changes to CNN architectures.
我们的方法的核心是采纳和修改最近对CNN架构进行的三项变更。
The first is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn its own spatial downsampling. We use this approach in our generator, allowing it to learn its own spatial upsampling, and discriminator.
第一个是全卷积网络(Springenberg et al。,2014),它用逐步卷积代替确定性空间汇聚函数(如maxpooling),允许网络学习它自己的空间下采样。我们在我们的生成器中使用这种方法,允许它学习它自己的空间上采样和鉴别器。
Second is the trend towards eliminating fully connected layers on top of convolutional features. The strongest example of this is global average pooling which has been utilized in state of the art image classification models (Mordvintsev et al.). We found global average pooling increased model stability but hurt convergence speed. A middle ground of directly connecting the highest convolutional features to the input and output respectively of the generator and discriminator worked well. The first layer of the GAN, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example model architecture.
其次是消除卷积特性之上的完全连接层的趋势。这方面最强有力的例子就是全球平均汇集技术,这种技术已经应用于最先进的图像分类模型(Mordvintsev等人)。我们发现全球平均汇聚增加了模型的稳定性,但却伤害了收敛速度将最高卷积特征直接连接到发生器和鉴别器的输入和输出的中间地带运行良好。GAN的第一层以统一的噪声分布Z作为输入,可以称为完全连接,因为它只是一个矩阵乘法,但结果被重新整形为四维张量并用作卷积栈的起点。对于鉴别器,最后的卷积层被抖动,然后被馈送到单个sigmoid输出中。有关示例模型体系结构的可视化,请参见图1。
Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the input to each unit to have zero mean and unit variance. This helps deal with training problems that arise due to poor initialization and helps gradient flow in deeper models. This proved critical to get deep generators to begin learning, preventing the generator from collapsing all samples to a single point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers however, resulted in sample oscillation and model instability. This was avoided by not applying batchnorm to the generator output layer and the discriminator input layer.
第三是批量标准化(Ioffe&Szegedy,2015),通过将每个单元的输入标准化为零均值和单位差异来稳定学习。这有助于处理由于初始化较差而出现的培训问题,并帮助深层模型中的渐变流。这对于让深层发生器开始学习非常重要,可以防止发生器将所有样品压缩到单个点,这是GAN中观察到的常见故障模式。然而,直接将蝙蝠applying applying应用于所有层,导致样品振荡和模型不稳定。这是通过不将蝙蝠chnorm应用于发生器输出层和鉴别器输入层而避免的。
The ReLU activation (Nair & Hinton, 2010) is used in the generator with the exception of the output layer which uses the Tanh function. We observed that using a bounded activation allowed the model to learn more quickly to saturate and cover the color space of the training distribution. Within the discriminator we found the leaky rectified activation (Maas et al., 2013) (Xu et al., 2015) to work well, especially for higher resolution modeling. This is in contrast to the original GAN paper, which used the maxout activation (Goodfellow et al., 2013).
ReLU激活(Nair&Hinton,2010)用于发生器,但使用Tanh函数的输出层除外。我们观察到,使用有界激活可使模型更快地学习,以饱和并覆盖训练分布的色彩空间。在鉴别器内部,我们发现泄漏整流激活(Maas et al。,2013)(Xu et al。,2015)能够很好地工作,尤其是对于更高分辨率的建模。这与使用最大激活的原始GAN纸相反(Goodfellow等,2013)。
image4 DETAILS OF ADVERSARIAL TRAINING
4不良训练的详情
We trained DCGANs on three datasets, Large-scale Scene Understanding (LSUN) (Yu et al., 2015), Imagenet-1k and a newly assembled Faces dataset. Details on the usage of each of these datasets are given below.
我们在三个数据集(大规模场景理解(LSUN)(Yu等,2015),Imagenet-1k和新组装的Faces数据集)上训练DCGAN。下面给出了每个数据集的使用细节。
No pre-processing was applied to training images besides scaling to the range of the tanh activation function [-1, 1]. All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128. All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models. While previous GAN work has used momentum to accelerate training, we used the Adam optimizer (Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001, to be too high, using 0.0002 instead. Additionally, we found leaving the momentum term imageat the suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped stabilize training.
除了缩放至tanh激活函数[-1,1]的范围外,没有预处理应用于训练图像。所有模型均采用小批量随机梯度下降(SGD)进行培训,最小批量为128。所有权重均从零中心正态分布初始化,标准偏差为0.02。在LeakyReLU中,所有型号的泄漏斜率设置为0.2。尽管以前的GAN工作利用动力来加速培训,但我们使用了具有调整超参数的Adam优化器(Kingma&Ba,2014)。我们发现建议的学习率为0.001,过高,使用0.0002代替。此外,我们发现将动量项 image保持在0.9的建议值,导致训练振荡和不稳定性,同时将其降至0.5,这有助于稳定训练。
image *Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribution Z is projected to a small spatial extent convolutional representation with many feature maps. A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called deconvolutions) then convert this high level representation into a imagepixel image. Notably, no fully connected or pooling layers are used.*
*图1:用于LSUN场景建模的DCGAN发生器。100维均匀分布Z被投影到具有许多特征映射的小空间范围卷积表示。一系列四个分步式卷积(在最近的一些论文中,这些被错误地称为反卷积),然后将这种高级表示转换成 image像素图像。值得注意的是,没有使用完全连接或合并层。*
4.1 LSUN
4.1 LSUN
As visual quality of samples from generative image models has improved, concerns of over-fitting and memorization of training samples have risen. To demonstrate how our model scales with more data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing a little over 3 million training examples. Recent analysis has shown that there is a direct link between how fast models learn and their generalization performance (Hardt et al., 2015). We show samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality samples via simply overfitting/memorizing training examples. No data augmentation was applied to the images.
随着生成图像模型样本的视觉质量的提高,培训样本的覆盖和记忆问题日益突出。为了演示我们的模型如何随着更多数据和更高分辨率的生成而扩展,我们在包含300多万个训练样例的LSUN卧室数据集上训练模型。最近的分析表明,模型学习的速度与泛化性能之间存在直接联系(Hardt等,2015)。除了收敛后的样本(图3),我们还展示了来自一个培训时期(图2)的样本,模拟在线学习,以此来证明我们的模型不通过简单的过度训练/记忆培训生成高质量样本例子。没有数据增加被应用于图像。
4.1.1 DEDUPLICATION
4.1.1重复使用
To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer activations are then binarized via thresholding the ReLU activation which has been shown to be an effective information preserving technique (Srivastava et al., 2014) and provides a convenient form of semantic-hashing, allowing for linear time de-duplication. Visual inspection of hash collisions showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.
为了进一步降低生成器记忆输入示例的可能性(图2),我们执行一个简单的图像重复删除过程。我们在32x32下采样中心作物的训练实例中提供了一个3072-128-3072去噪退出正则化RELU自编码器。然后通过对已被证明是有效的信息保存技术的ReLU激活进行阈值化(Srivastava等人,2014),对得到的代码层激活进行二值化,并提供便利的语义哈希形式,从而实现线性时间重复删除。哈希碰撞的目视检查显示出高精度,估计误报率小于100。此外,该技术检测到并删除了近275,000个重复数据,表明召回率很高。
4.2 FACES
4.2面部
We scraped images containing human faces from random web image queries of peoples names. The people names were acquired from dbpedia, with a criterion that they were born in the modern era. This dataset has 3M images from 10K people. We run an OpenCV face detector on these images, keeping the detections that are sufficiently high resolution, which gives us approximately 350,000 face boxes. We use these face boxes for training. No data augmentation was applied to the images.
我们从人物名称的随机Web图像查询中刮取包含人脸的图像。人名是从dbpedia获得的,其标准是他们出生在现代时代。该数据集包含来自10K人的3M图像。我们在这些图像上运行OpenCV人脸检测器,保持足够高分辨率的检测结果,这为我们提供了大约350,000个面部检测盒。我们使用这些脸盒进行训练。没有数据增加被应用于图像。
imageFigure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model could learn to memorize training examples, but this is experimentally unlikely as we train with a small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating memorization with SGD and a small learning rate.
图2:一次训练后产生的卧室通过数据集。从理论上讲,该模型可以学习记忆训练实例,但这在实验中不太可能,因为我们以小学习率和小批量SGD训练。我们知道没有先前的经验证据表明用SGD和小的学习率记忆。
imageFigure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visual under-fitting via repeated noise textures across multiple samples such as the base boards of some of the beds.
图3:经过五个培训阶段后的卧室。似乎有证据表明通过多个样品(例如某些床的基板)上的重复的噪音纹理,可能会造成视觉损伤。
4.3 IMAGENET-1K
4.3 IMAGENET-1K
We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. We train on imagemin-resized center crops. No data augmentation was applied to the images.
我们使用Imagenet-1k(Deng et al。,2009)作为无监督训练的自然图像源。我们在 image最小尺寸的中心作物上进行训练。没有数据增加被应用于图像。
5 EMPIRICAL VALIDATION OF DCGANS CAPABILITIES
5 DCGANS能力的经验验证
5.1 CLASSIFYING CIFAR-10 USING GANS AS A FEATURE EXTRACTOR
5.1使用GANS作为特征提取器对CIFAR-10进行分类
One common technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as a feature extractor on supervised datasets and evaluate the performance of linear models fitted on top of these features.
评估无监督表示学习算法的质量的一种常用技术是将它们用作受监督数据集上的特征提取器,并评估在这些特征之上拟合的线性模型的性能。
On the CIFAR-10 dataset, a very strong baseline performance has been demonstrated from a well tuned single layer feature extraction pipeline utilizing K-means as a feature learning algorithm. When using a very large amount of feature maps (4800) this technique achieves 80.6% accuracy. An unsupervised multi-layered extension of the base algorithm reaches 82.0% accuracy (Coates & Ng, 2011). To evaluate the quality of the representations learned by DCGANs for supervised tasks, we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce a image spatial grid. These features are then flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM classifier is trained on top of them. This achieves 82.8% accuracy, out performing all K-means based approaches. Notably, the discriminator has many less feature maps (512 in the highest layer) compared to K-means based techniques, but does result in a larger total feature vector size due to the many layers of imagespatial locations. The performance of DCGANs is still less than that of Exemplar CNNs (Dosovitskiy et al., 2015), a technique which trains normal discriminative CNNs in an unsupervised fashion to differentiate between specifically chosen, aggressively augmented, exemplar samples from the source dataset.Further improvements could be made by finetuning the discriminator’s representations, but we leave this for future work. Additionally, since our DCGAN was never trained on CIFAR-10 this experiment also demonstrates the domain robustness of the learned features.
在CIFAR-10数据集上,从使用K-means作为特征学习算法的良好调谐的单层特征提取流水线中已经证明了非常强的基线性能。当使用大量的特征映射(4800)时,该技术的准确性达到80.6%。基础算法的无监督多层扩展达到了82.0%的准确性(Coates&Ng,2011)。为了评估DCGAN为监督任务学习的表示的质量,我们在Imagenet-1k上训练,然后使用所有层的鉴别器的卷积特征,最大化每个层的表示以产生 image 空间网格。然后将这些特征平滑并连接起来形成一个28672维矢量,并在其上面训练一个正则化的线性L2-SVM分类器。除了执行所有基于K-means的方法之外,这实现了82.8%的准确度。值得注意的是,与基于K均值的技术相比,鉴别器具有许多较少的特征映射(最高层中的512),但由于 image空间位置的许多层,确实导致较大的总特征向量大小。DCGANs的性能仍然低于Exemplar CNN(Dosovitskiy等,2015),该技术以无监督的方式训练正常的区分性CNN,以区分源数据集中特定选择的,主动增强的示例性样本。通过对鉴别器的表示进行网络化可以进一步改进,但我们将其留作未来工作。此外,由于我们的DCGAN从未在CIFAR-10上进行过培训,因此本实验还显示了学习功能的域稳健性。
Table 1: CIFAR-10 classification results using our pre-trained model. Our DCGAN is not pretrained on CIFAR-10, but on Imagenet-1k, and the features are used to classify CIFAR-10 images.
表1:使用我们的预先训练的模型的CIFAR-10分类结果。我们的DCGAN不是在CIFAR-10上预训练的,而是在Imagenet-1k上的,并且这些特征用于对CIFAR-10图像进行分类。
image5.2 CLASSIFYING SVHN DIGITS USING GANS AS A FEATURE EXTRACTOR
5.2使用GANS作为特征提取器来分类SVHN数字
On the StreetView House Numbers dataset (SVHN)(Netzer et al., 2011), we use the features of the discriminator of a DCGAN for supervised purposes when labeled data is scarce. Following similar dataset preparation rules as in the CIFAR-10 experiments, we split off a validation set of 10,000 examples from the non-extra set and use it for all hyperparameter and model selection. 1000 uniformly class distributed training examples are randomly selected and used to train a regularized linear L2-SVM classifier on top of the same feature extraction pipeline used for CIFAR-10. This achieves state of the art (for classification using 1000 labels) at 22.48% test error, improving upon another modifcation of CNNs designed to leverage unlabled data (Zhao et al., 2015). Additionally, we validate that the CNN architecture used in DCGAN is not the key contributing factor of the model’s performance by training a purely supervised CNN with the same architecture on the same data and optimizing this model via random search over 64 hyperparameter trials (Bergstra & Bengio, 2012). It achieves a signficantly higher 28.87% validation error.
在StreetView House Numbers数据集(SVHN)(Netzer et al。,2011)中,当标记数据稀缺时,我们将DCGAN的鉴别器的特性用于监督目的。按照与CIFAR-10实验类似的数据集准备规则,我们从非额外集合中分离出10,000个实例的验证集,并将其用于所有超参数和模型选择。随机选择1000个均匀分布的分布式训练样本,并用于在用于CIFAR-10的相同特征提取流水线之上训练一个正则化的线性L2-SVM分类器。这实现了最先进的技术(用1000个标签进行分类),测试误差为22.48%,改进了CNN的另一种修改,旨在利用非标记数据(Zhao et al。,2015)。此外,我们通过在相同数据上训练具有相同架构的纯监督CNN并通过随机搜索优化该模型超过64个超参数试验(Bergstra&Bengio),验证DCGAN中使用的CNN架构不是模型性能的关键贡献因素,2012)。它实现了高达28.87%的验证错误。
6 INVESTIGATING AND VISUALIZING THE INTERNALS OF THE NETWORKS
6调查和可视化网络内部
We investigate the trained generators and discriminators in a variety of ways. We do not do any kind of nearest neighbor search on the training set. Nearest neighbors in pixel or feature space are trivially fooled (Theis et al., 2015) by small image transforms. We also do not use log-likelihood metrics to quantitatively assess the model, as it is a poor (Theis et al., 2015) metric.
我们以各种方式调查受过训练的发生器和鉴别器。我们不在训练集上进行任何类型的最近邻搜索。通过小图像变换,像素或特征空间中最近的邻居被平凡地愚弄(Theis et al。,2015)。我们也不使用对数似然度量来定量评估模型,因为它是一个很差的(Theis et al。,2015)度量。
Table 2: SVHN classification with 1000 labels
表2:具有1000个标签的SVHN分类
image6.1 WALKING IN THE LATENT SPACE
6.1在潜在空间中行走
The first experiment we did was to understand the landscape of the latent space. Walking on the manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations. The results are shown in Fig.4.
我们做的第一个实验是了解潜在空间的景观。在学习的流形中行走通常可以告诉我们关于记忆的迹象(如果存在剧烈的过渡)以及空间分层崩溃的方式。如果在这个潜在空间中行走导致图像世代发生语义变化(例如添加和删除的对象),我们可以推断该模型已经学习了相关的和有趣的表示。结果如图4所示。
6.2 VISUALIZING THE DISCRIMINATOR FEATURES
6.2可视化辨别器功能
Previous work has demonstrated that supervised training of CNNs on large image datasets results in very powerful learned features (Zeiler & Fergus, 2014). Additionally, supervised CNNs trained on scene classification learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervised DCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting.Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that the features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows. For comparison, in the same figure, we give a baseline for randomly initialized features that are not activated on anything that is semantically relevant or interesting.
以前的工作已经证明,对大图像数据集进行有监督的CNN培训会产生非常强大的学习功能(Zeiler&Fergus,2014)。此外,受监督的CNN在场景分类方面进行了培训,学习了物体探测器(Oquab等,2014)。我们证明在大图像数据集上训练的无监督DCGAN也可以学习有趣的功能层次结构。使用(Springenberg et al。,2014)提出的引导式反向传播,我们在图5中显示,鉴别器学习的特征在卧室的典型部分(如床和窗)上激活。为了比较,在同一图中,我们给出了随机初始化特征的基线,这些特征在语义上相关或有趣的任何事物上都未被激活。
6.3 MANIPULATING THE GENERATOR REPRESENTATION
6.3操纵发电机代表
6.3.1 FORGETTING TO DRAW CERTAIN OBJECTS
6.3.1忘记吸取某些物体
In addition to the representations learnt by a discriminator, there is the question of what representations the generator learns. The quality of samples suggest that the generator learns specific object representations for major scene components such as beds, windows, lamps, doors, and miscellaneous furniture. In order to explore the form that these representations take, we conducted an experiment to attempt to remove windows from the generator completely.
除了鉴别者学习的表示之外,还有一个关于生成器学习表示的问题。样本的质量表明,发生器学习了主要场景组件的特定对象表示,例如床,窗户,灯,门和其他家具。为了探索这些表示所采用的形式,我们进行了一个试验,试图从发生器中完全删除窗口。
On 150 samples, 52 window bounding boxes were drawn manually. On the second highest convolution layer features, logistic regression was fit to predict whether a feature activation was on a window (or not), by using the criterion that activations inside the drawn bounding boxes are positives and random samples from the same images are negatives. Using this simple model, all feature maps with weights greater than zero ( 200 in total) were dropped from all spatial locations. Then, random new samples were generated with and without the feature map removal.
在150个样本上,手动绘制了52个窗口边界框。在第二高的卷积层特征上,逻辑回归用于预测特征激活是否在窗口上(通过使用标准,即绘制的边界框内的激活是肯定的并且来自相同图像的随机样本是否定的)。使用这个简单模型,从所有空间位置删除所有权重大于零(总共200个)的特征地图。然后,在有和没有去除特征图的情况下生成随机新样本。
The generated images with and without the window dropout are shown in Fig.6, and interestingly, the network mostly forgets to draw windows in the bedrooms, replacing them with other objects.
图6显示了带有或不带有窗口丢失的生成图像,并且有趣的是,网络大多忘记在卧室中绘制窗户,用其他物体代替它们。
imageFigure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions, with every image in the space plausibly looking like a bedroom. In the 6th row, you see a room without a window slowly transforming into a room with a giant window. In the 10th row, you see what appears to be a TV slowly being transformed into a window.
图4:顶行:Z中的一系列9个随机点之间的插值表明,学习的空间具有平滑的过渡,空间中的每个图像看起来都像一间卧室。在第六排,你看到一个没有窗户的房间慢慢变成一个有巨大窗户的房间。在第十行中,你会看到电视正慢慢变成一扇窗户。
6.3.2 VECTOR ARITHMETIC ON FACE SAMPLES
6.3.2矢量在面部样本上的算术运算
In the context of evaluating learned representations of words (Mikolov et al., 2013) demonstrated that simple arithmetic operations revealed rich linear structure in representation space. One canonical example demonstrated that the vector(”King”) - vector(”Man”) + vector(”Woman”) resulted in a vector whose nearest neighbor was the vector for Queen. We investigated whether similar structure emerges in the Z representation of our generators. We performed similar arithmetic on the Z vectors of sets of exemplar samples for visual concepts. Experiments working on only single samples per concept were unstable, but averaging the Z vector for three examplars showed consistent and stable generations that semantically obeyed the arithmetic. In addition to the object manipulation shown in (Fig. 7), we demonstrate that face pose is also modeled linearly in Z space (Fig. 8).
在评估词汇的学习表征(Mikolov等,2013)中,证明了简单的算术运算揭示了表征空间中丰富的线性结构。一个典型的例子表明,矢量(“国王”) - 矢量(“人”)+矢量(“女人”)产生了一个矢量,其最近的邻居是女王的矢量。我们调查了在我们的发电机的Z表示中是否出现类似的结构。我们对视觉概念的示例样本集的Z向量执行类似的算术。每个概念仅对单个样本进行实验的实验是不稳定的,但对三个样本的平均Z向量显示了语义上服从算术的一致且稳定的世代。除了(图7)所示的对象操作外,我们还证明了在Z空间中线性模拟人脸姿态(图8)。
These demonstrations suggest interesting applications can be developed using Z representations learned by our models. It has been previously demonstrated that conditional generative models can learn to convincingly model object attributes like scale, rotation, and position (Dosovitskiy et al., 2014). This is to our knowledge the first demonstration of this occurring in purely unsupervised models. Further exploring and developing the above mentioned vector arithmetic could dramatically reduce the amount of data needed for conditional generative modeling of complex image distributions.
这些演示表明可以使用我们的模型学习到的Z表示来开发有趣的应用程序。先前已经证明,条件生成模型可以学会令人信服地模拟对象属性,如规模,旋转和位置(Dosovitskiy et al。,2014)。这是我们的知识,这是纯粹无监督模型中的第一次演示。进一步探索和开发上述向量算法可以显着减少复杂图像分布的条件生成建模所需的数据量。
imageFigure 5: On the right, guided backpropagation visualizations of maximal axis-aligned responses for the first 6 learned convolutional features from the last convolution layer in the discriminator. Notice a significant minority of features respond to beds - the central object in the LSUN bedrooms dataset. On the left is a random filter baseline. Comparing to the previous responses there is little to no discrimination and random structure.
图5:在右侧,针对来自鉴别器中最后卷积层的前6个学习卷积特征的最大轴对齐响应的反向传播可视化。注意一些重要特征对床的响应 - LSUN卧室数据集中的中心对象。左边是一个随机过滤器基线。与之前的回应相比,几乎没有歧视和随机结构。
imageFigure 6: Top row: un-modified samples from model. Bottom row: the same samples generated with dropping out ”window” filters. Some windows are removed, others are transformed into objects with similar visual appearance such as doors and mirrors. Although visual quality decreased, overall scene composition stayed similar, suggesting the generator has done a good job disentangling scene representation from object representation. Extended experiments could be done to remove other objects from the image and modify the objects the generator draws.
图6:顶行:来自模型的未修改样本。底行:通过删除“窗口”过滤器生成相同的样本。有些窗户被拆除,其他窗户被转换成具有类似视觉外观的物体,如门和镜子。尽管视觉质量下降,但整体场景构成保持相似,这表明生成器已经从对象表示中很好地解开了场景表示。可以进行扩展实验来从图像中移除其他对象并修改生成器绘制的对象。
7 CONCLUSION AND FUTURE WORK
7结论和未来工作
We propose a more stable set of architectures for training generative adversarial networks and we give evidence that adversarial networks learn good representations of images for supervised learning and generative modeling. There are still some forms of model instability remaining - we noticed as models are trained longer they sometimes collapse a subset of filters to a single oscillating mode.
我们提出了一套更稳定的架构来训练生成对抗网络,并且我们给出证据表明敌对网络学习了监督学习和生成建模的良好图像表示。仍然存在一些形式的模型不稳定性 - 我们注意到随着模型训练时间更长,它们有时会将一部分滤波器折叠成单个振荡模式。
Figure 7: Vector arithmetic for visual concepts. For each column, the Z vectors of samples are averaged.Arithmetic was then performed on the mean vectors creating a new vector Y. The center sample on the right hand side is produce by feeding Y as input to the generator. To demonstrate the interpolation capabilities of the generator, uniform noise sampled with scale +-0.25 was added to Y to produce the 8 other samples. Applying arithmetic in the input space (bottom two examples) results in noisy overlap due to misalignment.
图7:视觉概念的矢量算法。对于每列,对样本的Z向量进行平均。然后对均值向量进行算术运算,创建一个新的向量Y.右侧的中心样品是通过将Y作为输入发送到发生器而生产的。为了演示发生器的内插能力,将采用比例+ -0.25采样的均匀噪声添加到Y以产生另外8个采样。在输入空间中应用算术(下面的两个示例)会导致由于未对齐而产生的噪音重叠。
Further work is needed to tackle this from of instability. We think that extending this framework to other domains such as video (for frame prediction) and audio (pre-trained features for speech synthesis) should be very interesting. Further investigations into the properties of the learnt latent space would be interesting as well.
需要进一步的工作来解决不稳定因素。我们认为将这个框架扩展到视频(用于帧预测)和音频(用于语音合成的预先训练的特征)等其他领域应该是非常有趣的。对学习的潜在空间的属性的进一步研究也会很有趣。
imageFigure 8: A ”turn” vector was created from four averaged samples of faces looking left vs looking right.By adding interpolations along this axis to random samples we were able to reliably transform their pose.
图8:一个“转向”矢量是从四个平均的面向左看与右看样本创建的。通过沿这个轴插入随机样本,我们能够可靠地转换它们的姿态。
文章引用于 http://tongtianta.site/paper/351
编辑 Lornatang
校准 Lornatang
网友评论