5. Metrics
In this work we focus on two sets of metrics. We first analyze the recently proposed FID in terms of robustness (of the metric itself), and conclude that it has desirable properties and can be used in practice. Nevertheless, this metric, as well as Inception Score, is incapable of detecting overfitting: a memory GAN which simply stores all training samples would score perfectly under both measures. Based on these shortcomings, we propose an approximation to precision and recall for GANs and how that it can be used to quantify the degree of overfitting. We stress that the proposed method should be viewed as complementary to IS or FID, rather than a replacement.
5.1. Fr´echet Inception Distance
5.1。 Fr'echet初始距离
FID was shown to be robust to noise [10]. Here we quantify the bias and variance of FID, its sensitivity to the encoding network and sensitivity to mode dropping. To this end, we partition the data set into two groups, i.e.

Furthermore, while we present the results which were obtained by a random search, we have also investigated sequential Bayesian optimization, which resulted in comparable results.
Table 2: Bias and variance of FID. If the data distribution matches the model distribution, FID should evaluate to zero. However, we observe some bias and low variance on samples of size 10000.

sample. We observe results similar to Table 2 which is to be expected if the train and test data sets are drawn from the same distribution.
Detecting mode dropping with FID. To simulate missing modes, we fix a partition of data set

Figure 1: As the sample captures more classes the FID with respect to the reference data set decreases. We observe that FID drastically increases under mode dropping.
Sensitivity to encoding network. Suppose we compute FID using a different network and encoding layer. Would the ranking of models change? To test this we apply VGG trained on ImageNet and consider the layer FC7 of dimension 4096. Figure 2 shows the resulting distribution. We observe high Spearman’s rank correlation
which encourages the use of the default coding layer suggested by the authors. Of course, a natural comparison would be to apply VGG trained on some other data set, which we leave for future work.

which encourages the use of the default coding layer suggested by the authors.
图2:在InceptionNet上计算的FID得分与使用VGG为CELEBA数据集计算的FID之间的差异(对于感兴趣的范围:FID <200)。我们观察到高等级相关性(Spearman的
5.2. Precision, Recall and F1 Score
Precision, recall and

Notice that IS only captures precision: It will not penalize the model for not producing all modes of the data distribution — it will only penalize the model for not producing all classes. On the other hand, FID captures both precision and recall. Indeed, a model which fails to recover different modes of the data distribution will suffer in terms of FID.
请注意,IS仅捕获精度:它不会因为不生成数据分布的所有模式而惩罚模型 - 它只会惩罚模型而不生成所有类。另一方面,FID捕获精确度和召回率。实际上,无法恢复数据分布的不同模式的模型将在FID方面受到影响。
We propose a simple and effective data set for evaluating (and comparing) generative models. Our main motivation is that the currently used data sets are either too simple (e.g. simple mixtures of Gaussians, or MNIST) or too complex (e.g. ImageNet). We argue that it is critical to be able to increase the complexity of the task in a relatively smooth and controlled fashion. To this end, we present a set of tasks for which we can approximate the precision and recall of each model. As a result, we can compare different models based on established metrics.
Manifold of convex polygons. The main idea is to construct a data manifold such that the distances from samples to the manifold can be computed efficiently. As a result, the problem of evaluating the quality of the generative model is effectively transformed into a problem of computing the distance to the manifold. This enables an intuitive approach for defining the quality of the model. Namely, if the samples from the model distribution
are (on average) close to the manifold, its precision is high. Similarly, high recall implies that the generator can recover (i.e. generate something close to) any sample from the manifold.

Figure 3: Samples from models with (a) high recall and precision, (b) high precision, but low recall (lacking in diversity), (c) low precision, but high recall (can decently reproduce triangles, but fails to capture convexity), and (d) low precision and low recall.
For general data sets, this reduction is impractical as one has to compute the distance to the manifold which we are trying to learn. However, if we construct a manifold such that this distance is efficiently computable, the precision and recall can be efficiently evaluated.
To this end, we propose a set of toy data sets for which such computation can be performed efficiently: The manifold of convex polygons. As the simplest example, let us focus on gray-scale triangles represented as one channel images as in Figure 3. These triangles belong to a lowdimensional manifold

Figure 4: How does the minimum FID behave as a function of the budget? The plot shows the distribution of the minimum FID achievable for a fixed budget along with one standard deviation interval. For each budget, we estimate the mean and variance using 5000 bootstrap resamples out of 100 runs. We observe that, given a relatively low budget (say less than 15 hyperparameter settings), all models achieve a similar minimum FID. Furthermore, for a fixed FID, “bad” models can outperform “good” models given enough computational budget. We argue that the computational budget to search over hyperparameters is an important aspect of the comparison between algorithms.

using gradient descent on z while keeping G fixed [15].
在z上使用梯度下降同时保持G fi xed [15]。
6. Large-scale Experimental Evaluation
We consider two budget-constrained experimental setups whereby in the (i) wide one-shot setup one may select 100 samples of hyper-parameters per model, and where the range for each hyperparameter is wide, and (ii) the narrow two-shots setup where one is allowed to select 50 samples from more narrow ranges which were manually selected by first performing the wide hyperparameter search over a specific data set. For the exact ranges and hyperparameter search details we refer the reader to the Appendix A. In the second set of experiments we evaluate the models based on the ”novel” metric:
score on the proposed data set. Finally, we included the Variational Autoencoder [13] in the experiments as a popular alternative.
6.1. Experimental Setup
To ensure a fair comparison, we made the following choices: (i) we use the generator and discriminator architecture from INFO GAN [5] as the resulting function space is rich enough and all considered GANs were not originally designed for this architecture. Furthermore, it is similar to a proven architecture used in DCGAN [20]. The exception is BEGAN where an autoencoder is used as the discriminator. We maintain similar expressive power to INFO GAN by using identical convolutional layers the encoder and approximately matching the total number of parameters.
为了确保公平比较,我们做出了以下选择:(i)我们使用INFO GAN [5]中的生成器和鉴别器体系结构,因为得到的函数空间足够丰富,并且所有考虑的GAN最初都不是为此体系结构设计的。此外,它类似于DCGAN [20]中使用的成熟架构。例外是BEGAN,其中自动编码器用作鉴别器。我们通过使用相同的卷积层编码器并大致匹配参数总数来保持与INFO GAN类似的表达能力。
For all experiments we fix the latent code size to 64 and the prior distribution over the latent space to be uniform on

for ADAM, and hyperparameters of each model. We report the hyperparameter ranges and other details in Appendix A.
6.2. A Large Hyperparameter Search
We perform hyperparameter optimization and, for each run, look for the best FID across the training run (simulating early stopping). To choose the best model, every 5 epochs we compute the FID between the 10k samples generated by the model and the 10k samples from the test set. We have performed this computationally expensive search for each data set. We present the sensitivity of models to the hyperparameters in Figure 5 and the best FID achieved by each model in Table 3.
3An empirical comparison to RMSProp is provided in Appendix F 4Those four data sets are a popular choice for generative modeling. They are of simple to medium complexity, making it possible to run many experiments as well as getting decent results.

Figure 5: A wide range hyperparameter search (100 hyperparameter samples per model). Black stars indicate the performance of suggested hyperparameter settings. We observe that GAN training is extremely sensitive to hyperparameter settings and there is no model which is significantly more stable than others. The importance of hyperparameter search is further highlighted in Figure 15.
Table 3: Best FID obtained in a large-scale hyperparameter search for each data set. The scores were computed in two phases: first, we run a large-scale search on a wide range of hyperparameters, and select the best model. Then, we re-run the training of the selected model 50 times with different initialization seeds, to estimate the stability of the training and report the mean FID and standard deviation, excluding outliers. The asterisk () on some combinations of models and data sets indicates the presence of significant outlier runs, usually severe mode collapses or training failures (* indicates up to 20% failures). We observe that the performance of each model heavily depends on the data set and no model strictly dominates the others. We note that VAE is heavily penalized due to the blurriness of the generated images. Note that these results are not “state-of-the-art”: (i) larger architectures could improve all models, (ii) authors often report the best FID which opens the door for random seed optimization.

Critically, we consider the mean FID as the computational budget increases which is shown in Figure 4. There are three important observations. Firstly, there is no algorithm which clearly dominates others. Secondly, for an interesting range of FIDs, a “bad” model trained on a large budget can out perform a “good” model trained on a small budget. Finally, when the budget is limited, any statistically significant comparison of the models is unattainable.
6.3. Impact of Limited Computational Budget
In some cases, the computational budget available to a practitioner is too small to perform such a large-scale hyperparameter search. Instead, one can tune the range of hyperparameters on one data set and interpolate the good hy perparameter ranges for other data sets. We now consider this setting in which we allow only 50 samples from a set of narrow ranges, which were selected based on the wide hyperparameter search on the FASHION-MNIST data set. We report the narrow hyperparameter ranges in Appendix A. Figure 15 shows the variance of FID per model, where the hyperparameters were selected from narrow ranges. From the practical point of view, there are significant differences between the models: in some cases the hyperparameter ranges transfer from one data set to the others (e.g. NS GAN), while others are more sensitive to this choice (e.g. WGAN). We note that better scores can be obtained by a wider hyperparameter search. These results supports the conclusion that discussing the best score obtained by a model on a data set is not a meaningful way to discern between these models. One should instead discuss the distribution of the obtained scores.
在某些情况下,从业者可用的计算预算太小而无法执行如此大规模的超参数搜索。相反,可以调整一个数据集上的超参数范围,并为其他数据集插入良好的hy参数范围。我们现在考虑这种设置,其中我们仅允许来自一组窄范围的50个样本,这些样本是基于FASHION-MNIST数据集上的宽超参数搜索而选择的。我们在附录A中报告了狭窄的超参数范围。图15显示了每个模型的FID方差,其中超参数选自窄范围。从实际的角度来看,模型之间存在显着差异:在某些情况下,超参数范围从一个数据集转移到另一个数据集(例如NS GAN),而其他情况则对此选择更敏感(例如WGAN)。我们注意到,通过更广泛的超参数搜索可以获得更好的分数。这些结果支持这样的结论:讨论模型在数据集上获得的最佳分数并不是识别这些模型之间有意义的方法。人们应该讨论获得的分数的分布。
6.4. Robustness to Random Initialization
For a fixed model, hyperparameters, training algorithm, and the order that the data is presented to the model, one would expect similar model performance. To test this hypothesis we re-train the best models from the limited hyperparameter range considered for the previous section, while changing the initial weights of the generator and discriminator networks (i.e. by varying a random seed). Table 3 and Figure 16 show the results for each data set. Most models are relatively robust to random initialization, except LSGAN, even though for all of them the variance is significant and should be taken into account when comparing models.
6.5. Precision, recall, and F1
We perform a search over the wide range of hyperparameters and compute precision and recall by considering

Figure 6: How does


7. Conclusion & Open Problems
In this paper we have started a discussion on how to neutrally and fairly compare GANs. We focus on two sets of evaluation metrics: (i) The Fr´echet Inception Distance, and (ii) precision, recall and
. We provide empirical evidence that FID is a reasonable metric due to its robustness with respect to mode dropping and encoding network choices.
Comparison based on FID. Our main insight is that to compare models it is meaningless to report the minimum FID achieved. Instead, distributions of the FID for a fixed computational budget should be compared. Indeed, empirical evidence presented herein imply that algorithmic differences in state-of-the-art GANs become less relevant, as the computational budget increases. Furthermore, given a limited budget (say a month of compute-time), a “good” algorithm might be outperformed by a “bad” algorithm.
Comparison based on precision, recall and

score both NS GAN and WGAN enjoy both high precision and recall. Other models, such as DRAGAN and WGAN GP fail to reach high recall values. Fi nally, we observe that it is possible to achieve high precision and high recall on this task (cf. Appendix E).

得分时,NS GAN和WGAN都享有高精度和召回。其他型号(如DRAGAN和WGAN GP)无法达到高召回率。最后,我们观察到可以在此任务上实现高精度和高召回率(参见附录E)。
Comparison with respect to original GAN. While many algorithms have claimed superiority over the original GAN model [8], we found no empirical evidence which supports such claims, across all data sets. In fact, the NS GAN performs on par with most other models and achieves the best overall FID on MNIST. Furthermore, it outperforms other models in terms of the
score on TRIANGLES.
与原始GAN的比较。虽然许多算法声称优于原始GAN模型[8],但我们没有找到支持所有数据集的此类声明的经验证据。实际上,NS GAN与大多数其他型号相当,并且在MNIST上实现了最佳的整体FID。此外,它在TRIANGLES的
Open problems. It remains to be examined whether FID is stable under a more radical change of the encoding, e.g using a network trained on a different task. Also, FID cannot detect overfitting to the training data set, and an algorithm that just remembers all the training examples would perform very well. Finally, FID can probably be “fooled” by artifacts that are not detected by the embedding network.
The triangles data set can be made progressively more complex by: (i) introducing multiple convex polygons at once, (ii) providing color or texture inside the polygon, and (iii) gradually increasing the resolution. While the performance of existing models might be improved given a bigger computational budget and larger model capacity, we argue that algorithmic improvements should drive better performance. Having such a series of tasks of increasing complexity should greatly benefit the research community.
As discussed in Section 4, many dimensions have to be taken into account when comparing different models, and this work only explores a subset of the options. We cannot exclude the possibility that that some models significantly outperform others under currently unexplored conditions.
Finally, this work strongly suggest that future GAN research should be more experimentally systematic and models should be compared on a neutral ground.
