[Bug] Unable to reproduce the re

作者: 魏鹏飞 | 来源:发表于2020-04-26 11:30 被阅读0次

[Bug] Unable to reproduce the re
ReactNative开发-红屏\报错记录（持续更新...）
libuv bugs
2019-11-03 【git】暴力清理工程
2019-11-08
CAUSE——UNABLE TO FIND VALID CERT
Swift 相关 Bug
Swift 解决unable to dequeue a cell
Android下 PopupWindow弹出另一个PopupWi
[!] Unable to find a target name

futurely: commented on 7 Oct 2019

Bug

The results of running RGCN multiple times are not always consistent and are not always the same as the reported results.

The results of running HAN multiple times are consistent but are not the same as the reported results.

To Reproduce

Steps to reproduce the behavior:

https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn-hetero#entity-classification

python3 entity_classify.py -d aifb --testing --gpu 0
python3 entity_classify.py -d mutag --l2norm 5e-4 --n-bases 30 --testing --gpu 0
python3 entity_classify.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0
python3 entity_classify.py -d am --l2norm 5e-4 --n-bases 40 --testing --gpu 0

https://github.com/dmlc/dgl/tree/master/examples/pytorch/han

python main.py
python main.py --hetero

Expected behavior

Reproducible experimental results across different runtime environments.

Environment

DGL Version (e.g., 1.0): 0.4
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.2.0
OS (e.g., Linux): Linux n12-066-207 4.9.0-0.bpo.3-amd64 #1 SMP Debian 4.9.30-2+deb9u5~bpo8+1 (2017-09-28) x86_64 GNU/Linux
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.7.3
CUDA/cuDNN version (if applicable): cuda 9.0, cudnn 7.3.0
GPU models and configuration (e.g. V100): GeForce GTX 1080Ti
Any other relevant information: CUDA Driver Version: 410.78

jermainewang: commented on 8 Oct 2019

Hi @futurely , the results do vary across different runs as is noted by the author too. Here is the results of ten runs:

I will update the readme to clarify the results.

futurely: commented on 8 Oct 2019

HAN reruns get the same results in the same environment by setting the random seed. The different results in different environments must be caused by something else.

RGCN also needs to set the random seed to get fixed results.

Reproducible environment can be obtained with Docker.

jermainewang: commented on 8 Oct 2019

With more runs, the averaged outcomes become more and more stable. Deterministic behavior is useful for debugging but not necessary for model performance. Random seed cannot solve everything especially when the system has concurrency which has impact on numerical outcomes. With that being said, I think reporting averaged result from multiple runs is fine (and is also well acknowledged) and reporting standard deviation or min/max range is recommended if the variance is large.

Edit: @mufeili would you please take a look at the HAN result?

futurely: commented on 8 Oct 2019

Very few researches on GNN repeat random experiments multiple times to compare both average values and standard deviation ranges.

A good example is Keep It Simple: Graph Autoencoders Without Graph Convolutional Networks which uses metrics “averaged over 100 runs with different random train/validation/test splits” to show that linear autoencoder is competitive with multi-layer GCN autoencoders.

yzh119: commented on 8 Oct 2019

@futurely , dgl uses atomic operations in cuda kernels, and we can not guarantee deterministic even if we fixed all random seeds. (PyTorch has similar issues for several operators: https://pytorch.org/docs/stable/notes/randomness.html).

Though I don't think it's a good habit for ML researchers to report the best metric with a fixed random seed rather than report average metric for multiple runs with different random seeds, however I understand them if they do so. Yes we would try to remove atomic operations in dgl 0.5 and guarantee deterministic.

According to my experience, the non-deterministic issue would affect the result very very little if the dataset is relatively large. If the performance of a GNN model on small datasets(I'm not suggesting cora/citeseer/pubmed.. but they actually are) would differ much just because of the randomness in atomic operations (0.001 + 0.1 + 0.01 or 0.01 + 0.001 + 0.1 ?), I think researchers should better turn to a larger dataset(not that fragile) or report average result of multiple runs so that the results would be more convincing. If a paper claims its model outperforms a baseline by 0.* with a fixed random seed, who knows if it is a random noise or a substantial progress obtained by the model itself.

futurely: commented on 8 Oct 2019

There are too many factors that make the model performance hard to reproduce and compare. It is necessary to benchmark the representative algorithms with the same framework, datasets (including preprocessing), runtime environment and hyperparameters. The hyperparameters for each algorithm should not just use the default values of the original papers or implementations but should be thoroughly (auto-)tuned.

PyG has a benchmark suite on a few typical tasks with some small datasets.

Google benchmarked classic object detection algorithms with production level implementations.

yzh119: commented on 8 Oct 2019

@futurely I agree with all your points.
What I mean is that: for small datasets, researchers should report their models' average performance with different random seeds across multiple runs, or the result does not make any sense.

futurely: commented on 8 Oct 2019

I also agree with you.

My point is that a benchmark suite implementing best practices should be added in DGL or another related repo. The suite can be frequently run to show the latest model and speed performance improvements. It is helpful to attract more researchers to implement algorithms with DGL and contribute back.