参考

nni项目代码: https://github.com/microsoft/nni
nni中文文档: https://nni.readthedocs.io/zh/stable/Overview.html
nni配置文件中各个参数说明：https://nni.readthedocs.io/en/v2.3/Tutorial/ExperimentConfig.html
nni中实现训练平台：https://nni.readthedocs.io/zh/latest/TrainingService/HowToImplementTrainingService.html
超参数调优：https://nni.readthedocs.io/zh/stable/builtin_tuner.html

架构

NNI架构

Experiment（实验）：表示一次任务，例如，寻找模型的最佳超参组合，或最好的神经网络架构等。它由 Trial 和自动机器学习算法所组成。
搜索空间：是模型调优的范围。例如，超参的取值范围。
Configuration（配置）：配置是来自搜索空间的实例，每个超参都会有特定的值。
Trial：是一次独立的尝试，它会使用某组配置（例如，一组超参值，或者特定的神经网络架构）。 Trial 会基于提供的配置来运行。
Tuner（调优器）：一种自动机器学习算法，会为下一个 Trial 生成新的配置。新的 Trial 会使用这组配置来运行。
Assessor（评估器）：分析 Trial 的中间结果（例如，定期评估数据集上的精度），来确定 Trial 是否应该被提前终止。
训练平台：是 Trial 的执行环境。根据 Experiment 的配置，可以是本机，远程服务器组，或其它大规模训练平台（如，OpenPAI，Kubernetes）。

nni系统架构

nni命令

创建一个任务

nnictl create --config examples/trials/mnist-pytorch/config.yml
nnictl create --config examples/trials/mnist-pytorch/config.yml --port 9090
nnictl create --config examples/trials/mnist-pytorch/config_remote.yml --port 9091

停止任务

nnictl stop 
nnictl stop --all  # 停止所有实验

获取训练日志

nnictl log trial

监控实验

nnicl top

显示所有任务

nnictl experiment list --all

重启动某个任务的web UI

nnictl view TASK_ID

NNI运行配置文件示例

示例1：在本地运行配置的yaml文件

searchSpaceFile: search_space.json
trialCommand: python3 mnist.py  # NOTE: change "python3" to "python" if you are using Windows
trialCodeDirectory: /data/george/code/schinper-nni/nni/examples/trials/mnist-pytorch
trialGpuNumber: 1
trialConcurrency: 3
useAnnotation: False
debug: False
experimentWorkingDirectory: /root/nni-experiments
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: local
  useActiveGpu: True
  maxTrialNumberPerGpu: 2
maxTrialNumber: 3