1. 安装条件
安装的版本为:v0.12.0
(因为较新的版本需要k8s及kubectl版本比较高,目前没有符合条件的k8s集群。kubectl 版本和集群版本之间的差异必须在一个小版本号内。 例如:v1.26 版本的客户端能与 v1.25、 v1.26 和 v1.27 版本的控制面通信。所以kubernetes的版本必须控制在1.20及以上)
当kubectl 版本不满足时会报如下错误:
error: rawResources failed to read Resources: Load from path ../../components/namespace/ failed: '../../components/namespace/' must be a file (got d='/home/leinao/xiaoyu/katib/katib/manifests/v1beta1/components/namespace')
安装条件: kustomize版本>= v3.2.0
Kubectl version | Kustomize version |
---|---|
< v1.14 | n/a |
v1.14-v1.20 | v2.0.3 |
v1.21 | v4.0.5 |
v1.22 | v4.2.0 |
安装kustomize:
curl -Lo ./kustomize https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x ./kustomize
sudo mv kustomize /usr/local/bin
2. Installing Katib
在安装之前,如果需要跑Training Operators相关的任务,需要提前部署相关的operator,详情参考https://www.kubeflow.org/docs/components/training,https://v1-2-branch.kubeflow.org/docs/components/katib/trial-template/#custom-resource 并修改对应的katib-controller对应的args以及clusterrole对应类型的资源的权限,修改如下
//serviceaccount 修改
kubectl get rolebindings,clusterrolebindings --all-namespaces -o custom-columns='KIND:kind,NAMESPACE:metadata.namespace,NAME:metadata.name,SERVICE_ACCOUNTS:subjects[?(@.kind=="ServiceAccount")].name' | grep katib
//使Katib可以访问所有由CRD控制器创建的Kubernetes资源
kubectl edit clusterrole katib-controller
- apiGroups:
- kubeflow.org
resources:
- experiments
- experiments/status
- experiments/finalizers
- trials
- trials/status
- trials/finalizers
- suggestions
- suggestions/status
- suggestions/finalizers
- tfjobs
- pytorchjobs
- mpijobs
- xgboostjobs
verbs:
- '*'
- apiGroups:
- batch.volcano.sh
resources:
- jobs
verbs:
- '*'
//katib-controller deployment 修改
// --trial-resources=<object-kind>.<object-API-version>.<object-API-group>
spec:
containers:
- args:
- --webhook-port=8443
- --trial-resources=Job.v1.batch
- --trial-resources=TFJob.v1.kubeflow.org
- --trial-resources=PyTorchJob.v1.kubeflow.org
- --trial-resources=MPIJob.v1.kubeflow.org
- --trial-resources=XGBoostJob.v1.kubeflow.org
- --trial-resources=Job.v1alpha1.batch.volcano.sh
command:
- ./katib-controller
//查询katib-controller的日志
kubectl logs $(kubectl get pods -n kubeflow -o name | grep katib-controller) -n kubeflow
//期望的输出
{"level":"info","ts":1676883770.6249242,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch.volcano.sh","CRD Version":"v1alpha1","CRD Kind":"Job"}
{"level":"info","ts":1676883770.8276498,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: batch.volcano.sh/v1alpha1, Kind=Job"}
部署katib v1.2.0
git clone git@github.com:kubeflow/katib.git
git checkout 152ec07
make deploy
期望的输出:
NAME READY STATUS RESTARTS AGE
katib-cert-generator-jq7pd 0/1 Completed 0 27m
katib-controller-68c47fbf8b-857dk 1/1 Running 0 27m
katib-db-manager-68fdc946f8-sv5w2 1/1 Running 0 21m
katib-mysql-6dcb447c6f-mhb8p 1/1 Running 0 27m
katib-ui-64bb96d5bf-pxcvf 1/1 Running 0 27m
### 3. 运行examples
kubectl create -f https://github.com/kubeflow/katib//tree/release-0.12/examples/v1beta1/tfjob-example.yaml
experiment.kubeflow.org/tfjob-example created
4. 查看katib UI
port-forward the `katib-ui`
kubectl -n kubeflow port-forward svc/katib-ui 8080:80
http://localhost:8080/katib/
也可以直接使用nodeport的方式查看:http://127.0.0.1:39090/katib/
网友评论