美文网首页
katib v1.2.0版本部署坑及笔记

katib v1.2.0版本部署坑及笔记

作者: 缘尤会 | 来源:发表于2023-02-21 10:00 被阅读0次

1. 安装条件

安装的版本为:v0.12.0

(因为较新的版本需要k8s及kubectl版本比较高,目前没有符合条件的k8s集群。kubectl 版本和集群版本之间的差异必须在一个小版本号内。 例如:v1.26 版本的客户端能与 v1.25、 v1.26 和 v1.27 版本的控制面通信。所以kubernetes的版本必须控制在1.20及以上)

当kubectl 版本不满足时会报如下错误:

error: rawResources failed to read Resources: Load from path ../../components/namespace/ failed: '../../components/namespace/' must be a file (got d='/home/leinao/xiaoyu/katib/katib/manifests/v1beta1/components/namespace')

安装条件: kustomize版本>= v3.2.0

Kubectl version Kustomize version
< v1.14 n/a
v1.14-v1.20 v2.0.3
v1.21 v4.0.5
v1.22 v4.2.0

安装kustomize:

curl -Lo ./kustomize https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x ./kustomize
sudo mv kustomize /usr/local/bin

2. Installing Katib

在安装之前,如果需要跑Training Operators相关的任务,需要提前部署相关的operator,详情参考https://www.kubeflow.org/docs/components/training,https://v1-2-branch.kubeflow.org/docs/components/katib/trial-template/#custom-resource 并修改对应的katib-controller对应的args以及clusterrole对应类型的资源的权限,修改如下

//serviceaccount 修改
kubectl get rolebindings,clusterrolebindings --all-namespaces  -o custom-columns='KIND:kind,NAMESPACE:metadata.namespace,NAME:metadata.name,SERVICE_ACCOUNTS:subjects[?(@.kind=="ServiceAccount")].name' | grep katib
//使Katib可以访问所有由CRD控制器创建的Kubernetes资源
kubectl edit clusterrole  katib-controller

- apiGroups:
 - kubeflow.org
 resources:
 - experiments
 - experiments/status
 - experiments/finalizers
 - trials
 - trials/status
 - trials/finalizers
 - suggestions
 - suggestions/status
 - suggestions/finalizers
 - tfjobs
 - pytorchjobs
 - mpijobs
 - xgboostjobs
 verbs:
 - '*'
- apiGroups:
 - batch.volcano.sh
 resources:
 - jobs
 verbs:
 - '*'


//katib-controller deployment 修改
//  --trial-resources=<object-kind>.<object-API-version>.<object-API-group>
spec:
 containers:
 - args:
 - --webhook-port=8443
 - --trial-resources=Job.v1.batch
 - --trial-resources=TFJob.v1.kubeflow.org
 - --trial-resources=PyTorchJob.v1.kubeflow.org
 - --trial-resources=MPIJob.v1.kubeflow.org
 - --trial-resources=XGBoostJob.v1.kubeflow.org
 - --trial-resources=Job.v1alpha1.batch.volcano.sh
 command:
 - ./katib-controller

//查询katib-controller的日志
kubectl logs $(kubectl get pods -n kubeflow -o name | grep katib-controller) -n kubeflow 
//期望的输出
{"level":"info","ts":1676883770.6249242,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch.volcano.sh","CRD Version":"v1alpha1","CRD Kind":"Job"}
{"level":"info","ts":1676883770.8276498,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: batch.volcano.sh/v1alpha1, Kind=Job"}

部署katib v1.2.0

git clone git@github.com:kubeflow/katib.git
git checkout 152ec07
make deploy

期望的输出:


NAME                                READY   STATUS      RESTARTS   AGE
katib-cert-generator-jq7pd          0/1     Completed   0          27m
katib-controller-68c47fbf8b-857dk   1/1     Running     0          27m
katib-db-manager-68fdc946f8-sv5w2   1/1     Running     0          21m
katib-mysql-6dcb447c6f-mhb8p        1/1     Running     0          27m
katib-ui-64bb96d5bf-pxcvf           1/1     Running     0          27m

### 3. 运行examples

kubectl create -f https://github.com/kubeflow/katib//tree/release-0.12/examples/v1beta1/tfjob-example.yaml
experiment.kubeflow.org/tfjob-example created

4. 查看katib UI

port-forward the `katib-ui`

kubectl -n kubeflow port-forward svc/katib-ui 8080:80
http://localhost:8080/katib/

也可以直接使用nodeport的方式查看:http://127.0.0.1:39090/katib/

相关文章

网友评论

      本文标题:katib v1.2.0版本部署坑及笔记

      本文链接:https://www.haomeiwen.com/subject/ehmnkdtx.html