1. 环境
- virtualbox linux虚拟机CentOS 7
- docker-ce17.03
- Kubernetes 1.13
- kubernetes网络:Flannel
- NFS
- tensorflow:1.5.0
Kubernetes默认已搭建好,这里不进行赘述。
本环境仅是个人的学习与练习,多有不足。
1.1 hostname和ip对照表
主机名称 | IP |
---|---|
k8s-master | 172.20.10.2 |
k8s-node01 | 172.20.10.3 |
k8s-node02 | 172.20.10.4 |
2. Tensor Flow的部署
2.1 nfs
分布式Tensorflow需要一个能够被tensorflow节点共同访问使用的目录才能被使用,所以我在master节点上部署了一个nfs文件系统。在编写tensorflow的yaml文件的时候,将pod的存储卷挂载到我搭建的nfs文件系统上。这样,tensorflow节点便能共用一个目录。否则,通过kubernetes部署tensorflow的pod将会启动失败。
k8s-master:
yum install nfs-utils rpcbind -y
mkdir -p data/nfs
vim /etc/exports
/data/nfs 172.20.10.0/24(rw,no_root_squash,no_all_squash,sync)
/bin/systemctl start rpcbind.service
/bin/systemctl start nfs.service
2.1 Tensor Flow
本设计部署的分布式Tensorflow包含有一个ps节点和两个worker节点。Ps节点负责启动session.run(),并进行迭代训练。worker0负责读取并处理数据,worker1负责初始化TensorFlow集群所需要的参数。将ps节点的配置信息写在tf-ps.yaml文件中,将worker节点的配置信息写在tf-worker里面。
tf-ps.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: tensorflow-ps
spec:
replicas: 1
template:
metadata:
labels:
name: tensorflow-ps
role: ps
spec:
containers:
- name: ps
image: tensorflow/tensorflow:1.5.0
ports:
- containerPort: 2222
- containerPort: 8888
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 1
memory: 500Mi
volumeMounts:
- mountPath: notebooks
readOnly: false
name: nfs
volumes:
- name: nfs
nfs:
server: 172.20.10.2
path: "/data/nfs"
---
apiVersion: v1
kind: Service
metadata:
name: tensorflow-ps-service
labels:
name: tensorflow-ps
role: service
spec:
type: NodePort
ports:
- port: 80
targetPort: 8888
nodePort: 30001
name: tensorflow
- port: 2222
targetPort: 2222
name: tf-ps
selector:
name: tensorflow-ps
tf-worker.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: tensorflow-worker
spec:
replicas: 2
template:
metadata:
labels:
name: tensorflow-worker
role: worker
spec:
containers:
- name: worker
image: tensorflow/tensorflow:1.5.0
ports:
- containerPort: 2222
resources:
limits:
cpu: 2
memory: 1Gi
requests:
cpu: 1
memory: 500Mi
volumeMounts:
- mountPath: /notebooks
readOnly: false
name: nfs
volumes:
- name: nfs
nfs:
server: 172.20.10.2
path: "/data/nfs"
---
apiVersion: v1
kind: Service
metadata:
name: tensorflow-wk-service
labels:
name: tensorflow-worker
spec:
ports:
- port: 2222
targetPort: 2222
selector:
name: tensorflow-worker
使用kubectl apply -f tf-ps.yaml
和kubectl
TensorFlow 部署之后:
3. 简单的使用测试
在浏览器输入master的ip加端口,如:172.20.10.2:30001
即可进入jupyter的界面。
此时的jupyter需要输入token才能进去,使用kubectl logs tensorflow-ps或者在Dashboard中查看tensorflow-ps的日志便能看到相应的token值。
token.png
输入对应的token值进入jupyter界面,此时,可创建notebook,在上面进行机器学习实验。
notebook.png
在进行分布式TensorFlow测试的时候首先需要搭建起分布式环境,TensorFlow三个节点的ip地址可以通过kubectl describe svc相应的服务得到。然后编写部署分布式TensorFlow环境的代码如下:
import tensorflow as tf
tf.app.flags.DEFINE_string("ps_hosts", "10.244.3.4:2222", "ps hosts")
tf.app.flags.DEFINE_string("worker_hosts", "10.244.0.8:2222,10.244.4.4:2222", "worker hosts")
tf.app.flags.DEFINE_string("job_name", "worker", "'ps' or'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
FLAGS = tf.app.flags.FLAGS
def main(_):
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# create cluster
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# create the server
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
server.join()
if __name__ == "__main__":
tf.app.run()
接着通过kubectl exec进入三个TensorFlow节点,分别运行python distributed.py –job_name=ps –task_index=0、python distributed.py –job_name=worke r –task_index=1、python distributed.py –job_name=worker –task_index=1三条命令,执行成功后分布式TensorFlow环境便部署成功,如图5-22所示。
在Jupyter上新建一个jupyter nnotebook,用同样的代码测试方法,只不过ps节点负责启动session.run(),并进行迭代训练。worker0负责读取并处理数据,worker1负责初始化TensorFlow集群所需要的参数。
test.png
网友评论