简介
Swarm是Docker官方提供的一款集群管理工具,其主要作用是把若干台Docker主机抽象为一个整体,并且通过一个入口统一管理这些Docker主机上的各种Docker资源。Swarm和Kubernetes比较类似,但是更加轻,具有的功能也较kubernetes更少一些。
node
swarm 中的每个 Docker Engine 都是一个 node,有两种类型的 node:manager 和 worker。
我们在 manager node 上执行部署命令,manager node 会将部署任务拆解并分配给一个或多个 worker node 完成部署。
manager node 负责执行编排和集群管理工作,保持并维护 swarm 处于期望的状态。swarm 中如果有多个 manager node,它们会自动协商并选举出一个 leader 执行编排任务。
woker node 接受并执行由 manager node 派发的任务。默认配置下 manager node 同时也是一个 worker node,不过可以将其配置成 manager-only node,让其专职负责编排和集群管理工作。
work node 会定期向 manager node 报告自己的状态和它正在执行的任务的状态,这样 manager 就可以维护整个集群的状态。
service
service 定义了 worker node 上要执行的任务。swarm 的主要编排任务就是保证 service 处于期望的状态下。
举一个 service 的例子:在 swarm 中启动一个 http 服务,使用的镜像是 httpd:latest,副本数为 3。
manager node 负责创建这个 service,经过分析知道需要启动 3 个 httpd 容器,根据当前各 worker node 的状态将运行容器的任务分配下去,比如 worker1 上运行两个容器,worker2 上运行一个容器。
运行了一段时间,worker2 突然宕机了,manager 监控到这个故障,于是立即在 worker3 上启动了一个新的 httpd 容器。
这样就保证了 service 处于期望的三个副本状态。
初始化Swarm
命令参考
[root@node191 docker]# docker swarm --help
Usage: docker swarm COMMAND
Manage Swarm
Options:
Commands:
ca Display and rotate the root CA
init Initialize a swarm
join Join a swarm as a node and/or manager
join-token Manage join tokens
leave Leave the swarm
unlock Unlock swarm
unlock-key Manage the unlock key
update Update the swarm
Run 'docker swarm COMMAND --help' for more information on a command.
[root@node191 docker]# docker node --help
Usage: docker node COMMAND
Manage Swarm nodes
Options:
Commands:
demote Demote one or more nodes from manager in the swarm
inspect Display detailed information on one or more nodes
ls List nodes in the swarm
promote Promote one or more nodes to manager in the swarm
ps List tasks running on one or more nodes, defaults to current node
rm Remove one or more nodes from the swarm
update Update a node
Run 'docker node COMMAND --help' for more information on a command.
初始化、加入节点(manager|worker)
- [x] --advertise-addr 指定与其他 node 通信的地址。根据端口开放防火墙
[root@localhost ~]# docker swarm init --advertise-addr 172.16.1.146
Swarm initialized: current node (v2tjxinr9jxfg52evpswn4yb6) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5tvspbnrp9g6oxu6qixwhx98wtzx0t7efwfrh6wbpfbk4id1f7-f01zejqjqfnry2tubl3cractn \
172.16.1.146:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
- [x] 如果当时没有记录下 docker swarm init 提示的添加 worker 的完整命令,可以通过 docker swarm join-token worker 查看。
- [x] 同样的,加入manager通过 docker swarm join-token manager 查看。
[root@localhost ~]# docker swarm join-token worker
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5tvspbnrp9g6oxu6qixwhx98wtzx0t7efwfrh6wbpfbk4id1f7-f01zejqjqfnry2tubl3cractn \
172.16.1.146:2377
[root@localhost ~]# docker swarm join-token manager
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5tvspbnrp9g6oxu6qixwhx98wtzx0t7efwfrh6wbpfbk4id1f7-8pm1wzhfqx5e7jvl8fg61an3w \
172.16.1.146:2377
[root@node146 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
c9kynm13tvcf1vfrt0m6y7pbi node135 Ready Active Reachable
n494afsdjzs74q5y5vb4xlgd4 node136 Ready Active
v2tjxinr9jxfg52evpswn4yb6 * node146 Ready Active Leader
删除swam节点
- [x] 从swarm集群中删除节点,需要先把这个节点容器排空,然后再把节点从集群中去掉。
- [x] 排空节点
- 这个节点上的容器会先从其它节点启动,再停掉排空节点上的容器,保证服务不受影响。
## 排空node136
[root@node146 ~]# docker node update --availability drain n494afsdjzs74q5y5vb4xlgd4
n494afsdjzs74q5y5vb4xlgd4
-
删除指定节点
docker node rm node136
docker node rm --force node16
-
恢复节点
##将一个排空的节点恢复过来,可以正常使用
docker node update --availability Active n494afsdjzs74q5y5vb4xlgd4
-
节点离开(节点主机执行)
## 强制离开swarm集群 docker swarm leave--force
[root@node136 ~]# docker swarm leave
Node left the swarm.
## 此时节点node136 是down的。
[root@node146 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
c9kynm13tvcf1vfrt0m6y7pbi node135 Ready Active Reachable
n494afsdjzs74q5y5vb4xlgd4 node136 Down Active
v2tjxinr9jxfg52evpswn4yb6 * node146 Ready Active Leader
## manager节点删除掉这个废弃的节点
[root@node146 ~]# docker node rm n494afsdjzs74q5y5vb4xlgd4
n494afsdjzs74q5y5vb4xlgd4
## 以manager身份重新加入
[root@node136 ~]# docker swarm join \
> --token SWMTKN-1-5tvspbnrp9g6oxu6qixwhx98wtzx0t7efwfrh6wbpfbk4id1f7-8pm1wzhfqx5e7jvl8fg61an3w \
> 172.16.1.146:2377
This node joined a swarm as a manager.
[root@node146 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
c9kynm13tvcf1vfrt0m6y7pbi node135 Ready Active Reachable
n8dgcax0vcqmsjtc0aosx9k2q node136 Ready Active Reachable
v2tjxinr9jxfg52evpswn4yb6 * node146 Ready Active Leader
节点降级
节点从manager降级到worker
docker node demote v2tjxinr9jxfg52evpswn4yb6
[root@node146 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
c9kynm13tvcf1vfrt0m6y7pbi node135 Ready Active Reachable
n8dgcax0vcqmsjtc0aosx9k2q node136 Ready Active Leader
v2tjxinr9jxfg52evpswn4yb6 node146 Down Active Unreachable
yvjirlxwpgvjohi3iagtzzkh2 * node146 Ready Active Reachable
[root@node146 ~]# docker node demote v2tjxinr9jxfg52evpswn4yb6
Manager v2tjxinr9jxfg52evpswn4yb6 demoted in the swarm.
[root@node146 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
c9kynm13tvcf1vfrt0m6y7pbi node135 Ready Active Reachable
n8dgcax0vcqmsjtc0aosx9k2q node136 Ready Active Leader
v2tjxinr9jxfg52evpswn4yb6 node146 Down Active
yvjirlxwpgvjohi3iagtzzkh2 * node146 Ready Active Reachable
节点升级
- [x] 节点从worker升级到manager
- [x] docker node promote c9kynm13tvcf1vfrt0m6y7pbi
[root@node146 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
c9kynm13tvcf1vfrt0m6y7pbi node135 Ready Active
n8dgcax0vcqmsjtc0aosx9k2q node136 Ready Active
yvjirlxwpgvjohi3iagtzzkh2 * node146 Ready Active Leader
[root@node146 ~]# docker node promote c9kynm13tvcf1vfrt0m6y7pbi
Node c9kynm13tvcf1vfrt0m6y7pbi promoted to a manager in the swarm.
[root@node146 ~]# docker node promote n8dgcax0vcqmsjtc0aosx9k2q
Node n8dgcax0vcqmsjtc0aosx9k2q promoted to a manager in the swarm.
[root@node146 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
c9kynm13tvcf1vfrt0m6y7pbi node135 Ready Active Reachable
n8dgcax0vcqmsjtc0aosx9k2q node136 Ready Active Reachable
yvjirlxwpgvjohi3iagtzzkh2 * node146 Ready Active Leader
swarm 日常操作
命令参考
[root@node191 docker]# docker service --help
Usage: docker service COMMAND
Manage services
Options:
Commands:
create Create a new service
inspect Display detailed information on one or more services
logs Fetch the logs of a service or task
ls List services
ps List the tasks of one or more services
rm Remove one or more services
rollback Revert changes to a service's configuration
scale Scale one or multiple replicated services
update Update a service
Run 'docker service COMMAND --help' for more information on a command.
官方文档:
https://docs.docker.com/engine/reference/commandline/service/
创建服务
- [x] 参考:https://docs.docker.com/engine/reference/commandline/service_create/#options
- [x] publish 发布的端口,所有swarm节点都可以访问,哪怕没有运行对应服务的容器。
docker service create --name nginx-service --replicas=3 --publish 8080:8080 nginx:latest
如果仓库是私有仓库,记得增加--with-registry-auth 这个参数,否则其他节点无法拉取镜像,例如:
docker login 172.16.1.146 -p ***** -u admin; docker service create --with-registry-auth --name tomcat-logs-test --replicas=2 --publish 10080:8080 172.16.1.146/wondertek/docker-test:1.0.0-2018091910
查看服务信息
docker service ps docker-test
服务扩容
docker service scale docker-test=3
label 定义
- [x] 约束可以匹配节点或docker engine的labels,如下:
节点属性 | 匹配 | 示例 |
---|---|---|
node.id | 节点ID | node.id == 2ivku8v2gvtg4 |
node.hostname | 节点主机名 | node.hostname != node-2 |
node.role | 节点角色:manager | node.role == manager |
node.labels | 用户定义节点labels | node.labels.security == high |
engine.labels | Docker Engine的labels | engine.labels.operatingsystem == ubuntu 14.04 |
- [x] engine.labels匹配docker engine的lables,如操作系统,驱动等。集群管理员通过使用docker node update命令来添加node.labels以更好使用节点。
- [x] 添加标签
docker node update --label-add type=manager node146
[root@node146 ~]# docker node inspect node146 --pretty
ID: v2tjxinr9jxfg52evpswn4yb6
Labels:
- type = manager
Hostname: node146
Joined at: 2018-07-16 06:26:49.516457267 +0000 utc
Status:
State: Ready
Availability: Active
Address: 127.0.0.1
Manager Status:
Address: 172.16.1.146:2377
Raft Status: Reachable
Leader: Yes
Platform:
Operating System: linux
Architecture: x86_64
Resources:
CPUs: 8
Memory: 9.765 GiB
Plugins:
Network: bridge, host, macvlan, null, overlay
Volume: local
Engine Version: 1.13.1
- [x] 删除标签
docker node update --label-rm type node146
- [x] 指定标签运行
docker service rm my_web
docker node update --label-add env=test node135
docker node update --label-add env=prod node136
docker service create \
--constraint node.labels.env==test \
--replicas 3 \
--name my_web2 \
--publish 8080:80 \
httpd
[root@node146 ~]# docker service ps my_web2
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
lzle9hto7mk0 my_web2.1 httpd:latest node135 Running Running 4 seconds ago
j9ujd6mcs2ex my_web2.2 httpd:latest node135 Running Running 5 seconds ago
lqc4apjhonen my_web2.3 httpd:latest node135 Running Running 3 seconds ago
[root@node146 ~]# docker service inspect my_web2 --pretty
ID: m7s5ura6bmjg1nd60lfwn8voa
Name: my_web2
Service Mode: Replicated
Replicas: 3
Placement:Contraints: [node.labels.env==test]
UpdateConfig:
Parallelism: 1
On failure: pause
Max failure ratio: 0
ContainerSpec:
Image: httpd:latest@sha256:2edbf09d0dbdf2a3e21e4cb52f3385ad916c01dc2528868bc3499111cc54e937
Resources:
Endpoint Mode: vip
Ports:
PublishedPort 8080
Protocol = tcp
TargetPort = 80
删除服务
docker service rm docker-test
自定义overlay网络
[root@node135 ~]# docker network ls
NETWORK ID NAME DRIVER SCOPE
4888eb34115b bridge bridge local
5dda44146214 docker_gwbridge bridge local
4dda8692018b host host local
mumblsrh5oe4 ingress overlay swarm
1fcd0ef0748f none null local
docker network create --driver overlay --subnet 10.22.1.0/24 swarm_net
- [x] 使用自定义网络创建服务
docker service create --name my_web --replicas=3 --network swarm_net httpd
docker service create --name util --network swarm_net busybox sleep 10000000
- [x] 同一overlay网络网络测试
docker exec util.1.muu3o4906mihbp1v8r3ejh80p nslookup tasks.my_web
docker exec util.1.muu3o4906mihbp1v8r3ejh80p ping -c 3 my_web
服务升级
docker service update --image httpd:2.2.32 my_web
- [x] Swarm 可以在 service 创建或运行过程中灵活地通过 --replicas 调整容器副本的数量,内部调度器则会根据当前集群的资源使用状况在不同 node 上启停容器,这就是 service 默认的 replicated mode。
- [x] 在此模式下,node 上运行的副本数有多有少,一般情况下,资源更丰富的 node 运行的副本数更多,反之亦然。
- [x] 除了 replicated mode,service 还提供了一个 globalmode,其作用是强制在每个 node 上都运行一个且最多一个副本。
## global mode
docker service create \
--mode global \
--name logspout \
--mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock \
gliderlabs/logspout
- [x] service 增加到六个副本,每次更新两个副本,间隔时间一分半钟。
docker service update --replicas 6 --update-parallelism 2 --update-delay 1m30s my_web
## 指定新的镜像
docker service update --image httpd:2.2.32 --replicas 6 --update-parallelism 2 --update-delay 1m30s my_web
- [x] 查看服务升级过程
[root@node146 ~]# docker service ps my_web
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
ku14zmzkpo9a my_web.1 httpd:2.2.32 node135 Running Running about a minute ago
qh6pzjb6syt0 \_ my_web.1 httpd:latest node135 Shutdown Shutdown about a minute ago
0muer26mxx1d my_web.2 httpd:latest node136 Running Running 22 hours ago
k8ybfbc6j20y my_web.3 httpd:2.2.32 node146 Running Running about a minute ago
xr0adp42t7tm \_ my_web.3 httpd:latest node146 Shutdown Shutdown about a minute ago
acd06qrmmnrr my_web.4 httpd:2.2.32 node135 Running Running about a minute ago
jae5i5lhlnb2 my_web.5 httpd:2.2.32 node146 Running Running about a minute ago
3zk4i1drb1nk my_web.6 httpd:2.2.32 node136 Running Running about a minute ago
- [x] 删除并添加新的 constraint,设置 node.labels.env==prod
docker service update --constraint-rm node.labels.env==test my_web2
docker service update --constraint-add node.labels.env==prod my_web2
回滚
- [x] 回滚到上一次操作,只能回滚一次。
docker service update --rollback my_web
- [x] 再次回滚,就是重复刚刚的升级操作。
健康检查
- [x] 对于提供 HTTP 服务接口的应用,常用的 Health Check 是通过 curl 检查 HTTP 状态码,比如:
curl --fail http://localhost:8080/ || exit 1 - [x] 如果 curl 命令检测到任何一个错误的 HTTP 状态码,则返回 1,Health Check 失败。
docker service create --name my_web3 \
--health-cmd "curl --fail http://localhost:8091 || exit 1" \
httpd
-
[x] --health-cmd Health Check 的命令,还有几个相关的参数:
-
[x] 1. --timeout 命令超时的时间,默认 30s。
-
[x] 2. --interval 命令执行的间隔时间,默认 30s。
-
[x] 3. --retries 命令失败重试的次数,默认为 3,如果 3 次都失败了则会将容器标记为 unhealthy。swarm 会销毁并重建 unhealthy 的副本。
-
[x] 查看健康检查信息
docker inspect b671e3100133
"Health": {
"Status": "unhealthy",
"FailingStreak": 3,
"Log": [
{
"Start": "2018-07-18T14:40:18.941056152+08:00",
"End": "2018-07-18T14:40:19.027466281+08:00",
"ExitCode": 1,
"Output": "/bin/sh: 1: curl: not found\n"
},
{
"Start": "2018-07-18T14:40:49.027620925+08:00",
"End": "2018-07-18T14:40:49.076160261+08:00",
"ExitCode": 1,
"Output": "/bin/sh: 1: curl: not found\n"
},
{
"Start": "2018-07-18T14:41:19.076291897+08:00",
"End": "2018-07-18T14:41:19.124894642+08:00",
"ExitCode": 1,
"Output": "/bin/sh: 1: curl: not found\n"
}
]
}
网友评论