美文网首页
corosync+pacemaker部署运维笔记整理

corosync+pacemaker部署运维笔记整理

作者: SkTj | 来源:发表于2019-02-14 12:09 被阅读27次

    转载:https://www.cnblogs.com/yue-hong/p/7988821.html
    corosync是集群框架引擎程序,pacemaker是高可用集群资源管理器,crmsh是pacemaker的命令行工具。

    一、NTP对时,免密钥登陆
    [root@node-1 ~]# vim /etc/hosts
    192.168.43.128 node-2
    192.168.43.129 node-1
    [root@node-1 ~]# ssh-keygen
    [root@node-1 ~]# ssh-copy-id -i /root/.ssh/id_rsa root@node-2
    [root@node-1 corosync]# scp /etc/hosts node-2:/etc/hosts
    [root@node-1 ~]# ssh node-2
    [root@node-1 ~]# yum install ntp -y
    [root@node-2 ~]# hwclock -s //将硬件主板时钟设为系统时钟,比ntpdate和date -s命令强多了

    [root@node-2 ~]# ssh-keygen
    [root@node-2 ~]# ssh-copy-id -i /root/.ssh/id_rsa root@node-1
    [root@node-2 ~]# ssh node-1
    [root@node-2 ~]# yum install ntp -y

    二、安装corosync、pacemaker

    [root@node-1 corosync]# yum install corosync pacemaker -y //centos自带源即可,也可以只安装pcs即可。
    [root@node-2 ~]# yum install corosync pacemaker -y
    [root@node-1 ~]# vim /etc/yum.repos.d/crm.repo


    [network_ha-clustering_Stable]
    name=Stable High Availability/Clustering packages (CentOS_CentOS-7)
    type=rpm-md
    baseurl=http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/
    gpgcheck=1
    gpgkey=http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/repodata/repomd.xml.key
    enabled=1


    [root@node-1 ~]# yum install crmsh -y

    [root@node-1 corosync]# cd /etc/corosync
    [root@node-1 corosync]# cp corosync.conf.example corosync.conf
    [root@node-1 corosync]# vim corosync.conf
    bindnetaddr: 192.168.43.0
    service {
    var: 0
    name: pacemaker #表示启动pacemaker
    }


    corosync的节点直接需要密钥的。
    [root@node-1 corosync]# mv /dev/{random,random.bak}
    [root@node-1 corosync]# ln -s /dev/urandom /dev/random
    [root@node-1 corosync]# corosync-keygen
    Corosync Cluster Engine Authentication key generator.
    Gathering 1024 bits for key from /dev/random.
    Press keys on your keyboard to generate entropy.
    Writing corosync key to /etc/corosync/authkey.
    [root@node-1 corosync]# scp corosync.conf authkey root@node-2:/etc/corosync/
    [root@node-1 corosync]# systemctl start corosync;ssh node-2 systemctl start corosync //两台机器同时启动corosync服务

    =====================
    马哥运维理论:
    资源管理层(pacemaker负责仲裁指定谁是活动节点、IP地址的转移、本地资源管理系统)、消息传递层负责心跳信息(heartbeat、corosync)、Resource Agent(理解为服务脚本)负责服务的启动、停止、查看状态。多个节点上允许多个不同服务,剩下的2个备节点称为故障转移域,主节点所在位置只是相对的,同样,第三方仲裁也是相对的。vote system:少数服从多数。当故障节点修复后,资源返回来称为failback,当故障节点修复后,资源仍在备用节点,称为failover。
    CRM:cluster resource manager ===>pacemaker心脏起搏器,每个节点都要一个crmd(5560/tcp)的守护进程,有命令行接口crmsh和pcs(在heartbeat v3,红帽提出的)编辑xml文件,让crmd识别并负责资源服务的处理。也就是说crmsh和pcs等价。
    Resource Agent,OCF(open cluster framework)
    primtive:主资源,在集群中只运行一个实例。clone:克隆资源,在集群中可运行多个实例。每个资源都有一定的优先级。
    无穷大+负无穷大=负无穷大。主机名要和DNS解析的名称相同才行。

    一、安装pcs管理工具
    [root@node-1 ~]# ansible corosync -m service -a "name=pcsd state=started enabled=yes" //下载ansible,定义主机组为corosync
    [root@node-1 ~]# systemctl status pcsd ;ssh node-2 "systemctl status pcsd"
    [root@node-1 ~]# ansible corosync -m shell -a "echo "passw0rd"|passwd --stdin hacluster" ##单独创建用户,并设定密码,让用户名进行认证。
    [root@node-1 ~]# pcs cluster auth node-2 node-1 ##本机的pcs客户端向pcsd的守护进程发起请求,如果向远端node-1的pcsd进行认证不通过,可能是firewalld的关系
    Username: hacluster
    Password:
    node-1: Authorized
    node-2: Authorized
    [root@node-2 yum.repos.d]# pcs cluster auth node-1 node-2 //最好进行双向认证。
    Username: hacluster
    Password:
    node-1: Authorized
    node-2: Authorized

    二、建立集群
    [root@node-1 corosync]# pcs cluster setup --name mycluster node-1 node-2 --force
    [root@node-2 corosync]# cat corosync.conf //执行完创建集群的命令后,会在节点之间单独产生一个配置文件
    totem {
    version: 2
    secauth: off
    cluster_name: mycluster
    transport: udpu
    }

    nodelist {
    node {
    ring0_addr: node-1
    nodeid: 1
    }

    node {
    ring0_addr: node-2
    nodeid: 2
    }
    }

    quorum {
    provider: corosync_votequorum
    two_node: 1
    }

    logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    }

    解释:totem是两个节点进行心跳传播的协议,ring 0代表不需要向任何信息就能到达。
    [root@node-1 ~]# pcs cluster start
    [root@node-1 ~]# pcs cluster status
    Cluster Status:
    Stack: unknown
    Current DC: NONE
    Last updated: Sat Oct 28 20:17:56 2017
    Last change: Sat Oct 28 20:17:52 2017 by hacluster via crmd on node-1
    2 nodes configured
    0 resources configured
    PCSD Status:
    node-2: Online
    node-1: Online
    [root@node-2 ~]# pcs cluster start ##每个节点要单独启动pcsd守护进程。
    Starting Cluster...
    [root@node-2 ~]# corosync-cfgtool -s
    Printing ring status.
    Local node ID 2
    RING ID 0
    id = 192.168.43.128
    status = ring 0 active with no faults
    [root@node-2 ~]# corosync-cmapctl |grep members ##检查当前的集群成员情况
    runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
    runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.43.129)
    runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
    runtime.totem.pg.mrp.srp.members.1.status (str) = joined
    runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
    runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.43.128)
    runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
    runtime.totem.pg.mrp.srp.members.2.status (str) = joined
    [root@node-1 ~]# pcs status ##DC(Designated Coordinator)的意思是说指定的协调员
    每个node都有CRM,会有一个被选为DC,是整个Cluster的大脑,这个DC控制的CIB(cluster information base)是master CIB,其他的CIB都是副本
    Cluster name: mycluster
    WARNING: no stonith devices and stonith-enabled is not false ##stonith没有启用隔离设备,也就是说在抢占资源的时候直接把对方给爆头
    Stack: corosync
    Current DC: node-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
    Last updated: Sat Oct 28 20:28:01 2017
    Last change: Sat Oct 28 20:18:13 2017 by hacluster via crmd on node-1
    2 nodes configured
    0 resources configured
    Online: [ node-1 node-2 ]
    No resources
    Daemon Status:
    corosync: active/disabled
    pacemaker: active/disabled
    pcsd: active/enabled
    [root@node-2 ~]# pcs status corosync
    Membership information


    Nodeid Votes Name
    2 1 node-2 (local)
    1 1 node-1
    [root@node-1 ~]# crm_verify -L -V ##crm_verify命令用来验证当前的集群配置是否有错误
    error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
    error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
    error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
    Errors found during check: config not valid
    [root@node-1 ~]# pcs property set stonith-enabled=false
    [root@node-1 ~]# pcs property list ##查看已经更改过的集群属性,如果是全局的,使用pcs property --all
    Cluster Properties:
    cluster-infrastructure: corosync
    cluster-name: mycluster
    dc-version: 1.1.16-12.el7_4.4-94ff4df
    have-watchdog: false
    stonith-enabled: false

    三、安装crmsh命令行集群管理工具
    [root@node-1 yum.repos.d]# wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo
    crm(live)# configure
    crm(live)configure# edit ##编辑集群属性,类似于vim模式,修改后保存退出。

    crm部署web service:
    VIP:
    httpd:
    两个节点安装httpd,注意,只能停止httpd服务,而不能重启,并且不能设置为开机自启动,因为resource manager会自动管理这些服务的运行或停止。
    node-1和node-2均做以下步骤:
    [root@node-2 ~]# systemctl start httpd
    [root@node-2 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html
    [root@node-1 ~]# systemctl start httpd ##httpd不能够设置为enable,得靠crm自己管理
    [root@node-1 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html
    此时,可以从浏览器访问2个节点的web界面
    [root@node-2 ~]# crm
    crm(live)# status ##必须保证所有节点都上线,才执行那些命令
    crm(live)# ra
    crm(live)ra# list systemd
    httpd
    crm(live)ra# help info
    crm(live)ra# classes
    crm(live)ra# cd
    crm(live)# configure
    crm(live)configure# help primitive

    1、添加webIP资源
    crm(live)ra# classes
    crm(live)ra# list ocf ##ocf是classes
    crm(live)ra# info ocf:IPaddr ##IPaddr是provider
    crm(live)configure# primitive WebIP ocf:IPaddr params ip=192.168.43.120
    crm(live)configure# show
    node 1: node-1
    node 2: node-2
    primitive WebIP IPaddr
    params ip=192.168.43.120
    property cib-bootstrap-options:
    have-watchdog=false
    dc-version=1.1.13-10.el7-44eb2dd
    cluster-infrastructure=corosync
    cluster-name=mycluster
    stonith-enabled=false
    crm(live)configure# verify
    crm(live)configure# commit
    crm(live)# status
    WebIP (ocf::heartbeat:IPaddr): Stopped
    2、添加webservice资源
    crm(live)configure# primitive WebServer systemd:httpd ##systemd是classes命令看到的
    crm(live)configure# verify
    WARNING: WebServer: default timeout 20s for start is smaller than the advised 100
    WARNING: WebServer: default timeout 20s for stop is smaller than the advised 100
    crm(live)configure# commit

    3、webip和webserver绑定组资源
    crm(live)configure# help group
    crm(live)configure# group WebService WebIP WebServer ##它们之间是有顺序的,IP在哪儿,webserver就在哪儿
    crm(live)configure# verify
    WARNING: WebServer: default timeout 20s for start is smaller than the advised 100
    WARNING: WebServer: default timeout 20s for stop is smaller than the advised 100
    crm(live)configure# commit

    crm(live)configure# node standby ##把当前节点设为备节点

    四、如何保证某节点故障而后上线,资源不会从另一个节点转移回来?
    学习文档:http://blog.51cto.com/nmshuishui/1399811

    +++++++++++++++++++++++++++++++排错笔记++++++++++++++++++++++++++
    1、node-1节点执行crm status发现OFFLINE: [ node-1 node-2 ] ,node-2节点执行crm status发现Online: [ node-2 ],OFFLINE: [ node-1 ] ?
    解决:NTP不对时问题
    (1)[root@node-2 ~]# systemctl status pcsd;ssh node-1 "systemctl status pcsd" ##均正常
    [root@node-2 ~]# systemctl status corosync;ssh node-1 "systemctl status corosync" ##均为active
    两节点均可以ping通和互相SSH,于是查看corosync和pcsd日志,无明显error
    (2)怀疑认证密钥不通过了,结果不是
    [root@node-1 ~]# pcs cluster auth node-1 node-2
    node-1: Already authorized
    node-2: Already authorized
    [root@node-2 ~]# pcs cluster auth node-1 node-2
    node-1: Already authorized
    node-2: Already authorized
    (3)[root@node-1 ~]# crm status ##原因是packmaker挂了,[root@node-1 ~]# systemctl status crm_mon
    ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected
    (4)[root@node-1 ~]# systemctl status pacemaker ##看了博客才发觉NTP又没同步过来
    Active: failed (Result: exit-code)
    [root@node-1 ~]# vim /etc/ntp.conf
    server 192.168.43.128 burst iburst prefer
    [root@node-2 ~]# vim /etc/ntp.conf
    server 127.127.1.0
    fudge 127.127.1.0 stratum 10
    发现重启NTP还是没有卵用,只能date -s "23:52:10"了
    [root@node-1 ~]# date ; ssh node-2 "date"
    2017年 12月 01日 星期五 23:57:55 CST
    2017年 12月 01日 星期五 23:57:56 CST
    (5)最后,两个节点重启systemctl restart pacemaker,运行crm status,卧槽,终于Online: [ node-1 node-2 ]了。
    参考文档:http://blog.51cto.com/nmshuishui/1399811

    2、corosync服务起不来,进而导致pacemaker服务无法启动?
    报错:[root@node-2 ~]# crm status
    ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected
    [root@node-2 ~]# systemctl status pacemaker
    ● pacemaker.service - Pacemaker High Availability Cluster Manager
    Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
    Active: inactive (dead)
    Dec 04 19:57:28 node-2 systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager.
    Dec 04 19:57:28 node-2 systemd[1]: Job pacemaker.service/start failed with result 'dependency'.

    解决:节点更换了IP地址,忘了更新hosts文件。注意:是所有节点都要更新Hosts文件
    [root@node-2 ~]# tail /var/log/cluster/corosync.log
    [4577] node-2 corosyncerror [MAIN ] parse error in config: No interfaces defined
    [4577] node-2 corosyncerror [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1414.
    [root@node-2 ~]# vim /etc/hosts

    添加新的IP地址和主机名即可。

    [root@node-2 ~]# systemctl restart corosync
    [root@node-2 ~]# systemctl restart pacemaker

    3、Pacemaker服务起不来?
    报错:[root@node-2 ~]# systemctl status pacemaker
    Active: deactivating (stop-sigterm) since Mon 2017-12-04 21:04:44 CST; 54s ago
    Dec 04 21:04:44 node-2 pengine[4880]: warning: Processing failed op stop for WebIP on node-2: not configured (6)
    Dec 04 21:04:44 node-2 pengine[4880]: error: Preventing WebIP from re-starting anywhere: operation stop faile...d' (6
    解决:WebIP这个资源进程有问题,用cleanup清理掉进程即可。
    [root@node-2 ~]# crm resource cleanup WebIP
    crm(live)configure# delete WebIP ##删除一个组或资源都行
    crm(live)configure# commit
    [root@node-2 ~]# systemctl status pacemaker

    4、删掉crm(live)configure# delete WebIP,依然报出WebIP (ocf::heartbeat:IPaddr): ORPHANED FAILED node-2 (unmanaged)
    解决:[root@node-2 ~]# crm resource cleaup WebIP

    5、node-1认为node-2不在线,node-2认为node-1不在线?
    报错:[root@node-2 ~]# crm status
    Online: [ node-2 ]
    OFFLINE: [ node-1 ]
    [root@node-1 ~]# crm status
    Online: [ node-1 ]
    OFFLINE: [ node-2 ]

    未解决:两节点环境中,无法实现仲裁,那么每个节点都认为他是DC
    [root@node-1 ~]# time=date |awk '{print $5}';ssh node-2 date -s "$time" ##保证远程主机跟本机时间同步
    [root@node-1 ~]# date ;ssh node-2 "date"
    2017年 12月 04日 星期一 21:37:33 CST
    2017年 12月 04日 星期一 21:37:33 CST

    [root@node-2 ~]# systemctl list-unit-files|grep ntp ##开机保持NTP服务开启
    ntpd.service enabled
    [root@node-2 ~]# hwclock -w ##将当前系统时间写入BIOS

    6、pacemaker服务有问题,报出配置文件格式有问题
    [root@node-2 ~]# systemctl status pacemaker -l
    Dec 04 21:52:35 node-2 cib[6776]: error: Completed cib_replace operation for section 'all': Update does not conform to the configured schema
    解决:corosync.conf配置文件都是一个关键字,然后一个空格,一个花括符,紧接着4个空格,但是在拷贝的时候格式发生了变化,所以最好不要scp,手动改
    [root@node-2 corosync]# vim corosync.conf
    quorum {
    provider: corosync_votequorum
    two_node: 1
    }

    相关文章

      网友评论

          本文标题:corosync+pacemaker部署运维笔记整理

          本文链接:https://www.haomeiwen.com/subject/fhxeeqtx.html