PGPool-II+PG流复制实现HA主备切换

作者: 遥想公瑾当年 | 来源:发表于2016-12-21 20:46 被阅读4450次

    基于PG的流复制能实现热备切换,但是是要手动建立触发文件实现,对于一些HA场景来说,需要当主机down了后,备机自动切换,经查询资料知道pgpool-II可以实现这种功能。本文基于PG流复制基础上 ,以pgpool-II实现主备切换。在配置pgpool之前需分别在两台规划机上安装好pg数据库,且配置好了流复制环境,关于流复制配置参考前文:http://www.jianshu.com/p/12bc931ebba3

    pgpool双机集群架构图.png

      基于PGPool的双机集群如上图所示:pg主节点和备节点实现流复制热备,pgpool1,pgpool2作为中间件,将主备pg节点加入集群,实现读写分离,负载均衡和HA故障自动切换。两pgpool节点可以委托一个虚拟ip节点作为应用程序访问的地址,两节点之间通过watchdog进行监控,当pgpool1宕机时,pgpool2会自动接管虚拟ip继续对外提供不间断服务。

    一 主机规划

    主机名 | IP | 角色 | 端口
    :----:|:----:|:----:|:----:|:----:|:----:
    master| 192.168.0.108 |PGMaster|5432
    | 192.168.0.108|pgpool1|9999
    slave|192.168.0.109|PGSlave|5432
    | 192.168.0.109|pgpool2|9999
    vip|192.168.0.150|虚拟ip|9999
    建立好主机规划之后,在master,slave上两台机器设置下host

    [root@localhost ~]# vi .bashrc
    #编辑内容如下:
    192.168.0.108 master
    192.168.0.109 slave
    192.168.0.150 vip
    

    二 配置ssh秘钥

    在master,slave机器上都生成ssh如下:

    [root@localhost ~]# su - postgres
    [postgres@localhost ~]$ ssh-keygen -t rsa
    [postgres@localhost ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    [postgres@localhost ~]$ chmod 600 ~/.ssh/authorized_keys
    

    分别将master的公钥复制到slave,slave的公钥复制到master。

    #master端
    [postgres@localhost ~]$ scp ~/.ssh/authorized_keys postgres@slave:~/.ssh/
    #slave端
    [postgres@localhost ~]$ scp ~/.ssh/authorized_keys postgres@master:~/.ssh/
    

    验证下ssh配置是否成功

    #master端
    [postgres@slave ~]$ ssh postgres@slave
    Last login: Tue Dec 20 21:22:50 2016 from master
    #slave端
    [postgres@slave ~]$ ssh postgres@master
    Last login: Tue Dec 20 21:22:50 2016 from slave
    

    证明ssh信任关系配置成功。

    三 安装pgpool

    中文配置地址可参考http://pgpool.projects.pgfoundry.org/pgpool-II/doc/pgpool-zh_cn.html

    # 下载pgpool
    [root@master opt]#   wget http://www.pgpool.net/mediawiki/images/pgpool-II-3.6.0.tar.gz
    # 解压
    [root@master opt]#   tar -zxvf pgpool-II-3.6.0.tar.gz
    # 文件权限设置为postgres(其实并非一定装在postgres账户,只不过之前ssh设置都在postgres下,为了方便)
    [root@master opt]#   chown -R postgres.postgres /opt/pgpool-II-3.6.0
    [root@master ~]# su - postgres
    [postgres@master opt]$  cd pgpool-II-3.6.0
    [postgres@master pgpool-II-3.6.0]$  ./configure –prefix=/opt/pgpool -with-pgsql=path -with-pgsql=/home/postgres
    [postgres@master pgpool-II-3.6.0]$  make
    [postgres@master pgpool-II-3.6.0]$  make install
    

    安装pgpool相关函数,并非强制,可选安装,为了系统稳定,建议安装
    安装pg_reclass,pg_recovery

    [postgres@master pgpool-II-3.6.0]$  cd src/sql
    [postgres@master sql]$  make
    [postgres@master sql]$  make install
    [postgres@master sql]$  psql -f insert_lock.sql
    

    安装全部结束。

    四 配置pgpool

    4.1 配置pgpool环境变量

    pgpool装在了postgres账户下,在该账户中添加环境变量,master,slave节点都执行。

    [postgres@master ~]$ cd /home/postgres
    [postgres@master ~]$ vim .bashrc
    #编辑内容如下
    PGPOOLHOME=/opt/pgpool
    export PGPOOLHOME
    PATH=$PATH:$HOME/.local/bin:$HOME/bin:$PGHOME/bin:$PGPOOLHOME/bin
    export PATH
    

    4.2 配置pool_hba.conf

    pool_hba.conf是对登录用户进行验证的,要和pg的pg_hba.conf保持一致,要么都是trust,要么都是md5验证方式,这里采用了md5验证方式如下设置:

    [postgres@master ~]$ cd /opt/pgpool/etc
    [postgres@etc~]$ cp pool_hba.conf.sample pool_hba.conf
    [postgres@etc~]$ vim pool_hba.conf
    #编辑内容如下
    # "local" is for Unix domain socket connections only
    local   all         all                            md5
    # IPv4 local connections:
    host    all         all         0.0.0.0/0          md5
    host    all         all         0/0                md5
    

    4.3 配置pcp.conf

    pcp.conf配置用于pgpool自己登陆管理使用的,一些操作pgpool的工具会要求提供密码等,配置如下:

    [postgres@master ~]$ cd /opt/pgpool/etc
    [postgres@etc~]$ cp pcp.conf.sample pcp.conf
    # 使用pg_md5生成配置的用户名密码
    [postgres@etc~]$ pg_md5 nariadmin
    6b07583ba8af8e03043a1163147faf6a
    #pcp.conf是pgpool管理器自己的用户名和密码,用于管理集群。
    [postgres@etc~]$ vim pcp.conf
    #编辑内容如下
    postgres:6b07583ba8af8e03043a1163147faf6a
    #保存退出!
    #在pgpool中添加pg数据库的用户名和密码
    [postgres@etc~]$ pg_md5 -p -m -u postgres pool_passwd
    #数据库登录用户是postgres,这里输入登录密码,不能出错
    #输入密码后,在pgpool/etc目录下会生成一个pool_passwd文件
    

    4.4 配置系统命令权限

    配置 ifconfig, arping 执行权限 ,执行failover_stream.sh需要用到,可以让其他普通用户执行。

    [root@master ~]# chmod u+s /sbin/ifconfig 
    [root@master ~]# chmod u+s /usr/sbin 
    

    4.5 配置pgpool.conf

    查看本机网卡,配置后面的delegate_IP需要

    [postgres@etc~]$ ifconfig
    
    网卡名称.png

    配置master上的pgpool.conf:

    [postgres@master ~]$ cd /opt/pgpool/etc
    [postgres@etc~]$ cp pgpool.conf.sample pgpool.conf
    [postgres@etc~]$ vim pgpool.conf
    

    编辑内容如下:

    # CONNECTIONS
    listen_addresses = '*'
    port = 9999
    pcp_listen_addresses = '*'
    pcp_port = 9898
    
    # - Backend Connection Settings -
    
    backend_hostname0 = 'master'
    backend_port0 = 5432
    backend_weight0 = 1
    backend_data_directory0 = '/home/postgres/data'
    backend_flag0 = 'ALLOW_TO_FAILOVER'
    
    backend_hostname1 = 'slave'
    backend_port1 = 5432
    backend_weight1 = 1
    backend_data_directory1 = '/home/postgres/data'
    backend_flag1 = 'ALLOW_TO_FAILOVER'
    
    # - Authentication -
    enable_pool_hba = on
    pool_passwd = 'pool_passwd'
    
    # FILE LOCATIONS
    pid_file_name = '/opt/pgpool/pgpool.pid'
    
    replication_mode = off
    load_balance_mode = on
    master_slave_mode = on
    master_slave_sub_mode = 'stream'
    
    sr_check_period = 5
    sr_check_user = 'repuser'
    sr_check_password = 'repuser'
    sr_check_database = 'postgres'
    
    #------------------------------------------------------------------------------
    # HEALTH CHECK 健康检查
    #------------------------------------------------------------------------------
    
    health_check_period = 10 # Health check period
                                       # Disabled (0) by default
    health_check_timeout = 20
                                       # Health check timeout
                                       # 0 means no timeout
    health_check_user = 'postgres'
                                       # Health check user
    health_check_password = 'nariadmin' #数据库密码
                                       # Password for health check user
    health_check_database = 'postgres'
    #必须设置,否则primary数据库down了,pgpool不知道,不能及时切换。从库流复制还在连接数据,报连接失败。
    #只有下次使用pgpool登录时,发现连接不上,然后报错,这时候,才知道挂了,pgpool进行切换。
    
    
    #主备切换的命令行配置
    #------------------------------------------------------------------------------
    # FAILOVER AND FAILBACK
    #------------------------------------------------------------------------------
    
    failover_command = '/opt/pgpool/failover_stream.sh %H '
    
    #------------------------------------------------------------------------------
    # WATCHDOG
    #------------------------------------------------------------------------------
    
    # - Enabling -
    use_watchdog = on
    # - Watchdog communication Settings -
    
    wd_hostname = 'master'
                                        # Host name or IP address of this watchdog
                                        # (change requires restart)
    wd_port = 9000
                                        # port number for watchdog service
                                        # (change requires restart)
    # - Virtual IP control Setting -
    
    delegate_IP = 'vip'
                                        # delegate IP address
                                        # If this is empty, virtual IP never bring up.
                                        # (change requires restart)
    if_cmd_path = '/sbin'
                                        # path to the directory where if_up/down_cmd exists
                                        # (change requires restart)
    if_up_cmd = 'ifconfig eth1:0 inet $_IP_$ netmask 255.255.255.0'
                                        # startup delegate IP command
                                        # (change requires restart)
                                        # eth1根据现场机器改掉
    if_down_cmd = 'ifconfig eth1:0 down'
                                        # shutdown delegate IP command
                                        # (change requires restart)
                                        # eth1根据现场机器改掉
    # -- heartbeat mode --
    
    wd_heartbeat_port = 9694
                                        # Port number for receiving heartbeat signal
                                        # (change requires restart)
    wd_heartbeat_keepalive = 2
                                        # Interval time of sending heartbeat signal (sec)
                                        # (change requires restart)
    wd_heartbeat_deadtime = 30
                                        # Deadtime interval for heartbeat signal (sec)
                                        # (change requires restart)
    heartbeat_destination0 = 'slave'
                                        # Host name or IP address of destination 0
                                        # for sending heartbeat signal.
                                        # (change requires restart)
    heartbeat_destination_port0 = 9694
                                        # Port number of destination 0 for sending
                                        # heartbeat signal. Usually this is the
                                        # same as wd_heartbeat_port.
                                        # (change requires restart)
    heartbeat_device0 = 'eth1'
                                        # Name of NIC device (such like 'eth0')
                                        # used for sending/receiving heartbeat
                                        # signal to/from destination 0.
                                        # This works only when this is not empty
                                        # and pgpool has root privilege.
                                        # (change requires restart)
                                        # eth1根据现场机器改掉
    # - Other pgpool Connection Settings -
    
    other_pgpool_hostname0 = 'slave' #对端
                                        # Host name or IP address to connect to for other pgpool 0
                                        # (change requires restart)
    other_pgpool_port0 = 9999
                                        # Port number for othet pgpool 0
                                        # (change requires restart)
    other_wd_port0 = 9000
                                        # Port number for othet watchdog 0
                                        # (change requires restart)
    
    

    配置slave上的pgpool.conf:

    # CONNECTIONS
    listen_addresses = '*'
    port = 9999
    pcp_listen_addresses = '*'
    pcp_port = 9898
    
    # - Backend Connection Settings -
    
    backend_hostname0 = 'master'
    backend_port0 = 5432
    backend_weight0 = 1
    backend_data_directory0 = '/home/postgres/data'
    backend_flag0 = 'ALLOW_TO_FAILOVER'
    
    backend_hostname1 = 'slave'
    backend_port1 = 5432
    backend_weight1 = 1
    backend_data_directory1 = '/home/postgres/data'
    backend_flag1 = 'ALLOW_TO_FAILOVER'
    
    # - Authentication -
    enable_pool_hba = on
    pool_passwd = 'pool_passwd'
    
    # FILE LOCATIONS
    pid_file_name = '/opt/pgpool/pgpool.pid'
    
    replication_mode = off
    load_balance_mode = on
    master_slave_mode = on
    master_slave_sub_mode = 'stream'
    
    sr_check_period = 5
    sr_check_user = 'repuser'
    sr_check_password = 'repuser'
    sr_check_database = 'postgres'
    
    #------------------------------------------------------------------------------
    # HEALTH CHECK 健康检查
    #------------------------------------------------------------------------------
    
    health_check_period = 10 # Health check period
                                       # Disabled (0) by default
    health_check_timeout = 20
                                       # Health check timeout
                                       # 0 means no timeout
    health_check_user = 'postgres'
                                       # Health check user
    health_check_password = 'nariadmin' #数据库密码
                                       # Password for health check user
    health_check_database = 'postgres'
    #必须设置,否则primary数据库down了,pgpool不知道,不能及时切换。从库流复制还在连接数据,报连接失败。
    #只有下次使用pgpool登录时,发现连接不上,然后报错,这时候,才知道挂了,pgpool进行切换。
    
    
    #主备切换的命令行配置
    #------------------------------------------------------------------------------
    # FAILOVER AND FAILBACK
    #------------------------------------------------------------------------------
    
    failover_command = '/opt/pgpool/failover_stream.sh %H '
    
    #------------------------------------------------------------------------------
    # WATCHDOG
    #------------------------------------------------------------------------------
    
    # - Enabling -
    use_watchdog = on
    # - Watchdog communication Settings -
    
    wd_hostname = 'slave'  #本端
                                        # Host name or IP address of this watchdog
                                        # (change requires restart)
    wd_port = 9000
                                        # port number for watchdog service
                                        # (change requires restart)
    # - Virtual IP control Setting -
    
    delegate_IP = 'vip'
                                        # delegate IP address
                                        # If this is empty, virtual IP never bring up.
                                        # (change requires restart)
    if_cmd_path = '/sbin'
                                        # path to the directory where if_up/down_cmd exists
                                        # (change requires restart)
    if_up_cmd = 'ifconfig eth1:0 inet $_IP_$ netmask 255.255.255.0'
                                        # startup delegate IP command
                                        # (change requires restart)
                                        # eth1根据现场机器改掉
    if_down_cmd = 'ifconfig eth1:0 down'
                                        # shutdown delegate IP command
                                        # (change requires restart)
                                        # eth1根据现场机器改掉
    # -- heartbeat mode --
    
    wd_heartbeat_port = 9694
                                        # Port number for receiving heartbeat signal
                                        # (change requires restart)
    wd_heartbeat_keepalive = 2
                                        # Interval time of sending heartbeat signal (sec)
                                        # (change requires restart)
    wd_heartbeat_deadtime = 30
                                        # Deadtime interval for heartbeat signal (sec)
                                        # (change requires restart)
    heartbeat_destination0 = 'master' #对端
                                        # Host name or IP address of destination 0
                                        # for sending heartbeat signal.
                                        # (change requires restart)
    heartbeat_destination_port0 = 9694
                                        # Port number of destination 0 for sending
                                        # heartbeat signal. Usually this is the
                                        # same as wd_heartbeat_port.
                                        # (change requires restart)
    heartbeat_device0 = 'eth1'
                                        # Name of NIC device (such like 'eth0')
                                        # used for sending/receiving heartbeat
                                        # signal to/from destination 0.
                                        # This works only when this is not empty
                                        # and pgpool has root privilege.
                                        # (change requires restart)
                                        # eth1根据现场机器改掉
    # - Other pgpool Connection Settings -
    
    other_pgpool_hostname0 = 'master' #对端
                                        # Host name or IP address to connect to for other pgpool 0
                                        # (change requires restart)
    other_pgpool_port0 = 9999
                                        # Port number for othet pgpool 0
                                        # (change requires restart)
    other_wd_port0 = 9000
                                        # Port number for othet watchdog 0
                                        # (change requires restart)
    
    

    配置文件里,故障处理配置的是failover_command = '/opt/pgpool/failover_stream.sh %H ',因此,需要在/opt/pgpool目录中写个failover_stream.sh脚本:

    [postgres@master ~]$ cd /opt/pgpool
    [postgres@pgpool~]$ touch failover_stream.sh
    [postgres@pgpool~]$ vim failover_stream.sh
    

    注意这里使用了promote 而不是触发文件,触发文件来回切换有问题,编辑内容如下:

    #! /bin/sh 
    # Failover command for streaming replication. 
    # Arguments: $1: new master hostname. 
    
    new_master=$1 
    trigger_command="$PGHOME/bin/pg_ctl promote -D $PGDATA" 
    
    # Prompte standby database. 
    /usr/bin/ssh -T $new_master $trigger_command 
    
    exit 0; 
    

    如果是其他用户创建的,需要赋予postgres可执行权限,例如

    [root@opt ~]$ chown -R postgres.postgres /opt/pgpool
    [root@opt ~]]$ chmod 777  /opt/pgpool/failover_stream.sh
    

    五 PGPool集群管理

    启动之前在master,slave节点创建两个日志文件:

    [root@master ~]# mkdir /var/log/pgpool
    [root@master ~]# chown -R postgres.postgres /var/log/pgpool
    [root@master ~]# mkdir /var/run/pgpool
    [root@master ~]# chown -R postgres.postgres /var/run/pgpool
    

    5.1 启动集群

    分别启动primary,standby的pg库

    #master上操作
    [postgres@master ~]$ pg_ctl start -D $PGDATA
    #slave上操作
    [postgres@slave ~]$ pg_ctl start -D $PGDATA
    

    分别启动pgpool命令:

    #master上操作
    # -D会重新加载pg nodes的状态如down或up
    [postgres@master ~]$ pgpool -n -d -D > /var/log/pgpool/pgpool.log 2>&1 &
    [1] 3557
    
    #slave上操作
    [postgres@slave ~]$ pgpool -n -d -D > /var/log/pgpool/pgpool.log 2>&1 &
    [1] 3557
    

    注意快速终止pgpool命令:

    [postgres@ ~]$ pgpool -m fast stop
    

    启动pgpool后,查看集群节点状态:

    [postgres@master ~]$ psql -h vip -p 9999
    psql (9.6.1)
    #提示输入密码:
    Type "help" for help.
    
    postgres=# show pool_nodes;
     node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
    ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------
     0       | master   | 5432 | up     | 0.500000  | primary | 0             | false  | 0
     1       | slave     | 5432 | up     | 0.500000  | standby | 0             |  true  | 0
    (2 rows)
    
    #在slave上节点也是psql -h vip -p 9999,双pgpool使用虚拟ip,做到高可用。
    

    发现当前主备节点都是正常的up状态。

    5.2 Pgpool的HA

    5.2.1 模拟master端pgpool宕机

    在master节点上停止pgpool服务
    [postgres@master ~]$ pgpool -m fast stop
    #稍等片刻后,访问集群
    [postgres@master ~]$ psql -h vip -p 9999
    psql (9.6.1)
    #提示输入密码:
    Type "help" for help.
    
    postgres=# show pool_nodes;
     node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
    ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------
     0       | master   | 5432 | up     | 0.500000  | primary | 0             | false  | 0
     1       | slave     | 5432 | up     | 0.500000  | standby | 0             |  true  | 0
    (2 rows)
    #访问成功,在master节点上的pgpool宕机后,由slave节点的pgpool接管vip和集群服务,并未中断应用访问。
    #在master上重新启动pgpool后,定制slave上的pgpool服务,结果一样。
    

    5.2.2模拟master端pg primary宕机

    [postgres@master ~]$ pg_ctl stop
    #master端打印
    2017-07-24 18:52:37.751 PDT [28154] STATEMENT:  SELECT pg_current_xlog_location()
    2017-07-24 18:52:37.760 PDT [2553] LOG:  received fast shutdown request
    2017-07-24 18:52:37.760 PDT [2553] LOG:  aborting any active transactions
    2017-07-24 18:52:37.762 PDT [28156] FATAL:  canceling authentication due to timeout
    2017-07-24 18:52:37.763 PDT [2555] LOG:  shutting down
    2017-07-24 18:52:37.768 PDT [28158] FATAL:  the database system is shutting down
    2017-07-24 18:52:37.775 PDT [28159] FATAL:  the database system is shutting down
    2017-07-24 18:52:39.653 PDT [2553] LOG:  database system is shut down
    
    #slave端打印
    2017-07-24 18:52:41.455 PDT [2614] LOG:  invalid record length at 0/2A000098: wanted 24, got 0
    2017-07-24 18:52:47.333 PDT [2614] LOG:  received promote request
    2017-07-24 18:52:47.333 PDT [2614] LOG:  redo done at 0/2A000028
    2017-07-24 18:52:47.333 PDT [2614] LOG:  last completed transaction was at log time 2017-07-24 18:17:00.946759-07
    2017-07-24 18:52:47.336 PDT [2614] LOG:  selected new timeline ID: 10
    2017-07-24 18:52:47.841 PDT [2614] LOG:  archive recovery complete
    2017-07-24 18:52:47.851 PDT [2613] LOG:  database system is ready to accept connections
    
    #日志清楚看到主机down机了,slave切换了。
    #稍等片刻后,访问集群
    [postgres@master ~]$ psql -h vip -p 9999
    Password: 
    psql (10beta1)
    Type "help" for help.
    
    postgres=# show pool_nodes;
     node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
    ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------
     0       | master   | 5432 | down   | 0.500000  | standby | 0          | false             | 0
     1       | slave    | 5432 | up     | 0.500000  | primary | 0          | true              | 0
    (2 rows)
    #slave已经被切换成primary,且master节点状态是down
    

    5.2.3 修复master节点重新加入集群

    master节点down机后,slave节点已经被切换成了primary,修复好master后应重新加入节点,作为primary的standby。
    修复master端并启动操作:

    [postgres@master ~]$ cd $PGDATA
    [postgres@master data]$ mv recovery.done recovery.conf #一定要把.done改成.conf
    [postgres@master data]$ pg_ctl start
    
    

    在pgpool集群中加入节点状态:

    #注意master的node_id是0,所以-n 0
    [postgres@master data]$ pcp_attach_node -d -U postgres -h vip -p 9898 -n 0
    #提示输入密码,输入pcp管理密码。
    #查看当前状态
    postgres=# show pool_nodes;
     node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
    ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------
     0       | master   | 5432 | up    | 0.500000  | standby | 0             | false  | 0
     1       | slave     | 5432 | up     | 0.500000  | primary | 0             |  true  | 0
    (2 rows)
    

    5.2.4 主机直接down机

    当前slave节点是primay,我们直接将slave服务器直接关机后,发现实现了主备切换,slave已经down了,而master已经被切换成了primary:

    [postgres@master ~]$ psql -h vip -p 9999
    Password: 
    psql (10beta1)
    Type "help" for help.
    
    postgres=# show pool_nodes;
     node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
    ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------
     0       | master   | 5432 | up     | 0.500000  | primary | 0          | true              | 0
     1       | slave    | 5432 | down   | 0.500000  | standby | 0          | false             | 0
    (2 rows)
    
    

    5.3 数据线同步

    在主备切换时,修复节点并重启后,由于primary数据发生变化,或修复的节点数据发生变化再按照流复制模式加入集群,很可能报时间线不同步错误:

    #slave机器重启后,由于master或slave数据不同步产生了
    [postgres@slave data]$ mv recovery.done recovery.conf
    [postgres@slave data]$ pg_ctl start
    waiting for server to start....2017-07-24 19:31:44.563 PDT [2663] LOG:  listening on IPv4 address "0.0.0.0", port 5432
    2017-07-24 19:31:44.563 PDT [2663] LOG:  listening on IPv6 address "::", port 5432
    2017-07-24 19:31:44.565 PDT [2663] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
    2017-07-24 19:31:44.584 PDT [2664] LOG:  database system was shut down at 2017-07-24 19:31:30 PDT
    2017-07-24 19:31:44.618 PDT [2664] LOG:  entering standby mode
    2017-07-24 19:31:44.772 PDT [2664] LOG:  consistent recovery state reached at 0/2D000098
    2017-07-24 19:31:44.772 PDT [2663] LOG:  database system is ready to accept read only connections
    2017-07-24 19:31:44.772 PDT [2664] LOG:  invalid record length at 0/2D000098: wanted 24, got 0
    2017-07-24 19:31:44.798 PDT [2668] LOG:  fetching timeline history file for timeline 11 from primary server
    2017-07-24 19:31:44.826 PDT [2668] FATAL:  could not start WAL streaming: ERROR:  requested starting point 0/2D000000 on timeline 10 is not in this server's history
        DETAIL:  This server's history forked from timeline 10 at 0/2B0001B0.
    2017-07-24 19:31:44.826 PDT [2664] LOG:  new timeline 11 forked off current database system timeline 10 before current recovery point 0/2D000098
     done
    
    

    产生这种情况,需要根据pg_rewind工具同步数据时间线,具体分5步走。

    5.3.1停掉需要做同步的节点pg服务

    [postgres@slave ] pg_ctl stop 
    
    

    5.3.2 同步master节点上时间线

    [postgres@slave data]$ pg_rewind  --target-pgdata=/home/postgres/data --source-server='host=master port=5432 user=postgres dbname=postgres password=nariadmin'
    servers diverged at WAL location 0/2B0001B0 on timeline 10
    rewinding from last common checkpoint at 0/2B000108 on timeline 10
    Done!
    

    5.3.3 修改pg_hba.conf与 recovery.done文件

    #pg_hba.conf与 recovery.done都是同步master上来的,要改成slave自己的
    [postgres@slave ] cd $PGDATA
    [postgres@slave data]$ mv recovery.done recovery.conf
    [postgres@slave data]$ vi pg_hba.conf
    #slave改成master(相当于slave的流复制对端)
    host    replication     repuser         master                   md5
    [postgres@slave data]$ vi recovery.conf
    #slave改成master(相当于slave的流复制对端)
    primary_conninfo = 'host=master port=5432 user=repuser password=repuser'   
    

    5.3.4 重启pg服务

    [postgres@slave data]$ pg_ctl start
    waiting for server to start....2017-07-24 19:47:06.821 PDT [2722] LOG:  listening on IPv4 address "0.0.0.0", port 5432
    2017-07-24 19:47:06.821 PDT [2722] LOG:  listening on IPv6 address "::", port 5432
    2017-07-24 19:47:06.907 PDT [2722] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
    2017-07-24 19:47:06.930 PDT [2723] LOG:  database system was interrupted while in recovery at log time 2017-07-24 19:25:42 PDT
    2017-07-24 19:47:06.930 PDT [2723] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
    2017-07-24 19:47:06.961 PDT [2723] LOG:  entering standby mode
    2017-07-24 19:47:06.966 PDT [2723] LOG:  redo starts at 0/2B0000D0
    2017-07-24 19:47:06.971 PDT [2723] LOG:  consistent recovery state reached at 0/2B01CA30
    2017-07-24 19:47:06.972 PDT [2722] LOG:  database system is ready to accept read only connections
    2017-07-24 19:47:06.972 PDT [2723] LOG:  invalid record length at 0/2B01CA30: wanted 24, got 0
    2017-07-24 19:47:06.982 PDT [2727] LOG:  started streaming WAL from primary at 0/2B000000 on timeline 11
     done
    server started
    

    5.3.5 重新加入集群

    #注意slave的node_id是1,所以-n 1
    [postgres@slave data]$ pcp_attach_node -d -U postgres -h vip -p 9898 -n 1
    Password: #提示输入密码,输入pcp管理密码。
    DEBUG: recv: tos="m", len=8
    DEBUG: recv: tos="r", len=21
    DEBUG: send: tos="C", len=6
    DEBUG: recv: tos="c", len=20
    pcp_attach_node -- Command Successful
    DEBUG: send: tos="X", len=4
    

    5.3.6 查看集群节点状态

    [postgres@slave data]$ psql -h vip -p 9999
    Password: 
    psql (10beta1)
    Type "help" for help.
    
    postgres=# show pool_nodes;
     node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
    ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------
     0       | master   | 5432 | up     | 0.500000  | primary | 0          | true              | 0
     1       | slave    | 5432 | up     | 0.500000  | standby | 0          | false             | 0
    (2 rows)
    
    

    全部恢复工作完成。

    相关文章

      网友评论

      • IrvinX:请教个问题: 主节点A宕机重启变为从节点,从节点 B 提升为主节点; 如果当前主节点B挂掉, 原来的主节点A如何自动提升为主节点???:pray:
        IrvinX:@遥想公瑾当年 谢谢! 还有个问题,就是外部服务通过 pgpool访问数据库服务, 若将该请求访问的节点的数据库关闭, 服务再次访问会出现数据库连接失败的问题, 需要过一段时间访问才会恢复, 这个问题出现在哪里, 有什么解决方案吗?
        遥想公瑾当年:@IrvinX 一样的,首先修复启动A,然后加入集群。等于a b换了位置
      • Kaeser:$trigger_command 在failover.sh 中是怎么来的?是传入的参数?传入的参数只有一个%H啊。这个值难道是环境变量里面传入的?但是环境变量里面也没有这个值啊。
        Kaeser:哦。就在上一行啊。。。。我的锅看漏了。继续试试。
      • thirc:老哥,那个虚拟ip (vip) 是随便写一个就好吗?
      • 3f2abaefd0db:如果postgres放在docker中配置还是一样的吗

      本文标题:PGPool-II+PG流复制实现HA主备切换

      本文链接:https://www.haomeiwen.com/subject/xxagmttx.html