美文网首页
记录一次 core dump 的分析过程

记录一次 core dump 的分析过程

作者: luckriver | 来源:发表于2023-06-04 17:38 被阅读0次

    背景

    收到运维通知,负责的工程下面有很多core文件,是python进程崩溃后系统生成的。core 文件的生成原理这里不错介绍了,感兴趣的可以自己去了解一下。


    core 文件

    分析

    分析core文件需要上 gdb,因为符号文件的限制,分析时需要有一个和原有问题环境一样的调试环境,否则看到的就是乱码


    gdb 显示乱码

    因为程序运行在k8s中,需要将镜像文件下载后在容器内调试。

    首先是需要把core文件拷贝到容器中,可以按照如下命令进行

    docker cp /loca/path/file <container_id>:/container_path
    

    container_id 可以通过 docker ps 查询获得
    接下来按如下步骤分析

    1. 启动gdb
    gdb /home/chunyu/workspace/ENV/bin/python core.17445
    
    1. 查看出问题的调用栈

    直接 bt 出来的并不对应到python代码,需要启动python的debug信息才行。通过 file python 可以查看当前python debug信息的加载情况。 安装debug信息的方式一般 gdb 也会给出来,我执行的时候命令如下

    yum –enablerepo='*debug*' install /usr/lib/debug/.buildid/8d/75b23c27b98a6fc5656327f915409f6f1fba5b.debug
    

    之后就可以通过 py-bt 命令来分析了


    py-bt

    通过上图可以看出,线程是在执行 hbase 数据读取的时候产生异常出的core文件

    1. 原因分析

    初步怀疑是多线程访问导致的问题,从所有线程的调用栈分析上可以看出,thread 1 和 thread 35 两个线程都在进行hbase的访问。thread 35 中直接创建了新链接而thread 1 中还在访问老的链接,直接导致了异常

    Thread 1 (Thread 0x7f66b14aa700 (LWP 32386)):
    #18 Frame 0x7f66a002ac80, for file /home/workspace/ENV/lib/python2.7/site-packages/thriftpy/thrift.py, line 150, in read (self=<getRowsWithColumns_result(io=None, success=None) at remote 0x7f66b1537890>, iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>)
        iprot.read_struct(self)
    #22 Frame 0x7f66a005d860, for file /home/workspace/ENV/lib/python2.7/site-packages/thriftpy/thrift.py, line 217, in _recv (self=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>) at remote 0x7f66b1544a10>, _api='getRowsWithColumns', fname=u'getRowsWithColumns', mtype=2, rseqid=0, result=<getRowsWithColumns_result(io=None, success=None) at remote 0x7f66b1537890>)
        result.read(self._iprot)
    #26 Frame 0x7f669c124950, for file /home/chunyu/workspace/ENV/lib/python2.7/site-packages/thriftpy/thrift.py, line 198, in _req (self=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>) at remote 0x7f66b1544a10>, _api='getRowsWithColumns', 
        return self._recv(_api)
    #36 Frame 0x7f66a000bd00, for file /home/workspace/ENV/lib/python2.7/site-packages/happybase/table.py, line 162, in rows (self=<Table(connection=<Connection(compat='0.96', _transport_class=<type at remote 0x7f670fdb2d00>, table_prefix=None, table_prefix_separator='_', _protocol_class=<type at remote 0x7f670f57b4c0>, _initialized=True, host='offline_hbase', client=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>) at remote 0x7f66b1544a10>, timeout=1000, port=29090, transport=<thriftpy.transport.buffered.cybuffered.TCyBufferedTransport at remote 0x7f66b1554870>) at remote 0x7f66b1544f50>, name='') at remote 0x7f66b1537c10>
        self.name, rows, columns, {})
    
    
    Thread 35 (Thread 0x7f66b0ca9700 (LWP 32390)):
    #14 Frame 0x7f672d8f9790, for file /usr/lib64/python2.7/socket.py, line 224, in meth (name='connect', self=<_socketobject at remote 0x7f66b14c5ad0>, args=(('offline_hbase', 29090),))
        return getattr(self._sock,name)(*args)
    #22 Frame 0x7f66b4ad7620, for file /home/workspace/ENV/lib/python2.7/site-packages/thriftpy/transport/socket.py, line 96, in open (self=<TSocket(socket_timeout=<float at remote 0x7f669c0b15e8>, sock=<_socketobject at remote 0x7f66b14c5ad0>, socket_family=2, unix_socket=None, host='offline_hbase', connect_timeout=<float at remote 0x7f669c0b15e8>, port=29090) at remote 0x7f66b14c6e50>, addr=('offline_hbase', 29090))
        self.sock.connect(addr)
    #31 Frame 0x7f66b56ceb00, for file /home/workspace/ENV/lib/python2.7/site-packages/happybase/connection.py, line 178, in open (self=<Connection(compat='0.96', _transport_class=<type at remote 0x7f670fdb2d00>, table_prefix=None, table_prefix_separator='_', _protocol_class=<type at remote 0x7f670f57b4c0>, host='offline_hbase', client=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1502dc0>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1502dc0>) at remote 0x7f66b14c6790>, timeout=1000, port=29090, transport=<thriftpy.transport.buffered.cybuffered.TCyBufferedTransport at remote 0x7f66b15029b0>) at remote 0x7f66b14c6350>)
        self.transport.open()
    #34 Frame 0x2e41150, for file /home/workspace/ENV/lib/python2.7/site-packages/happybase/connection.py, line 148, in __init__ (self=<Connection(compat='0.96', _transport_class=<type at remote 0x7f670fdb2d00>, table_prefix=None, table_prefix_separator='_', _protocol_class=<type at remote 0x7f670f57b4c0>, host='offline_hbase', client=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1502dc0>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1502dc0>) at remote 0x7f66b14c6790>, timeout=1000, port=29090, transport=<thriftpy.transport.buffered.cybuffered.TCyBufferedTransport at remote 0x7f66b15029b0>) at remote 0x7f66b14c6350>, host='offline_hbase', port=29090, timeout=1000, autoconnect=True, table_prefix=None, table_prefix_separator='_', compat='0.96', transport='buffered', protocol='binary')
        self.open()
    

    可以看到 thread35 在重新建立到 hbase 的链接,而 thread1 还在直接的 connection 上读取数据,由于 happybase 的 connection 并不是线程安全的,因此发生了程序崩溃的问题。


    重试代码

    解决方式为考虑使用 happybase 的连接池。

    相关文章

      网友评论

          本文标题:记录一次 core dump 的分析过程

          本文链接:https://www.haomeiwen.com/subject/ygwbedtx.html