美文网首页
链路监控

链路监控

作者: 三云_16d2 | 来源:发表于2018-12-20 14:11 被阅读0次

    链路监控

    以前的方案
        全量校验,逻辑上就是select * from tcbuyer.order的结果,和select * from tc.order 的结果作对比。
        伪增量校验 ,比较上一个小时的数据。
        单流增量校验, 基于事件的比较,当买家库生成一笔订单后,相应地MySQL会产生一条binlog,单流增量校验系统就能以这条binlog作为触发条件,解析出binlog内容,去实时反查卖家库有没有对应记录。
    
    AMG的校验图模型——Check Graph
        假设交易链路有4个业务系统需要对账,分别是交易、库存、资金和支付,其中涉及的事件分别对应 交易下单事件、减库存事件、使用红包资金事件、支付事件。对账的需求如下:
    
        交易事件 和 库存事件 做校验;
    
        交易事件 和 资金事件 做校验;
    
        交易事件 和 支付事件 做校验;
    
        资金事件 和 支付事件 做校验。
    
        一旦上面4个校验中的其中一个出现问题,都认为是业务系统存在异常,需要及时报出来。
    
        明显可以看出来,这是一个图模型。比如,A事件和B事件校验,则存在一条边,连接A和B点。以事件作为点(Node),事件间的校验方法作为边(Edge),构造出一个图(Graph)模型。按照上述场景,构造的图模型如下:
    
    
            交易 《-----校验----》
    

    集团的各个系统为了业务解耦、保证主链路的性能或可用性,各系统之间常常存在各种同步异步调用、强弱依赖关系。一旦网络抖动、业务系统bug、或是某个子系统出现异常,就可能就会出现业务数据不一致。拿最核心的交易系统和库存系统来说,用户下了单之后,没减库存,那么很有可能出现超卖;用户关闭订单之后,没有回补库存,那么就会导致少卖。这就是交易和库存系统之间的数据不一致。

     from influxdb import InfluxDBClient
    
     json_body = [
        {
            "measurement": "cpu_load_short",
            "tags": {
                "host": "server01",
                "region": "us-west"
            },
            "time": "2009-11-10T23:00:00Z",
            "fields": {
                "value": 0.64
            }
        }
    ]
    
     client = InfluxDBClient('localhost', 8086, 'root', 'root', 'example')
    
     client.create_database('example')
    
     client.write_points(json_body)
    
     result = client.query('select value from cpu_load_short;')
    
     print("Result: {0}".format(result))
    
    
    
    insert prism_trace_log,serverApp='camel',serviceName='index.api', rt=50 '2017-09-08 13:00:01' 
    
    

    // TRACE 类型默认不输出 rpcId

    ==========================BaseModel
    traceId
    rpcId
    timestamp
    rpcType
    rpcId
    hostIp
    ==========================RpcModel
    clientApp
    clientIp
    clientSpan
    serverApp
    serverIp
    serverSpan
    opName //操作名称,一般视 RPC 情况确定,如 LOCAL、SYNC、CALLBACK、FUTURE 等;对于数据库,如 QUERY、UPDATE、INSERT、DELETE
    opType //操作类型,一般视 RPC 情况确定,如序列化方式,或读写标记等;对于数据库,分成 R、W 两种表示读、写操作
    serviceName //接口名,
    methodName //方法名
    error //0
    result // 1,2,3,3,4,5
    ==========================

    http 总量
    select count(*),sum(error),avg(serverSpan) from prism_trace where rpcType=0 and serverApp = ?

    http 按页面统计
    select count(*),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=0 and serverApp = ? group by serviceName

    RPC 总量
    select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ?
    RPC 按服务统计
    select count(
    ),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serviceName

    RPC 服务来源
    select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by clientApp
    RPC 服务去向
    select count(
    ),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by serverApp

    RPC 应用来源
    select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by clientApp
    RPC 应用去向
    select count(
    ),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serverApp

    DB 总量
    select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=3 and serverApp = ?
    DB 按表统计
    select count(
    ),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serviceName
    DB 统计表的来源
    select count(*),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by clientApp

    错误类型:
    /**
    * 未知
    /
    UNKNOWN,
    /
    *
    * 成功
    /
    OK,
    /
    *
    * 业务错误
    /
    BIZ_ERROR,
    /
    *
    * RPC 错误
    /
    RPC_ERROR,
    /
    *
    * 超时
    /
    TIMEOUT,
    /
    *
    * 软错误,一般用于资源找不到、未命中、加锁未成功、
    * 版本不一致导致未更新等情况,需要根据中间件不同来判定
    /
    SOFT_ERROR,
    /
    *
    * 限流错误
    */
    LIMIT_ERROR,

    模型的字段如下:

    OpName:DB 操作名,如 QUERY、UPDATE,(TDDL v5 后增加的)INSERT、DELETE
    OpType:DB 操作类型,分成 R、W 两种表示读、写操作
    ServiceDim1:物理库名
    ServiceDim2:tableName,例如 JOIN:TABLE_A,TABLE_C,TABLE_B
    ServiceDim3:逻辑 SQL 编码
    ServerName:(db@dbName),例如 andor_mysql_group
    ClientName:clientAppId
    ServerDimKey:TDDL_opName@dbName:tableName

    tlive,,mtop/get.do(),500
    1. tlive,fun,CommentService,save,100
    2. tlive,fun,CommentService,save, 90
    fun,db,"table1",100
    tlive,fun,MemberService,save,200
    fun,db,"table2",100

    /*
     * Rpc 类型的数字编号
     */
    // @formatter:off
    public static final int RPC_TYPE_UNKNOWN =                   255;
    public static final int RPC_TYPE_TRACE =                       0;
    public static final int RPC_TYPE_HSF =                         1;
    public static final int RPC_TYPE_HSF_SERVER =                  2;
    public static final int RPC_TYPE_NOTIFY =                      3;
    public static final int RPC_TYPE_TDDL =                        4;
    public static final int RPC_TYPE_TAIR =                        5;
    public static final int RPC_TYPE_SEARCH =                      6;
    public static final int RPC_TYPE_MASTER =                     11;
    public static final int RPC_TYPE_SLAVE =                      12;
    public static final int RPC_TYPE_METAQ =                      13;
    public static final int RPC_TYPE_DRDS =                       14;
    public static final int RPC_TYPE_TFS =                        15;
    public static final int RPC_TYPE_ALIPAY =                     16;
    public static final int RPC_TYPE_HTTP_B =                     20;
    public static final int RPC_TYPE_HTTP =                       25;
    public static final int RPC_TYPE_SENTINEL =                   26;
    public static final int RPC_TYPE_LOCAL =                      30;
    public static final int RPC_TYPE_JINGWEI =                    32;
    public static final int RPC_TYPE_ISEARCH =                    36;
    public static final int RPC_TYPE_LOCAL_NG =                   40;
    public static final int RPC_TYPE_CSB_SERVER =                 52;
    public static final int RPC_TYPE_HTTP_SERVER =               251;
    public static final int RPC_TYPE_METAQ_RCV =                 252;
    public static final int RPC_TYPE_ACCESS =                    253;
    public static final int RPC_TYPE_NOTIFY_RCV =                254;
    //自定的RPCTYPE
    public static final int RPC_TYPE_CUSTOM_TRACE =               90;
    public static final int RPC_TYPE_CUSTOM_RPC_CLIENT =          91;
    public static final int RPC_TYPE_CUSTOM_RPC_SERVER =          92;
    public static final int RPC_TYPE_CUSTOM_MESSAGE_PUB =         93;
    public static final int RPC_TYPE_CUSTOM_MESSAGE_SUB =         96;
    public static final int RPC_TYPE_CUSTOM_DB =                  94;
    public static final int RPC_TYPE_CUSTOM_CACHE =               95;
    public static final int RPC_TYPE_CUSTOM_PROTOCOL_CLIENT =     97;
    public static final int RPC_TYPE_CUSTOM_PROTOCOL_SERVER =     98;
    // @formatter:on
    

    ->A->B->C

    client, server, type
    -,a,0
    a,b,1
    a,b,2
    b,c,1
    b,c,2

    LOCAL_IP_ADDRESS= getLocalInetAddress();

    IP_16 = getIP_16(LOCAL_IP_ADDRESS);
    IP_16 = getIP_16(LOCAL_IP_ADDRESS);

    1.应用概要,2. 服务详情,3. 应用去向 4.应用来源

    1. 概要
      数字
      select
      count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
      from prism_trace
      where (rpcType='0' or rpcType='2' or rpcType='3') and serverApp='cammel' and time>now()-1d and time<=now() group by serverIp,time(1d)

    表格
    from prism_trace
    where (rpcType='0' or rpcType='2' or rpcType='3') and serverApp='cammel' and time>now()-1d and time<=now() group by rpcType,serviceName,time(1d)

    1. 服务详情

    大图
    select
    count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
    from prism_trace
    where serverApp='cammel' and serviceName='?' and time>now()-1d and time<=now() group by time(1d)

    去向
    select
    count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
    from prism_trace
    where clientApp='camel' and rpcType='1' and clientService='/login.do' and time>now()-1d and time<=now() group by serverApp,serviceName,time(1d)

    来源
    select
    count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
    from prism_trace
    where serverApp='whale' and rpcType='1' and serviceName='MemberQueryService' and time>now()-1d and time<=now() group by clientApp,clientService,time(1d)

    1. 应用去向

    select
    count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
    from prism_trace
    where clientApp='camel' and rpcType='1' and time>now()-1d and time<=now() group by serverApp,serviceName,time(1d)

    1. 应用来源

    select
    count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
    from prism_trace
    where serverApp='whale' and rpcType='1' and time>now()-1d and time<=now() group by clientApp,clientService,time(1d)

    相关文章

      网友评论

          本文标题:链路监控

          本文链接:https://www.haomeiwen.com/subject/rgsykqtx.html