问题背景

问题描述

Osprey 线上ES的logger出现问题如下:

image

从而导致数据无法被dump进ES中。

出现原因

问题的原因在于相同字段前后输入的类型不一致导致dump error。
通过检查宁夏线上集群kibana的mapping:

{
    "fluentd-2019.12.16": {
        "mappings": {
            "properties": {
                "@timestamp": {
                    "type": "date"
                },
               "cost": {
                    "type": "float"
                },
                "field": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "json": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "logger": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "msg": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "ts": {
                    "type": "date"
                },
                "version": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}

可以发现ts这个字段的类型是 date类型，而报错信息中的ts是个时间戳，为float类型,从而导致了类型不一致，进而导致错误。

解决方法

将时间戳改为与Readygo相同的时间戳记录方式即可：

func GetLogger() (*zap.Logger, error) {
    cfg := zap.NewProductionConfig()

    // 标准输出，一般只需要配置这个即可
    cfg.OutputPaths = []string{
        "stdout",
    }

    // level只是一个简单的过滤, 只有level >= cfg.level 的日志才会被继续输出
    cfg.Level = zap.NewAtomicLevelAt(zap.InfoLevel)

    // 输出为 JSON 格式
    cfg.Encoding = "json"

    // 使用默认生产环境日志编码设置
    cfg.EncoderConfig = zap.NewProductionEncoderConfig()
    cfg.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder
    cfg.EncoderConfig.EncodeLevel = zapcore.CapitalLevelEncoder
    return cfg.Build()
}

效果呈现

在上线最新的osprey v1.3.4版本后，kibana上看不到Osprey staging 项目里的dump error。
但这个error在以下项目中出现:

jaeger-agent-daemonset
kube-cleanup
policy-brain-searcher
csc-osprey
dori
jarvis-client-go

具体可以访问 kibana来查询所有的dump error。以上项目需要检查自己的log信息是否与线上的mapping类型一致

总结

400 dump error 的错误原因在于前后输入的值类型不一致，解决方法是采用一套相同标准来规定一些通用字段。建议全部采用Readygo采取的标准，即将时间格式为ISO8601.。

另外可能预见的后续问题是：不同项目可能会记录相同的字段，比如osprey在记录时间时，采用的cost字段，类型为float，但是另外一个项目xxx，它的cost代表花费的是带单位的金钱，类型为string,这时候会导致这个项目无法将日志信息推送至ES。我的建议是将所有具体信息的字段通过json的方式包裹在msg内，不去单独开一个和msg并列的字段。

例如线上原本是输入{"msg":"新闻搜索消耗时间", "cost":0.0739479}，导出成