美文网首页
elasticsearch 数据建模(一)

elasticsearch 数据建模(一)

作者: IT菜鸟学习 | 来源:发表于2019-11-20 18:57 被阅读0次

    文章转自:https://www.jianshu.com/p/098236cf3a44
    https://blog.csdn.net/napoay/article/details/62233031

    例1:电商推广数据结构

    {
      "id": 536600477,
      "name": "黑色外穿打底裤女春秋薄款铅笔裤2019新款高腰九分显瘦紧身小脚裤",
      "image": "http://img.alicdn.com/bao/uploaded/i4/1687728515/O1CN015vKRk22Clv2z9jVKM_!!0-item_pic.jpg",
      "item_url":  "http://item.taobao.com/item.htm?id=536600477798",
      "shop_name": "XXX旗舰店",
      "price": 35.00,
      "sales": 12866,
      "contact_info": "XXX旗舰店",
      "short_url": "https://s.click.taobao.com/6dhjX0w",
      "sales_url":  "https://s.click.taobao.com/t?e=m%3D2%26s%3DhqNnFErxaS0cQipKwQzePOeEDrYVVa64K7Vc7tFgwiG3bLqV5UHdqSJ215tW5ra7%2Fl0%2B1yuzCtL9CVjm9%2FaTIMEcIrQjme5phH%2FwEhdaGdpwfW9VvJkbiUOLibAxXu8J4DrzI0Q%2Bh5mWydDa%2BK5%2FZ44CXhN9RDLu87eUjW4Ylwlp3E7b2H5imSCyCj9paIOIxiXvDf8DaRs%3D",
      "sales_pass":  "¥q6vvYNlY15Y¥",
      "coupon_total_num": 50000,
      "coupon_remaining_num":  49981,
      "coupon_quota": "满35减10",
      "coupon_start_date": "2019-09-20",
      "coupon_end_date": "2019-09-25",
      "coupon_url": "https://uland.taobao.com/coupon/edetail?e=EpEKjA4ejsRt3vqbdXnGlgxMgopp14njlHycenxkSuDwJfMHI%2FfVmw2KFrzHTGtgHv69%2F64THFCtOwU1ltpiC5ZrJ2LltVbgH31ZeQAUzbQ%3D&af=1&pid=mm_226490165_153450382_44990650090",
      "coupon_pass": "¥b0NmYNlbC8t¥",
      "coupon_short_url": "https://s.click.taobao.com/XRkjX0w"
    }
    

    "id"为整形,设置为long类型
    "name" 名称是字符串类型,需要作为查询条件,并且需要分词。类型设置为"text",指定中文分词器为"ik_max_word",搜索的时候指定"ik_smart"分词器。
    注意:1、"type": "text"会分词, "type": "keyword"不会分词
    2、"ik_max_word" 为最细粒度分词,"ik_smart"为粗粒度分词,
    索引时,为了提高索引的范围,通常会采用"ik_max_word" ,会以最细粒度分词索引,
    搜索是,为了提高搜索的准确性,会采用"ik_smart"分词器为粗粒度分词;

    ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;
    ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。
    

    字段mapping设置如下:

       "name":  {
            "type":  "text",
            "analyzer":  "ik_max_word",
            "search_analyzer": "ik_smart"
          },
    

    "image" 字段是一个链接,不需要搜索,只需要显示就可以,索引不必添加索引,节省内存和空间,也不需要做集合分析,可以直接设置"enabled":false。其它类似需要也可以和这个字段一样处理。
    "shop_name"是店铺名称,可以和"name"一样使用分词

    "coupon_pass"是优惠券推广口令,不需要分词,但是需要放进索引中,设置"keyword"。
    对应的数据模型

    PUT item_index
    {
     "mappings":  {
       "dynamic": false,
       "properties":  {
         "id":  {
           "type":  "long"
         },
         "name":  {
           "type":  "text",
           "analyzer":  "ik_max_word",
           "search_analyzer": "ik_smart"
         },
         "image":  {
           "enabled": false
         },
         "item_url":  {
           "enabled": false
         },
         "shop_name":  {
           "type":  "text",
           "analyzer":  "ik_max_word",
           "search_analyzer": "ik_smart",
           "fields": {
               "keyword": {
                   "type":  "keyword"
                }
            }
         },
         "price":  {
           "type":  "double"
         },
         "sales":  {
           "type":  "integer"
         },
         "contact_info":  {
           "type":  "keyword"
         },
         "short_url":  {
           "enabled": false
         },
         "sales_url":  {
            "enabled": false
         },
         "sales_pass":  {
           "type":  "keyword"
         },
         "coupon_total_num":  {
           "type":  "integer"
         },
         "coupon_remaining_num":  {
           "type":  "integer"
         },
         "coupon_quota":  {
           "type":  "keyword"
         },
         "coupon_start_date":  {
           "type":  "date",
           "format":  "yyyy-MM-dd"
         },
         "coupon_end_date":  {
           "type":  "date",
           "format":  "yyyy-MM-dd"
         },
         "coupon_url":  {
           "enabled": false
         },
         "coupon_pass":  {
           "type":  "keyword"
         },
         "coupon_short_url":  {
           "enabled": false
         },
       }
     }
    }
    
    

    例2:服务器日志数据结构

    222.67.85.228 - - [14/Nov/2018:14:30:34 +0800] "GET /search?keyword=&hasCoupon=0&pageNum=1&pageSize=100 HTTP/1.1" 200 12268 "-" "Apache-HttpClient/4.5.5 (Java/1.8.0_131)" "-"
    

    通过日志格式化,将nginx日志转换成以下数据结构:

    {
        "ip": "222.67.85.228",
        "username": "-",
        "time": "2018-11-14 14:30:34",
        "request_action": "GET",
        "request_url": "/search?keyword=&hasCoupon=0&pageNum=1&pageSize=100",
        "http_version": "1.1",
        "response_status": 200,
        "byte": 12268,
        "referrer": "-",
        "agent": "Apache-HttpClient/4.5.5 (Java/1.8.0_131)",
        "http_forward": "-"
    }
    

    一般查看日志按照时间和响应状态这两个维度作为查询条件。比如说,需要查询从2019年01月01日至今为止的响应状态为500的请求。整个日志字段基本不需要做分词处理,基本都是做一个展示,字符串数据基本就是"keyword"类型,日期类型注意格式化。

    PUT nginx_log_index
    {
        "mappings": {
            "dynamic": false,
            "properties":  {
                "ip":  {
                    "type": "keyword"
                },
                "username":  {
                    "type": "keyword"
                },
                "time":  {
                    "type": "date",
                    "format": "yyyy-MM-dd HH:mm:ss"
                },
                "request_action":  {
                    "type": "keyword"
                },
                "request_url":  {
                    "enabled": false
                },
                "http_version":  {
                    "type": "keyword"
                },
                "response_status":  {
                    "type": "integer"
                },
                "bytes":  {
                    "type": "long"
                },
                "referrer":  {
                    "type": "keyword"
                },
                "agent":  {
                    "type": "keyword"
                },
                "http_forward":  {
                    "type": "keyword"
                }
            }
        }
    }
    

    例3:博客数据结构

    image.png
    {
        "id": "89546eff3cd0",
        "url": "https://www.jianshu.com/p/89546eff3cd0",
        "title": "简单剖析代理模式实现原理",
        "author": "梦想实现家_Z",
        "content": "代理模式在java中随处可见,其他编程语言也一样,它的作用就是用来解耦的。代理模式又分为静态代理和动态代理。......省略剩下的内容",
        "time": "2019.04.10 21:08:21",
        "word_num": 1056,
        "read_num": 161,
        "like_num": 1,
        "reward_num": 0
    }
    

    因为博客内容特别大,避免每次查询都带上庞大的博客内容,建议将字段分开存储,查询的时候按需要展示。所有建议将"_source"字段设置为"enabled":false,但是要整的每个字段单独设置"store":true

    PUT blog_index
    {
        "mappings": {
            "dynamic": false,
            "_source": {
                "enabled": false
            }, 
            "properties":  {
                "id": {
                    "type":  "keyword",
                    "store":  true,
                },
                "url": {
                    "type":  "keyword",
                    "store":  true,
                    "ignore_above":  100,
                    "doc_values":  false,
                    "norms":  false,
                },
                "title": {
                    "type":  "text",
                    "store":  true,
                    "analyzer":  "ik_max_word",
                    "search_analyzer": "ik_smart",
                    "fields": {
                        "keyword": {
                            "type":  "keyword"
                        }
                    }
                },
                "author": {
                    "type":  "keyword",
                    "store":  true,
                },
                "content": {
                    "type":  "text",
                    "analyzer":  "ik_max_word",
                    "search_analyzer": "ik_smart",
                    "store":  true
                },
                "time": {
                    "type":  "text",
                    "format":  "yyyy.MM.dd HH:mm:ss",
                    "store":  true
                },
                "word_num": {
                    "type":  "integer",
                    "store":  true
                },
                "read_num": {
                    "type":  "integer",
                    "store":  true
                },
                "like_num": {
                    "type":  "integer",
                    "store":  true
                },
                "reward_num": {
                    "type":  "integer",
                    "store":  true
                }
            }
        }
    }
    

    补充一下,"_source" 是在默认配置是“true”,在某个字段特别大的情况下,不放入索引中,把大字段的内容存在Elasticsearch中只会增大索引,这一点文档数量越大结果越明显,如果一条文档节省几KB,放大到亿万级的量结果也是非常可观的。这里的博客内容就是这样的例子
    "_source"的使用方法参考
    参考:https://blog.csdn.net/napoay/article/details/62233031

    相关文章

      网友评论

          本文标题:elasticsearch 数据建模(一)

          本文链接:https://www.haomeiwen.com/subject/nolovctx.html