美文网首页
监控系列讲座(十二)常见系统监控指标之存储

监控系列讲座(十二)常见系统监控指标之存储

作者: 炼狱腾蛇Eric | 来源:发表于2020-08-17 22:21 被阅读0次

    4. 磁盘/存储监控指标

    一般来说,我们监控存储设备的时候大多数都是在监控文件系统,也就是可以被操作系统直接使用的部分。但是实际的生产中,我们会有其他的监控需求

    • 监控那些没有被格式化成对应的文件系统(XFS,FAT32,EXT4)的磁盘,这些磁盘使用的时候使用一般的df命令是不可见的,比如:oracle的ASM盘,块存储,存储映射过来的lun(FC,iSCSI)。我们需要使用一些特殊手段才能让他们为我们所用。
    • 提供存储的设备,比如:惠普的3PAR,IBM的DS存储,EMC存储这些设备,或者是NFS服务器,Ceph/swift集群这一类提供存储功能的服务器。对于厂商的产品我们最好是去咨询原厂的工程师,关于监控指标的问题,比如是否可以有插件支持某类(zabbix,grafana)监控软件直接采集,还是有API接口,可以供外部程序采集指标。即使都没有的话,还会有snmp这种简单的方式可以让我们监控。但是snmp方式提供的指标数量有限,算是个保底的solution。而对于使用开源软件这类的解决方案,可以我们客户或者领导最想听到的是一些硬性指标,比如:随机读写的速度,顺时读写的速度等等。因为这类指标是衡量我们系统的重要依据之一。

    这块我们后面会在讲分布式存储和Ceph的时候再详细说,我们这里只比较一下一些工具内置模板可以监控到的指标。

    4.1. 系统上查看硬盘指标

    同样是两类

    • 通过命令:top、iostat、vmstat、sar这类属于查看瞬时速度的和查看使用率的df类命令。或者使用dd+time命令,可以通过查看读写的结果来测试速度。当然,还有一些三方工具,比如:FIO,hdparm,smartctl

    • 通过文件:一般来说,这些磁盘也是一个文件,他们都有对应的指标,我们可以在/sys/block/sda下面找到对应的信息,sda是设备名字。当然,不是所有的linux/unix系统都是这样的,比如MacOS就找不到/sys/block目录

      # ls /sys/block/sda
      alignment_offset  discard_alignment  inflight queue      slaves
      bdi         ext_range      mmcblk0p1  range      stat
      capability      force_ro       mmcblk0p2  removable  subsystem
      dev         hidden         mq     ro     trace
      device          holders        power  size       uevent
      

    其实系统上能看到的指标是最全面的,而我们常用的vmstat命令提供的指标也非常少

       Swap
           si: Amount of memory swapped in from disk (/s).
           so: Amount of memory swapped to disk (/s).
    
       IO
           bi: Blocks received from a block device (blocks/s).
           bo: Blocks sent to a block device (blocks/s).
    

    只有swap分区的读和写,块存储的读和写。我们经常会使用iostat -d来查看硬盘的IO

    $ iostat
    Linux 2.6.32-431.11.15.el6.ucloud.x86_64 (ssdk1)     10/14/2016     _x86_64_    (4 CPU)
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.44    0.00    0.26    0.01    0.01   99.29
    
    Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
    vda               0.66         0.09         6.75    1404732  105885456
    vdb               1.42        12.47        55.86  195619082  876552296
    

    这个会显示所有的每块盘的速度

    tps:该设备每秒的传输次数
    Blk_read/s:每秒从设备(drive expressed)读取的数据量;
    Blk_wrtn/s:每秒向设备(drive expressed)写入的数据量;
    Blk_read:  读取的总数据量;
    Blk_wrtn:写入的总数量数据量;
    

    然后就是df命令了,他会显示磁盘的使用率,这个是是很重要的指标,因为如果磁盘满了,和CPU一样,某些运行的程序可能会由于无法写数据而意外终止。

    df -H
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/root       126G  2.0G  119G   2% /
    devtmpfs        1.9G     0  1.9G   0% /dev
    tmpfs           2.0G     0  2.0G   0% /dev/shm
    tmpfs           2.0G  8.8M  2.0G   1% /run
    tmpfs           5.3M  4.1k  5.3M   1% /run/lock
    tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
    /dev/mmcblk0p1  265M   55M  210M  21% /boot
    tmpfs           400M     0  400M   0% /run/user/1000
    

    4.2. zabbix上的存储监控指标

    和我们在系统上看到的指标大同小异

    image-20200724235135473.png file

    4.3. grafana上的存储监控指标

    多了一个inode的监控,其他的基本一样

    image-20200725000108563.png file

    4.4. node_exporter上的存储监控指标

    这边的监控貌似多了很多

    # HELP node_disk_discard_time_seconds_total This is the total number of seconds spent by all discards.
    # TYPE node_disk_discard_time_seconds_total counter
    node_disk_discard_time_seconds_total{device="mmcblk0"} 0
    node_disk_discard_time_seconds_total{device="mmcblk0p1"} 0
    node_disk_discard_time_seconds_total{device="mmcblk0p2"} 0
    # HELP node_disk_discarded_sectors_total The total number of sectors discarded successfully.
    # TYPE node_disk_discarded_sectors_total counter
    node_disk_discarded_sectors_total{device="mmcblk0"} 0
    node_disk_discarded_sectors_total{device="mmcblk0p1"} 0
    node_disk_discarded_sectors_total{device="mmcblk0p2"} 0
    # HELP node_disk_discards_completed_total The total number of discards completed successfully.
    # TYPE node_disk_discards_completed_total counter
    node_disk_discards_completed_total{device="mmcblk0"} 0
    node_disk_discards_completed_total{device="mmcblk0p1"} 0
    node_disk_discards_completed_total{device="mmcblk0p2"} 0
    # HELP node_disk_discards_merged_total The total number of discards merged.
    # TYPE node_disk_discards_merged_total counter
    node_disk_discards_merged_total{device="mmcblk0"} 0
    node_disk_discards_merged_total{device="mmcblk0p1"} 0
    node_disk_discards_merged_total{device="mmcblk0p2"} 0
    # HELP node_disk_io_now The number of I/Os currently in progress.
    # TYPE node_disk_io_now gauge
    node_disk_io_now{device="mmcblk0"} 0
    node_disk_io_now{device="mmcblk0p1"} 0
    node_disk_io_now{device="mmcblk0p2"} 0
    # HELP node_disk_io_time_seconds_total Total seconds spent doing I/Os.
    # TYPE node_disk_io_time_seconds_total counter
    node_disk_io_time_seconds_total{device="mmcblk0"} 11.476
    node_disk_io_time_seconds_total{device="mmcblk0p1"} 0.44
    node_disk_io_time_seconds_total{device="mmcblk0p2"} 11.064
    # HELP node_disk_io_time_weighted_seconds_total The weighted # of seconds spent doing I/Os.
    # TYPE node_disk_io_time_weighted_seconds_total counter
    node_disk_io_time_weighted_seconds_total{device="mmcblk0"} 16.476
    node_disk_io_time_weighted_seconds_total{device="mmcblk0p1"} 0.668
    node_disk_io_time_weighted_seconds_total{device="mmcblk0p2"} 15.792
    # HELP node_disk_read_bytes_total The total number of bytes read successfully.
    # TYPE node_disk_read_bytes_total counter
    node_disk_read_bytes_total{device="mmcblk0"} 2.32966144e+08
    node_disk_read_bytes_total{device="mmcblk0p1"} 1.153536e+07
    node_disk_read_bytes_total{device="mmcblk0p2"} 2.20890112e+08
    # HELP node_disk_read_time_seconds_total The total number of seconds spent by all reads.
    # TYPE node_disk_read_time_seconds_total counter
    node_disk_read_time_seconds_total{device="mmcblk0"} 11.972
    node_disk_read_time_seconds_total{device="mmcblk0p1"} 0.704
    node_disk_read_time_seconds_total{device="mmcblk0p2"} 11.232000000000001
    # HELP node_disk_reads_completed_total The total number of reads completed successfully.
    # TYPE node_disk_reads_completed_total counter
    node_disk_reads_completed_total{device="mmcblk0"} 4883
    node_disk_reads_completed_total{device="mmcblk0p1"} 416
    node_disk_reads_completed_total{device="mmcblk0p2"} 4447
    # HELP node_disk_reads_merged_total The total number of reads merged.
    # TYPE node_disk_reads_merged_total counter
    node_disk_reads_merged_total{device="mmcblk0"} 6505
    node_disk_reads_merged_total{device="mmcblk0p1"} 3795
    node_disk_reads_merged_total{device="mmcblk0p2"} 2710
    # HELP node_disk_write_time_seconds_total This is the total number of seconds spent by all writes.
    # TYPE node_disk_write_time_seconds_total counter
    node_disk_write_time_seconds_total{device="mmcblk0"} 26.967000000000002
    node_disk_write_time_seconds_total{device="mmcblk0p1"} 0.008
    node_disk_write_time_seconds_total{device="mmcblk0p2"} 26.958000000000002
    # HELP node_disk_writes_completed_total The total number of writes completed successfully.
    # TYPE node_disk_writes_completed_total counter
    node_disk_writes_completed_total{device="mmcblk0"} 1456
    node_disk_writes_completed_total{device="mmcblk0p1"} 3
    node_disk_writes_completed_total{device="mmcblk0p2"} 1453
    # HELP node_disk_writes_merged_total The number of writes merged.
    # TYPE node_disk_writes_merged_total counter
    node_disk_writes_merged_total{device="mmcblk0"} 2529
    node_disk_writes_merged_total{device="mmcblk0p1"} 0
    node_disk_writes_merged_total{device="mmcblk0p2"} 2529
    # HELP node_disk_written_bytes_total The total number of bytes written successfully.
    # TYPE node_disk_written_bytes_total counter
    node_disk_written_bytes_total{device="mmcblk0"} 6.9829632e+07
    node_disk_written_bytes_total{device="mmcblk0p1"} 5120
    node_disk_written_bytes_total{device="mmcblk0p2"} 6.9824512e+07
    node_scrape_collector_duration_seconds{collector="diskstats"} 0.001754445
    node_scrape_collector_success{collector="diskstats"} 1
    
    • merged的,是说合并所有硬盘后的指标

    • discard是说硬盘的丢包率,也就是说如果丢包率过高,有可能是硬盘本身的介质出现问题
      为了方便大家学习,请大家加我的微信,我会把大家加到微信群(微信群的二维码会经常变)和qq群821119334,问题答案云原生技术课堂,有问题可以一起讨论

    • 个人微信
      640.jpeg

    • 腾讯课堂
      640-20200506145837072.jpeg

    • 微信公众号
      640-20200506145842007.jpeg

    • 专题讲座

    2020 CKA考试视频 真题讲解 https://www.bilibili.com/video/BV167411K7hp

    2020 CKA考试指南 https://www.bilibili.com/video/BV1sa4y1479B/

    2020年 5月CKA考试真题 https://mp.weixin.qq.com/s/W9V4cpYeBhodol6AYtbxIA

    相关文章

      网友评论

          本文标题:监控系列讲座(十二)常见系统监控指标之存储

          本文链接:https://www.haomeiwen.com/subject/llucjktx.html