美文网首页测试转转
性能监控之Telegraf+InfluxDB+Grafana+P

性能监控之Telegraf+InfluxDB+Grafana+P

作者: 沉浮在心 | 来源:发表于2019-10-14 21:32 被阅读0次

    前言

    监控集成选型的 Telegraf 探针,最近需要实现对 Oracle 数据库的做实时监控,查了下 Telegraf 竟然还不支持 Oracle 监控,WTF?于是自己研究了下,通过 Python + SQL 脚本折中解决了,此文去且当作小结。

    实现的效果

    预备知识

    Oracle动态性能视图

    动态性能视图属于数据字典,它们的所有者为SYS,并且多数动态性能视图只能由特权用户和DBA用户查询。当数据库处于不同状态时,可以访问的动态性能视图有所不同。启动例程时,ORACLE会自动建立动态性能视图;停止例程时,ORACLE会自动删除动态性能视图。数据字典信息是从数据文件中获得,而动态性能视图信息是从SGA和控制文件取得。所以,两者所反映的信息还是有很大差异的。数据库管理员利用这些动态性能视图,可以了解数据库运行的一些基本信息,为我们进行数据库维护以及数据库性能优化提供一些数据上的支持。所有动态性能视图都是以V_$开始的,Oracle 为每个动态性能视图提供了相应的同义词(V$开头)

    通过查询V$FIXED_TABLE,可以列出所有可用的动态性能视图和动态性能表。

    SQL> select  * from V$FIXED_TABLE  where name like 'V$%';

    NAME                            OBJECT_ID TYPE  TABLE_NUM

    ------------------------------ ---------- ----- ----------

    V$WAITSTAT                    4294950915 VIEW      65537

    V$BH                          4294951406 VIEW      65537

    V$GC_ELEMENT                  4294951794 VIEW      65537

    V$CR_BLOCK_SERVER              4294951796 VIEW      65537

    V$CURRENT_BLOCK_SERVER        4294952095 VIEW      65537

    V$POLICY_HISTORY              4294953128 VIEW      65537

    V$ENCRYPTED_TABLESPACES        4294952996 VIEW      65537

    V$GC_ELEMENTS_WITH_COLLISIONS  4294951798 VIEW      65537

    V$FILE_CACHE_TRANSFER          4294951800 VIEW      65537

    V$TEMP_CACHE_TRANSFER          4294951802 VIEW      65537

    V$CLASS_CACHE_TRANSFER        4294951804 VIEW      65537

    V$INSTANCE_CACHE_TRANSFER      4294952151 VIEW      65537

    V$LOCK_ELEMENT                4294951408 VIEW      65537

    V$BSP                          4294951594 VIEW      65537

    V$LOCKS_WITH_COLLISIONS        4294951410 VIEW      65537

    V$FILE_PING                    4294951412 VIEW      65537

    V$TEMP_PING                    4294951532 VIEW      65537

    V$CLASS_PING                  4294951414 VIEW      65537

    V$LOCK_ACTIVITY                4294951437 VIEW      65537

    V$ROWCACHE                    4294950916 VIEW      65537

    以下是不同类型的指标视图的快速表格比较:

    该表的第一行是经典的等待事件和统计视图。以下几行是度量标准视图。度量标准视图是在Oracle10g中引入的。

    度量视图计算增量和速率,这极大地简化了解决简单问题的能力,比如 “现在我的数据库的I/O速率是多少?” 这个问题,在10g之前,处理起来出奇的乏味。要回答这个问题,你必须查询v$sysstat,例如:

    Select value from v$sysstat where name='physical reads';

    但是仅查询一次v$sysstat不能解决问题,而是“自数据库启动以来已完成了多少I / O?”的问题。要回答原始问题,必须两次查询v$sysstat并接受两个值之间的增量:

    在时间A取值

    在时间B取值

    Delta = (B-A)

    and/or get Rate = (B-A)/elapsed time

    获得这些差值和速率可能是一项艰巨的工作。然后10gOracle引入了度量标准表,这些度量表可以在一个查询中解决问题。

    等待事件视图为(系统级别)

    V$SYSTEM_EVENT– 自启动以来累积的等待事件

    V$EVENTMETRIC- 等待事件增量持续60秒

    DBA_HIST_SYSTEM_EVENT– 自启动以来累计的上周按快照(小时)的等待事件

    等待事件汇总到称为等待类的组中。对于等待类,有以下视图:

    V$SYSTEM_WAIT_CLASS– 自启动以来累积

    V$WAITCLASSMETRIC– 持续60秒增量

    V$WAITCLASSMETRIC_HISTORY– 最后一小时的60秒增量

    注意:DBA_HIST_WAITCLASSMETRIC_HISTORY用于警报或基准,而不是日常值。

    其他的就不一一展开了,具体可以参考下文:

    http://datavirtualizer.com/wait-event-and-wait-class-metrics-vs-vsystem_event/

    cx_Oracle

    cx_Oracle 是一个 Python 扩展模块,可以访问 Oracle 数据库。它符合 Python 数据库API 2.0 规范。

    基本要求

    要在 Python 和 Oracle 数据库中使用cx_Oracle7,需要满足以下条件:

    Python 2.7或 3.5 及更高版本。

    Oracle 客户端库。

    Oracle 数据库。Oracle的标准客户端 - 服务器版本互操作性允许cx_Oracle连接到较旧和较新的数据库。(推荐)

    快速安装

    在 Linux 上安装 cx_Oracle 的一般方法是使用 Python 的 Pip 包从 PyPI 安装cx_Oracle:

    从 PyPI 安装 cx_Oracle:

    python -m pip install cx_Oracle --upgrade

    将 Oracle 客户端库添加到操作系统库搜索路径,例如 Linux 的 LDLIBRARYPATH

    如果你的数据库位于远程计算机上,请下 适用于你的操作系统体系结构的免费Oracle Instant Client “Basic” 或 “Basic Light” 包

    至于具体的OracleClient安装,可以参考下文:

    https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html#installing-cx-oracle-on-linux

    解决方案

    Python:收集 Oracle 指标数据

    Telegraf:收集 Python 打印的性能指标数据

    InfluxDB:存储时间序列 Oracle 性能指标数据

    Grafana:可视化 Dashboard

    安装

    具体的安装可以参考官方文档:

    Telegraf:https://docs.influxdata.com/telegraf/v1.12/introduction/installation/

    InfluxDB:https://docs.influxdata.com/influxdb/v1.7/introduction/installation/

    Grafana:https://grafana.com/docs/installation/rpm/

    具体设置

    在 InfluxDB 中创建一个 Telegraf 数据库:

    [root@zuozewei ~]# influx

    Connected to http://localhost:8086 version 1.6.2

    InfluxDB shell version: 1.6.2

    > create user "telegraf" with password 'telegraf'

    > create database telegraf

    > show databases

    name: databases

    name

    ----

    _internal

    telegraf

    编写 python+sql 脚本以收集 oracle 指标。脚本的输出内容很重要,必须是 InfluxDB line-protocol。该脚本查询v$ SYSMETRIC和v$eventmetric,获得最后一分钟时,等待类和等待事件指标。

    python代码是:

    import socket,argparse,subprocess,re,cx_Oracle

    fqdn = socket.getfqdn()

    class OraStats():

        def __init__(self, user, passwd, sid):

            self.user = user

            self.passwd = passwd

            self.sid = sid

            self.delengine = "none"

            connstr=self.user+'/'+self.passwd+'@'+self.sid

            self.connection = cx_Oracle.connect(connstr)

            cursor = self.connection.cursor()

            cursor.execute("select distinct(SVRNAME)  from v$dnfs_servers")

            rows = cursor.fetchall()

            for i in range(0, cursor.rowcount):

                self.dengine_ip = rows[i][0]

                proc = subprocess.Popen(["nslookup", self.dengine_ip], stdout=subprocess.PIPE)

                lookupresult = proc.communicate()[0].split('\n')

                for line in lookupresult:

                    if 'name=' in re.sub(r'\s', '', line):

                        self.delengine = re.sub('\..*$', '', re.sub(r'^.*name=', '', re.sub(r'\s', '', re.sub(r'.$', '', line))))

        # 等待类别

        def waitclassstats(self, user, passwd, sid, format):

            cursor = self.connection.cursor()

            cursor.execute("""

            select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS

            from  v$waitclassmetric  m, v$system_wait_class n

            where m.wait_class_id=n.wait_class_id and n.wait_class != 'Idle'

            union

            select  'CPU', round(value/100,3) AAS

            from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2

            union select 'CPU_OS', round((prcnt.busy*parameter.cpu_count)/100,3) - aas.cpu

            from

            ( select value busy

            from v$sysmetric

            where metric_name='Host CPU Utilization (%)'

            and group_id=2 ) prcnt,

            ( select value cpu_count from v$parameter where name='cpu_count' )  parameter,

            ( select  'CPU', round(value/100,3) cpu from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2) aas

            """)

            for wait in cursor:

                wait_name = wait[0]

                wait_value = wait[1]

                print ("oracle_wait_class,fqdn={0},delphix={1},db={2},wait_class={3} wait_value={4}".format(fqdn, self.delengine, sid, re.sub(' ', '_', wait_name), wait_value))

        # 系统指标

        def sysmetrics(self, user, passwd, sid, format):

            cursor = self.connection.cursor()

            cursor.execute("""

            select METRIC_NAME,VALUE,METRIC_UNIT from v$sysmetric where group_id=2

            """)

            for metric in cursor:

                metric_name = metric[0]

                metric_value = metric[1]

                print ("oracle_sysmetric,fqdn={0},delphix={1},db={2},metric_name={3} metric_value={4}".format(fqdn,self.delengine,sid,re.sub(' ', '_', metric_name),metric_value))

        # 在闪回恢复区中有关磁盘配额和当前磁盘使用情况

        def fraused(self, user, passwd, sid, format):

            cursor = self.connection.cursor()

            cursor.execute("""

            select round((SPACE_USED-SPACE_RECLAIMABLE)*100/SPACE_LIMIT,1) from  V$RECOVERY_FILE_DEST

            """)

            for frau in cursor:

                fra_used = frau[0]

                print ("oracle_fra_pctused,fqdn={0},delphix={1},db={2} fra_pctused={3}".format(fqdn,self.delengine,sid,fra_used))

        # 磁盘使用状态

        def fsused(self):

        fss = ['/oracle', '/data']

        for fs in fss:

                df = subprocess.Popen(["df","-P",fs], stdout=subprocess.PIPE)

                output = df.communicate()[0]

                total = re.sub('%','',output.split("\n")[1].split()[1])

                used = re.sub('%','',output.split("\n")[1].split()[2])

                pctused = re.sub('%','',output.split("\n")[1].split()[4])

                print("oracle_fs_pctused,fqdn={0},fs_name={1} oraclefs_pctused={2},oraclefs_alloc={3},oraclefs_used={4}".format(fqdn,fs,pctused,total,used))

        # 等待状态

        def waitstats(self, user, passwd, sid, format):

            cursor = self.connection.cursor()

            cursor.execute("""

            select /*+ ordered use_hash(n) */

            n.wait_class wait_class,

            n.name wait_name,

            m.wait_count  cnt,

            nvl(round(10*m.time_waited/nullif(m.wait_count,0),3) ,0) avg_ms

            from v$eventmetric m,

            v$event_name n

            where m.event_id=n.event_id

            and n.wait_class <> 'Idle' and m.wait_count > 0 order by 1""")

            for wait in cursor:

                wait_class = wait[0]

                wait_name = wait[1]

                wait_cnt = wait[2]

                wait_avgms = wait[3]

                print ("oracle_wait_event,fqdn={0},delphix={1},db={2},wait_class={3},wait_event={4} count={5},latency={6}".format(fqdn, self.delengine,sid,re.sub(' ', '_', wait_class), re.sub(' ','_',wait_name),wait_cnt,wait_avgms))

        # 表空间使用状态

        def tbsstats(self, user, passwd, sid, format):

            cursor = self.connection.cursor()

            cursor.execute("""

            select /*+ ordered */ tablespace_name,

                round(used_space),

                round(max_size-used_space) free_space,

                round(max_size),

                round(used_space*100/max_size,2) percent_used

                from (

                    select m.tablespace_name,

                    m.used_space*t.block_size/1024/1024 used_space,

                    (case when t.bigfile='YES' then power(2,32)*t.block_size/1024/1024

                            else tablespace_size*t.block_size/1024/1024 end) max_size

                from dba_tablespace_usage_metrics m, dba_tablespaces t

            where m.tablespace_name=t.tablespace_name)

            """)

            for tbs in cursor:

                tbs_name = tbs[0]

                used_space_mb = tbs[1]

                free_space_mb = tbs[2]

                max_size_mb = tbs[3]

                percent_used = tbs[4]

                print ("oracle_tablespaces,fqdn={0},delphix={1},db={2},tbs_name={3} used_space_mb={4},free_space_mb={5},percent_used={6},max_size_mb={7}".format(fqdn, self.delengine, sid, re.sub(' ', '_', tbs_name), used_space_mb,free_space_mb,percent_used,max_size_mb))

    if __name__ == "__main__":

        parser = argparse.ArgumentParser()

        parser.add_argument('-f', '--format', help="Output format, default influx", choices=['kafka', 'influx'], default='influx')

        subparsers = parser.add_subparsers(dest='stat')

        parser_all = subparsers.add_parser('ALL', help="Get all database stats")

        parser_all.add_argument('-u', '--user', help="Username with sys views grant", required=True)

        parser_all.add_argument('-p', '--passwd', required=True)

        parser_all.add_argument('-s', '--sid', help="tnsnames SID to connect", required=True)

        args = parser.parse_args()

        if args.stat == "ALL":

            stats = OraStats(args.user, args.passwd, args.sid)

            stats.waitclassstats(args.user, args.passwd, args.sid, args.format)

            stats.waitstats(args.user, args.passwd, args.sid, args.format)

            stats.sysmetrics(args.user, args.passwd, args.sid, args.format)

            stats.tbsstats(args.user, args.passwd, args.sid, args.format)

            stats.fraused(args.user, args.passwd, args.sid, args.format)

            stats.fsused()

    输出格式化为 InfluxDB line-protocol

    [root@localhost tools]# ./oracle.sh

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Application wait_value=0

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=CPU wait_value=0.003

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=CPU_OS wait_value=0.778

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Commit wait_value=0

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Concurrency wait_value=0.001

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Configuration wait_value=0

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Network wait_value=0

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Other wait_value=0

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O wait_value=0.001

    oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=User_I/O wait_value=0

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Commit,wait_event=log_file_sync count=2,latency=0.122

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Concurrency,wait_event=os_thread_startup count=2,latency=21.595

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Network,wait_event=SQL*Net_message_to_client count=17,latency=0.001

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Other,wait_event=asynch_descriptor_resize count=4,latency=0.001

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=db_file_parallel_write count=2,latency=0.081

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=control_file_parallel_write count=24,latency=0.268

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=control_file_sequential_read count=71,latency=0.716

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=log_file_parallel_write count=7,latency=0.076

    oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=User_I/O,wait_event=Disk_file_operations_I/O count=16,laten

    定义一个 shell 脚本执行 Python 脚本

    #!/usr/bin/env bash

    python /home/oracle/scripts/oracle_metrics.sh -f "influx" "ALL" "-u" "system" "-p" "xxxx" "-s" "172.16.106.251:1521/orcl"

    在 oracle主机上,配置 telegraf 以60秒的间隔执行 python sh,然后将输出发送到 InfluxDB。编辑/etc/telegraf/telegraf.conf配置文件:

    # Telegraf configuration

    # Telegraf is entirely plugin driven. All metrics are gathered from the

    # declared inputs, and sent to the declared outputs.

    # Plugins must be declared in here to be active.

    # To deactivate a plugin, comment out the name and any variables.

    # Use 'telegraf -config telegraf.conf -test' to see what metrics a config

    # file would generate.

    # Global tags can be specified here in key="value" format.

    [global_tags]

      # dc = "us-east-1" # will tag all metrics with dc=us-east-1

      # rack = "1a"

      host="Dprima"

      collector="telegraf"

    # Configuration for telegraf agent

    [agent]

      ## Default data collection interval for all inputs

      interval = "10s"

      ## Rounds collection interval to 'interval'

      ## ie, if interval="10s" then always collect on :00, :10, :20, etc.

      round_interval = true

      ## Telegraf will cache metric_buffer_limit metrics for each output, and will

      ## flush this buffer on a successful write.

      metric_buffer_limit = 10000

      ## Flush the buffer whenever full, regardless of flush_interval.

      flush_buffer_when_full = true

      ## Collection jitter is used to jitter the collection by a random amount.

      ## Each plugin will sleep for a random time within jitter before collecting.

      ## This can be used to avoid many plugins querying things like sysfs at the

      ## same time, which can have a measurable effect on the system.

      collection_jitter = "0s"

      ## Default flushing interval for all outputs. You shouldn't set this below

      ## interval. Maximum flush_interval will be flush_interval + flush_jitter

      flush_interval = "60s"

      ## Jitter the flush interval by a random amount. This is primarily to avoid

      ## large write spikes for users running a large number of telegraf instances.

      ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s

      flush_jitter = "0s"

      ## Run telegraf in debug mode

      debug = false

      ## Run telegraf in quiet mode

      quiet = false

      ## Override default hostname, if empty use os.Hostname()

      hostname = "Dprima"

    ###############################################################################

    #                                  OUTPUTS                                    #

    ###############################################################################

    # Configuration for influxdb server to send metrics to

    [[outputs.influxdb]]

      urls = ["http://influxgraf:8086"] # required

      database = "telegraf" # required

      precision = "s"

      timeout = "5s"

    [[outputs.influxdb]]

      urls = ["http://localhost:9092"] # required

      database = "kapacitor" # required

      precision = "s"

      retention_policy = "default"

      timeout = "5s"

    #[[outputs.file]]

    #  files=["/home/oracle/scripts/telegraf_debug.txt"]

    ###############################################################################

    #                                  INPUTS                                    #

    ###############################################################################

    # Oracle metrics

    [[inputs.exec]]

      # Shell/commands array

      commands = ["/home/oracle/scripts/oracle_metrics.sh"]

      # Data format to consume. This can be "json", "influx" or "graphite" (line-protocol)

      # NOTE json only reads numerical measurements, strings and booleans are ignored.

      data_format = "influx"

      interval = "60s"

    ###############################################################################

    #                              SERVICE INPUTS                                #

    ###############################################################################

    启动 telegraf:

    telegraf -config /etc/telegraf/telegraf.conf

    数据可视化

    查询 InfluxDB 数据库

    [root@localhost log]# influx

    Connected to http://localhost:8086 version 1.7.4

    InfluxDB shell version: 1.7.4

    Enter an InfluxQL query

    > show databases

    name: databases

    name

    ----

    _internal

    telegraf

    > use telegraf

    Using database telegraf

    > show measurements

    name: measurements

    name

    ----

    oracle_fra_pctused

    oracle_sysmetric

    oracle_tablespaces

    oracle_wait_class

    oracle_wait_event

    > select * from oracle_sysmetric limit 5

    name: oracle_sysmetric

    time                db                      delphix fqdn                  host                  metric_name                                  metric_value

    ----                --                      ------- ----                  ----                  -----------                                  ------------

    1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Active_Parallel_Sessions                      0

    1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Active_Serial_Sessions                        1

    1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Average_Active_Sessions                      0.0138029495084

    1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Average_Synchronous_Single-Block_Read_Latency 0.5875

    1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Background_CPU_Usage_Per_Sec                  0.104149308449

    >

    Grafana 效果图如下:

    小结

    通过结合 Python 脚本开发的方式,我们可以扩展部分 Telegraf 不支持的监控项,本文简单提供了一种思路。

    相关资料:

    https://github.com/7DGroup/JMeter-examples/tree/master/Performance%20Monitoring/Telegraf-InfluxDB-Grafana-Python-Oracle

    参考资料:

    [1]:https://cx-oracle.readthedocs.io/en/latest/index.html

    [2]:http://datavirtualizer.com/wait-event-and-wait-class-metrics-vs-vsystem_event/

    [3]:https://docs.influxdata.com/influxdb/v1.7/write_protocols/

    备注:文章来自微信公众号:Zee_7DGroup

    相关文章

      网友评论

        本文标题:性能监控之Telegraf+InfluxDB+Grafana+P

        本文链接:https://www.haomeiwen.com/subject/qepdmctx.html