美文网首页
Timelineserver进程挂掉原因排查

Timelineserver进程挂掉原因排查

作者: guangdong_18b7 | 来源:发表于2019-06-14 21:12 被阅读0次

        6月13号,凌晨 00:18分Zookeeper进程挂掉,查看zookeeper进程在挂掉时输出的JVM相关的错误文件hs_err_pid5829.log,可以看到zookeeper进程是因为内存溢出挂掉的,

        理论上,zookeeper里面没存什么东西,不应该会因为内存溢出挂掉,

        仔细查看该文件,在末尾,看到了当时服务器的内存的状态,发现该台物理内存为256G的服务器,在zookeeper挂掉的时候只有1.7G左右的空闲内存,感到很神奇,因为在分配该台服务器上的内存的时候是预留了50G左右,不应该只剩下那么少的内存

        通过查看谛听监控发现,该台服务器内存不断的升高,直到物理内存全部消耗完,然后内存直接降下来,然后又慢慢的上高到最大,如此重复,

        猜测该台服务器上有某个进程发生了内存泄露,消耗了服务器的所有内存,导致自己和别的进程(比如zookeeper进程和之前挂掉的JN进程)因为申请不到配置的内存资源,最终因为内存不足,导致报内存溢出挂掉,实际上不是真正的内存溢出,通过使用top命令,发现,yarn用户有一个进程,使用内存达到80%多,也就是将近190G的内存都被该进程占用,通过ps命令查找到这个进程ID最终定位该进程为timelineserver进程,但是很好奇的是,这个进程,我们分配了100G的堆内存+10G的堆外内存,理论上最多只会消耗110G左右的内存,但是实际上却消耗了200G内存,另外的80G内存消耗在哪了?(至于服务器内存的波动是因为crontab中有一个定时检测timelineserver的存活状态的脚本,会主动的将挂掉的timelineserver启动)

        最开始猜测是堆外内存泄露导致的,但是如何排查?

        通过在启动参数中加上 -XX:NativeMemoryTracking=detail 来追踪JVM堆外内存(该参数会带来5%~10%的性能消耗),通过命令

    jcmd 22215 VM.native_memory summary scale=GB 来查看JVM的内存消耗(包括NMT内存),

        而使用linux的命令 ps -p 22215 -o rss,vsz是查看进程在linux中真实消耗的内存大小,如下图:

        通过图中发现,即便是追踪JVM的堆外内存,发现能够追踪到的内存消耗只有105G,这符合设置的110G内存,说明该进程没有发生堆内存泄露,也没有发生directory内存泄露,那么另外将近80G内存哪去了?

        通过对JVM内存分配模型发现,还有一部分JNI调用的内存,是不在JVM统计的范围之内的,因为JNI调用的是C或C++代码,C或C++代码内部自己的内存申请和释放是不受JVM管控的,不在JVM的统计范围之内,而timelineserver存储数据是使用leveldb,leveldb是使用C++ 语言实现的,因此现在确定是 JNI导致的堆外内存泄露,不会C++ 语言,更不会检测C++ 语言的内存泄露检测,如何排查?请教C++ 语言大牛,得到的结果,自己慢慢调试C++ 程序,找没有释放的内存的地方,找JVM大牛,得到的结果是看程序实现的地方,看能不能找到可疑的地方。

        一脸蒙圈,无从下手,JNI堆外内存泄露很难排查,最终选择查看最近修改的代码,看能不能找到什么蛛丝马迹,最终发现了一点问题,看如下的代码:

    public TimelineEvents getEntityTimelines(String entityType,

          SortedSet<String> entityIds, Long limit, Long windowStart,

          Long windowEnd, Set<String> eventType) throws IOException {

        TimelineEvents events = new TimelineEvents();

        if (entityIds == null || entityIds.isEmpty()) {

          return events;

        }

        // create a lexicographically-ordered map from start time to entities

        Map<byte[], List<EntityIdentifier>> startTimeMap =

            new TreeMap<byte[], List<EntityIdentifier>>(

            new Comparator<byte[]>() {

              @Override

              public int compare(byte[] o1, byte[] o2) {

                return WritableComparator.compareBytes(o1, 0, o1.length, o2, 0,

                    o2.length);

              }

            });

        DBIterator iterator = null;

        try {

          // look up start times for the specified entities

          // skip entities with no start time

          for (String entityId : entityIds) {

            byte[] startTime = getStartTime(entityId, entityType);

            if (startTime != null) {

              List<EntityIdentifier> entities = startTimeMap.get(startTime);

              if (entities == null) {

                entities = new ArrayList<EntityIdentifier>();

                startTimeMap.put(startTime, entities);

              }

              entities.add(new EntityIdentifier(entityId, entityType));

            }

          }

          for (Entry<byte[], List<EntityIdentifier>> entry : startTimeMap

              .entrySet()) {

            // look up the events matching the given parameters (limit,

            // start time, end time, event types) for entities whose start times

            // were found and add the entities to the return list

            byte[] revStartTime = entry.getKey();

            for (EntityIdentifier entityIdentifier : entry.getValue()) {

              EventsOfOneEntity entity = new EventsOfOneEntity();

              entity.setEntityId(entityIdentifier.getId());

              entity.setEntityType(entityType);

              events.addEvent(entity);

              KeyBuilder kb = KeyBuilder.newInstance().add(entityType)

                  .add(revStartTime).add(entityIdentifier.getId())

                  .add(EVENTS_COLUMN);

              byte[] prefix = kb.getBytesForLookup();

              if (windowEnd == null) {

                windowEnd = Long.MAX_VALUE;

              }

              byte[] revts = writeReverseOrderedLong(windowEnd);

              kb.add(revts);

              byte[] first = kb.getBytesForLookup();

              byte[] last = null;

              if (windowStart != null) {

                last = KeyBuilder.newInstance().add(prefix)

                    .add(writeReverseOrderedLong(windowStart)).getBytesForLookup();

              }

              if (limit == null) {

                limit = DEFAULT_LIMIT;

              }

              DB db = entitydb.getDBForStartTime(readReverseOrderedLong(

                  revStartTime, 0));

              if (db == null) {

                continue;

              }

              iterator = db.iterator();

              for (iterator.seek(first); entity.getEvents().size() < limit

                  && iterator.hasNext(); iterator.next()) {

                byte[] key = iterator.peekNext().getKey();

                if (!prefixMatches(prefix, prefix.length, key)

                    || (last != null && WritableComparator.compareBytes(key, 0,

                        key.length, last, 0, last.length) > 0)) {

                  break;

                }

                TimelineEvent event = getEntityEvent(eventType, key, prefix.length,

                    iterator.peekNext().getValue());

                if (event != null) {

                  entity.addEvent(event);

                }

              }

            }

          }

        } finally {

          IOUtils.cleanup(LOG, iterator);

        }

        return events;

      }

        主要看加粗的部分,看着iterator这个变量在finally中通过调用IOUtils.cleanup(LOG, iterator); 好像是被关闭了,但实际上finally中关闭的只是最后一个iterator指向的DBIterator对象,因为里面还有一层循环,不断的给iterator赋值别的变量,但是这些遍历途中的DBIterator对象却没有被调用close()方法,因此将上面的代码改成如下的方式:

      public TimelineEvents getEntityTimelines(String entityType,

          SortedSet<String> entityIds, Long limit, Long windowStart,

          Long windowEnd, Set<String> eventType) throws IOException {

        TimelineEvents events = new TimelineEvents();

        if (entityIds == null || entityIds.isEmpty()) {

          return events;

        }

        // create a lexicographically-ordered map from start time to entities

        Map<byte[], List<EntityIdentifier>> startTimeMap =

            new TreeMap<byte[], List<EntityIdentifier>>(

            new Comparator<byte[]>() {

              @Override

              public int compare(byte[] o1, byte[] o2) {

                return WritableComparator.compareBytes(o1, 0, o1.length, o2, 0,

                    o2.length);

              }

            });

          // look up start times for the specified entities

          // skip entities with no start time

        for (String entityId : entityIds) {

          byte[] startTime = getStartTime(entityId, entityType);

          if (startTime != null) {

            List<EntityIdentifier> entities = startTimeMap.get(startTime);

            if (entities == null) {

              entities = new ArrayList<EntityIdentifier>();

              startTimeMap.put(startTime, entities);

            }

            entities.add(new EntityIdentifier(entityId, entityType));

          }

        }

        for (Entry<byte[], List<EntityIdentifier>> entry : startTimeMap

              .entrySet()) {

          // look up the events matching the given parameters (limit,

          // start time, end time, event types) for entities whose start times

          // were found and add the entities to the return list

          byte[] revStartTime = entry.getKey();

          for (EntityIdentifier entityIdentifier : entry.getValue()) {

            EventsOfOneEntity entity = new EventsOfOneEntity();

            entity.setEntityId(entityIdentifier.getId());

            entity.setEntityType(entityType);

            events.addEvent(entity);

            KeyBuilder kb = KeyBuilder.newInstance().add(entityType)

                .add(revStartTime).add(entityIdentifier.getId())

                .add(EVENTS_COLUMN);

            byte[] prefix = kb.getBytesForLookup();

            if (windowEnd == null) {

              windowEnd = Long.MAX_VALUE;

            }

            byte[] revts = writeReverseOrderedLong(windowEnd);

            kb.add(revts);

            byte[] first = kb.getBytesForLookup();

            byte[] last = null;

            if (windowStart != null) {

              last = KeyBuilder.newInstance().add(prefix)

                  .add(writeReverseOrderedLong(windowStart)).getBytesForLookup();

            }

            if (limit == null) {

              limit = DEFAULT_LIMIT;

            }

            DB db = entitydb.getDBForStartTime(readReverseOrderedLong(

                revStartTime, 0));

            if (db == null) {

              continue;

            }

            try (DBIterator iterator = db.iterator()) {

              for (iterator.seek(first); entity.getEvents().size() < limit

                  && iterator.hasNext(); iterator.next()) {

                byte[] key = iterator.peekNext().getKey();

                if (!prefixMatches(prefix, prefix.length, key)

                    || (last != null && WritableComparator.compareBytes(key, 0,

                    key.length, last, 0, last.length) > 0)) {

                  break;

                }

                TimelineEvent event = getEntityEvent(eventType, key, prefix.length,

                    iterator.peekNext().getValue());

                if (event != null) {

                  entity.addEvent(event);

                }

              }

            }

          }

        }

        return events;

      }

        同样主要看加粗的部分,使用jdk8的写法,在最内部的每次循环结束之后,JVM都会主动调用iterator.close()方法(jdk8的写法,无需显示指明调用close()方法)将iterator引用的对象的close()方法执行关闭的操作,现在问题来了,为啥这个地方执行close()方法,就不会发生内存泄露了,

        通过分析DBIterator对象的close()方法

        public void close() {

            iterator.delete();

        }

        里面调用的delete()方法如下:

        public void delete() {

            assertAllocated();

            IteratorJNI.delete(self);

            self = 0;

        }

    IteratorJNI.delete()方法如下:

            @JniMethod(flags={CPP_DELETE})

            public static final native void delete(long self);

        可以看到,最终通过jni调用了底层的C++的delete()方法,做释放的操作,这也就是说由于没有主动调用释放操作,导致底层C++的代码中申请的内存不能够释放,最终导致JNI内存泄露。

        通过修改代码后,观察几个小时,发现不在发生内存泄露现象了,至此问题解决。

    相关文章

      网友评论

          本文标题:Timelineserver进程挂掉原因排查

          本文链接:https://www.haomeiwen.com/subject/tdsmfctx.html