【gperftools】1——CPU profiler

作者: ixiaolong | 来源:发表于2022-03-22 14:47 被阅读0次

【gperftools】1——CPU profiler
Android面试知识整理-性能优化
【gperftools】2——heap profiler
Instruments之Time Profiler的使用
Python开发必备库与工具
Unity游戏客户端CPU优化经验list
Android Profiler（二）Memory Profil
CPU Profiler
[Unity优化] Unity Profiler性能分析
四. Android 卡顿优化

1 gperftools 简介

gperftools 是一款 Google 的开源高性能内存相关工具集，包括 tcmalloc 内存管理工具，还有一些例如 cpu profiler、heap profiler 等性能分析工具，本系列将逐一介绍。

2 CPU profiler 简介

CPU profiler 主要是通过采样的的方式，给出一段时间内程序实际占用cpu时间偏进行统计和分析，优点是使用起来简洁方便。

性能分析通过抽样方法完成，默认是1秒100个样本，一个样本是10毫秒，所以如果程序运行时间不到10ms，那么得到的结果可能会和开始执行的时候不同。

3 本次测试简介

使用的是 ubuntu 20.04 系统。
采用直接调用提供的 API（在需要测试的代码的前后分别调用 ProfilerStart() 和ProfilerStop()）的方式进行测试。

4 安装环境并测试

安装 unwind

sudo apt install libunwind-dev

安装 gperftools

cd ~/Download
git clone https://github.com/gperftools/gperftools.git
cd gperftools
sh autogen.sh
./configure
make all
sudo make install

编写测试代码，监控开始，参数为需要生成的文件名：

/* Start profiling and write profile info into fname, discarding any
 * existing profiling data in that file.
 *
 * This is equivalent to calling ProfilerStartWithOptions(fname, NULL).
 */
PERFTOOLS_DLL_DECL int ProfilerStart(const char* fname);

监控结束：

/* Stop profiling. Can be started again with ProfilerStart(), but
 * the currently accumulated profiling data will be cleared.
 */
PERFTOOLS_DLL_DECL void ProfilerStop(void);

如果开启了新的线程，需要在线程起始添加如下函数进行线程的的注册，但测试发现有无该语句并不造成影响。

/* Routine for registering new threads with the profiler.
 */
PERFTOOLS_DLL_DECL void ProfilerRegisterThread(void);

使用时需要包含的头文件：

#include <gperftools/profiler.h>

测试代码
https://github.com/ixiaolonglong/memory_tool/blob/master/gperftools/tests/cpu_profiler/cpu_profiler_test.cpp

编译测试代码

g++ cpu_profiler_test.cpp -o cpu_profiler_test --lprofiler

执行 cpu_profiler_test，生成 .profile 文件。

5 无环境情况下测试

有时候并不能要求所有编译程序的环境都安装一遍 gperftools 和 unwind，更多的是直接编译运行，所以需要在编译可执行文件时就要准备好相关的源文件。

在这个项目中本人将源码作为 submodule，编译时本地生成 lib 文件，可以直接进行链接。

工程如下，使用流程详见 README.md 文件：
https://github.com/ixiaolonglong/memory_tool/tree/master/gperftools

在编译工程时需要添加如下编译选项，避免被编译器优化：

# tcmalloc options
add_compile_options(
    -fno-builtin-malloc
    -fno-builtin-calloc
    -fno-builtin-realloc
    -fno-builtin-free)

这里记录一下静态库与动态库链接，对于静态库来说，需要描述所有递归用到的 lib：

target_link_libraries(cpu_profiler_test PRIVATE
                                profiler
                                fake_stacktrace_scope
                                sysinfo
                                spinlock
                                maybe_threads
                                logging
                                unwind
                                pthread)

而如果是动态库，则链接就会简单很多：

target_link_libraries(cpu_profiler_test PRIVATE
                                profiler
                                pthread)

https://www.zhihu.com/question/277160878
这个不是cmake的坑，应该是你glog库的坑，我猜想glog库是你自行编译，而glog库编译时没有动态链接gflags导致的。如果你生成动态库时，就使用target link library生成，再配合上rpath寻找路径，是可以支持a依赖b，b依赖c，而你在a中只要写b的依赖而不用写c的依赖。

6 报告

执行程序的环境不一定非要安装 gperftools，但生成的 profile 文件时必须要安装 gperftools 使用其 pprof 工具进行解析。

需要安装图形工具 Graphviz：

sudo apt-get install graphviz

生成不同类型的报告命令：

# 生成性能报告（层次调用节点有向图）输出到web浏览器显示
pprof cpu_profiler_test cpu_test.profile --web

# 生成pdf格式的性能报告（层次调用节点有向图）
pprof cpu_profiler_test cpu_test.profile --pdf > prof.pdf

# 生成文本格式的性能报告输出到控制台
pprof cpu_profiler_test cpu_test.profile --text

6.1 文本

Total: 30 samples
       6  20.0%  20.0%       16  53.3% psiginfo
       5  16.7%  36.7%        5  16.7% __nss_database_lookup
       4  13.3%  50.0%        4  13.3% _IO_default_xsputn
       4  13.3%  63.3%       28  93.3% __snprintf
       3  10.0%  73.3%        3  10.0% _IO_enable_locks
       3  10.0%  83.3%       24  80.0% vscanf
       2   6.7%  90.0%        2   6.7% _IO_str_pbackfail
       1   3.3%  93.3%        1   3.3% cuserid
       1   3.3%  96.7%        9  30.0% test_main_thread
       1   3.3% 100.0%       21  70.0% test_other_thread
       0   0.0% 100.0%       21  70.0% RunFunctionInThread
       0   0.0% 100.0%        9  30.0% __libc_start_main
       0   0.0% 100.0%        9  30.0% _start
       0   0.0% 100.0%       21  70.0% clone
       0   0.0% 100.0%        9  30.0% main
       0   0.0% 100.0%       21  70.0% start_thread

上面文本中输出的内容是对程序中每一个函数的CPU使用时间分析，数据有两大列：

左：不包含内部其他函数调用所消耗的CPU时间（内联函数除外）如果函数内部没有任何调用，那么就和右列相等
右：整个函数消耗的CPU时间，包括函数内部其他函数调用所消耗的CPU时间

每行按照数据顺序：

分析样本数量（不包含其他函数调用）
分析样本百分比（不包含其他函数调用）
目前为止的分析样本百分比（不包含其他函数调用）
分析样本数量（包含其他函数调用）
分析样本百分比（包含其他函数调用）
函数名

6.2 图形

profile.png

每个节点代表一个函数，节点数据格式：

Class Name
Method Name
local (percentage) ，不包含内部其他函数调用所消耗的CPU时间（内联函数除外）
of cumulative (percentage) ，整个函数消耗的CPU时间，包括函数内部其他函数调用所消耗的CPU时间，如果与local相同，则不打印
有向边由调用者指向被调用者，有向边上的时间表示被调用者所消耗的CPU时间

meta 信息（图左上角）：

Total samples，总采样数
Focusing on，--focus option 所包含的采样数
Dropped nodes，忽略的节点
Dropped edges，忽略的边

focus 某些函数：

pprof --gv --focus=vsnprintf cpu_profiler_test cpu_test.profile

ignore 某些函数：

pprof --gv --ignore=snprintf cpu_profiler_test cpu_test.profile

更多操作可参考：https://gperftools.github.io/gperftools/cpuprofile.html

6.3 Kcachegrind

安装 Kcachegrind

sudo apt-get install kcachegrind

生成 .callgrind 文件

pprof --callgrind cpu_profiler_test cpu_test.profile > cpu_test.callgrind

分析命令

kcachegrind cpu_test.callgrind

分析结果如下图所示，相对来说功能比较强：

callgrind.png

7 控制监控开关

如果是server上的程序，启动后一般不会主动退出，即使退出，也一般不会正常退出，而 gperftools 必须在程序正常退出的情况下才能够正常收集或者收集完整数据。

7.1 请求服务

#include <gperftools/profiler.h>

void on_request(Request* req) {
    static bool is_profile_started = false;
    if (req->type == START_PROFILE && !is_profile_started) {
        ProfilerStart("xxx.profile");
        is_profile_started = true;
    } else if (req->type == STOP_PROFILE && is_profile_started) {
        ProfilerStop();
        is_profile_started = false;
    } else {
        // normal request processing here
    }
}

7.2 信号

static void gprof_callback(int signum) {
    if (signum == SIGUSR1) {
        printf("Catch the signal ProfilerStart\n");
        ProfilerStart("bs.prof");
    }
    else if (signum == SIGUSR2) {
        printf("Catch the signal ProfilerStop\n");
        ProfilerStop();
    }
}

static void setup_signal() {
    struct sigaction profstat;
    profstat.sa_handler = gprof_callback;
    profstat.sa_flags = 0;
    sigemptyset(&profstat.sa_mask);                                        
    sigaddset(&profstat.sa_mask, SIGUSR1);
    sigaddset(&profstat.sa_mask, SIGUSR2);

    if (sigaction(SIGUSR1, &profstat,NULL) < 0)
        fprintf(stderr, "SIGUSR1 Fail !");

    if (sigaction(SIGUSR2, &profstat,NULL) < 0)
        fprintf(stderr, "SIGUSR2 Fail !");
}

8 原理

如果只关心如何使用，则到这里就可以编写自己的工程了，下面对 CPU profiler 的源码进行简单的剖析。

站在巨人的肩膀上：http://www.tealcode.com/gperftool_source_analysis/

入口：

extern “C” PERFTOOLS_DLL_DECL int ProfilerStart(const char* fname) {
    return CpuProfiler::instance_.Start(fname, NULL);
}

bool CpuProfiler::Start(const char* fname, const ProfilerOptions* options) {
    collector_.Start(fname, collector_options);
    // Setup handler for SIGPROF interrupts
    EnableHandler();
    return true;
}

CPU profiler 启动的时候，核心功能就是启动数据收集器（collector_），这个数据收集器的 Start() 函数的功能就是初始化数据收集需要的数据结构，并创建数据收集文件：

bool ProfileData::Start(const char* fname, const ProfileData::Options& options) {
    // Open output file and initialize various data structures
    int fd =open(fname, O_CREAT | O_WRONLY | O_TRUNC, 0666);
    start_time_ = time(NULL);
    fname_ = strdup(fname);
    // Reset counters
    num_evicted_ = 0;
    count_ = 0;
    evictions_ = 0;
    total_bytes_ = 0;
    hash_ = new Bucket[kBuckets];
    evict_ = new Slot[kBufferLength];
    memset(hash_, 0, sizeof(hash_[0]) * kBuckets);
    // Record special entries
    evict_[num_evicted_++] = 0; // count for header
    evict_[num_evicted_++] = 3; // depth for header
    evict_[num_evicted_++] = 0; // Version number
    CHECK_NE(0, options.frequency());
    int period =1000000/ options.frequency();
    evict_[num_evicted_++] = period; // Period (microseconds)
    evict_[num_evicted_++] = 0; // Padding
    out_ = fd;
    return true;
}

然后就是开启 CPU profiler 的一个处理函数，这个函数就是把 prof_handler() 注册到了某个地方：

void CpuProfiler::EnableHandler() {
    prof_handler_token_ = ProfileHandlerRegisterCallback(prof_handler, this);
}

ProfileHandlerToken* ProfileHandlerRegisterCallback(
    ProfileHandlerCallback callback, void* callback_arg) {
    return ProfileHandler::Instance()->RegisterCallback(callback, callback_arg);
}

功能都在 ProfileHandler 里面，其为一个单例类，构造函数如下：

ProfileHandler::ProfileHandler() {
    timer_type_ = (getenv(“CPUPROFILE_REALTIME”) ? ITIMER_REAL : ITIMER_PROF);
    signal_number_ = (timer_type_ == ITIMER_PROF ? SIGPROF : SIGALRM);

    // Get frequency of interrupts (if specified)
    char junk;
    constchar* fr =getenv(“CPUPROFILE_FREQUENCY”);

    if (fr != NULL && (sscanf(fr, “%u%c”, &frequency_, &junk) == 1) && (frequency_ > 0)) {
        // Limit to kMaxFrequency
        frequency_ = (frequency_ > kMaxFrequency) ? kMaxFrequency : frequency_;
    } else {
        frequency_ = kDefaultFrequency;
    }

    // Install the signal handler.

    structsigaction sa;
    sa.sa_sigaction = SignalHandler;
    sa.sa_flags = SA_RESTART | SA_SIGINFO;
    sigemptyset(&sa.sa_mask);
    sigaction(signal_number_, &sa, NULL);
}

构造函数中，根据环境变量 CPUPROFILE_REALTIME 的配置，来决定让 SIGPROF 还是 SIGALRM 信号来触发 SignalHandler 信号处理函数，并根据环境变量 CPUPROFILE_FREQUENCY 的配置来设置自己的一个频率变量 frequency_ ，如果没有设置，就使用默认值，这个默认值是100，而最大值是4000。

然后 ProfileHandler 的 RegisterCallback() 函数的实现如下：

ProfileHandlerToken* ProfileHandler::RegisterCallback(ProfileHandlerCallback callback, void* callback_arg) {
    ProfileHandlerToken* token = new ProfileHandlerToken(callback, callback_arg);
    SpinLockHolder cl(&control_lock_);
    DisableHandler();
    {
        SpinLockHolder sl(&signal_lock_);
        callbacks_.push_back(token);
    }

    // Start the timer if timer is shared and this is a first callback.
    if ((callback_count_ == 0) && (timer_sharing_ == TIMERS_SHARED)) {
        StartTimer();
    }

    ++callback_count_;
    EnableHandler();
    return token;
}

这个函数就如其函数名字，把指定的回调函数添加到 callbacks_里面去，然后在加入第一个 callback 的时候调用 StartTimer() 函数来启动定时器，然后调用 EnableHander() 函数来开启回调。StartTimer() 的实现如下：

void ProfileHandler::StartTimer() {
    struct itimerval timer;
    timer.it_interval.tv_sec = 0;
    timer.it_interval.tv_usec = 1000000 / frequency_;
    timer.it_value = timer.it_interval;
    setitimer(timer_type_, &timer, 0);
}

EnableHandler() 的实现如下：

void ProfileHandler::EnableHandler() {
    struct sigaction sa;
    sa.sa_sigaction = SignalHandler;
    sa.sa_flags = SA_RESTART | SA_SIGINFO;
    sigemptyset(&sa.sa_mask);
    const int signal_number = (timer_type_ == ITIMER_PROF ? SIGPROF : SIGALRM);
    RAW_CHECK(sigaction(signal_number, &sa, NULL) == 0, “sigprof (enable)”);
}

到这里，这个工具的基本工作原理已经可以猜出个大概了。它用 setitimer() 启动一个系统定时器，这个定时器会每秒钟执行触发 frequency 次 SIGPROF 或者 SIGALRM 信号，从而去触发上面注册的信号处理函数。那么猜想，信号处理函数里面应该会用 backtrace 去检查一下目标程序执行到什么位置了。信号处理函数如下：

void CpuProfiler::prof_handler(int sig, siginfo_t*, void* signal_ucontext, void* cpu_profiler) {
    CpuProfiler* instance = static_cast<CpuProfiler*>(cpu_profiler);
    if (instance->filter_==NULL||(*instance->filter_)(instance->filter_arg_)) {
    void* stack[ProfileData::kMaxStackDepth];
    // Under frame-pointer-based unwinding at least on x86, the
    // top-most active routine doesn’t show up as a normal frame, but
    // as the “pc” value in the signal handler context.
    stack[0] = GetPC(*reinterpret_cast<ucontext_t*>(signal_ucontext));
    // We skip the top three stack trace entries (this function,
    // SignalHandler::SignalHandler and one signal handler frame)
    // since they are artifacts of profiling and should not be
    // measured. Other profiling related frames may be removed by
    // “pprof” at analysis time. Instead of skipping the top frames,
    // we could skip nothing, but that would increase the profile size
    // unnecessarily.
    int depth = GetStackTraceWithContext(stack +1, arraysize(stack) -1, 3, signal_ucontext);
    void**used_stack;
    if (depth >0&& stack[1] == stack[0]) {
        // in case of non-frame-pointer-based unwinding we will get
        // duplicate of PC in stack[1], which we don’t want
        used_stack = stack + 1;
    } else {
        used_stack = stack;
        depth++; // To account for pc value in stack[0];
    }
        instance->collector_.Add(depth, used_stack);
    }
}

果然是获取backtrace，然后记录到collector_里面去。

总结：

这个工具是用系统定时器定时产生信号的方式，在信号处理函数里面获取当前的调用堆栈来确定当前落在哪个函数里面的。获取频率默认是每10ms采样一次，参数是可调的，但是最大频率是4000，也就是支持的最小采样间隔是250微秒；
这个工具获取到的性能数据是基于统计数据的，也就是他并不真正跟踪函数的每一次调用过程，而是均匀地采样并记录采样点所落在的函数调用位置，用这些统计数据来计算每个函数的执行时间占比。这个数据并不是准确的数据，但是只要运行时间相对比较长，统计数据还是能比较准确地说明问题的。而这也是为什么说这个工具是比较好的服务器程序性能分析工具，而对一些客户端程序，比如游戏客户端并不是非常合适。因为游戏客户端上，相比长时间的统计数据，它们通常更加关心的是某些帧内的具体负载情况。
这个工具不工作的时候，就会把系统定时器取消掉，不会定时产生中断信号，不会触发中断处理程序，所以对运行程序的影响真的是很小，运行效率上可以说完全没有影响。而对产品的影响只是多占用一些链接 profiler 库的内存而已。

参考链接：

https://gperftools.github.io/gperftools/cpuprofile.html
http://airekans.github.io/cpp/2014/07/04/gperftools-profile
https://blog.csdn.net/aganlengzi/article/details/62893533
https://blog.csdn.net/10km/article/details/83820080
https://www.zhihu.com/question/277160878
http://www.tealcode.com/gperftool_source_analysis/

【gperftools】1——CPU profiler
1 gperftools 简介 gperftools 是一款 Google 的开源高性能内存相关工具集，包括 tc...
Android面试知识整理-性能优化
一、Android Profiler 1、CPU profiler（优化CPU性能） Call chart 橙色表...
【gperftools】2——heap profiler
1 heap profiler 简介 heap profiler 大致有三类功能：可以分析出在程序的堆内有些什么...
Instruments之Time Profiler的使用
1、Time Profiler简介 (1) Time Profiler时间分析工具用来检测应用CPU的使用情况。可...
Python开发必备库与工具
1. line_profiler——分析每行耗时（性能） line_profiler是一款分析python的CPU...
Unity游戏客户端CPU优化经验list
1. CPU Profiler分析大量的热点需要经过Profiler的分析找出来进行具体的优化；原则有：（1）尽...
Android Profiler（二）Memory Profil
总述上节 Android Profiler（一）CPU Profiler 本文基于 Android Studio...
CPU Profiler
优化应用的 CPU 使用率具有许多优势，如提供更快且更顺畅的用户体验，以及延长设备电池续航时间。您可以使用 CP...
[Unity优化] Unity Profiler性能分析
Profiler窗口 1. CPU A. WaitForTargetFPS: Vsync(垂直同步)功能所，即显示...
四. Android 卡顿优化
1. 工具选择 CPU Profiler、Systrace、StrictMode 原因复杂：代码、内存、绘制、IO...