性能优化之sleep、sched_yield和忙等待

作者: 耐寒 | 来源:发表于2023-03-10 17:54 被阅读0次

MS(4)：Android之性能优化篇
收集_性能优化
iOS性能优化系列篇之“优化总体原则”
笔记46 | Android性能优化之优化layout的层级（一
iOS性能优化系列篇之“列表流畅度优化”
Android优化文章精选
iOS性能优化之页面加载速率
iOS性能调优之--tableView优化
目录
Android性能优化（下）

最近帮一家公司优化他们的量化交易系统，其中有这么一段代码：

void xxxx::MonitorThread()
{
    while (m_running)
    {
        MonitorOrders();
        Sleep(0);
    }
}

在监控订单的线程里调用了sleep(0)；这种设计就是死循环地将队列中的订单执行完，然后调用sleep(0)去让出CPU，以供其他线程获得更高优先级去执行。
在整个系统中大量使用了sleep(0)这种方式的设计，那么这种方式是否恰当呢？
我们都知道对于量化交易来讲，天下武功唯快不破；
实际上还有一个系统调用sched_yiled也能让出CPU的执行权限，描述如下：

NAME
       sched_yield - yield the processor

SYNOPSIS
       #include <sched.h>

       int sched_yield(void);

DESCRIPTION
       sched_yield() causes the calling thread to relinquish the CPU.  The thread is moved to the end of the queue for its static priority and a new thread gets to run.

RETURN VALUE
       On success, sched_yield() returns 0.  On error, -1 is returned, and errno is set appropriately.

ERRORS
       In the Linux implementation, sched_yield() always succeeds.

CONFORMING TO
       POSIX.1-2001, POSIX.1-2008.

NOTES
       If the calling thread is the only thread in the highest priority list at that time, it will continue to run after a call to sched_yield().

       POSIX systems on which sched_yield() is available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.

       Strategic  calls to sched_yield() can improve performance by giving other threads or processes a chance to run when (heavily) contended resources (e.g., mutexes) have been released by the caller.  Avoid calling sched_yield() unnecessarily
       or inappropriately (e.g., when resources needed by other schedulable threads are still held by the caller), since doing so will result in unnecessary context switches, which will degrade system performance.

       sched_yield() is intended for use with real-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR).  Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application
       design is broken.

如果当前的线程是最高优先级的线程，那么调用sched_yield后该线程会继续运行。
下面我们看看sched_yield和sleep(0)的性能对比：

root@iZ2zefnvk8kwih8l62w90yZ:/data# more test.c
#include <sched.h>
#include <unistd.h>

int main(int argc, char **argv) {

    for (int i = 0; i < 100000; i++) {
        //sleep(0);
    sched_yield();
    }

    return 0;
}

root@iZ2zefnvk8kwih8l62w90yZ:/data# time ./test  

real    0m6.186s
user    0m0.092s
sys 0m0.460s
root@iZ2zefnvk8kwih8l62w90yZ:/data# time ./test

real    0m0.043s
user    0m0.012s
sys 0m0.031s

0.043 vs 6.186，这个差距还是比较明显的，那这是如何造成的呢？
这是因为sleep过程中触发了系统的调度，但是系统调度会将进程从红黑树中移出，并放入等待队列，这个过程耗时明显。
在设计的时候实际上我们期待的是该执行订单线程能一直运行着，如果可以的话，想一直运行着，那么这实际上就是一种“忙等待”，我们来看看redis 6.0之后的多线程IO方案里的“忙等待”是如何执行的：

void *IOThreadMain(void *myid) {
    /* The ID is the thread number (from 0 to server.iothreads_num-1), and is
     * used by the thread to just manipulate a single sub-array of clients. */
    long id = (unsigned long)myid;
    char thdname[16];

    snprintf(thdname, sizeof(thdname), "io_thd_%ld", id);
    redis_set_thread_title(thdname);
    redisSetCpuAffinity(server.server_cpulist);
    makeThreadKillable();

    while(1) {
        /* Wait for start */
        for (int j = 0; j < 1000000; j++) {
            if (getIOPendingCount(id) != 0) break;
        }

        /* Give the main thread a chance to stop this thread. */
        if (getIOPendingCount(id) == 0) {
            pthread_mutex_lock(&io_threads_mutex[id]);
            pthread_mutex_unlock(&io_threads_mutex[id]);
            continue;
        }

        serverAssert(getIOPendingCount(id) != 0);

        /* Process: note that the main thread will never touch our list
         * before we drop the pending count to 0. */
        listIter li;
        listNode *ln;
        listRewind(io_threads_list[id],&li);
        while((ln = listNext(&li))) {
            client *c = listNodeValue(ln);
            if (io_threads_op == IO_THREADS_OP_WRITE) {
                writeToClient(c,0);
            } else if (io_threads_op == IO_THREADS_OP_READ) {
                readQueryFromClient(c->conn);
            } else {
                serverPanic("io_threads_op value is unknown");
            }
        }
        listEmpty(io_threads_list[id]);
        setIOPendingCount(id, 0);
    }
}

具体我们可以看到：

        /* Wait for start */
        for (int j = 0; j < 1000000; j++) {
            if (getIOPendingCount(id) != 0) break;
        }

也就是说一直让CPU忙碌，直到发现pending队列的io数量不为0，或者for了100万次。

那么在spin_lock的实现中，又是如何设计的呢？

如果只是简单地不断地去check spinlock, 那么会非常占用CPU。
如果使用sleep(0)或者是sched_yield()的话，那么会导致ring3 -> ring0的context switch. 延迟会非常高。
最好的方式检查几轮spinlock, 然后使用sleep(0), sched_yield()切换出去，如此往复。
不过在检查spinlock的时候，可以使用 mm_pause 这个指令。使用这个指令可以告诉CPU, 接下来的指令是是要去check spinlock, 所以不用full-speed地去检查，比如完全填满流水线这样，最后功能也能节省4%。这个指令之后接下来的执行可能会延迟一段时间，但是这个延迟时间是不可控，完全由CPU去决定的，所以我们不能依赖或者是假设这个延迟。

Essentially, the pause instruction delays the next instruction's execution for a finite period of time. By delaying the execution of the next instruction, the processor is not under demand, and parts of the pipeline are no longer being used, which in turn reduces the power consumed by the processor.
The pause instruction can be used in conjunction with a Sleep(0) to construct something similar to an exponential back-off in situations where the lock or more work may become available in a short period of time, and the performance may benefit from a short spin in ring 3. It is important to note that the number of cycles delayed by the pause instruction may vary from one processor family to another. You should avoid using multiple pause instructions, assuming you will introduce a delay of a specific cycle count. Since you cannot guarantee the cycle count from one system to the next, you should check the lock in between each pause to avoid introducing unnecessarily long delays on new systems.

这个指令按照Intel文档解释来说，就是专门给spinlock场景设计的，从文档看上去这个指令的latency很高（不知道这个latency是不是就是到接下来执行指令的延迟）。

void _mm_pause (void) #include <emmintrin.h> Instruction: pause CPUID Flags: SSE2 Provide a hint to the >processor that the code sequence is a spin-wait loop. This can help improve the performance and power consumption of spin-wait loops.

Architecture Latency Throughput (CPI) Skylake 140 140

最后实现出来的代码长的是这个样子的：

ATTEMPT_AGAIN:
  if (!acquire_lock())
  {
    /* Spin on pause max_spin_count times before backing off to sleep */
    for(int j = 0; j < max_spin_count; ++j)
    {
      /* pause intrinsic */
      _mm_pause();
      if (read_volatile_lock())
      {
        if (acquire_lock())

        {
          goto PROTECTED_CODE;
        }
      }
    }

    /* Pause loop didn't work, sleep now */
    Sleep(0);
    goto ATTEMPT_AGAIN;
  }
PROTECTED_CODE:
  get_work();
  release_lock();
  do_work();

致力于分布式系统的高并发和低延时系统的设计，有技术问题请联系李哥

MS(4)：Android之性能优化篇
六、性能及优化 1、App优化之性能分析工具 Android App优化之性能分析工具 2、ListView优化 ...
收集_性能优化
Android性能优化（一）之启动加速35%Android性能优化（二）之布局优化面面观Android性能优化（三...
iOS性能优化系列篇之“优化总体原则”
iOS性能优化系列篇之“优化总体原则” iOS性能优化系列篇之“优化总体原则”
笔记46 | Android性能优化之优化layout的层级（一
地址笔记46 | Android性能优化之优化layout的层级（一）笔记46 | Android性能优化之优化...
iOS性能优化系列篇之“列表流畅度优化”
iOS性能优化系列篇之“列表流畅度优化” iOS性能优化系列篇之“列表流畅度优化”
Android优化文章精选
Android性能优化典范 Android性能优化典范 - 第1季Android性能优化之渲染篇Android性能...
iOS性能优化之页面加载速率
iOS性能优化之页面加载速率 iOS性能优化之页面加载速率
iOS性能调优之--tableView优化
iOS性能调优之--tableView优化 iOS性能调优之--tableView优化
目录
Spark之参数介绍 Spark之性能优化2.1. 官方性能优化指南2.2. Spark性能优化指南——基础篇2....
Android性能优化（下）
Android性能优化内存泄漏和性能优化方式Android性能优化（上）数据库优化和网络优化Android性能优...

性能优化之sleep、sched_yield和忙等待

相关文章

MS(4)：Android之性能优化篇

收集_性能优化

iOS性能优化系列篇之“优化总体原则”

笔记46 | Android性能优化之优化layout的层级（一

iOS性能优化系列篇之“列表流畅度优化”

Android优化文章精选

iOS性能优化之页面加载速率

iOS性能调优之--tableView优化

目录

Android性能优化（下）

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读