美文网首页
性能优化之sleep、sched_yield和忙等待

性能优化之sleep、sched_yield和忙等待

作者: 耐寒 | 来源:发表于2023-03-10 17:54 被阅读0次

    最近帮一家公司优化他们的量化交易系统,其中有这么一段代码:

    void xxxx::MonitorThread()
    {
        while (m_running)
        {
            MonitorOrders();
            Sleep(0);
        }
    }
    

    在监控订单的线程里调用了sleep(0);这种设计就是死循环地将队列中的订单执行完,然后调用sleep(0)去让出CPU,以供其他线程获得更高优先级去执行。
    在整个系统中大量使用了sleep(0)这种方式的设计,那么这种方式是否恰当呢?
    我们都知道对于量化交易来讲,天下武功唯快不破;
    实际上还有一个系统调用sched_yiled也能让出CPU的执行权限,描述如下:

    NAME
           sched_yield - yield the processor
    
    SYNOPSIS
           #include <sched.h>
    
           int sched_yield(void);
    
    DESCRIPTION
           sched_yield() causes the calling thread to relinquish the CPU.  The thread is moved to the end of the queue for its static priority and a new thread gets to run.
    
    RETURN VALUE
           On success, sched_yield() returns 0.  On error, -1 is returned, and errno is set appropriately.
    
    ERRORS
           In the Linux implementation, sched_yield() always succeeds.
    
    CONFORMING TO
           POSIX.1-2001, POSIX.1-2008.
    
    NOTES
           If the calling thread is the only thread in the highest priority list at that time, it will continue to run after a call to sched_yield().
    
           POSIX systems on which sched_yield() is available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
    
           Strategic  calls to sched_yield() can improve performance by giving other threads or processes a chance to run when (heavily) contended resources (e.g., mutexes) have been released by the caller.  Avoid calling sched_yield() unnecessarily
           or inappropriately (e.g., when resources needed by other schedulable threads are still held by the caller), since doing so will result in unnecessary context switches, which will degrade system performance.
    
           sched_yield() is intended for use with real-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR).  Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application
           design is broken.
    

    如果当前的线程是最高优先级的线程,那么调用sched_yield后该线程会继续运行。
    下面我们看看sched_yield和sleep(0)的性能对比:

    root@iZ2zefnvk8kwih8l62w90yZ:/data# more test.c
    #include <sched.h>
    #include <unistd.h>
    
    int main(int argc, char **argv) {
    
        for (int i = 0; i < 100000; i++) {
            //sleep(0);
        sched_yield();
        }
    
        return 0;
    }
    
    root@iZ2zefnvk8kwih8l62w90yZ:/data# time ./test  
    
    real    0m6.186s
    user    0m0.092s
    sys 0m0.460s
    root@iZ2zefnvk8kwih8l62w90yZ:/data# time ./test
    
    real    0m0.043s
    user    0m0.012s
    sys 0m0.031s
    

    0.043 vs 6.186,这个差距还是比较明显的,那这是如何造成的呢?
    这是因为sleep过程中触发了系统的调度,但是系统调度会将进程从红黑树中移出,并放入等待队列,这个过程耗时明显。
    在设计的时候实际上我们期待的是该执行订单线程能一直运行着,如果可以的话,想一直运行着,那么这实际上就是一种“忙等待”,我们来看看redis 6.0之后的多线程IO方案里的“忙等待”是如何执行的:

    void *IOThreadMain(void *myid) {
        /* The ID is the thread number (from 0 to server.iothreads_num-1), and is
         * used by the thread to just manipulate a single sub-array of clients. */
        long id = (unsigned long)myid;
        char thdname[16];
    
        snprintf(thdname, sizeof(thdname), "io_thd_%ld", id);
        redis_set_thread_title(thdname);
        redisSetCpuAffinity(server.server_cpulist);
        makeThreadKillable();
    
        while(1) {
            /* Wait for start */
            for (int j = 0; j < 1000000; j++) {
                if (getIOPendingCount(id) != 0) break;
            }
    
            /* Give the main thread a chance to stop this thread. */
            if (getIOPendingCount(id) == 0) {
                pthread_mutex_lock(&io_threads_mutex[id]);
                pthread_mutex_unlock(&io_threads_mutex[id]);
                continue;
            }
    
            serverAssert(getIOPendingCount(id) != 0);
    
            /* Process: note that the main thread will never touch our list
             * before we drop the pending count to 0. */
            listIter li;
            listNode *ln;
            listRewind(io_threads_list[id],&li);
            while((ln = listNext(&li))) {
                client *c = listNodeValue(ln);
                if (io_threads_op == IO_THREADS_OP_WRITE) {
                    writeToClient(c,0);
                } else if (io_threads_op == IO_THREADS_OP_READ) {
                    readQueryFromClient(c->conn);
                } else {
                    serverPanic("io_threads_op value is unknown");
                }
            }
            listEmpty(io_threads_list[id]);
            setIOPendingCount(id, 0);
        }
    }
    

    具体我们可以看到:

            /* Wait for start */
            for (int j = 0; j < 1000000; j++) {
                if (getIOPendingCount(id) != 0) break;
            }
    

    也就是说一直让CPU忙碌,直到发现pending队列的io数量不为0,或者for了100万次。

    那么在spin_lock的实现中,又是如何设计的呢?

    • 如果只是简单地不断地去check spinlock, 那么会非常占用CPU。
    • 如果使用sleep(0)或者是sched_yield()的话,那么会导致ring3 -> ring0的context switch. 延迟会非常高。
    • 最好的方式检查几轮spinlock, 然后使用sleep(0), sched_yield()切换出去,如此往复。
      不过在检查spinlock的时候,可以使用 mm_pause 这个指令。使用这个指令可以告诉CPU, 接下来的指令是是要去check spinlock, 所以不用full-speed地去检查,比如完全填满流水线这样,最后功能也能节省4%。这个指令之后接下来的执行可能会延迟一段时间,但是这个延迟时间是不可控,完全由CPU去决定的,所以我们不能依赖或者是假设这个延迟。

    Essentially, the pause instruction delays the next instruction's execution for a finite period of time. By delaying the execution of the next instruction, the processor is not under demand, and parts of the pipeline are no longer being used, which in turn reduces the power consumed by the processor.
    The pause instruction can be used in conjunction with a Sleep(0) to construct something similar to an exponential back-off in situations where the lock or more work may become available in a short period of time, and the performance may benefit from a short spin in ring 3. It is important to note that the number of cycles delayed by the pause instruction may vary from one processor family to another. You should avoid using multiple pause instructions, assuming you will introduce a delay of a specific cycle count. Since you cannot guarantee the cycle count from one system to the next, you should check the lock in between each pause to avoid introducing unnecessarily long delays on new systems.

    这个指令按照Intel文档解释来说,就是专门给spinlock场景设计的,从文档看上去这个指令的latency很高(不知道这个latency是不是就是到接下来执行指令的延迟)。

    void _mm_pause (void) #include <emmintrin.h> Instruction: pause CPUID Flags: SSE2 Provide a hint to the >processor that the code sequence is a spin-wait loop. This can help improve the performance and power consumption of spin-wait loops.

    Architecture Latency Throughput (CPI) Skylake 140 140

    最后实现出来的代码长的是这个样子的:

    ATTEMPT_AGAIN:
      if (!acquire_lock())
      {
        /* Spin on pause max_spin_count times before backing off to sleep */
        for(int j = 0; j < max_spin_count; ++j)
        {
          /* pause intrinsic */
          _mm_pause();
          if (read_volatile_lock())
          {
            if (acquire_lock())
    
            {
              goto PROTECTED_CODE;
            }
          }
        }
    
        /* Pause loop didn't work, sleep now */
        Sleep(0);
        goto ATTEMPT_AGAIN;
      }
    PROTECTED_CODE:
      get_work();
      release_lock();
      do_work();
    

    致力于分布式系统的高并发和低延时系统的设计,有技术问题请联系李哥

    相关文章

      网友评论

          本文标题:性能优化之sleep、sched_yield和忙等待

          本文链接:https://www.haomeiwen.com/subject/hlrtrdtx.html