关于page cache的性能实验数据

作者: 谭英智 | 来源:发表于2022-11-25 19:19 被阅读0次

Page Cache
kafka读写高效的总结
Rocket MQ
The Page Cache and Page Writebac
write()函数
Kafka专题：1.kafka高性能的原因
Page Cache
Page Cache
Page Claim
【CentOS】关于CentOS7.x free内存查看

page cache局部性

行优先VS列优先

代码

int array[n][n];

//行优先
for (int i = 0; i < n; ++i)
  for (int j = 0; j < n; ++j)
    array[i][j] += j;

//列优先
for (int i = 0; i < n; ++i)
  for (int j = 0; j < n; ++j)
    array[j][i] += j;

//乱序访问
for (int i = 0; i < n; ++i)
  for (int j = 0; j < n; ++j)
    array[rand()%n][rand()%n] += j;

行优先

row_tra

列优先

col_tra

乱序访问（链表）

list_tra

性能差异

performance

分析结论

L1 cache的cache line大小是64 bytes = 64 * 8 bits。每次cpu加载一个变量时，会一次性的加载一个cache line大小的内存到L1 cache。

行优先遍历数组，由于每次加载的cache line数据都能命中缓存，因此效率高

而列优先遍历，由于每次读完L1 cache变量后，读下一个变量有可能发现cache miss，需要重新从内存中加载，导致消息低

乱序访问是每次读去都会cache miss

访问数据的顺序

当我们有三个变量，cpu访问的顺序有可能有下面两种

var_cross

var_order

第一种是不断的轮训访问三个变量，但是当再次访问变量时，有可能cache就会被踢走，需要重新加载变量到cache中。

比较高效的方法是，有序的访问三个变量，这样就能一直保持变量在cache中，直到处理完为止

伪共享与对其cache line

多线程共享一个原子变量

代码

void work(std::atomic<int>& a) {
  for(int i = 0; i < 100000; ++i) {
    a++;
  }
}
void directSharing() {
  // Create a single atomic integer
  std::atomic<int> a;
  a = 0;

  // Four threads all sharing one atomic<int>
  std::thread t1([&]() {work(a)});
  std::thread t2([&]() {work(a)});
  std::thread t3([&]() {work(a)});
  std::thread t4([&]() {work(a)});

  // Join the 4 threads
  t1.join();
  t2.join();
  t3.join();
  t4.join();
}

性能

share_one_perf

可以看到随着线程数的增加，性能会出现下降

原子变量在不同core之间的流转过程

初始状态

cache_1

Core1 core2读取原子变量a，并加载到cache

cache_2

Core1 修改a变量，并另core2的变量失效

cache_3

core2重新加载a变量

cache_4

分析与结论

原子变量在不同core之间，由于需要不断的令cache失效，又重新加载修改，造成随着分享的core越多，开销也越大

多线程发生伪共享

代码

void work(std::atomic<int>& a) {
  for(int i = 0; i < 100000; ++i) {
    a++;
  }
}
void falseSharing() {
  // Create a single atomic integer
  std::atomic<int> a;
  a = 0;
  std::atomic<int> b;
  b = 0;
  std::atomic<int> c;
  c = 0;
  std::atomic<int> d;
  d = 0;

  // Four threads each with their own atomic<int>
  std::thread t1([&]() {work(a)});
  std::thread t2([&]() {work(b)});
  std::thread t3([&]() {work(c)});
  std::thread t4([&]() {work(d)});

  // Join the 4 threads
  t1.join();
  t2.join();
  t3.join();
  t4.join();
}

性能

false_share

可以看到效果跟共享一个原子变量没有什么差异

分析与结论

这是因为变量的失效是以cache line为单位的，

变量a b c d，他们都位于同一个cache linezhong，所以只要有一个失效，其他变量也跟着失效

解决

对齐cache line

struct alignas(64) AlignedType {
  AlignedType() { val = 0; }
  std::atomic<int> val;
};
void noSharing() {
  AlignedType a{};
  AlignedType b{};
  AlignedType c{};
  AlignedType d{};
 
  // Launch the four threads now using our aligned data
  std::thread t1([&]() { work(a.val); });
  std::thread t2([&]() { work(b.val); });
  std::thread t3([&]() { work(c.val); });
  std::thread t4([&]() { work(d.val); });
 
  // Join the threads
  t1.join();
  t2.join();
  t3.join();
  t4.join();
}