准备
本文所涉及的源码全部基于linux内核5.15。
概述
当我们读写文件时,如果不是使用DIO 模式,那么读写操作会先用过vfs调用page cache最后才会达到磁盘。在 【linux内核源码】io操作之read 里曾介绍到,vfs会先读page cache,如果page cache不存在,那么会构造bio请求向块设备发起请求。对于写操作其实也是类似,vfs会将数据先写到page cache,等到一定的时机在通过后台线程将page cache刷回磁盘。磁盘的速度明显低于内存,系统通过page cache的使用,可以大大提升文件的读写速度。当然速度提升带来的副作用就是如果机器宕机,那么未刷到磁盘的数据可能就会丢失。因此用户需要根据不同的场景来决定page cache的刷盘策略。
page cache内核参数
在正式开始介绍阅读内核代码之前,需要先来了解几个page cache相关的内核参数
>sysctl -a |grep dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
sysctl关于vm相关参数的含义可以查看 https://www.kernel.org/doc/Documentation/sysctl/vm.txt
这几个参数可以通过sysctl直接读取,也可以通过 cat /proc/sys/vm/dirty*
进行读取。这几个参数分别表示:
dirty_background_bytes
Contains the amount of dirty memory at which the background kernel flusher threads will start writeback. Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read.
回刷的具体字节数,默认为0,和background_ratio只能同时指定一个
dirty_background_ratio
Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel flusher threads will start writing out dirty data.
该参数表示脏页数据到达系统整体内存的百分比,当用户写入的时候,系统会检查当前脏页的占比 当用户调用write时,如果发现系统中的脏数据大于这阈值(或dirty_background_bytes ),会触发pdflush进程去写脏数据,但是用户的write调用会立即返回,无需等待。pdflush刷脏页的标准是让脏页降低到该阈值以下。
-
dirty_bytes
类似dirty_backgroup_bytes 和dirty_ratio只能同时指定一个 -
dirty_ratio
该参数表示如果进程产生的脏数据到达系统整体内存的百分比,此时用户进程自行把脏数据写回磁盘。当用户调用write时,如果发现系统中的脏数据大于这阈值(或dirty_bytes ),需要自己把脏数据刷回磁盘,此时进程即便是aio也会陷入阻塞,降低到这个阈值以下才返回。 -
dirty_expire_centisecs
表示如果脏数据在内存中驻留时间超过该值,pdflush进程在下一次将把这些数据写回磁盘。 默认值:3000(1/100秒) -
dirty_writeback_centisecs
pdflush进程的唤醒间隔,周期性把超过dirty_expire_centisecs时间的脏数据写回磁盘。 缺省设置:500(1/100秒)
刷新策略
通过上述几个参数可以知道,page cache的回刷策略包含三种:
- 周期性回刷
- 占用空间超过后台阈值回刷
- 占用空间超过进程最大限制触发回刷,该场景会阻塞进程
周期性回刷
周期性回刷的间隔通过 dirty_writeback_centisecs来控制,默认为5s。先来看下wb_init,通过初始化delaywork来回调wb_workfn进行回刷调用。
static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
gfp_t gfp)
{
int i, err;
memset(wb, 0, sizeof(*wb));
if (wb != &bdi->wb)
bdi_get(bdi);
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
INIT_LIST_HEAD(&wb->b_dirty_time);
spin_lock_init(&wb->list_lock);
atomic_set(&wb->writeback_inodes, 0);
wb->bw_time_stamp = jiffies;
wb->balanced_dirty_ratelimit = INIT_BW;
wb->dirty_ratelimit = INIT_BW;
wb->write_bandwidth = INIT_BW;
wb->avg_write_bandwidth = INIT_BW;
spin_lock_init(&wb->work_lock);
INIT_LIST_HEAD(&wb->work_list);
INIT_DELAYED_WORK(&wb->dwork, wb_workfn); // 初始化delay work用于周期性回刷,delaywork调用wb_workfn进行刷页
// ...
}
wb_workfn 是回刷控制函数,通过wb_do_writeback进行后台回刷
void wb_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(to_delayed_work(work),
struct bdi_writeback, dwork);
long pages_written;
set_worker_desc("flush-%s", bdi_dev_name(wb->bdi));
current->flags |= PF_SWAPWRITE;
if (likely(!current_is_workqueue_rescuer() ||
!test_bit(WB_registered, &wb->state))) {
/*
* The normal path. Keep writing back @wb until its
* work_list is empty. Note that this path is also taken
* if @wb is shutting down even when we're running off the
* rescuer as work_list needs to be drained.
*/
do {
// 调用该函数完成周期回写操作。
pages_written = wb_do_writeback(wb);
trace_writeback_pages_written(pages_written);
} while (!list_empty(&wb->work_list));
} else {
/*
* bdi_wq can't get enough workers and we're running off
* the emergency worker. Don't hog it. Hopefully, 1024 is
* enough for efficient IO.
*/
pages_written = writeback_inodes_wb(wb, 1024,
WB_REASON_FORKER_THREAD);
trace_writeback_pages_written(pages_written);
}
if (!list_empty(&wb->work_list))
wb_wakeup(wb);
else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
wb_wakeup_delayed(wb);
current->flags &= ~PF_SWAPWRITE;
}
wb_do_writeback 的实现则具体分为两种场景 包括周期回刷和后台回刷,看具体代码实现
static long wb_do_writeback(struct bdi_writeback *wb)
{
struct wb_writeback_work *work;
long wrote = 0;
set_bit(WB_writeback_running, &wb->state);
while ((work = get_next_work_item(wb)) != NULL) {
trace_writeback_exec(wb, work);
wrote += wb_writeback(wb, work);
finish_writeback_work(wb, work);
}
/*
* Check for a flush-everything request
*/
wrote += wb_check_start_all(wb);
/*
* Check for periodic writeback, kupdated() style
*/
wrote += wb_check_old_data_flush(wb); // 检查周期回刷
wrote += wb_check_background_flush(wb); // 检查后台回刷
clear_bit(WB_writeback_running, &wb->state);
return wrote;
}
周期回刷的具体检查逻辑为 wb_check_old_data_flush
static long wb_check_old_data_flush(struct bdi_writeback *wb)
{
unsigned long expired;
long nr_pages;
/*
* When set to zero, disable periodic writeback
*/
// 未设置周期回刷,直接返回
if (!dirty_writeback_interval)
return 0;
expired = wb->last_old_flush +
msecs_to_jiffies(dirty_writeback_interval * 10);
// 未超时直接返回
if (time_before(jiffies, expired))
return 0;
wb->last_old_flush = jiffies;
nr_pages = get_nr_dirty_pages();
// 如果存在脏页则构造回刷任务进行脏页写回
if (nr_pages) {
struct wb_writeback_work work = {
.nr_pages = nr_pages,
.sync_mode = WB_SYNC_NONE,
.for_kupdate = 1,
.range_cyclic = 1,
.reason = WB_REASON_PERIODIC,
};
return wb_writeback(wb, &work);
}
return 0;
}
- 判断是否设置回刷周期
- 判断是否超时
- 判断是否存在脏页
占用空间超过后台阈值回刷
wb_do_writeback检查完是否进行周期回刷后会调用wb_check_background_flush判断是否需要进行后台回刷。wb_check_background_flush的实现则比较简单,通过wb_over_bg_thresh判断是否超过了阈值决定是否需要后台回刷。
static long wb_check_background_flush(struct bdi_writeback *wb)
{ // 判断是否超过后台阈值,如果超过了则构造回刷work
if (wb_over_bg_thresh(wb)) {
struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
.sync_mode = WB_SYNC_NONE,
.for_background = 1,
.range_cyclic = 1,
.reason = WB_REASON_BACKGROUND,
};
return wb_writeback(wb, &work);
}
return 0;
}
进程写入时超过阈值回刷
之前 【linux内核源码】io操作之read 讲到,read操作实际会调到generic_file_read_iter
然后执行page cache相关的读取操作,那么可以联想到用户执行写操作的时候应该也会调用到 generic_file_write_iter
然后进行page cache的写入,具体sys_write的流程本文不做详细分析,只分析写入page cache的部分。
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
ssize_t ret;
inode_lock(inode);
ret = generic_write_checks(iocb, from);
if (ret > 0)
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
if (ret > 0)
ret = generic_write_sync(iocb, ret);
return ret;
}
如我们所想,确实是有叫做generic_file_write_file的函数,如果要查看sys_write的详细调用栈,可以参考【linux内核源码】io操作之read 里ftrace的使用来跟踪write的调用。其他详细的写入流程先不做分析,先直接快进到 __generic_file_write_iter 函数内部的 generic_perform_write调用。直接看该函数的实现
```
ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
long status = 0;
ssize_t written = 0;
unsigned int flags = 0;
do {
struct page *page;
unsigned long offset; /* Offset into pagecache page */
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
void *fsdata;
offset = (pos & (PAGE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_SIZE - offset,
iov_iter_count(i));
again:
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
*/
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}
if (fatal_signal_pending(current)) {
status = -EINTR;
break;
}
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);
if (unlikely(status < 0))
break;
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);
copied = copy_page_from_iter_atomic(page, offset, bytes, i);
flush_dcache_page(page);
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
if (unlikely(status != copied)) {
iov_iter_revert(i, copied - max(status, 0L));
if (unlikely(status < 0))
break;
}
cond_resched();
if (unlikely(status == 0)) {
/*
* A short copy made ->write_end() reject the
* thing entirely. Might be memory poisoning
* halfway through, might be a race with munmap,
* might be severe memory pressure.
*/
if (copied)
bytes = copied;
goto again;
}
pos += status;
written += status;
balance_dirty_pages_ratelimited(mapping);
} while (iov_iter_count(i));
return written ? written : status;
}
```
首先会做一系列的检查然后将数据从用户空间拷贝到page cache。copy完数据后有个比较有意思的函数 flush_dcache_page dcache是啥,flush dcache又是干啥用的,带着疑问我们去查看下该函数的实现。
#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
static inline void flush_dcache_page(struct page *page)
{
}
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 0
#endif
空函数?,更加令人费解了,带着疑惑google然后查看内核文档说明
If D-cache aliasing is not an issue, this routine may simply be defined as a nop on that architecture.
进一步看dcache是干啥用的 iCache和dCache一致性, icache 和dcache分别表示cpu的指令缓存和数据缓存,涉及到cpu 的cacheline。对于不存在alias情况下的架构,此处为空操作不需要额外处理。回到perform_write的函数部分,注意循环的最后,每次写完page后会通过 balance_dirty_pages_ratelimited检查脏页的状态。直接进入该函数查看具体的脏页检查策略。
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
struct bdi_writeback *wb = NULL;
int ratelimit;
int *p;
if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
return;
if (inode_cgwb_enabled(inode))
wb = wb_get_create_current(bdi, GFP_KERNEL);
if (!wb)
wb = &bdi->wb;
ratelimit = current->nr_dirtied_pause;
if (wb->dirty_exceeded) // 当该值被设置的时候,需要调低阈值加快flush
ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
preempt_disable();
/*
* This prevents one CPU to accumulate too many dirtied pages without
* calling into balance_dirty_pages(), which can happen when there are
* 1000+ tasks, all of them start dirtying pages at exactly the same
* time, hence all honoured too large initial task->nr_dirtied_pause.
*/
p = this_cpu_ptr(&bdp_ratelimits);
if (unlikely(current->nr_dirtied >= ratelimit)) // 当前线程脏页超过阈值,必然触发
*p = 0;
// 当前线程未超过阈值,但是当前cpu超过cpu脏页数限制也会触发 默认32页
else if (unlikely(*p >= ratelimit_pages)) {
*p = 0;
ratelimit = 0;
}
/*
* Pick up the dirtied pages by the exited tasks. This avoids lots of
* short-lived tasks (eg. gcc invocations in a kernel build) escaping
* the dirty throttling and livelock other long-run dirtiers.
*/
// 获取以退出线程的脏页,如果当前线程脏页数 + 退出线程脏页大于阈值触发刷盘
p = this_cpu_ptr(&dirty_throttle_leaks);
if (*p > 0 && current->nr_dirtied < ratelimit) {
unsigned long nr_pages_dirtied;
nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
*p -= nr_pages_dirtied;
current->nr_dirtied += nr_pages_dirtied;
}
preempt_enable();
if (unlikely(current->nr_dirtied >= ratelimit))
balance_dirty_pages(wb, current->nr_dirtied);
wb_put(wb);
}
balance_dirty_pages触发
在检查是否会触发balance的时候总共有三个场景会触发balance_dirty_page
- 当前线程的脏页数超过了ratelimit
- 当前cpu的脏页数超过了ratelimit_pages
- 当前线程的脏页数 + 已退出线程的脏页数 大于 ratelimit
当脏页的阈值满足上述上个条件的时候,在进程写入的时候balance_dirty_pages会被调用
强制刷盘触发
正常情况下,通过周期回收和backgroup强制回收即可保证脏页的占比不会超过阈值,此场景下进程写入也不会被阻塞。但是凡事必有例外。具体看下balance_dirty_pages的实现
/*
* balance_dirty_pages() must be called by processes which are generating dirty
* data. It looks at the number of dirty pages in the machine and will force
* the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
* If we're over `background_thresh' then the writeback threads are woken to
* perform some writeout.
*/
static void balance_dirty_pages(struct bdi_writeback *wb,
unsigned long pages_dirtied)
{
// ... 该函数过长,省略其他部分细节
// 当脏页数小于 该阈值时,不会阻塞线程,但是会通过dirty_poll_interval来加速背景回收
if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
(!mdtc ||
m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
unsigned long intv;
unsigned long m_intv;
free_running:
intv = dirty_poll_interval(dirty, thresh);
m_intv = ULONG_MAX;
current->dirty_paused_when = now;
current->nr_dirtied = 0;
if (mdtc)
m_intv = dirty_poll_interval(m_dirty, m_thresh);
current->nr_dirtied_pause = min(intv, m_intv);
break;
}
// ..省略部分代码
// 超过阈值时会根据当前脏页数判断throttle时间,然后让出cpu进回收线程工作
dirty_ratelimit = READ_ONCE(wb->dirty_ratelimit);
task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >>
RATELIMIT_CALC_SHIFT;
max_pause = wb_max_pause(wb, sdtc->wb_dirty);
min_pause = wb_min_pause(wb, max_pause,
task_ratelimit, dirty_ratelimit,
&nr_dirtied_pause);
if (unlikely(task_ratelimit == 0)) {
period = max_pause;
pause = max_pause;
goto pause;
}
}
static unsigned long dirty_freerun_ceiling(unsigned long thresh,
unsigned long bg_thresh)
{
return (thresh + bg_thresh) / 2;
}
直接看函数注释,可以看到当脏页数量超过(后台回写阈值+进程主动回写阈值)/ 2时,也即(background_thresh + dirty_thresh)/ 2时 会阻塞当前线程进行强制回写。 具体看代码实现,当脏页数小于 dirty_freerun_ceiling(thresh,bg_thresh)时,不会阻塞进程,但是会调节nr_dirtied_pause的值来更好进行background回收,当脏页超过dirty_freerun_ceiling(thresh,bg_thresh)时,则会计算pause间隔,然后让出cpu时间片让回收线程工作。进程会陷入阻塞,因此在线业务必须注意进程的脏页使用防止系统长时间抖动不可用。
总结
- 高吞吐的场景要合理设置刷盘周期避免强制刷盘导致写入长时间抖动。
- 磁盘的吞吐远低于cpu,可以通过挂载多块磁盘提升机器整体的吞吐达到更好的cpu 磁盘配比。
- 同样引申出来的问题是redis缓存场景下合理选择物理机的cpu核数和内存配比提高整体的利用率
Reference:
https://www.quora.com/Is-pdflush-poorly-designed-in-Linux
linux_perf_and_tuning_IBM
cachetlb
Why flush_dcache_page() does nothing in linux kernel?
iCache和dCache一致性
Linux 性能优化之 IO 子系统
Optimizing subsystem throughput in Red Hat Enterprise Linux 6
网友评论