Lec 8 - Lab 5 Copy-on-Write Fork

作者: 西部小笼包 | 来源:发表于2023-11-17 20:03 被阅读0次

技术思想
Redis持久化常见问题
redis 持久化AOF
fork 进程测试 copy-on-write
连载 | 机器学习基石 Lec 5:Growth Functio
Lec 5 E-R module
Liz雅思小作文: Lecture 5
2020-10-26-Git的简单应用
专业词汇总结
讲解：Stat 20、R、Rmd file、RDatabase|

Interrupt硬件部分

中断对应的场景很简单，就是硬件想要得到操作系统的关注。例如网卡收到了一个packet，网卡会生成一个中断；用户通过键盘按下了一个按键，键盘会产生一个中断.
3个差别: 异步,Interrupt handler与当前运行的进程在CPU上没有任何关联; 并行,设备handler 和 CPU 同时运行. 设备编程,每个设备都有一个编程手册.

image.png

从左上角可以看出，我们有53个不同的来自于设备的中断。这些中断到达PLIC之后，PLIC会路由这些中断。图的右下角是CPU的核，PLIC会将中断路由到某一个CPU的核.

PLIC会通知当前有一个待处理的中断
其中一个CPU核会Claim接收中断，这样PLIC就不会把中断发给其他的CPU处理
CPU核处理完中断之后，CPU会通知PLIC
PLIC将不再保存中断的信息

设备驱动概述

管理设备的代码称为驱动.
UART设备的驱动，代码在uart.c文件中.如果我们查看代码的结构，我们可以发现大部分驱动都分为两个部分，bottom/top。

bottom

bottom部分通常是Interrupt handler.nterrupt handler并不运行在任何特定进程的context中，它只是处理中断

// handle a uart interrupt, raised because input has
// arrived, or the uart is ready for more output, or
// both. called from devintr().
void
uartintr(void)
{
  // read and process incoming characters.
  while(1){
    int c = uartgetc();
    if(c == -1)
      break;
    consoleintr(c);
  }

  // send buffered characters.
  acquire(&uart_tx_lock);
  uartstart();
  release(&uart_tx_lock);
}

top

top部分，是用户进程，或者内核的其他部分去调用的接口
比如：

// alternate version of uartputc() that doesn't 
// use interrupts, for use by kernel printf() and
// to echo characters. it spins waiting for the uart's
// output register to be empty.
void
uartputc_sync(int c)
{
  push_off();

  if(panicked){
    for(;;)
      ;
  }

  // wait for Transmit Holding Empty to be set in LSR.
  while((ReadReg(LSR) & LSR_TX_IDLE) == 0)
    ;
  WriteReg(THR, c);

  pop_off();
}

通常情况下，驱动中会有一些队列（或者说buffer），top部分的代码会从队列中读写数据，而Interrupt handler（bottom部分）同时也会向队列中读写数据.
Interrupt handler来说存在一些限制，因为它并没有运行在任何进程的context中，所以进程的page table并不知道该从哪个地址读写数据，也就无法直接从Interrupt handler读写数据.

如何对设备进行编程

操作系统需要知道这些设备位于物理地址空间的具体位置，然后再通过普通的load/store指令对这些地址进行编程。load/store指令实际上的工作就是读写设备的控制寄存器。

例如，对网卡执行store指令时，CPU会修改网卡的某个控制寄存器，进而导致网卡发送一个packet。所以这里的load/store指令不会读写内存，而是会操作设备

Console是如何显示出$ ls

$ 是Shell程序的输出，而“ls”是用户通过键盘输入之后再显示出来的.
实际上就是设备会将字符传输给UART的寄存器，UART之后会在发送完字符之后产生一个中断.线路的另一端会有另一个UART芯片（模拟的），这个UART芯片连接到了虚拟的Console，它会进一步将$ 显示在console上.

对于“ls”,当你在键盘上按下一个按键，UART芯片会将按键字符通过串口线发送到另一端的UART芯片.另一端的UART芯片先将数据bit合并成一个Byte，之后再产生一个中断，并告诉处理器说这里有一个来自于键盘的字符.

SIE（Supervisor Interrupt Enable）寄存器。这个寄存器中有一个bit（E）专门针对例如UART的外部设备的中断.
SSTATUS（Supervisor Status）寄存器。这个寄存器中有一个bit来打开或者关闭中断。每一个CPU核都有独立的SIE和SSTATUS寄存器
SIP（Supervisor Interrupt Pending）寄存器。当发生中断时，处理器可以通过查看这个寄存器知道当前是什么类型的中断。

我们第一个外设是console，这是我们print的输出位置。查看位于console.c的consoleinit函数。

image.png

uartinit函数位于uart.c文件。这个函数实际上就是配置好UART芯片使其可以被使用

image.png
运行完这个函数之后，原则上UART就可以生成中断了。但是因为我们还没有对PLIC编程，所以中断不能被CPU感知。最终，在main函数中，需要调用plicinit函数。下图是plicinit函数。

void
plicinit(void)
{
  // set desired IRQ priorities non-zero (otherwise disabled).
  *(uint32*)(PLIC + UART0_IRQ*4) = 1;
  *(uint32*)(PLIC + VIRTIO0_IRQ*4) = 1;
}

代码的第一行使能响应UART的中断，这里实际上就是设置PLIC会接收哪些中断，进而将中断路由到CPU。
代码的第二行设置PLIC接收来自IO磁盘的中断。
plicinit之后就是plicinithart函数。plicinit是由0号CPU运行，之后，每个CPU的核都需要调用plicinithart函数表明对于哪些外设中断感兴趣。

void
plicinithart(void)
{
  int hart = cpuid();
  
  // set enable bits for this hart's S-mode
  // for the uart and virtio disk.
  *(uint32*)PLIC_SENABLE(hart) = (1 << UART0_IRQ) | (1 << VIRTIO0_IRQ);

  // set this hart's S-mode priority threshold to 0.
  *(uint32*)PLIC_SPRIORITY(hart) = 0;
}

在plicinithart函数中，每个CPU的核都表明自己对来自于UART和VIRTIO的中断感兴趣。

到目前为止，我们有了生成中断的外部设备，我们有了PLIC可以传递中断到单个的CPU。但是CPU自己还没有设置好接收中断，因为我们还没有设置好SSTATUS寄存器.
在实际运行进程之前，会执行intr_on函数来使得CPU能接收中断。intr_on函数只完成一件事情，就是设置SSTATUS寄存器，打开中断标志位。

// enable device interrupts
static inline void
intr_on()
{
  w_sstatus(r_sstatus() | SSTATUS_SIE);
}

UART驱动的top部分

如何从Shell程序输出提示符“$ ”到Console.
在 init.c 的 main函数创建了一个代表Console的设备。这里通过mknod操作创建了console设备。因为这是第一个打开的文件，所以这里的文件描述符0。之后通过dup创建stdout和stderr。这里实际上通过复制文件描述符0，得到了另外两个文件描述符1，2。最终文件描述符0，1，2都用来代表Console。

if(open("console", O_RDWR) < 0){
    mknod("console", CONSOLE, 0);
    open("console", O_RDWR);
  }
  dup(0);  // stdout
  dup(0);  // stderr

尽管Console背后是UART设备，但是从应用程序来看，它就像是一个普通的文件。Shell程序只是向文件描述符2写了数据。

 else if(f->type == FD_DEVICE){
    if(f->major < 0 || f->major >= NDEV || !devsw[f->major].write)
      return -1;
    ret = devsw[f->major].write(1, addr, n);
  }

在filewrite函数中首先会判断文件描述符的类型。mknod生成的文件描述符属于设备（FD_DEVICE），而对于设备类型的文件描述符，我们会为这个特定的设备执行设备相应的write函数。因为我们现在的设备是Console，所以我们知道这里会调用console.c中的consolewrite函数。

void
consoleinit(void)
{
  initlock(&cons.lock, "cons");

  uartinit();

  // connect read and write system calls
  // to consoleread and consolewrite.
  devsw[CONSOLE].read = consoleread;
  devsw[CONSOLE].write = consolewrite;
}

//
// user write()s to the console go here.
//
int
consolewrite(int user_src, uint64 src, int n)
{
  int i;

  for(i = 0; i < n; i++){
    char c;
    if(either_copyin(&c, user_src, src+i, 1) == -1)
      break;
    uartputc(c);
  }

  return i;
}

这里先通过either_copyin将字符拷入，之后调用uartputc函数。uartputc函数将字符写入给UART设备，所以你可以认为consolewrite是一个UART驱动的top部分。uart.c文件中的uartputc函数会实际的打印字符。

void
uartputc(int c)
{
  acquire(&uart_tx_lock);

  if(panicked){
    for(;;)
      ;
  }
  while(uart_tx_w == uart_tx_r + UART_TX_BUF_SIZE){
    // buffer is full.
    // wait for uartstart() to open up space in the buffer.
    sleep(&uart_tx_r, &uart_tx_lock);
  }
  uart_tx_buf[uart_tx_w % UART_TX_BUF_SIZE] = c;
  uart_tx_w += 1;
  uartstart();
  release(&uart_tx_lock);
}

在UART的内部会有一个buffer用来发送数据，buffer的大小是32个字符。同时还有一个为consumer提供的读指针和为producer提供的写指针，来构建一个环形的buffer.。如果读写指针相同，那么buffer是空的，如果写指针加1等于读指针，那么buffer满了.

Shell是producer，所以需要调用uartputc函数.因为提示符“$”是我们送出的第一个字符。所以代码会走到while外面，字符会被送到buffer中，更新写指针，之后再调用uartstart函数。

// if the UART is idle, and a character is waiting
// in the transmit buffer, send it.
// caller must hold uart_tx_lock.
// called from both the top- and bottom-half.
void
uartstart()
{
  while(1){
    if(uart_tx_w == uart_tx_r){
      // transmit buffer is empty.
      return;
    }
    
    if((ReadReg(LSR) & LSR_TX_IDLE) == 0){
      // the UART transmit holding register is full,
      // so we cannot give it another byte.
      // it will interrupt when it's ready for a new byte.
      return;
    }
    
    int c = uart_tx_buf[uart_tx_r % UART_TX_BUF_SIZE];
    uart_tx_r += 1;
    
    // maybe uartputc() is waiting for space in the buffer.
    wakeup(&uart_tx_r);
    
    WriteReg(THR, c);
  }
}

uartstart就是通知设备执行操作。首先是检查当前设备是否空闲，如果空闲的话，我们会从buffer中读出数据，然后将数据写入到THR（Transmission Holding Register）发送寄存器。这里相当于告诉设备，我这里有一个字节需要你来发送。

与此同时，UART设备会将数据送出。在某个时间点，我们会收到中断，因为我们之前设置了要处理UART设备中断。接下来我们看一下，当发生中断时，实际会发生什么。

UART驱动的bottom部分

在trap.c的devintr函数中，首先会通过SCAUSE寄存器判断当前中断是否是来自于外设的中断。如果是的话，再调用plic_claim函数来获取中断。

int
devintr()
{
  uint64 scause = r_scause();

  if((scause & 0x8000000000000000L) &&
     (scause & 0xff) == 9){
    // this is a supervisor external interrupt, via PLIC.

    // irq indicates which device interrupted.
    int irq = plic_claim();

    if(irq == UART0_IRQ){
      uartintr();
    } else if(irq == VIRTIO0_IRQ){
      virtio_disk_intr();
    } else if(irq){
      printf("unexpected interrupt irq=%d\n", irq);
    }

plic_claim函数位于plic.c文件中。在这个函数中，当前CPU核会告知PLIC，自己要处理中断，PLIC_CLAIM会将中断号返回，对于UART来说，返回的中断号是10。

// handle a uart interrupt, raised because input has
// arrived, or the uart is ready for more output, or
// both. called from devintr().
void
uartintr(void)
{
  // read and process incoming characters.
  while(1){
    int c = uartgetc();
    if(c == -1)
      break;
    consoleintr(c);
  }

  // send buffered characters.
  acquire(&uart_tx_lock);
  uartstart();
  release(&uart_tx_lock);
}

int
uartgetc(void)
{
  if(ReadReg(LSR) & 0x01){
    // input data is ready.
    return ReadReg(RHR);
  } else {
    return -1;
  }
}

为我们现在还没有通过键盘输入任何数据，所以UART的接受寄存器现在为空。代码会直接运行到uartstart函数，这个函数会将Shell存储在buffer中的任意字符送出。实际上在提示符“$”之后，Shell还会输出一个空格字符，write系统调用可以在UART发送提示符“$”的同时，并发的将空格字符写入到buffer中。所以UART的发送中断触发时，可以发现在buffer中还有一个空格字符，之后会将这个空格字符送出。
UART连接了两个设备，一个是键盘，另一个是显示设备，也就是Console。

UART读取键盘输入

有时Shell会调用read从键盘中读取字符。在read系统调用的底层，会调用fileread函数。在这个函数中，如果读取的文件类型是设备，会调用相应设备的read函数。

int
consoleread(int user_dst, uint64 dst, int n)
{
  uint target;
  int c;
  char cbuf;

  target = n;
  acquire(&cons.lock);
  while(n > 0){
    // wait until interrupt handler has put some
    // input into cons.buffer.
    while(cons.r == cons.w){
      if(killed(myproc())){
        release(&cons.lock);
        return -1;
      }
      sleep(&cons.r, &cons.lock);
    }

    c = cons.buf[cons.r++ % INPUT_BUF_SIZE];

    if(c == C('D')){  // end-of-file
      if(n < target){
        // Save ^D for next time, to make sure
        // caller gets a 0-byte result.
        cons.r--;
      }
      break;
    }
    // copy the input byte to the user-space buffer.
    cbuf = c;
    if(either_copyout(user_dst, dst, &cbuf, 1) == -1)
      break;

    dst++;
    --n;

    if(c == '\n'){
      // a whole line has arrived, return to
      // the user-level read().
      break;
    }
  }
  release(&cons.lock);

  return target - n;
}

所以在某个时间点，假设用户通过键盘输入了“l”，这会导致“l”被发送到主板上的UART芯片，产生中断之后再被PLIC路由到某个CPU核，之后会触发devintr函数，devintr可以发现这是一个UART中断，然后通过uartgetc函数获取到相应的字符，之后再将字符传递给consoleintr函数。

// read one input character from the UART.
// return -1 if none is waiting.
int
uartgetc(void)
{
  if(ReadReg(LSR) & 0x01){
    // input data is ready.
    return ReadReg(RHR);
  } else {
    return -1;
  }
}

// handle a uart interrupt, raised because input has
// arrived, or the uart is ready for more output, or
// both. called from devintr().
void
uartintr(void)
{
  // read and process incoming characters.
  while(1){
    int c = uartgetc();
    if(c == -1)
      break;
    consoleintr(c);
  }

  // send buffered characters.
  acquire(&uart_tx_lock);
  uartstart();
  release(&uart_tx_lock);
}

//
// the console input interrupt handler.
// uartintr() calls this for input character.
// do erase/kill processing, append to cons.buf,
// wake up consoleread() if a whole line has arrived.
//
void
consoleintr(int c)
{
  acquire(&cons.lock);

  switch(c){
  case C('P'):  // Print process list.
    procdump();
    break;
  case C('U'):  // Kill line.
    while(cons.e != cons.w &&
          cons.buf[(cons.e-1) % INPUT_BUF_SIZE] != '\n'){
      cons.e--;
      consputc(BACKSPACE);
    }
    break;
  case C('H'): // Backspace
  case '\x7f': // Delete key
    if(cons.e != cons.w){
      cons.e--;
      consputc(BACKSPACE);
    }
    break;
  default:
    if(c != 0 && cons.e-cons.r < INPUT_BUF_SIZE){
      c = (c == '\r') ? '\n' : c;

      // echo back to the user.
      consputc(c);

      // store for consumption by consoleread().
      cons.buf[cons.e++ % INPUT_BUF_SIZE] = c;

      if(c == '\n' || c == C('D') || cons.e-cons.r == INPUT_BUF_SIZE){
        // wake up consoleread() if a whole line (or end-of-file)
        // has arrived.
        cons.w = cons.e;
        wakeup(&cons.r);
      }
    }
    break;
  }
  
  release(&cons.lock);
}

字符会通过consputc，输出到console上给用户查看。之后，字符被存放在buffer中。在遇到换行符的时候，唤醒之前sleep的进程，也就是Shell，再从buffer中将数据读出。
所以这里也是通过buffer将consumer和producer之间解耦，这样它们才能按照自己的速度，独立的并行运行。如果某一个运行的过快了，那么buffer要么是满的要么是空的，consumer和producer其中一个会sleep并等待另一个追上来。

Lab: Copy-on-Write Fork for xv6

这个lab还是非常耗时的，因为难在他是第一次有了race condition。如果一开始没有想清楚，多个进程并发的情况，就很容易陷入痛苦的debug中。但想清楚后，核心代码非常简洁。
代码分为2部分。

第一部分，是对每个物理页分配一个引用计数器。然后更新kalloc.c 每个函数，去正确配置他们，因为会涉及多进程并行操作，所以需要上锁。

1700306424947.png

1700306445830.png

第二部分，是我们需要在uvmcopy中（fork调用），针对有写权限的页，进行抹除父子页上的写权限标志。然后添加COW标志。同时增加该页指向的物理地址的引用计数。然后在发生page fault中断，或copyout时，进行写时复制。

1700306480744.png

1700306492500.png

1700306506697.png

optional challenge

Measure how much your COW implementation reduces the number of bytes xv6 copies and the number of physical pages it allocates. Find and exploit opportunities to further reduce those numbers.

我们可以看到上面的做法，保证了线程安全性，但是我们写一个函数去统计节约的页面和字节复制的开销，其实是不够理想的。我们先在kalloc里定义2个新函数。

1700306664943.png

创建一个新的系统调用

1700306700154.png

然后我们在UVMCOPY，可以去节约页面复制开销，但是在copyonwrite的时候，是要把节约的开销给还回去。

1700306847476.png

我们可以看到在simpletest的时候，是节约了不少开销。但是在做three test的时候，我们不但没有节约，反而还比之前付出了更多的页面开销。

1700307109412.png

这里进一步的优化空间就在于，当引用计数降为1时，我们依然保持了kalloc 加 kfree的策略。也就是说我们有2个进程，做fork的时候，假设节约了4页。但是这4页之后都被父进程和子进程写的话。父子进程都会进到copyonwrite里，kalloc+kfree，这样复制了8页，反而消耗更多了。

简单的加一个读取引用计数，判断是否为1，如果为1，就不alloc, 直接设置WRITE权限的做法，会造成测试不通过。这里我调了很久，又是race condition的问题。

问题如下：
父进程读取当前引用计数为2；决定采取kalloc+kfree的方法。然后schedule到子进程
子进程读取当前引用计数为2；决定采取kalloc+kfree的方法。然后schedule到父进程
然后调用完kfree 2次，其中第二次，还是把原先的那页给free了。所以依然会有这个问题。

为了解决上述问题：
我们需要保证当一个进程在读取引用计数并且决定kalloc时，另一个进程在读取引用计数时拿不到锁。当对面搞定了，才会把锁释放。然后这样就不会发生上述情况了。当然里面还有非常多的细节要考虑。比如kalloc 内存分配不够。我的改动大概如下：

新加了一个kalloc_cow，专门为copyonwrite函数调度。里面会处理，如果引用计数是1，就是不复制，直接返回之前的地址，同时赋值WRITE权限。

1700307924691.png
修改copyonwrite函数，根据返回的地址新旧走不同的逻辑。

1700308041682.png

我们来看下新方法的效果：

1700308073259.png

可以看到这次的节约是递增的，就是我们最坏情况也是很原来的分配页面开销一样。不会出现反而要花更多开销的情况。

进一步优化，实现lazy allocation

1700308253485.png

1700308270098.png

1700308418901.png

1700308464160.png

1700308502844.png

我们再来看一下新的效果：

首先确保功能性测试都能通过：

1700308575865.png
1700308844216.png

simpletest 没有大的提升主要是因为sbrk之后，几乎把每个页都写了。把写的逻辑注释掉，我们可以再看下效果：

1700308900038.png

1700308917007.png

技术思想
惰性删除（redis过期删除） copy-on-write：redis rdb的时候，fork子进程bgsave...
Redis持久化常见问题
Fork操作定义：调用fork函数给当前正在运行的进程创建一个子进程。根据copy-on-write（写时复制）...
redis 持久化AOF
RDB 有什么问题耗时耗性能o(n)数据：耗时fork():销毁内存，copy-on-write策略Disk...
fork 进程测试 copy-on-write
父进程 fork 子进程后，子进程通过 copy-on-write 模式获得父进程内存，也就是子进程共用了大部分父...
连载 | 机器学习基石 Lec 5:Growth Functio
Lec5：Training versus Testing tips：符号含义参照Lec1 上一节中我们得到在一些条...
Lec 5 E-R module
Lec 5 E-R model Entity Sets Entity with several attribute...
Liz雅思小作文: Lecture 5
IELTS Liz Lecture 5 Maps 例题： Example_of_Lec_5.png 不常见但是的确...
2020-10-26-Git的简单应用
前提: 注册一个Github帐号成功fork工作室的仓库到个人的仓库(singularity-lab/blog ...
专业词汇总结
2018-8-12 PPA: Power，Time, Area 1.LEC：Logic Equivalence...
讲解：Stat 20、R、Rmd file、RDatabase|
2019/5/10 Lab2.utf8https://ccle.ucla.edu/pluginfile.php/2...

Lec 8 - Lab 5 Copy-on-Write Fork

Interrupt硬件部分

设备驱动概述

bottom

top

如何对设备进行编程

Console是如何显示出$ ls

UART驱动的top部分

UART驱动的bottom部分

UART读取键盘输入

Lab: Copy-on-Write Fork for xv6

optional challenge

进一步优化，实现lazy allocation

相关文章

技术思想

Redis持久化常见问题

redis 持久化AOF

fork 进程测试 copy-on-write

连载 | 机器学习基石 Lec 5:Growth Functio

Lec 5 E-R module

Liz雅思小作文: Lecture 5

2020-10-26-Git的简单应用

专业词汇总结

讲解：Stat 20、R、Rmd file、RDatabase|

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读