Lec 14,15 文件系统

作者: 西部小笼包 | 来源:发表于2024-02-15 17:26 被阅读0次

连载 | 机器学习基石 Lec 7：VC Dimension 及
连载 | 机器学习基石 Lec 9: 线性回归
连载 | 机器学习基石 Lec 5:Growth Functio
第55次打卡（1月6日）
英语笑话 81 （谁是印刷工？ Who's the pr
连载 | 机器学习基石 Lec10：Logistics Regr
英语词缀【45】
Lec 1
连载 | 机器学习基石 Lec 8：Noise & Er
lec1-Introduction

当我们向文件 append 一个单词时，背后发生了啥？

一，找到父亲文件夹的inode

首先我们可能会打开（没有的时候创建一个文件），这个是通过在用户空间调用fd = open(name, O_CREATE | O_RDWR); 实现的。
接下来会去定位到这个路径下的dir，他会从当前路径的inode，往下一路寻找。直到找到这个文件上一级的inode，就是这个文件上一级dir的inode。

struct inode {
  uint dev;           // Device number
  uint inum;          // Inode number
  int ref;            // Reference count
  struct sleeplock lock; // protects everything below here
  int valid;          // inode has been read from disk?

  short type;         // copy of disk inode
  short major;
  short minor;
  short nlink;
  uint size;
  uint addrs[NDIRECT+1];
};

下面讲解下找inode的过程
1705635681264.png
上图中，inode块里，存的是这样的数据结构的内容：

struct dinode {
  short type;           // File type
  short major;          // Major device number (T_DEVICE only)
  short minor;          // Minor device number (T_DEVICE only)
  short nlink;          // Number of links to inode in file system
  uint size;            // Size of file (bytes)
  uint addrs[NDIRECT+1];   // Data block addresses
};

我们需要把inode磁盘上的数据加载进内存里缓存起来（注，这里读的不是磁盘上的文件内容，而是inode的内容，也就是文件里会存dinode这个数据结构里的几个数据），这个就是ilock这个方法做的事情。所以这里会第一次读磁盘（存在bcache里）

a. 首先通过下面代码定位当前inode, 一种路径是绝对路径，一种是相对路径。

 if(*path == '/')
    ip = iget(ROOTDEV, ROOTINO);
  else
    ip = idup(myproc()->cwd);

b. 循环弹出，每一级folder 的 folder名。如a/b/c，就依次遍历a,b,c
c. 当前inode必为dir, 查找弹出的名字，如a
d. 得到新的inode，继续循环，知道到最后一级（不再是dir,是文件了），不再查找文件，直接返回父亲dir 的inode

下面讲解下在dir inode中找下一级inode的过程
1705639931332.png

for(off = 0; off < dp->size; off += sizeof(de)){
    if(readi(dp, 0, (uint64)&de, off, sizeof(de)) != sizeof(de))
      panic("dirlookup read");
    if(de.inum == 0)
      continue;
    if(namecmp(name, de.name) == 0){
      // entry matches path element
      if(poff)
        *poff = off;
      inum = de.inum;
      return iget(dp->dev, inum);
    }
  }

a. 首先我们需要读取当前dir inode里的数据，因为dir存的就是下面有哪些dir或文件。我们要判断我们这个文件名在不在里面，以及在什么位置。dir文件内容是由下面的数据结构组成的。inum 就是子文件或子DIR的inode编号

struct dirent {
  ushort inum;
  char name[DIRSIZ];
};

所以可以看到for 循环每次off += sizeof(de)
b. 那么有了偏移量和inode，就可以计算数据在磁盘上的物理地址. 就是用offset/BSIZE,得到一个序号，这个序号就可以从inode的addrs，找到对应磁盘的地址了。
c. 然后就是读取磁盘的地址，有可能一个dirent ，会跨2个BBLOCK，所以我们要注意这种可能。这也是下面代码m的作用：

for(tot=0; tot<n; tot+=m, off+=m, dst+=m){
    uint addr = bmap(ip, off/BSIZE);
    if(addr == 0)
      break;
    bp = bread(ip->dev, addr);
    m = min(n - tot, BSIZE - off%BSIZE);
    if(either_copyout(user_dst, dst, bp->data + (off % BSIZE), m) == -1) {
      brelse(bp);
      tot = -1;
      break;
    }
    brelse(bp);
  }

综上我们成功找到了要创建/打开的文件的父文件夹目录。

二，创建并返回文件inode

先用同样的方法遍历父文件夹下有没有这个文件的文件名，有的话，直接返回该文件名的inode （实际只需要inum，其余的属性会在ilock时去文件系统读）
如果不存在，就代表要创建一个新文件，那么就调用ialloc

struct inode*
ialloc(uint dev, short type)
{
  int inum;
  struct buf *bp;
  struct dinode *dip;

  for(inum = 1; inum < sb.ninodes; inum++){
    bp = bread(dev, IBLOCK(inum, sb));
    dip = (struct dinode*)bp->data + inum%IPB;
    if(dip->type == 0){  // a free inode
      memset(dip, 0, sizeof(*dip));
      dip->type = type;
      log_write(bp);   // mark it allocated on the disk
      brelse(bp);
      return iget(dev, inum);
    }
    brelse(bp);
  }
  printf("ialloc: no inodes\n");
  return 0;
}

读取inum所属的那个inode文件，找到对应偏移量，判断是不是freenode，是的话，就可以用来创建了。这边有一个读取inode区域的读磁盘操作。

三，关联file 和 fd

在ftable 里加一个新的file 的数据结构

struct file {
  enum { FD_NONE, FD_PIPE, FD_INODE, FD_DEVICE } type;
  int ref; // reference count
  char readable;
  char writable;
  struct pipe *pipe; // FD_PIPE
  struct inode *ip;  // FD_INODE and FD_DEVICE
  uint off;          // FD_INODE
  short major;       // FD_DEVICE
};

// Allocate a file structure.
struct file*
filealloc(void)
{
  struct file *f;

  acquire(&ftable.lock);
  for(f = ftable.file; f < ftable.file + NFILE; f++){
    if(f->ref == 0){
      f->ref = 1;
      release(&ftable.lock);
      return f;
    }
  }
  release(&ftable.lock);
  return 0;
}

在该进程下找到一个空闲的fd去存储这个file

static int
fdalloc(struct file *f)
{
  int fd;
  struct proc *p = myproc();

  for(fd = 0; fd < NOFILE; fd++){
    if(p->ofile[fd] == 0){
      p->ofile[fd] = f;
      return fd;
    }
  }
  return -1;
}

配置file的ip(inode)

f->ip = ip;
f->readable = !(omode & O_WRONLY);
f->writable = (omode & O_WRONLY) || (omode & O_RDWR);

返回文件描述符到用户空间。（此时文件描述符可以找到file, file可以找到inode， inode下可以找到file的metadata, 和data在磁盘上的地址。

四，开始写一个单词

现在我们在用户空间可以向fd 写一些数据，具体操作如write(fd, (void*)addr, 8192); 分别描述向哪个文件描述符，从addr开始把里面8192长度的数据写进fd所属的文件里。
锁定inode, 写数据，解锁inode

if(f->type == FD_INODE){
    // write a few blocks at a time to avoid exceeding
    // the maximum log transaction size, including
    // i-node, indirect block, allocation blocks,
    // and 2 blocks of slop for non-aligned writes.
    // this really belongs lower down, since writei()
    // might be writing a device like the console.
    int max = ((MAXOPBLOCKS-1-1-2) / 2) * BSIZE;
    int i = 0;
    while(i < n){
      int n1 = n - i;
      if(n1 > max)
        n1 = max;

      begin_op();
      ilock(f->ip);
      if ((r = writei(f->ip, 1, addr + i, f->off, n1)) > 0)
        f->off += r;
      iunlock(f->ip);
      end_op();

      if(r != n1){
        // error from writei
        break;
      }
      i += r;
    }
    ret = (i == n ? n : -1);
  }

上面可以隐式看到文件的off 被维护住了，随着写的发生，off值相应增大。

writei 就是根据off 去查在inode的哪个addr里，取到addr，把它从磁盘读进缓存，然后开始向缓存buf写数据.
最后更新inode的size 为新的off, 然后写回磁盘的inode区域

int
writei(struct inode *ip, int user_src, uint64 src, uint off, uint n)
{
  uint tot, m;
  struct buf *bp;

  if(off > ip->size || off + n < off)
    return -1;
  if(off + n > MAXFILE*BSIZE)
    return -1;

  for(tot=0; tot<n; tot+=m, off+=m, src+=m){
    uint addr = bmap(ip, off/BSIZE);
    if(addr == 0)
      break;
    bp = bread(ip->dev, addr);
    m = min(n - tot, BSIZE - off%BSIZE);
    if(either_copyin(bp->data + (off % BSIZE), user_src, src, m) == -1) {
      brelse(bp);
      break;
    }
    log_write(bp);
    brelse(bp);
  }

  if(off > ip->size)
    ip->size = off;

  // write the i-node back to disk even if the size didn't change
  // because the loop above might have called bmap() and added a new
  // block to ip->addrs[].
  iupdate(ip);

  return tot;
}

综上，我们完成了向文件系统写一个单词的操作。当然具体的落盘会由log取真正的保证。下面我们会去讲解LOG的部分。。
我们这里来回顾一下，当我们往文件里写一个单词会发生什么？

首先，要定位路径父亲的inode，这里会用到itable 这个缓存，如果没在缓存里找到，那么会建一个新的条目放进itable里，并且之后通过ilock方法，会根据inum把这个inode相关缺失的属性从文件系统中的inode区域里读出来。
当处理dir的inode时，他的数据存放的其实是这个DIR下所有的其他INUM。我们依次遍历找到名字可以对的上的INUM，然后用这个inum用步骤1去找到对应的inode.
如果名字没有匹配，需要创建时,通过在文件系统的inode区域找到一个空闲的inode，进行写入。
那么文件的inode创建出来后，就会在进程下找到一个空闲的 fd 去关联这个file, file里面最重要的就是这个inode
这个inode 具体文件内容写的地址，会在调用readi 或writei 时调用bmap的时候，用balloc去分配
bmap返回一个addr, 随后会调用bread 去读这个addr( 也就是blockno)
cache层会先找到一个可用的cache，然后把内容从磁盘读进cache里，之后就是操作内存。
最后通过log的记录会进行数据落盘

Log是如何工作的？

这里的思想是，在任何一个写操作前，我们会先记这个写操作到LOG里. 然后通过写一条commit的标志，表示前面的操作要么全执行成功，要么全部不执行（如果没有commit标志，前面的LOG全部丢弃）
那么在任意时间断电，我们就可以去读取LOG FILE，根据里面的COMMIT的标志，来恢复数据了。
在XV6中，以创建文件的sys_open为例（在sysfile.c文件中）每个文件系统操作，都有begin_op和end_op分别表示事物的开始和结束。在end_op中会实现commit操作。

中间的写log操作

void
log_write(struct buf *b)
{
  int i;

  acquire(&log.lock);
  if (log.lh.n >= LOGSIZE || log.lh.n >= log.size - 1)
    panic("too big a transaction");
  if (log.outstanding < 1)
    panic("log_write outside of trans");

  for (i = 0; i < log.lh.n; i++) {
    if (log.lh.block[i] == b->blockno)   // log absorption
      break;
  }
  log.lh.block[i] = b->blockno;
  if (i == log.lh.n) {  // Add new block to log?
    bpin(b);
    log.lh.n++;
  }
  release(&log.lock);
}

这里其实只是更新了log.lh 的总size（也就是有几个BLOCK 需要落盘，然后存放了哪几个）
比如写block 45，我们已经更新了block cache中的block 45。接下来我们需要在内存中记录，在稍后的commit中，要将block 45写入到磁盘的log中。
这里的代码先获取log header的锁，之后再更新log header。首先代码会查看block 45是否已经被log记录了。如果是的话，其实不用做任何事情，因为block 45已经会被写入了。这种忽略的行为称为log absorbtion。如果block 45不在需要写入到磁盘中的block列表中，接下来会对n加1，并将block 45记录在列表的最后。

下面我们看下commit函数，是怎么实现的

static void
commit()
{
  if (log.lh.n > 0) {
    write_log();     // Write modified blocks from cache to log
    write_head();    // Write header to disk -- the real commit
    install_trans(0); // Now install writes to home locations
    log.lh.n = 0;
    write_head();    // Erase the transaction from the log
  }
}

首先是write_log。这基本上就是将所有存在于内存中的log header中的block编号对应的block，从block cache写入到磁盘上的log区域中（注，也就是将变化先从内存拷贝到log中）。
write_head会将内存中的log header写入到磁盘中。

// Copy modified blocks from cache to log.
static void
write_log(void)
{
  int tail;

  for (tail = 0; tail < log.lh.n; tail++) {
    struct buf *to = bread(log.dev, log.start+tail+1); // log block
    struct buf *from = bread(log.dev, log.lh.block[tail]); // cache block
    memmove(to->data, from->data, BSIZE);
    bwrite(to);  // write the log
    brelse(from);
    brelse(to);
  }
}

依次遍历log中记录的block，并写入到log中。它首先读出log block，将cache中的block拷贝到log block，最后再将log block写回到磁盘中。这样可以确保需要写入的block都记录在文件系统的log区域中。

但是在这个位置，我们还没有commit，现在我们只是将block存放在了log中。如果我们在这个位置也就是在write_head之前crash了，那么最终的表现就像是transaction从来没有发生过。

接下来看一下write_head函数，我之前将write_head称为commit point。

// Write in-memory log header to disk.
// This is the true point at which the
// current transaction commits.
static void
write_head(void)
{
  struct buf *buf = bread(log.dev, log.start);
  struct logheader *hb = (struct logheader *) (buf->data);
  int i;
  hb->n = log.lh.n;
  for (i = 0; i < log.lh.n; i++) {
    hb->block[i] = log.lh.block[i];
  }
  bwrite(buf);
  brelse(buf);
}

上面的bwrite，就是真正的commit point. 在这部之前crash, 都会像transaction从来没有发生过一样。

install_trans实际上就是将block数据从log中拷贝到了实际的文件系统block中。当然，可能在这里代码的某个位置会出现问题，但是这应该也没问题，因为在恢复的时候，我们会从最开始重新执行过。

灾难恢复

initlog基本上就是调用recover_from_log函数。

static void
recover_from_log(void)
{
  read_head();
  install_trans(1); // if committed, copy from log to disk
  log.lh.n = 0;
  write_head(); // clear the log
}

recover_from_log先调用read_head函数从磁盘中读取header，之后调用install_trans函数。这个函数之前在commit函数中也调用过，它就是读取log header中的n，然后根据n将所有的log block拷贝到文件系统的block中。recover_from_log在最后也会跟之前一样清除log。

xv6文件系统的三个挑战

1. Cache Eviction 问题

问题描述：当进行中的事务需要更新一个数据块（如block 45），而缓冲区（buffer cache）已满时，系统可能需要撤回（evict）这个刚刚更新的数据块以腾出空间。如果这时直接将更新后的数据块写回磁盘，会违反事务的原子性和前写规则（write ahead rule）。前写规则要求在实际更新文件系统的数据块之前，必须先将所有更改写入日志。
解决方案：通过引用计数来“固定”（pin）相关的数据块在缓冲区中，避免它们被撤回，直到相关的日志条目被写入，从而保证了数据的一致性和事务的原子性。

2. 文件系统操作与日志大小的适配问题

问题描述：在XV6文件系统中，日志的大小是固定的（例如，30个数据块），这意味着所有文件系统操作都必须适应这个日志大小。如果一个操作尝试写入超过日志容量的数据块，将违反前写规则。
解决方案：将大的文件系统操作分割成多个小的操作，每个小操作作为独立的事务处理，并逐个写入日志。虽然这意味着大的写操作不是原子的，但它遵守了不破坏文件系统一致性的原则。

3. 并发文件系统调用的挑战

问题描述：如果有多个并发执行的事务同时占用日志空间，可能会导致日志空间用尽，而没有任何一个事务能够完全完成。这种情况下，如果提交任何一个部分完成的事务，都将违反前写规则，因为这样会提交一个不完整的事务。
解决方案：XV6通过限制同时进行的文件系统操作的数量来解决这个问题。如果当前有太多操作正在执行，在begin_op中, 系统会暂停新的文件系统操作，直到足够多的操作完成并提交（commit）。这种方法有时被称为“组提交”（group commit），因为它允许多个操作像一个大事务一样同时提交，确保它们要么全部成功，要么全部失败，从而保证了事务的原子性和系统的一致性。

// called at the start of each FS system call.
void
begin_op(void)
{
  acquire(&log.lock);
  while(1){
    if(log.committing){
      sleep(&log, &log.lock);
    } else if(log.lh.n + (log.outstanding+1)*MAXOPBLOCKS > LOGSIZE){
      // this op might exhaust log space; wait for commit.
      sleep(&log, &log.lock);
    } else {
      log.outstanding += 1;
      release(&log.lock);
      break;
    }
  }
}

连载 | 机器学习基石 Lec 7：VC Dimension 及
Lec 7：The VC Dimension tips：符号含义主要参照lec 1，部分需要参照其他lec 上一节...
连载 | 机器学习基石 Lec 9: 线性回归
Tips：未进行解释的符号主要参照lec1，部分需要参照其他lec～前面几个Lecture介绍了why can ...
连载 | 机器学习基石 Lec 5:Growth Functio
Lec5：Training versus Testing tips：符号含义参照Lec1 上一节中我们得到在一些条...
第55次打卡（1月6日）
Good morning everyone, they are environmental science lec...
英语笑话 81 （谁是印刷工？ Who's the pr
Who's the printer? The professor was giving the final lec...
连载 | 机器学习基石 Lec10：Logistics Regr
Tips：符号主要参照 Lec1，部分参照其他Lec～上一节介绍了线性回归，线性回归是输出一个 score，可以...
英语词缀【45】
【lect】 “to gather， to choose” “收集；选择” |eclecticism=ec+lec...
Lec 1
6.824 2015 Lecture 1: Introduction and lab overview 6.824...
连载 | 机器学习基石 Lec 8：Noise & Er
Tips：未进行说明的符号主要参照Lec 1 ，部分参照其他Lec. 上一节介绍了ML中非常重要的工具VC Dim...
lec1-Introduction
lec1 Introduction 1 Compiled Language and C++ Concisences...

Lec 14,15 文件系统

当我们向文件 append 一个单词时，背后发生了啥？

一，找到父亲文件夹的inode

二，创建并返回文件inode

三，关联file 和 fd

四，开始写一个单词

Log是如何工作的？

灾难恢复

xv6文件系统的三个挑战

1. Cache Eviction 问题

2. 文件系统操作与日志大小的适配问题

3. 并发文件系统调用的挑战

相关文章

连载 | 机器学习基石 Lec 7：VC Dimension 及

连载 | 机器学习基石 Lec 9: 线性回归

连载 | 机器学习基石 Lec 5:Growth Functio

第55次打卡（1月6日）

英语笑话 81 （谁是印刷工？ Who's the pr

连载 | 机器学习基石 Lec10：Logistics Regr

英语词缀【45】

Lec 1

连载 | 机器学习基石 Lec 8：Noise & Er

lec1-Introduction

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Lec 14,15 文件系统

当我们向 文件 append 一个单词时，背后发生了啥？

一， 找到父亲文件夹的inode

二， 创建并返回文件inode

三，关联file 和 fd

四，开始写一个单词

Log是如何工作的？

灾难恢复

xv6文件系统的三个挑战

1. Cache Eviction 问题

2. 文件系统操作与日志大小的适配问题

3. 并发文件系统调用的挑战

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

当我们向文件 append 一个单词时，背后发生了啥？

一，找到父亲文件夹的inode

二，创建并返回文件inode