[转]深入linux内核架构--虚拟文件系统(简介)

作者: Nothing_655f | 来源:发表于2021-01-09 16:53 被阅读0次

深入linux内核架构--虚拟文件系统(简介)

在Linux中，“万物兼文件”，我们知道在linux下面有很多文件系统，如EXT/2/3/4，XFS等，为了很好的支持各种类型的文件系统，Linux抽象了一层虚拟文件系统层，用于更加灵活的适配各种具体的文件系统实现。其基本架构如下：

image

可以看到所有的虚拟文件系统操作都必须在内核态执行，这是由于对于系统存储及外部设备的访问极其复杂，这部分的操作不能交给用户去操作，否则系统会非常不稳定。

文件系统类型

基于磁盘的文件系统
在非易失介质存储存储文件的经典方法，也就是为我们所熟知的各类文件系统，注入EXT2/3/4, FAT等
虚拟文件系统
在内核中生成，是一种使用用户应用程序与用户通信的方法，最为人所知的就是proc文件系统，其不需要与任何种类的硬件上存储信息，所有的信息都存储在内存中，伴随着进程而消亡
网络文件系统
这种文件系统可以访问其他计算机上的数据，本机不会陷入内核态，所有的请求会发送到其他机器执行，因此网络文件系统一般会以FUSE的形式挂载。

通用文件系统

虚拟文件系统定义了一些了方法和抽象以及文件系统中对象(或文件)的统一视图，但是在不同的实现中，会截然不同，其提供的是一个通用的全集，其提供的许多操作在某些子系统中并不需要，比如proc系统中的write_page操作。
在处理文件时，内核空间和用户空间使用的对象是不同的，在用户空间一个文件有一个"文件描述符"标识，是一个整数，也就是我们经常说的FD，只在一个进程内部有效，两个不同进程之间可以使用同一个FD；而FD对应的内核空间的数据结构是struct file，其主要的成员为address_space，address_space是真正与底层设备交互数据结构，而另外一个管理文件元信息的数据结构是inode，其存储着文件的链接，访问时间，版本，对应的后端设备，所在的超级块等等元信息，但是不包括文件名，文件名存储在struct dentry中，这是由于文件名是用于索引及管理inode的，而dentry就是用于管理inode的，而dentry则通过super_block索引。
下面我们就来具体讨论一下具体的各个结构及他们的关系，并讨论一下在linux中打开一个文件到写入具体经历了哪些事情。

VFS结构

image

inode

inode用于管理文件的元数据信息，包括权限信息，访问信息，链接信息，存储设备信息等，对应的操作主要包括链接、权限、，其数据结构如下：
相关介绍参考inode

/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {
    ...
    const struct inode_operations   *i_op; // inode的操作，与具体的文件系统相关
    struct super_block  *i_sb; // 超级块
    struct address_space    *i_mapping; // 地址空间，真正的与设备交互模块
        ...
    /* Stat data, not accessed from path walking */
    unsigned long       i_ino; // inode 编号
    /*
     * Filesystems may only read i_nlink directly.  They shall use the
     * following functions for modification:
     *
     *    (set|clear|inc|drop)_nlink
     *    inode_(inc|dec)_link_count
     */
    union {
        const unsigned int i_nlink;
        unsigned int __i_nlink;
    };
    dev_t           i_rdev;
    loff_t          i_size;
    struct timespec64   i_atime; // 最后访问时间
    struct timespec64   i_mtime; // 最后修改时间
    struct timespec64   i_ctime; // 创建时间
    spinlock_t          i_lock; /* i_blocks, i_bytes, maybe i_size */
    unsigned short      i_bytes; // 文件大小字节数
    u8                  i_blkbits;       // 文件大小对应的块长度
    u8                  i_write_hint;
    blkcnt_t            i_blocks; // 文件长度 / 块长度

#ifdef __NEED_I_SIZE_ORDERED
    seqcount_t      i_size_seqcount;
#endif

    /* Misc */
    unsigned long       i_state;
    struct rw_semaphore i_rwsem;

    unsigned long       dirtied_when;   /* jiffies of first dirtying */
    unsigned long       dirtied_time_when;

    struct hlist_node   i_hash;
    struct list_head    i_io_list;  /* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
    struct bdi_writeback    *i_wb;      /* the associated cgroup wb */

    /* foreign inode detection, see wbc_detach_inode() */
    int         i_wb_frn_winner;
    u16         i_wb_frn_avg_time;
    u16         i_wb_frn_history;
#endif
    struct list_head    i_lru;      /* inode LRU list */
    struct list_head    i_sb_list;
    struct list_head    i_wb_list;  /* backing dev writeback list */
    union {
        struct hlist_head   i_dentry; // 一个inode可能被多个dentry使用(link)
        struct rcu_head i_rcu;
    };
    atomic64_t  i_version;
    atomic_t        i_count;
    atomic_t        i_dio_count;
    atomic_t        i_writecount;
#ifdef CONFIG_IMA
    atomic_t        i_readcount; /* struct files open RO */
#endif
    const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
    struct file_lock_context    *i_flctx;
    struct address_space    i_data;
    struct list_head    i_devices;
    union {
        struct pipe_inode_info  *i_pipe; // 管道类型
        struct block_device *i_bdev; // 块设备
        struct cdev     *i_cdev;  // 字符设备
        char            *i_link; // 不知道是啥
        unsigned        i_dir_seq; // 不知道是啥
    };
    __u32           i_generation;
#ifdef CONFIG_FSNOTIFY
    __u32           i_fsnotify_mask; /* all events this inode cares about */
    struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
#endif

#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
    struct fscrypt_info *i_crypt_info;
#endif
    void            *i_private; /* fs or device private pointer */
} __randomize_layout;
struct inode_operations {
    struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); // 根据inode中的dir及dentry中的filename 查找 inode
    const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); // 查找inode目录下的对于dentryfilename的所有链接
    int (*permission) (struct inode *, int);
    struct posix_acl * (*get_acl)(struct inode *, int);

    int (*readlink) (struct dentry *, char __user *,int);

    int (*create) (struct inode *,struct dentry *, umode_t, bool);
    int (*link) (struct dentry *,struct inode *,struct dentry *); // 创建hard link
    int (*unlink) (struct inode *,struct dentry *); // 删除hardlink
    int (*symlink) (struct inode *,struct dentry *,const char *); // 创建软连接
    int (*mkdir) (struct inode *,struct dentry *,umode_t); // 根据mode及dentry中的目录名创建目录，并生成inode
    int (*rmdir) (struct inode *,struct dentry *); // 删除目录
    int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); // 根据
    int (*rename) (struct inode *, struct dentry *,
            struct inode *, struct dentry *, unsigned int); // VFS to move the file specified by old_dentry from the old_dir directory to the directory new_dir, with the filename specified by new_dentry
    int (*setattr) (struct dentry *, struct iattr *);
    int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
    ssize_t (*listxattr) (struct dentry *, char *, size_t);
    int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
              u64 len);
    int (*update_time)(struct inode *, struct timespec64 *, int);
    int (*atomic_open)(struct inode *, struct dentry *,
               struct file *, unsigned open_flag,
               umode_t create_mode); 
    int (*tmpfile) (struct inode *, struct dentry *, umode_t);
    int (*set_acl)(struct inode *, struct posix_acl *, int);
} ____cacheline_aligned;

dentry

dentry主要用于管理文件名，建立与所有子目录项的联系。

dentry state

dentry可以有三种状态 used，unused，negative
used：关联到一个有效的inode
unused：关联到了一个有效的inode，但是引用数为0，还没被真正删除
negative：没有可关联的inode，可能是文件被删除了，或者根本没有存储设备的文件

dentry cache

通过一个path查找对应的dentry，如果每次都从磁盘中去获取的话会比较耗资源，所以提供了一个lru缓存用于加速查找，比如我们查找 /usr/bin/java这个文件的目录项的时候，先需要找到 / 的目录项，然后/bin，依次类推直到找到path的结尾，这样中间的查找过程中涉及到的目录项就会被缓存起来，方便下次查找。而这个查找过程在下面的look_up中详细分析
更多细节看dentry
其数据结构如下：

struct dentry {
    /* RCU lookup touched fields */
    unsigned int d_flags;       /* protected by d_lock */
    seqcount_t d_seq;       /* per dentry seqlock */
    struct hlist_bl_node d_hash;    /* lookup hash list */
    struct dentry *d_parent;    /* parent directory */
    struct qstr d_name;
    struct inode *d_inode;      /* Where the name belongs to - NULL is
                     * negative */
    unsigned char d_iname[DNAME_INLINE_LEN];    /* small names */

    /* Ref lookup also touches following */
    struct lockref d_lockref;   /* per-dentry lock and refcount */
    const struct dentry_operations *d_op;
    struct super_block *d_sb;   /* The root of the dentry tree */
    unsigned long d_time;       /* used by d_revalidate */
    void *d_fsdata;         /* fs-specific data */

    union {
        struct list_head d_lru;     /* LRU list */
        wait_queue_head_t *d_wait;  /* in-lookup ones only */
    };
    struct list_head d_child;   /* child of parent list */
    struct list_head d_subdirs; /* our children */
    /*
     * d_alias and d_rcu can share memory
     */
    union {
        struct hlist_node d_alias;  /* inode alias list */
        struct hlist_bl_node d_in_lookup_hash;  /* only for in-lookup ones */
        struct rcu_head d_rcu;
    } d_u;
} __randomize_layout;
struct dentry_operations {
    int (*d_revalidate)(struct dentry *, unsigned int); // 检测dentry有消息
    int (*d_weak_revalidate)(struct dentry *, unsigned int);
    int (*d_hash)(const struct dentry *, struct qstr *); // 计算dentry的hash值
    int (*d_compare)(const struct dentry *, // 比较文件名
            unsigned int, const char *, const struct str *);
    int (*d_delete)(const struct dentry *); 
                     // 删除目录项，默认实现为将引用置0，也就是标位unused
    int (*d_init)(struct dentry *);
    void (*d_release)(struct dentry *);
    void (*d_prune)(struct dentry *);
    void (*d_iput)(struct dentry *, struct inode *); //当丢失inode时，释放dentry
    char *(*d_dname)(struct dentry *, char *, int);
    struct vfsmount *(*d_automount)(struct path *);
    int (*d_manage)(const struct path *, bool);
    struct dentry *(*d_real)(struct dentry *, const struct inode *);
} ____cacheline_aligned;

super_block

超级块用于管理挂载点对于的实际文件系统中的一些参数，包括：块长度，文件系统可处理的最大文件长度，文件系统类型，对应的存储设备等。（注：在之前的整体结构图中superblock会有一个files指向所有打开的文件，但是在下面的数据结构中并没有找到相关的代码，是因为之前该结构会用于判断umount逻辑时，确保所有文件都已被关闭，新版的不知道怎么处理这个逻辑了，后续看到了再补上）
相关superblock的管理主要在文件系统的挂载逻辑，这个后续在讲到挂载相关的模块是详细分析。而superblock主要功能是管理inode。
详细信息见superblock
其数据结构如下：

struct super_block {
    struct list_head    s_list;     /* Keep this first */
    dev_t           s_dev;      /* search index; _not_ kdev_t */
    unsigned char       s_blocksize_bits; // 块字节
    unsigned long       s_blocksize; // log2(块字节)
    loff_t          s_maxbytes; /* Max file size */
    struct file_system_type *s_type; // 文件系统类型
    const struct super_operations   *s_op; // 超级块的操作
    const struct dquot_operations   *dq_op;
    const struct quotactl_ops   *s_qcop;
    const struct export_operations *s_export_op;
    unsigned long       s_flags;
    unsigned long       s_iflags;   /* internal SB_I_* flags */
    unsigned long       s_magic;
    struct dentry       *s_root; // 根目录项。所有的path lookup 都是从此开始
    struct rw_semaphore s_umount;
    int         s_count;
    atomic_t        s_active;
#ifdef CONFIG_SECURITY
    void                    *s_security;
#endif
    const struct xattr_handler **s_xattr;
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
    const struct fscrypt_operations *s_cop;
#endif
    struct hlist_bl_head    s_roots;    /* alternate root dentries for NFS */
    struct list_head    s_mounts;   /* list of mounts; _not_ for fs use */
    struct block_device *s_bdev;
    struct backing_dev_info *s_bdi;
    struct mtd_info     *s_mtd;
    struct hlist_node   s_instances;
    unsigned int        s_quota_types;  /* Bitmask of supported quota types */
    struct quota_info   s_dquot;    /* Diskquota specific options */

    struct sb_writers   s_writers;

    /*
     * Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
     * s_fsnotify_marks together for cache efficiency. They are frequently
     * accessed and rarely modified.
     */
    void            *s_fs_info; /* Filesystem private info */

    /* Granularity of c/m/atime in ns (cannot be worse than a second) */
    u32         s_time_gran;
#ifdef CONFIG_FSNOTIFY
    __u32           s_fsnotify_mask;
    struct fsnotify_mark_connector __rcu    *s_fsnotify_marks;
#endif

    char            s_id[32];   /* Informational name */
    uuid_t          s_uuid;     /* UUID */

    unsigned int        s_max_links;
    fmode_t         s_mode;

    /*
     * The next field is for VFS *only*. No filesystems have any business
     * even looking at it. You had been warned.
     */
    struct mutex s_vfs_rename_mutex;    /* Kludge */

    /*
     * Filesystem subtype.  If non-empty the filesystem type field
     * in /proc/mounts will be "type.subtype"
     */
    char *s_subtype;

    const struct dentry_operations *s_d_op; /* default d_op for dentries */

    /*
     * Saved pool identifier for cleancache (-1 means none)
     */
    int cleancache_poolid;

    struct shrinker s_shrink;   /* per-sb shrinker handle */

    /* Number of inodes with nlink == 0 but still referenced */
    atomic_long_t s_remove_count;

    /* Pending fsnotify inode refs */
    atomic_long_t s_fsnotify_inode_refs;

    /* Being remounted read-only */
    int s_readonly_remount;

    /* AIO completions deferred from interrupt context */
    struct workqueue_struct *s_dio_done_wq;
    struct hlist_head s_pins;

    /*
     * Owning user namespace and default context in which to
     * interpret filesystem uids, gids, quotas, device nodes,
     * xattrs and security labels.
     */
    struct user_namespace *s_user_ns;

    /*
     * The list_lru structure is essentially just a pointer to a table
     * of per-node lru lists, each of which has its own spinlock.
     * There is no need to put them into separate cachelines.
     */
    struct list_lru     s_dentry_lru; // 目录项缓存
    struct list_lru     s_inode_lru; // inode 缓存
    struct rcu_head     rcu;
    struct work_struct  destroy_work;

    struct mutex        s_sync_lock;    /* sync serialisation lock */

    /*
     * Indicates how deep in a filesystem stack this SB is
     */
    int s_stack_depth;

    /* s_inode_list_lock protects s_inodes */
    spinlock_t      s_inode_list_lock ____cacheline_aligned_in_smp;
    struct list_head    s_inodes;   /* all inodes */

    spinlock_t      s_inode_wblist_lock;
    struct list_head    s_inodes_wb;    /* writeback inodes */
} __randomize_layout;
struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb); // 在当前sb创建inode
    void (*destroy_inode)(struct inode *); // 在当前sb删除inode
    void (*dirty_inode) (struct inode *, int flags); // 标记为脏inode
    int (*write_inode) (struct inode *, struct writeback_control *wbc);// inode 写回
    int (*drop_inode) (struct inode *); // 同delete，不过inode的引用必须为0
    void (*evict_inode) (struct inode *);
    void (*put_super) (struct super_block *);  // 卸载sb
    int (*sync_fs)(struct super_block *sb, int wait); 
    int (*freeze_super) (struct super_block *);
    int (*freeze_fs) (struct super_block *);
    int (*thaw_super) (struct super_block *);
    int (*unfreeze_fs) (struct super_block *);
    int (*statfs) (struct dentry *, struct kstatfs *); // 查询元信息
    int (*remount_fs) (struct super_block *, int *, char *); //重新挂载
    void (*umount_begin) (struct super_block *); // 主要用于NFS
        // 查询相关
    int (*show_options)(struct seq_file *, struct dentry *);
    int (*show_devname)(struct seq_file *, struct dentry *);
    int (*show_path)(struct seq_file *, struct dentry *);
    int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
    ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
    ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
    struct dquot **(*get_dquots)(struct inode *);
#endif
    int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
    long (*nr_cached_objects)(struct super_block *,
                  struct shrink_control *);
    long (*free_cached_objects)(struct super_block *,
                    struct shrink_control *);
};

address_space

之前提到spuerblock用于管理inode，而dentry用于文件名管理，文件名到inode的映射及目录的管理，而inode用于管理一些文件的元数据信息，但是真正的将文件与磁盘等存储设备的交互由谁来做呢？write一份数据是怎么从内存写回磁盘，而又如何从磁盘读数据到内存呢？这就是address_space主要需要处理的工作，address_space主要用于处理内存到后端设备之间的数据同步，其具体工作原理在内存缓存中详细介绍。

struct address_space {
    struct inode        *host; // 所在的inode 以便于获取文件元信息
    struct xarray       i_pages; // 文件对应的内存页
    gfp_t           gfp_mask; // 内存类型
    atomic_t        i_mmap_writable; // VM_SHARED映射计数
    struct rb_root_cached   i_mmap; // mmap私有和共享映射的树结构
    struct rw_semaphore i_mmap_rwsem;
    unsigned long       nrpages; // 文件大小对应的内存页数量
    unsigned long       nrexceptional;
    pgoff_t         writeback_index; //回写由此开始
    const struct address_space_operations *a_ops; // 地址空间操作
    unsigned long       flags; // 错误标识位
    errseq_t        wb_err; //
    spinlock_t      private_lock;
    struct list_head    private_list;
    void            *private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc); // 回写一页
    int (*readpage)(struct file *, struct page *); //读取一页数据到内存中

    /* Write back some dirty pages from this mapping. */
    int (*writepages)(struct address_space *, struct writeback_control *); // 回写脏页

    /* Set a page dirty.  Return true if this dirtied it */
    int (*set_page_dirty)(struct page *page); // 标记脏页

    /*
     * Reads in the requested pages. Unlike ->readpage(), this is
     * PURELY used for read-ahead!.
     */
    int (*readpages)(struct file *filp, struct address_space *mapping,
            struct list_head *pages, unsigned nr_pages);

    int (*write_begin)(struct file *, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned flags,
                struct page **pagep, void **fsdata);
    int (*write_end)(struct file *, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned copied,
                struct page *page, void *fsdata);

    /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
    sector_t (*bmap)(struct address_space *, sector_t);
    void (*invalidatepage) (struct page *, unsigned int, unsigned int);
    int (*releasepage) (struct page *, gfp_t);
    void (*freepage)(struct page *);
    ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
    /*
     * migrate the contents of a page to the specified target. If
     * migrate_mode is MIGRATE_ASYNC, it must not block.
     */
    int (*migratepage) (struct address_space *,
            struct page *, struct page *, enum migrate_mode);
    bool (*isolate_page)(struct page *, isolate_mode_t);
    void (*putback_page)(struct page *);
    int (*launder_page) (struct page *);
    int (*is_partially_uptodate) (struct page *, unsigned long,
                    unsigned long);
    void (*is_dirty_writeback) (struct page *, bool *, bool *);
    int (*error_remove_page)(struct address_space *, struct page *);

    /* swapfile support */
    int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                sector_t *span);
    void (*swap_deactivate)(struct file *file);
};

file

前文中提到对于进程来说，用户空间看到的整数fd，而内核中的对应的数据结构则为file，所有用户空间对于fd的操作都会由系统调用转换到操作file。
更多详细信息见file
其数据结构如下：

struct task_struct {
       ...
    /* Filesystem information: */
    struct fs_struct        *fs; // root & pwd path

    /* Open file information: */
    struct files_struct     *files; // opened files

    /* Namespaces: */
    struct nsproxy          *nsproxy;
        ...
};
/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
    atomic_t count; // 打开文件数
    bool resize_in_progress; //
    wait_queue_head_t resize_wait;

    struct fdtable __rcu *fdt; // fd table
    struct fdtable fdtab; // fd table
  /*
   * written part on a separate cache line in SMP
   */
    spinlock_t file_lock ____cacheline_aligned_in_smp;
    unsigned int next_fd; // 该进程打开的下一个fd
    unsigned long close_on_exec_init[1];
    unsigned long open_fds_init[1];
    unsigned long full_fds_bits_init[1];
    struct file __rcu * fd_array[NR_OPEN_DEFAULT]; //打开的文件
};
struct fdtable {
    unsigned int max_fds; // ulimit -n 打开句柄上限
    struct file __rcu **fd;      /* current fd array */
    unsigned long *close_on_exec;
    unsigned long *open_fds;  // fd占用位图
    unsigned long *full_fds_bits;
    struct rcu_head rcu;
};
struct file {
    union {
        struct llist_node   fu_llist;
        struct rcu_head     fu_rcuhead;
    } f_u;
    struct path     f_path;  // 路径
    struct inode        *f_inode;    /* cached value */
    const struct file_operations    *f_op; // 文件操作
    /*
     * Protects f_ep_links, f_flags.
     * Must not be taken from IRQ context.
     */
    spinlock_t      f_lock;
    enum rw_hint        f_write_hint;
    atomic_long_t   f_count;
    unsigned int        f_flags;
    fmode_t         f_mode;
    struct mutex        f_pos_lock;
    loff_t          f_pos; // 当前文件的操作位置
    struct fown_struct  f_owner; // 当前文件所在的进程
    const struct cred   *f_cred;
    struct file_ra_state    f_ra;
    u64         f_version;
#ifdef CONFIG_SECURITY
    void            *f_security;
#endif
    /* needed for tty driver, and maybe others */
    void            *private_data;

#ifdef CONFIG_EPOLL
    /* Used by fs/eventpoll.c to link all the hooks to this file */
    struct list_head    f_ep_links;
    struct list_head    f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
    struct address_space    *f_mapping; // 地址空间
    errseq_t        f_wb_err;
} __randomize_layout
struct file_operations {
    struct module *owner;
    loff_t (*llseek) (struct file *, loff_t, int); // 移动操作位置
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
    ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
    int (*iterate) (struct file *, struct dir_context *);
    int (*iterate_shared) (struct file *, struct dir_context *);
    __poll_t (*poll) (struct file *, struct poll_table_struct *);
    long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
    long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
    int (*mmap) (struct file *, struct vm_area_struct *); // 将文件与虚拟内存映射
    unsigned long mmap_supported_flags;
    int (*open) (struct inode *, struct file *); // 
    int (*flush) (struct file *, fl_owner_t id);
    int (*release) (struct inode *, struct file *);
    int (*fsync) (struct file *, loff_t, loff_t, int datasync);
    int (*fasync) (int, struct file *, int);
    int (*lock) (struct file *, int, struct file_lock *);
    ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
    unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
    int (*check_flags)(int); 
    int (*flock) (struct file *, int, struct file_lock *); // 对一个file 加锁
    ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
    ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
    int (*setlease)(struct file *, long, struct file_lock **, void **);
    long (*fallocate)(struct file *file, int mode, loff_t offset,
              loff_t len);
    void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
    unsigned (*mmap_capabilities)(struct file *);
#endif
    ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
            loff_t, size_t, unsigned int);
    loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
                   struct file *file_out, loff_t pos_out,
                   loff_t len, unsigned int remap_flags);
    int (*fadvise)(struct file *, loff_t, loff_t, int);
} __randomize_layout;

虚拟文件系统实战

由此对于虚拟文件的基本架构有了一定的理解，但是如果想要对于虚拟文件有比较深刻的认识还是比较模糊的，那么我们来通过自己伪码来操作一下文件，以描述linux内核是如何来读写文件的，我们以写文件为例来过一下整个流程：
需求：从0开始向文件/testmount/testdir/testfile1.txt 中写入 hello world
基本过程其基本系统调用过程为1.mkdir 2. creat 3. open 4. write
mkdir对应的函数调用的执行过程如下：
rootInode = sb->s_root->d_inode;
testDirDentry = dentry("testdir")
testDirInode = rootInode->i_op->mkdir(rootInode , testDirDentry, 777))
creat对应的函数调用的执行过程如下：
testFileDentry = dentry("testfile1.txt")
testFileInode = testDirInode->i_op->create(testDirInode, testFileDentry, 777 )
open 的系统调用的执行过程如下
testFileInode->f_op->open(testFileInode, testfile)
write的系统调用的执行过程如下
testfile->f_op->write(file, "hello world", len, 0)
具体流程：

假设现在我们有一个快磁盘设备/dev/sda，我们将其格式化为EX2文件系统，具体怎么将块设备格式化这个我们再设备管理章节在描述。
我们将该磁盘挂载到/testmount 目录，这样内核就会通过挂载模块注册对应的superblock，具体如何挂载且听下回分解。
我们想要写文件/testmount/testdir/testfile1.txt文件，那么首先会要根据文件名完整路径查找对应的目录项，并在不存在的时候创建对应的inode文件。
3.1 根据完整路径找到对应的挂载点的superblock，我们这里最精确的匹配sb是/testmount
3.2 找到sb后，找到当前sb的root dentry，找到root dentry对应的inode，通过inode中的address_space从磁盘中读取信息，如果是目录则其中存储内容为所有子条目信息，从而构建完整的root dentry中的子条目；发现没有对应testdir的目录，这时候就会报目录不存在的错误；用户开始创建对应的目录，并将对应的信息写回inode对应的设备；同理也需要在/testdir目录下创建testfile1.txt文件并写回/testdir对应的inode设备。
找到inode之后，我们需要通过open系统调用打开对应的文件，进程通过files_struct中的next_fd申请分配一个文件描述符，然后调用inode->f_op->open(inode, file)，生成一个file对象，并将inode中的address_space信息传到file中，然后将用户空间的fd关联到该file对象。
打开文件之后所有后续的读写操作都是通过该fd来进行，在内核层面就是通过对应的file数据结构操作文件，比如我们要写入hello world，那么就是通过调用file->f_op->write；
其实file->f_op其实是讲对应的字节内容写入到address_space中对应的内存中，address_space再选择合适的时间写回磁盘，这就是我们常说的缓存系统，当然我们也可以通过fsync系统调用强制将数据同步回存储系统。在f_op的函数中都可以看到__user描述信息，说明数据是来自用户空间的内存地址，这些数据最终要写到内核缓存的address_space中的page内存中，这就是我们常说的内核拷贝，后来就出来了大家所熟知的零拷贝sendfile，直接在两个fd直接拷贝数据，操作的都是内核里面的page数据，不需要到用户地址空间走一遭。

结语

至此vfs的基本流程就介绍完了，但是对于super_block的挂载，address_space的具体读写操作后续再慢慢补上。其中address_space会在也缓存及块缓存中详细介绍，因为这一块是特别复杂的而且与具体的文件系统实现相关，后续将结合EX2文件系统一起介绍。

作者：淡泊宁静_3652
链接：https://www.jianshu.com/p/a98cb5519a50
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

[转]深入linux内核架构--虚拟文件系统(简介)

深入linux内核架构--虚拟文件系统(简介)

文件系统类型

通用文件系统

VFS结构

inode

dentry

dentry state

dentry cache

super_block

address_space

file

虚拟文件系统实战

结语

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读