美文网首页
[转]深入linux内核架构--虚拟文件系统(简介)

[转]深入linux内核架构--虚拟文件系统(简介)

作者: Nothing_655f | 来源:发表于2021-01-09 16:53 被阅读0次

    深入linux内核架构--虚拟文件系统(简介)

    在Linux中,“万物兼文件”,我们知道在linux下面有很多文件系统,如EXT/2/3/4,XFS等,为了很好的支持各种类型的文件系统,Linux抽象了一层虚拟文件系统层,用于更加灵活的适配各种具体的文件系统实现。其基本架构如下:

    image

    可以看到所有的虚拟文件系统操作都必须在内核态执行,这是由于对于系统存储及外部设备的访问极其复杂,这部分的操作不能交给用户去操作,否则系统会非常不稳定。

    文件系统类型

    1. 基于磁盘的文件系统
      在非易失介质存储存储文件的经典方法,也就是为我们所熟知的各类文件系统,注入EXT2/3/4, FAT等
    2. 虚拟文件系统
      在内核中生成,是一种使用用户应用程序与用户通信的方法,最为人所知的就是proc文件系统,其不需要与任何种类的硬件上存储信息,所有的信息都存储在内存中,伴随着进程而消亡
    3. 网络文件系统
      这种文件系统可以访问其他计算机上的数据,本机不会陷入内核态,所有的请求会发送到其他机器执行,因此网络文件系统一般会以FUSE的形式挂载。

    通用文件系统

    虚拟文件系统定义了一些了方法和抽象以及文件系统中对象(或文件)的统一视图,但是在不同的实现中,会截然不同,其提供的是一个通用的全集,其提供的许多操作在某些子系统中并不需要,比如proc系统中的write_page操作。
    在处理文件时,内核空间和用户空间使用的对象是不同的,在用户空间一个文件有一个"文件描述符"标识,是一个整数,也就是我们经常说的FD,只在一个进程内部有效,两个不同进程之间可以使用同一个FD;而FD对应的内核空间的数据结构是struct file,其主要的成员为address_space,address_space是真正与底层设备交互数据结构,而另外一个管理文件元信息的数据结构是inode,其存储着文件的链接,访问时间,版本,对应的后端设备,所在的超级块等等元信息,但是不包括文件名,文件名存储在struct dentry中,这是由于文件名是用于索引及管理inode的,而dentry就是用于管理inode的,而dentry则通过super_block索引。
    下面我们就来具体讨论一下具体的各个结构及他们的关系,并讨论一下在linux中打开一个文件到写入具体经历了哪些事情。

    VFS结构

    image

    inode

    inode用于管理文件的元数据信息,包括权限信息,访问信息,链接信息,存储设备信息等, 对应的操作主要包括链接、权限、,其数据结构如下:
    相关介绍参考inode

    /*
     * Keep mostly read-only and often accessed (especially for
     * the RCU path lookup and 'stat' data) fields at the beginning
     * of the 'struct inode'
     */
    struct inode {
        ...
        const struct inode_operations   *i_op; // inode的操作,与具体的文件系统相关
        struct super_block  *i_sb; // 超级块
        struct address_space    *i_mapping; // 地址空间,真正的与设备交互模块
            ...
        /* Stat data, not accessed from path walking */
        unsigned long       i_ino; // inode 编号
        /*
         * Filesystems may only read i_nlink directly.  They shall use the
         * following functions for modification:
         *
         *    (set|clear|inc|drop)_nlink
         *    inode_(inc|dec)_link_count
         */
        union {
            const unsigned int i_nlink;
            unsigned int __i_nlink;
        };
        dev_t           i_rdev;
        loff_t          i_size;
        struct timespec64   i_atime; // 最后访问时间
        struct timespec64   i_mtime; // 最后修改时间
        struct timespec64   i_ctime; // 创建时间
        spinlock_t          i_lock; /* i_blocks, i_bytes, maybe i_size */
        unsigned short      i_bytes; // 文件大小字节数
        u8                  i_blkbits;       // 文件大小对应的块长度
        u8                  i_write_hint;
        blkcnt_t            i_blocks; // 文件长度 / 块长度
    
    #ifdef __NEED_I_SIZE_ORDERED
        seqcount_t      i_size_seqcount;
    #endif
    
        /* Misc */
        unsigned long       i_state;
        struct rw_semaphore i_rwsem;
    
        unsigned long       dirtied_when;   /* jiffies of first dirtying */
        unsigned long       dirtied_time_when;
    
        struct hlist_node   i_hash;
        struct list_head    i_io_list;  /* backing dev IO list */
    #ifdef CONFIG_CGROUP_WRITEBACK
        struct bdi_writeback    *i_wb;      /* the associated cgroup wb */
    
        /* foreign inode detection, see wbc_detach_inode() */
        int         i_wb_frn_winner;
        u16         i_wb_frn_avg_time;
        u16         i_wb_frn_history;
    #endif
        struct list_head    i_lru;      /* inode LRU list */
        struct list_head    i_sb_list;
        struct list_head    i_wb_list;  /* backing dev writeback list */
        union {
            struct hlist_head   i_dentry; // 一个inode可能被多个dentry使用(link)
            struct rcu_head i_rcu;
        };
        atomic64_t  i_version;
        atomic_t        i_count;
        atomic_t        i_dio_count;
        atomic_t        i_writecount;
    #ifdef CONFIG_IMA
        atomic_t        i_readcount; /* struct files open RO */
    #endif
        const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
        struct file_lock_context    *i_flctx;
        struct address_space    i_data;
        struct list_head    i_devices;
        union {
            struct pipe_inode_info  *i_pipe; // 管道类型
            struct block_device *i_bdev; // 块设备
            struct cdev     *i_cdev;  // 字符设备
            char            *i_link; // 不知道是啥
            unsigned        i_dir_seq; // 不知道是啥
        };
        __u32           i_generation;
    #ifdef CONFIG_FSNOTIFY
        __u32           i_fsnotify_mask; /* all events this inode cares about */
        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
    #endif
    
    #if IS_ENABLED(CONFIG_FS_ENCRYPTION)
        struct fscrypt_info *i_crypt_info;
    #endif
        void            *i_private; /* fs or device private pointer */
    } __randomize_layout;
    struct inode_operations {
        struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); // 根据inode中的dir及dentry中的filename 查找 inode
        const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); // 查找inode目录下的对于dentryfilename的所有链接
        int (*permission) (struct inode *, int);
        struct posix_acl * (*get_acl)(struct inode *, int);
    
        int (*readlink) (struct dentry *, char __user *,int);
    
        int (*create) (struct inode *,struct dentry *, umode_t, bool);
        int (*link) (struct dentry *,struct inode *,struct dentry *); // 创建hard link
        int (*unlink) (struct inode *,struct dentry *); // 删除hardlink
        int (*symlink) (struct inode *,struct dentry *,const char *); // 创建软连接
        int (*mkdir) (struct inode *,struct dentry *,umode_t); // 根据mode及dentry中的目录名创建目录,并生成inode
        int (*rmdir) (struct inode *,struct dentry *); // 删除目录
        int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); // 根据
        int (*rename) (struct inode *, struct dentry *,
                struct inode *, struct dentry *, unsigned int); // VFS to move the file specified by old_dentry from the old_dir directory to the directory new_dir, with the filename specified by new_dentry
        int (*setattr) (struct dentry *, struct iattr *);
        int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
                  u64 len);
        int (*update_time)(struct inode *, struct timespec64 *, int);
        int (*atomic_open)(struct inode *, struct dentry *,
                   struct file *, unsigned open_flag,
                   umode_t create_mode); 
        int (*tmpfile) (struct inode *, struct dentry *, umode_t);
        int (*set_acl)(struct inode *, struct posix_acl *, int);
    } ____cacheline_aligned;
    
    

    dentry

    dentry主要用于管理文件名,建立与所有子目录项的联系。

    dentry state

    dentry可以有三种状态 used,unused,negative
    used:关联到一个有效的inode
    unused:关联到了一个有效的inode,但是引用数为0,还没被真正删除
    negative:没有可关联的inode,可能是文件被删除了,或者根本没有存储设备的文件

    dentry cache

    通过一个path查找对应的dentry,如果每次都从磁盘中去获取的话会比较耗资源,所以提供了一个lru缓存用于加速查找,比如我们查找 /usr/bin/java这个文件的目录项的时候,先需要找到 / 的 目录项,然后/bin,依次类推直到找到path的结尾,这样中间的查找过程中涉及到的目录项就会被缓存起来,方便下次查找。而这个查找过程在下面的look_up中详细分析
    更多细节看dentry
    其数据结构如下:

    struct dentry {
        /* RCU lookup touched fields */
        unsigned int d_flags;       /* protected by d_lock */
        seqcount_t d_seq;       /* per dentry seqlock */
        struct hlist_bl_node d_hash;    /* lookup hash list */
        struct dentry *d_parent;    /* parent directory */
        struct qstr d_name;
        struct inode *d_inode;      /* Where the name belongs to - NULL is
                         * negative */
        unsigned char d_iname[DNAME_INLINE_LEN];    /* small names */
    
        /* Ref lookup also touches following */
        struct lockref d_lockref;   /* per-dentry lock and refcount */
        const struct dentry_operations *d_op;
        struct super_block *d_sb;   /* The root of the dentry tree */
        unsigned long d_time;       /* used by d_revalidate */
        void *d_fsdata;         /* fs-specific data */
    
        union {
            struct list_head d_lru;     /* LRU list */
            wait_queue_head_t *d_wait;  /* in-lookup ones only */
        };
        struct list_head d_child;   /* child of parent list */
        struct list_head d_subdirs; /* our children */
        /*
         * d_alias and d_rcu can share memory
         */
        union {
            struct hlist_node d_alias;  /* inode alias list */
            struct hlist_bl_node d_in_lookup_hash;  /* only for in-lookup ones */
            struct rcu_head d_rcu;
        } d_u;
    } __randomize_layout;
    struct dentry_operations {
        int (*d_revalidate)(struct dentry *, unsigned int); // 检测dentry有消息
        int (*d_weak_revalidate)(struct dentry *, unsigned int);
        int (*d_hash)(const struct dentry *, struct qstr *); // 计算dentry的hash值
        int (*d_compare)(const struct dentry *, // 比较文件名
                unsigned int, const char *, const struct str *);
        int (*d_delete)(const struct dentry *); 
                         // 删除目录项,默认实现为将引用置0,也就是标位unused
        int (*d_init)(struct dentry *);
        void (*d_release)(struct dentry *);
        void (*d_prune)(struct dentry *);
        void (*d_iput)(struct dentry *, struct inode *); //当丢失inode时,释放dentry
        char *(*d_dname)(struct dentry *, char *, int);
        struct vfsmount *(*d_automount)(struct path *);
        int (*d_manage)(const struct path *, bool);
        struct dentry *(*d_real)(struct dentry *, const struct inode *);
    } ____cacheline_aligned;
    
    

    super_block

    超级块用于管理挂载点对于的实际文件系统中的一些参数,包括:块长度,文件系统可处理的最大文件长度,文件系统类型,对应的存储设备等。(注:在之前的整体结构图中superblock会有一个files指向所有打开的文件,但是在下面的数据结构中并没有找到相关的代码,是因为之前该结构会用于判断umount逻辑时,确保所有文件都已被关闭,新版的不知道怎么处理这个逻辑了,后续看到了再补上
    相关superblock的管理主要在文件系统的挂载逻辑,这个后续在讲到挂载相关的模块是详细分析。而superblock主要功能是管理inode。
    详细信息见superblock
    其数据结构如下:

    struct super_block {
        struct list_head    s_list;     /* Keep this first */
        dev_t           s_dev;      /* search index; _not_ kdev_t */
        unsigned char       s_blocksize_bits; // 块字节
        unsigned long       s_blocksize; // log2(块字节)
        loff_t          s_maxbytes; /* Max file size */
        struct file_system_type *s_type; // 文件系统类型
        const struct super_operations   *s_op; // 超级块的操作
        const struct dquot_operations   *dq_op;
        const struct quotactl_ops   *s_qcop;
        const struct export_operations *s_export_op;
        unsigned long       s_flags;
        unsigned long       s_iflags;   /* internal SB_I_* flags */
        unsigned long       s_magic;
        struct dentry       *s_root; // 根目录项。所有的path lookup 都是从此开始
        struct rw_semaphore s_umount;
        int         s_count;
        atomic_t        s_active;
    #ifdef CONFIG_SECURITY
        void                    *s_security;
    #endif
        const struct xattr_handler **s_xattr;
    #if IS_ENABLED(CONFIG_FS_ENCRYPTION)
        const struct fscrypt_operations *s_cop;
    #endif
        struct hlist_bl_head    s_roots;    /* alternate root dentries for NFS */
        struct list_head    s_mounts;   /* list of mounts; _not_ for fs use */
        struct block_device *s_bdev;
        struct backing_dev_info *s_bdi;
        struct mtd_info     *s_mtd;
        struct hlist_node   s_instances;
        unsigned int        s_quota_types;  /* Bitmask of supported quota types */
        struct quota_info   s_dquot;    /* Diskquota specific options */
    
        struct sb_writers   s_writers;
    
        /*
         * Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
         * s_fsnotify_marks together for cache efficiency. They are frequently
         * accessed and rarely modified.
         */
        void            *s_fs_info; /* Filesystem private info */
    
        /* Granularity of c/m/atime in ns (cannot be worse than a second) */
        u32         s_time_gran;
    #ifdef CONFIG_FSNOTIFY
        __u32           s_fsnotify_mask;
        struct fsnotify_mark_connector __rcu    *s_fsnotify_marks;
    #endif
    
        char            s_id[32];   /* Informational name */
        uuid_t          s_uuid;     /* UUID */
    
        unsigned int        s_max_links;
        fmode_t         s_mode;
    
        /*
         * The next field is for VFS *only*. No filesystems have any business
         * even looking at it. You had been warned.
         */
        struct mutex s_vfs_rename_mutex;    /* Kludge */
    
        /*
         * Filesystem subtype.  If non-empty the filesystem type field
         * in /proc/mounts will be "type.subtype"
         */
        char *s_subtype;
    
        const struct dentry_operations *s_d_op; /* default d_op for dentries */
    
        /*
         * Saved pool identifier for cleancache (-1 means none)
         */
        int cleancache_poolid;
    
        struct shrinker s_shrink;   /* per-sb shrinker handle */
    
        /* Number of inodes with nlink == 0 but still referenced */
        atomic_long_t s_remove_count;
    
        /* Pending fsnotify inode refs */
        atomic_long_t s_fsnotify_inode_refs;
    
        /* Being remounted read-only */
        int s_readonly_remount;
    
        /* AIO completions deferred from interrupt context */
        struct workqueue_struct *s_dio_done_wq;
        struct hlist_head s_pins;
    
        /*
         * Owning user namespace and default context in which to
         * interpret filesystem uids, gids, quotas, device nodes,
         * xattrs and security labels.
         */
        struct user_namespace *s_user_ns;
    
        /*
         * The list_lru structure is essentially just a pointer to a table
         * of per-node lru lists, each of which has its own spinlock.
         * There is no need to put them into separate cachelines.
         */
        struct list_lru     s_dentry_lru; // 目录项缓存
        struct list_lru     s_inode_lru; // inode 缓存
        struct rcu_head     rcu;
        struct work_struct  destroy_work;
    
        struct mutex        s_sync_lock;    /* sync serialisation lock */
    
        /*
         * Indicates how deep in a filesystem stack this SB is
         */
        int s_stack_depth;
    
        /* s_inode_list_lock protects s_inodes */
        spinlock_t      s_inode_list_lock ____cacheline_aligned_in_smp;
        struct list_head    s_inodes;   /* all inodes */
    
        spinlock_t      s_inode_wblist_lock;
        struct list_head    s_inodes_wb;    /* writeback inodes */
    } __randomize_layout;
    struct super_operations {
        struct inode *(*alloc_inode)(struct super_block *sb); // 在当前sb创建inode
        void (*destroy_inode)(struct inode *); // 在当前sb删除inode
        void (*dirty_inode) (struct inode *, int flags); // 标记为脏inode
        int (*write_inode) (struct inode *, struct writeback_control *wbc);// inode 写回
        int (*drop_inode) (struct inode *); // 同delete,不过inode的引用必须为0
        void (*evict_inode) (struct inode *);
        void (*put_super) (struct super_block *);  // 卸载sb
        int (*sync_fs)(struct super_block *sb, int wait); 
        int (*freeze_super) (struct super_block *);
        int (*freeze_fs) (struct super_block *);
        int (*thaw_super) (struct super_block *);
        int (*unfreeze_fs) (struct super_block *);
        int (*statfs) (struct dentry *, struct kstatfs *); // 查询元信息
        int (*remount_fs) (struct super_block *, int *, char *); //重新挂载
        void (*umount_begin) (struct super_block *); // 主要用于NFS
            // 查询相关
        int (*show_options)(struct seq_file *, struct dentry *);
        int (*show_devname)(struct seq_file *, struct dentry *);
        int (*show_path)(struct seq_file *, struct dentry *);
        int (*show_stats)(struct seq_file *, struct dentry *);
    #ifdef CONFIG_QUOTA
        ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
        ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
        struct dquot **(*get_dquots)(struct inode *);
    #endif
        int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
        long (*nr_cached_objects)(struct super_block *,
                      struct shrink_control *);
        long (*free_cached_objects)(struct super_block *,
                        struct shrink_control *);
    };
    
    

    address_space

    之前提到spuerblock用于管理inode,而dentry用于文件名管理,文件名到inode的映射及目录的管理,而inode用于管理一些文件的元数据信息,但是真正的将文件与磁盘等存储设备的交互由谁来做呢?write一份数据是怎么从内存写回磁盘,而又如何从磁盘读数据到内存呢?这就是address_space主要需要处理的工作,address_space主要用于处理内存到后端设备之间的数据同步,其具体工作原理在内存缓存中详细介绍。

    struct address_space {
        struct inode        *host; // 所在的inode 以便于获取文件元信息
        struct xarray       i_pages; // 文件对应的内存页
        gfp_t           gfp_mask; // 内存类型
        atomic_t        i_mmap_writable; // VM_SHARED映射计数
        struct rb_root_cached   i_mmap; // mmap私有和共享映射的树结构
        struct rw_semaphore i_mmap_rwsem;
        unsigned long       nrpages; // 文件大小对应的内存页数量
        unsigned long       nrexceptional;
        pgoff_t         writeback_index; //回写由此开始
        const struct address_space_operations *a_ops; // 地址空间操作
        unsigned long       flags; // 错误标识位
        errseq_t        wb_err; //
        spinlock_t      private_lock;
        struct list_head    private_list;
        void            *private_data;
    } __attribute__((aligned(sizeof(long)))) __randomize_layout;
    struct address_space_operations {
        int (*writepage)(struct page *page, struct writeback_control *wbc); // 回写一页
        int (*readpage)(struct file *, struct page *); //读取一页数据到内存中
    
        /* Write back some dirty pages from this mapping. */
        int (*writepages)(struct address_space *, struct writeback_control *); // 回写脏页
    
        /* Set a page dirty.  Return true if this dirtied it */
        int (*set_page_dirty)(struct page *page); // 标记脏页
    
        /*
         * Reads in the requested pages. Unlike ->readpage(), this is
         * PURELY used for read-ahead!.
         */
        int (*readpages)(struct file *filp, struct address_space *mapping,
                struct list_head *pages, unsigned nr_pages);
    
        int (*write_begin)(struct file *, struct address_space *mapping,
                    loff_t pos, unsigned len, unsigned flags,
                    struct page **pagep, void **fsdata);
        int (*write_end)(struct file *, struct address_space *mapping,
                    loff_t pos, unsigned len, unsigned copied,
                    struct page *page, void *fsdata);
    
        /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
        sector_t (*bmap)(struct address_space *, sector_t);
        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
        int (*releasepage) (struct page *, gfp_t);
        void (*freepage)(struct page *);
        ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
        /*
         * migrate the contents of a page to the specified target. If
         * migrate_mode is MIGRATE_ASYNC, it must not block.
         */
        int (*migratepage) (struct address_space *,
                struct page *, struct page *, enum migrate_mode);
        bool (*isolate_page)(struct page *, isolate_mode_t);
        void (*putback_page)(struct page *);
        int (*launder_page) (struct page *);
        int (*is_partially_uptodate) (struct page *, unsigned long,
                        unsigned long);
        void (*is_dirty_writeback) (struct page *, bool *, bool *);
        int (*error_remove_page)(struct address_space *, struct page *);
    
        /* swapfile support */
        int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                    sector_t *span);
        void (*swap_deactivate)(struct file *file);
    };
    
    

    file

    前文中提到对于进程来说,用户空间看到的整数fd,而内核中的对应的数据结构则为file,所有用户空间对于fd的操作都会由系统调用转换到操作file。
    更多详细信息见file
    其数据结构如下:

    struct task_struct {
           ...
        /* Filesystem information: */
        struct fs_struct        *fs; // root & pwd path
    
        /* Open file information: */
        struct files_struct     *files; // opened files
    
        /* Namespaces: */
        struct nsproxy          *nsproxy;
            ...
    };
    /*
     * Open file table structure
     */
    struct files_struct {
      /*
       * read mostly part
       */
        atomic_t count; // 打开文件数
        bool resize_in_progress; //
        wait_queue_head_t resize_wait;
    
        struct fdtable __rcu *fdt; // fd table
        struct fdtable fdtab; // fd table
      /*
       * written part on a separate cache line in SMP
       */
        spinlock_t file_lock ____cacheline_aligned_in_smp;
        unsigned int next_fd; // 该进程打开的下一个fd
        unsigned long close_on_exec_init[1];
        unsigned long open_fds_init[1];
        unsigned long full_fds_bits_init[1];
        struct file __rcu * fd_array[NR_OPEN_DEFAULT]; //打开的文件
    };
    struct fdtable {
        unsigned int max_fds; // ulimit -n 打开句柄上限
        struct file __rcu **fd;      /* current fd array */
        unsigned long *close_on_exec;
        unsigned long *open_fds;  // fd占用位图
        unsigned long *full_fds_bits;
        struct rcu_head rcu;
    };
    struct file {
        union {
            struct llist_node   fu_llist;
            struct rcu_head     fu_rcuhead;
        } f_u;
        struct path     f_path;  // 路径
        struct inode        *f_inode;    /* cached value */
        const struct file_operations    *f_op; // 文件操作
        /*
         * Protects f_ep_links, f_flags.
         * Must not be taken from IRQ context.
         */
        spinlock_t      f_lock;
        enum rw_hint        f_write_hint;
        atomic_long_t   f_count;
        unsigned int        f_flags;
        fmode_t         f_mode;
        struct mutex        f_pos_lock;
        loff_t          f_pos; // 当前文件的操作位置
        struct fown_struct  f_owner; // 当前文件所在的进程
        const struct cred   *f_cred;
        struct file_ra_state    f_ra;
        u64         f_version;
    #ifdef CONFIG_SECURITY
        void            *f_security;
    #endif
        /* needed for tty driver, and maybe others */
        void            *private_data;
    
    #ifdef CONFIG_EPOLL
        /* Used by fs/eventpoll.c to link all the hooks to this file */
        struct list_head    f_ep_links;
        struct list_head    f_tfile_llink;
    #endif /* #ifdef CONFIG_EPOLL */
        struct address_space    *f_mapping; // 地址空间
        errseq_t        f_wb_err;
    } __randomize_layout
    struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int); // 移动操作位置
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
        ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
        ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
        int (*iterate) (struct file *, struct dir_context *);
        int (*iterate_shared) (struct file *, struct dir_context *);
        __poll_t (*poll) (struct file *, struct poll_table_struct *);
        long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
        long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *); // 将文件与虚拟内存映射
        unsigned long mmap_supported_flags;
        int (*open) (struct inode *, struct file *); // 
        int (*flush) (struct file *, fl_owner_t id);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, loff_t, loff_t, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
        unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
        int (*check_flags)(int); 
        int (*flock) (struct file *, int, struct file_lock *); // 对一个file 加锁
        ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
        ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
        int (*setlease)(struct file *, long, struct file_lock **, void **);
        long (*fallocate)(struct file *file, int mode, loff_t offset,
                  loff_t len);
        void (*show_fdinfo)(struct seq_file *m, struct file *f);
    #ifndef CONFIG_MMU
        unsigned (*mmap_capabilities)(struct file *);
    #endif
        ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
                loff_t, size_t, unsigned int);
        loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
                       struct file *file_out, loff_t pos_out,
                       loff_t len, unsigned int remap_flags);
        int (*fadvise)(struct file *, loff_t, loff_t, int);
    } __randomize_layout;
    
    

    虚拟文件系统实战

    由此对于虚拟文件的基本架构有了一定的理解,但是如果想要对于虚拟文件有比较深刻的认识还是比较模糊的,那么我们来通过自己伪码来操作一下文件,以描述linux内核是如何来读写文件的,我们以写文件为例来过一下整个流程:
    需求:从0开始向文件/testmount/testdir/testfile1.txt 中写入 hello world
    基本过程其基本系统调用过程为1.mkdir 2. creat 3. open 4. write
    mkdir对应的函数调用的执行过程如下:
    rootInode = sb->s_root->d_inode;
    testDirDentry = dentry("testdir")
    testDirInode = rootInode->i_op->mkdir(rootInode , testDirDentry, 777))
    creat对应的函数调用的执行过程如下:
    testFileDentry = dentry("testfile1.txt")
    testFileInode = testDirInode->i_op->create(testDirInode, testFileDentry, 777 )
    open 的系统调用的执行过程如下
    testFileInode->f_op->open(testFileInode, testfile)
    write的系统调用的执行过程如下
    testfile->f_op->write(file, "hello world", len, 0)
    具体流程:

    1. 假设现在我们有一个快磁盘设备/dev/sda,我们将其格式化为EX2文件系统,具体怎么将块设备格式化这个我们再设备管理章节在描述。
    2. 我们将该磁盘挂载到/testmount 目录,这样内核就会通过挂载模块注册对应的superblock,具体如何挂载且听下回分解。
    3. 我们想要写文件/testmount/testdir/testfile1.txt文件,那么首先会要根据文件名完整路径查找对应的目录项,并在不存在的时候创建对应的inode文件。
      3.1 根据完整路径找到对应的挂载点的superblock,我们这里最精确的匹配sb是/testmount
      3.2 找到sb后,找到当前sb的root dentry,找到root dentry对应的inode,通过inode中的address_space从磁盘中读取信息,如果是目录则其中存储内容为所有子条目信息,从而构建完整的root dentry中的子条目;发现没有对应testdir的目录,这时候就会报目录不存在的错误;用户开始创建对应的目录,并将对应的信息写回inode对应的设备;同理也需要在/testdir目录下创建testfile1.txt文件并写回/testdir对应的inode设备。
    4. 找到inode之后,我们需要通过open系统调用打开对应的文件,进程通过files_struct中的next_fd申请分配一个文件描述符,然后调用inode->f_op->open(inode, file),生成一个file对象,并将inode中的address_space信息传到file中,然后将用户空间的fd关联到该file对象。
    5. 打开文件之后所有后续的读写操作都是通过该fd来进行,在内核层面就是通过对应的file数据结构操作文件,比如我们要写入hello world,那么就是通过调用file->f_op->write;
      其实file->f_op其实是讲对应的字节内容写入到address_space中对应的内存中,address_space再选择合适的时间写回磁盘,这就是我们常说的缓存系统,当然我们也可以通过fsync系统调用强制将数据同步回存储系统。在f_op的函数中都可以看到__user描述信息,说明数据是来自用户空间的内存地址,这些数据最终要写到内核缓存的address_space中的page内存中,这就是我们常说的内核拷贝,后来就出来了大家所熟知的零拷贝sendfile,直接在两个fd直接拷贝数据,操作的都是内核里面的page数据,不需要到用户地址空间走一遭。

    结语

    至此vfs的基本流程就介绍完了,但是对于super_block的挂载,address_space的具体读写操作后续再慢慢补上。其中address_space会在也缓存及块缓存中详细介绍,因为这一块是特别复杂的而且与具体的文件系统实现相关,后续将结合EX2文件系统一起介绍。

    作者:淡泊宁静_3652
    链接:https://www.jianshu.com/p/a98cb5519a50
    来源:简书
    著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

    相关文章

      网友评论

          本文标题:[转]深入linux内核架构--虚拟文件系统(简介)

          本文链接:https://www.haomeiwen.com/subject/zjwcaktx.html