美文网首页
块存储 2022-10-29

块存储 2022-10-29

作者: 9_SooHyun | 来源:发表于2022-10-28 18:58 被阅读0次

前言

block在不同的上下文环境代表不同的意义
例如,block device 的 block,file system 层面的block,不是一个概念
两者的关系如下:

In general, for magnetic disk(磁盘), a sector is the smallest unit of information that can be read or written.(即使只想读写一两个字节,也得读写整个block) Section sizes are typically 512 bytes. As for SSD, the smallest unit is often called page, whose size is commonly 4096 bytes. Here, both section and page have physical senses, representing disk block.

However, block in some context may refer to the logical unit of storage allocation and retrieval used by file systems or database systems, and block sizes today typically range from 4 to 16 kilobytes.

A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size.

Operating system creates abstraction called file system where it has it's own block size which is larger(multiple of) than disk block size. Similar to disk, operating system read/write data in size of file system block size. For a single read/write filesystem block multiple disk block operation will be performed

文章结构

  • block device
  • 分区
  • 格式化
  • 挂载

block device

这里的block,指的是一小块一小块的存储单元。所以块设备指的是,由大量的这些存储单元构成的存储设备

常见的块设备就包括hdd和ssd

  • hdd: hard disk drive. 传统旋转式磁盘(一般通过sata协议和操作系统交互)
  • ssd: solid state drive. 固态硬盘 (一般通过sata/nvme协议和操作系统交互)

for hdd,sector is the disk block
for ssd, page is the disk block

lsblk查看内核识别的所有块设备

在linux中,可使用lsblk查看内核识别的块存储文件
The command prints all block devices (except RAM disks) in a tree-like format by default.

[root@VM-165-116-centos ~]# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0     11:0    1 16.4M  0 rom  
vda    253:0    0  100G  0 disk 
└─vda1 253:1    0  100G  0 part /
vdb    253:16   0  200G  0 disk 
└─vdb1 253:17   0  200G  0 part /data
[root@VM-165-116-centos ~]# 

lsblk可以通过输出ROTA列来显示设备是hdd还是ssd。1 hdd, 0 ssd

root@:~# lsblk -p -o +rota
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT ROTA
/dev/sdf       8:80   0   3.7T  0 disk               1
/dev/nvme0n1 259:1    0   2.9T  0 disk               0
/dev/sdk       8:48   0   3.7T  0 disk /data5        1
/dev/sdb       8:16   0   3.7T  0 disk               1
└─/dev/sdb1    8:17   0     2T  0 part /data/data    1
/dev/nvme1n1 259:0    0   2.9T  0 disk               0
/dev/sda       8:0    0 447.1G  0 disk               0
├─/dev/sda4    8:4    0 406.6G  0 part /data         0
├─/dev/sda2    8:2    0   512M  0 part /boot/efi     0
├─/dev/sda3    8:3    0    20G  0 part /usr/local    0
└─/dev/sda1    8:1    0    20G  0 part /             0
/dev/sdj       8:144  0   3.7T  0 disk               1

linux 硬盘名

内核通过sata协议对接硬盘 叫sdx(可以是hdd/ssd);
通过nvme协议对接硬盘 叫nvme (基本上都是ssd)

查看硬盘的 block_size (sector_size)

硬盘扇区的概念分为 物理扇区 和 逻辑扇区
https://nymrli.top/2020/09/06/%E6%89%87%E5%8C%BA%E3%80%81%E5%9D%97-%E7%B0%87/

物理扇区 vs 逻辑扇区

硬盘的基本读写单位是扇区(Sector, hdd和ssd的block这里都称为Sector)。由于硬盘数据存储结构的限制,单独读写1个或几个字节是不可能的,每次读写都是以扇区为单位进行的,通常是512个字节。通过系统提供的接口读写文件数据时,看起来可以单独读写少量字节,实际上是经过了操作系统的转换才实现的。

那么什么是逻辑扇区和物理扇区呢?随着对硬盘容量的要求不断增加,为了提高数据记录密度,硬盘厂商往往采用增大扇区大小的方法,于是出现了扇区大小为4096字节的硬盘(比如ssd)。我们将这样的扇区称之为物理扇区。但是这样的大扇区会有兼容性问题,有些【直接和硬盘交互的系统或软件】(如tencent cdb就是直接和硬盘交互,而不通过文件系统)无法适应。为了解决这个问题,硬盘内部将物理扇区在逻辑上划分为多个扇区片段并将其作为普通的扇区(一般为512字节大小)报告给系统or软件。这样的扇区片段我们称之为逻辑扇区。实际读写时由硬盘内的程序(固件)负责在逻辑扇区与物理扇区之间进行转换(比如写512,先读4K出来,在4K中写512,再写入磁盘),上层感觉不到物理扇区的存在。于是逻辑扇区是硬盘可以接受读写指令的最小操作单元,是操作系统及应用程序可以访问的扇区,多数情况下其大小为512字节。我们通常所说的扇区一般就是指的逻辑扇区

对磁盘的操作采用与物理扇区大小对齐的方式可以减少转换代价,提高性能

看看MySQL(CDB)的I/O方式。linux中通常有两种I/O方式:Buffered I/O、DIRECT I/O。在 Linux 的缓存 I/O 机制中,操作系统会将 I/O 的数据缓存在文件系统的页缓存( page cache )中,通常缓存 I/O 可以减少读盘的次数,从而提高性能。但是它也有一个很明显的缺点:不能直接在应用程序地址空间和磁盘之间进行数据传输,中间隔了一层文件系统。这样的话,数据在传输过程中需要在应用程序地址空间和页缓存之间进行多次数据拷贝操作,这些数据拷贝操作所带来的 CPU 以及内存开销是非常大的。对于MySQL(CDB)来说,避开操作系统内核缓冲区而直接在应用程序地址空间和磁盘之间传输数据会比使用操作系统内核缓冲区获取更好的性能,因为MySQL(CDB)内核对要操作的数据的语义了如指掌,内部采用了更加高效的缓存替换算法

所以MySQL(CDB)通常会将参数innodb_flush_method设置为O_DIRECT,来避免使用操作系统使用文件系统来缓存I/O,提高性能。使用O_DIRECT有一个限制,就是读写参数必须与磁盘逻辑扇区进行对齐。这个也比较好理解,没有任何转换的情况下,磁盘必须要以扇区为单位进行读写的。如果不满足条件则会报错:无效参数(EINVAL),错误码22

下面给出了查看硬盘(hdd/ssd通用)物理扇区和逻辑扇区的方式

#### sdd is a HDD
root@:/# cat /sys/block/sdd/queue/logical_block_size
512
root@:/# cat /sys/block/sdd/queue/physical_block_size
512
root@:/# fdisk -l /dev/sdd

Disk /dev/sdd: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

root@:/# 
#### sda is a SSD ####
root@:~# cat /sys/block/sda/queue/logical_block_size
512
root@:~# cat /sys/block/sda/queue/physical_block_size
4096
root@:~# fdisk -l /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 480.1 GB, 480103981056 bytes, 937703088 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
...
root@:~#

可以看到,cat 的对象是名为logical_block_sizephysical_block_size的文件,顾名思义它们描述的是硬盘block的信息。而它们的值就是fdisk -l显示的 Sector size (logical/physical)

block device vs byte device

Block devices are nonvolatile mass storage devices whose information can be accessed in any order(因为块设备是由块组成的,块可以被随机访问). Hard disks, floppy disks, and CD-ROMs are examples of block devices. (HDDs, SSDs, files, ...)

Byte devices (raw devices) are sequential-access mass storage devices, for example tape devices.

linux如何区别ssd和hdd

since kernel version 2.6.29, you may verify sda with:
cat /sys/block/sda/queue/rotational
You should get 1 for hard disks and 0 for a SSD.

hdparm

hdparm provides a command line interface to various kernel interfaces supported by the Linux SATA/PATA/SAS "libata" subsystem and the older IDE driver subsystem.

因为块设备可以随机访问,因此控制对块设备的访问依赖一套相对复杂的机制,内核专门有一个子系统负责

hdparm是通过 kernel 的 libata子系统 和 IDE子系统来和hdd交互,达到读取或者设置硬盘参数的目的

root@:/# hdparm /dev/sdk

/dev/sdk:
 multcount     =  0 (off)
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 486401/255/63, sectors = 7814037168, start = 0
root@:/# 
root@:/# hdparm -W /dev/sdk  # 查看磁盘的write cache配置

/dev/sdk:
 write-caching =  1 (on)
root@:/# 

分区

分区可以将一块硬盘划分为多个partition
每个partition之间的数据互不干扰,就像一块大硬盘被划分成了很多块小硬盘一样

分区信息记录在硬盘的分区表中
Block devices can be divided into one or more logical disks called partitions. This division is recorded in the partition table, usually found in sector 0 of the disk.

parted

parted是一个交互式的分区命令
eg.
parted /dev/sdd

fdisk (format disk)

fdisk - manipulate disk partition table

All partitioning is driven by device I/O limits (the topology) by default. fdisk is able to optimize the disk layout for a 4K-sector size and use an alignment offset on modern devices for MBR and GPT.

root@:/# fdisk -l /dev/sdd

Disk /dev/sdd: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

cat /proc/partitions

The /proc/partitions file contains a table with major and minor number of partitioned devices, their number of blocks and the device name in /dev. Verify with /proc/devices to link the major number to the proper device.

paul@RHELv4u4:~$ cat /proc/partitions 
major minor  #blocks  name

   3     0     524288 hda
   3    64     734003 hdb
   8     0    8388608 sda
   8     1     104391 sda1
   8     2    8281507 sda2
   8    16    1048576 sdb
   8    32    1048576 sdc
   8    48    1048576 sdd
 253     0    7176192 dm-0
 253     1    1048576 dm-1

paul@RHELv4u4:~$ cat /proc/devices
Character devices:
  1 mem
  2 pty
  3 ttyp
  4 /dev/vc/0
  4 tty
  4 ttyS
  5 /dev/tty
  5 /dev/console
  5 /dev/ptmx
  7 vcs
  9 st
 10 misc
 13 input
 21 sg
 29 fb
 86 ch
128 ptm
136 pts
162 raw
180 usb
189 usb_device
202 cpu/msr
203 cpu/cpuid
206 osst
240 ipmidev
241 ptp
242 pps
243 pmcsas
244 gdth
245 megadev_legacy
246 ql2xapidev
247 aac
248 nvme
249 bsg
250 rtc
251 dax
252 dimmctl
253 ndctl
254 tpm

Block devices:
  1 ramdisk
  7 loop
  8 sd
  9 md
 43 nbd
 65 sd
 66 sd
 67 sd
 68 sd
 69 sd
 70 sd
 71 sd
128 sd
129 sd
130 sd
131 sd
132 sd
133 sd
134 sd
135 sd
251 device-mapper
252 mtip32xx
253 virtblk
254 mdp
259 blkext

The major number corresponds to the device type (or driver) and can be found in /proc/devices. In this case 3 corresponds to ide and 8 to sd. The major number determines the device driver to be used with this device.

The minor number is a unique identification of an instance of this device type. The devices.txt file in the kernel tree contains a full list of major and minor numbers.

格式化(为硬盘/分区写入文件系统)

file system

A filesystem is the set of methods and the data structures for keeping track of files on a disk or partition.
有了文件系统,文件才能被有序地组织在存储介质上

mkfs 创建文件系统

mkfs is used to build a Linux filesystem on a device, usually a hard disk partition. The device argument is either the device name (e.g. /dev/hda1, /dev/sdb2), or a regular file that shall contain the filesystem.

mkfs.ext4 -L 'LinuxCool' -b 2048 -F /dev/sdb
在/dev/sdb上建立ext4文件系统,该文件系统的block大小为2048字节,并给定了LinuxCool作为lable

如何判断一个磁盘是否已经格式化?

  • 只需要挂载一下那块磁盘就可以知道他有没格式化. mount and echo $? 但这种方法实在是太简陋了
  • parted -l. 查看所有磁盘是否被格式化,以及格式化后的文件系统类型
  • use lsblk -no KNAME,FSTYPE $drive or lsblk -f $drive. recommended
    If the device has been formatted, there must be a filesystem existing on the device. In this case we will get 2 columns such as sdd ext4 while we type lsblk -no KNAME,FSTYPE /dev/sdd. Otherwise we will only get the name column such as sdd while we type lsblk -no KNAME,FSTYPE /dev/sdd
    And here is a fast helper function for testing if disk has been formatted:
is_formatted()
{
  # two args
  drive=$1
  fs_type=$2

  if [[ ! -z $drive ]]
  then
    if [[ ! -z $fs_type ]]
    then
      # key logic
      current_fs=$(lsblk -no KNAME,FSTYPE $drive)

      if [[ $(echo $current_fs | wc -w) == 1 ]]
      then
        echo "[INFO] '$drive' is not formatted."
        return 0
      else
        current_fs=$(echo $current_fs | awk '{print $2}')

        if [[ "$current_fs" == "$fs_type" ]]
        then
          echo "[INFO] '$drive' is formatted with correct fs type. Moving on."
          return 0
        else
          echo "[WARN] '$drive' is formatted, but with wrong fs type '$current_fs'."
          return 0
        fi
      fi
    else
      echo "[WARN] is_formatted() was called without specifying fs_type."
      return 0
    fi
  else
    echo "[FATAL] is_formatted() was called without specifying a drive. Quitting."
    return 1
  fi
}

参考https://unix.stackexchange.com/questions/299715/method-to-test-if-disks-in-system-are-formatted

挂载

mount 挂载硬盘

挂载后,操作系统就可以使用磁盘设备了
挂载分为一次生效和永久生效

  • 在本次关机前生效,关机后失效
    mkdir -p /data5; mount /dev/sdd /data5
  • 写入/etc/fstab,永久生效
    echo "/dev/sdd /data5 ext4 defaults 0 0" >> /etc/fstab; mount -a

查看挂载
mount -l
cat /etc/mtab (mtab refers to mount table)

/etc/fstab vs /etc/mtab
/etc/fstab is a list of filesystems to be mounted at boot time. If you want your Windows or file-storage partitions mounted once your computer boots, you'll need to put appropriate entries into /etc/fstab.

/etc/mtab is a list of currently mounted filesystems. If you have a disk connected but not mounted, it won't show up in the /etc/mtab file. Once you mount it, it will show up there.

两块盘挂到同一目录,会发生什么?后面的挂载会wrap前面的挂载
# 两块盘挂到同一目录,会发生什么?后面的挂载会wrap前面的挂载
[root@TENCENT64 ~]# cat /etc/fstab
#
# /etc/fstab
# Created by anaconda
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
...
LABEL=disktest1 /data/test1 ext4 defaults 0 0
LABEL=disktest2 /data/test1 ext4 defaults 0 0
[root@TENCENT64 ~]# mount -a
[root@TENCENT64 ~]# findmnt
TARGET                           SOURCE       FSTYPE      OPTIONS
...
├─/data                          /dev/sda4    ext4        rw,noatime,data=ordered,barrier
│ └─/data/test1                  /dev/sdd     ext4        rw,relatime,data=ordered,barrier
│   └─/data/test1                /dev/sdf     ext4        rw,relatime,data=ordered,barrier
...
[root@TENCENT64 ~]# 
[root@TENCENT64 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/nvme1n1    2.9T   89M  2.8T   1% /data12
/dev/sdf        3.6T   89M  3.4T   1% /data/test1

# 再把/dev/sdd挂载到/data/test1
[root@TENCENT64 ~]# mount /dev/sdd /data/test1
[root@TENCENT64 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/nvme1n1    2.9T   89M  2.8T   1% /data12
/dev/sdd        3.6T   89M  3.4T   1% /data/test1
[root@TENCENT64 ~]# findmnt
TARGET                           SOURCE       FSTYPE      OPTIONS
...
├─/data                          /dev/sda4    ext4        rw,noatime,data=ordered,barrier
│ └─/data/test1                  /dev/sdd     ext4        rw,relatime,data=ordered,barrier
│   └─/data/test1                /dev/sdf     ext4        rw,relatime,data=ordered,barrier
│     └─/data/test1              /dev/sdd     ext4        rw,relatime,data=ordered,barrier
[root@TENCENT64 ~]# 
同一块盘先后挂载两个目录,会发生什么?
# 同一块盘先后挂载两个目录
[root@TENCENT64 ~]# cat /etc/fstab
...
LABEL=disktest1 /data/test1 ext4 defaults 0 0
LABEL=disktest1 /data/test2 ext4 defaults 0 0
[root@TENCENT64 ~]# findmnt
TARGET                           SOURCE       FSTYPE      OPTIONS
...
├─/data                          /dev/sda4    ext4        rw,noatime,data=ordered,barrier
│ ├─/data/test1                  /dev/sdd     ext4        rw,relatime,data=ordered,barrier
│ └─/data/test2                  /dev/sdd     ext4        rw,relatime,data=ordered,barrier
[root@TENCENT64 ~]# 
[root@TENCENT64 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/sdd        3.6T   89M  3.4T   1% /data/test1
[root@TENCENT64 ~]# ### 显示 /dev/sdd 挂载在/data/test1 ###
[root@TENCENT64 ~]# umount /data/test1
[root@TENCENT64 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/sdd        3.6T   89M  3.4T   1% /data/test2
[root@TENCENT64 ~]# ### umount /data/test1后,显示 /dev/sdd 挂载在/data/test1 ###

所以,os认为的挂载是findmnt目录树的深度遍历优先的结果?

umount 解挂载

The umount command detaches the mentioned file system(s) from the file hierarchy. A file system is specified by giving the directory where it has been mounted.

umount [-f -l] /data/tmp

df (disk free)

df是文件系统层面的命令,可以查看【已经挂载的】分区和文件系统类型
df - report file system disk space usage

root@:/# df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs   63G  4.0K   63G   1% /dev
tmpfs          tmpfs      63G   20K   63G   1% /dev/shm
tmpfs          tmpfs      63G  283M   63G   1% /run
tmpfs          tmpfs      63G     0   63G   0% /sys/fs/cgroup
/dev/sda1      ext4       20G  9.3G  9.3G  50% /
/dev/sda3      ext4       20G  1.6G   17G   9% /usr/local
/dev/sda2      vfat      511M  4.5M  507M   1% /boot/efi
/dev/sda4      xfs       407G   33M  407G   1% /data
/dev/sdb1      ext4      2.0T  3.9G  1.9T   1% /data/data
/dev/sdd       ext4      3.6T   89M  3.4T   1% /data5
root@:/# 

du vs df

du 是按文件来统计磁盘空间的。它统计被文件系统记录到的每个文件的大小,然后进行累加得到的大小,这是通过文件系统获取到的,并且这个统计是可以跨文件系统的
df 是按盘来统计磁盘空间的,是从超级块(superblock)中读入硬盘使用信息,df获取到的是磁盘块被使用的情况
通常会使用df -lh命令来检查每个挂载了文件系统的硬盘的总量和已使用量,使用du -sh [directory]命令来统计某个目录下所有文件的空间占用

[root@TEN boot]# du -sh /boot
634M    /boot
[root@TEN boot]# du --exclude=efi -sh /boot
625M    /boot
[root@TEN boot]# du -sh /boot/efi 
9.8M    /boot/efi
[root@TEN boot]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        251G     0  251G   0% /dev
tmpfs           251G     0  251G   0% /dev/shm
tmpfs           251G   60M  251G   1% /run
tmpfs           251G     0  251G   0% /sys/fs/cgroup
/dev/nvme0n1p1   20G   12G  6.7G  65% /
/dev/nvme0n1p2  511M  9.8M  502M   2% /boot/efi
/dev/nvme0n1p4  841G  894M  797G   1% /data
/dev/nvme0n1p3   20G  772M   18G   5% /usr/local
tmpfs            51G     0   51G   0% /run/user/0
[root@TEN boot]#

如示例,/boot这个文件夹占用的是/dev/nvme0n1p1的磁盘空间,而/boot/efi占用的是/dev/nvme0n1p2的磁盘空间
而在du统计文件磁盘占用时,du -sh /boot = 634M = 625M + 9.8M
显然统计/boot的文件磁盘占用既包含了/dev/nvme0n1p1的空间,也包含了/dev/nvme0n1p2的空间
因此du统计是基于文件的、可跨文件系统的

相关文章

  • 块存储 2022-10-29

    前言 block在不同的上下文环境代表不同的意义例如,block device 的 block,file syst...

  • 【大话存储II】学习笔记(15章),对象存储

    在谈对象存储是什么之前,我们先回顾一下块存储和文件存储是什么 块存储与文件存储 块存储: 常见的块存储设备是磁盘阵...

  • Cinder块存储

    一、什么是Cinder块存储 Cinder是Openstack对块存储的实现 块存储服务主要是为虚拟机提供弹性的存...

  • 块存储、文件存储和对象存储

    补充说明Hadoop DFS 不是对象存储,它是一个改造的文件存储系统。 术语对象存储中,OSD(Object-b...

  • 块存储、文件存储与对象存储

    1. 块存储 典型设备:磁盘阵列、硬盘块存储主要是将裸磁盘空间整个映射给主机使用的。就是说例如:磁盘阵列里面有5块...

  • Ceph块存储客户端架构及流程简析

    Ceph可以提供文件、块和对象三种类型的存储形式,但最为主要的存储形式就是块存储。Ceph块存储可以直接与云计算平...

  • RTOS基础(存储块)

    存储块的原理与创建 问题概述 设计原理 设计实现 存储块的获取与释放 设计原理 设计实现 存储块的删除和状态查询 ...

  • 块存储,文件存储,对象存储,存储简单入门

    【块存储】 典型设备:磁盘阵列,硬盘 块存储主要是将裸磁盘空间整个映射给主机使用的,就是说例如磁盘阵列里面有5块硬...

  • 开源分布式存储系统笔记

    专业名词 块存储、文件存储、对象存储 块存储 就好比硬盘一样, 直接挂在到主机, 一般用于主机的直接存储空间和数据...

  • 块存储 文件存储 对象存储 2022-10-14

    块存储 块存储是最底层、最硬件的存储 硬件存储介质本身对数据的存储组织和访问方式是最简单最基本的——分块和编码:按...

网友评论

      本文标题:块存储 2022-10-29

      本文链接:https://www.haomeiwen.com/subject/hgcctdtx.html