前言
block在不同的上下文环境代表不同的意义
例如,block device 的 block,file system 层面的block,不是一个概念
两者的关系如下:
In general, for magnetic disk(磁盘), a sector
is the smallest unit of information that can be read or written.(即使只想读写一两个字节,也得读写整个block) Section sizes are typically 512 bytes. As for SSD, the smallest unit is often called page
, whose size is commonly 4096 bytes. Here, both section and page have physical senses, representing disk block
.
However, block
in some context may refer to the logical unit of storage allocation and retrieval used by file systems or database systems, and block sizes today typically range from 4 to 16 kilobytes.
A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size.
Operating system creates abstraction called file system where it has it's own block size which is larger(multiple of) than disk block size. Similar to disk, operating system read/write data in size of file system block size. For a single read/write filesystem block multiple disk block operation will be performed
文章结构
- block device
- 分区
- 格式化
- 挂载
block device
这里的block,指的是一小块一小块的存储单元。所以块设备指的是,由大量的这些存储单元构成的存储设备
常见的块设备就包括hdd和ssd
- hdd: hard disk drive. 传统旋转式磁盘(一般通过sata协议和操作系统交互)
- ssd: solid state drive. 固态硬盘 (一般通过sata/nvme协议和操作系统交互)
for hdd,sector is the disk block
for ssd, page is the disk block
lsblk查看内核识别的所有块设备
在linux中,可使用lsblk
查看内核识别的块存储文件
The command prints all block devices (except RAM disks) in a tree-like format by default.
[root@VM-165-116-centos ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 16.4M 0 rom
vda 253:0 0 100G 0 disk
└─vda1 253:1 0 100G 0 part /
vdb 253:16 0 200G 0 disk
└─vdb1 253:17 0 200G 0 part /data
[root@VM-165-116-centos ~]#
lsblk可以通过输出ROTA列来显示设备是hdd还是ssd。1 hdd, 0 ssd
root@:~# lsblk -p -o +rota
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT ROTA
/dev/sdf 8:80 0 3.7T 0 disk 1
/dev/nvme0n1 259:1 0 2.9T 0 disk 0
/dev/sdk 8:48 0 3.7T 0 disk /data5 1
/dev/sdb 8:16 0 3.7T 0 disk 1
└─/dev/sdb1 8:17 0 2T 0 part /data/data 1
/dev/nvme1n1 259:0 0 2.9T 0 disk 0
/dev/sda 8:0 0 447.1G 0 disk 0
├─/dev/sda4 8:4 0 406.6G 0 part /data 0
├─/dev/sda2 8:2 0 512M 0 part /boot/efi 0
├─/dev/sda3 8:3 0 20G 0 part /usr/local 0
└─/dev/sda1 8:1 0 20G 0 part / 0
/dev/sdj 8:144 0 3.7T 0 disk 1
linux 硬盘名
内核通过sata协议对接硬盘 叫sdx(可以是hdd/ssd);
通过nvme协议对接硬盘 叫nvme (基本上都是ssd)
查看硬盘的 block_size (sector_size)
硬盘扇区的概念分为 物理扇区 和 逻辑扇区
https://nymrli.top/2020/09/06/%E6%89%87%E5%8C%BA%E3%80%81%E5%9D%97-%E7%B0%87/
物理扇区 vs 逻辑扇区
硬盘的基本读写单位是扇区(Sector, hdd和ssd的block这里都称为Sector)。由于硬盘数据存储结构的限制,单独读写1个或几个字节是不可能的,每次读写都是以扇区为单位进行的,通常是512个字节。通过系统提供的接口读写文件数据时,看起来可以单独读写少量字节,实际上是经过了操作系统的转换才实现的。
那么什么是逻辑扇区和物理扇区呢?随着对硬盘容量的要求不断增加,为了提高数据记录密度,硬盘厂商往往采用增大扇区大小的方法,于是出现了扇区大小为4096字节的硬盘(比如ssd)。我们将这样的扇区称之为物理扇区。但是这样的大扇区会有兼容性问题,有些【直接和硬盘交互的系统或软件】(如tencent cdb就是直接和硬盘交互,而不通过文件系统)无法适应。为了解决这个问题,硬盘内部将物理扇区在逻辑上划分为多个扇区片段并将其作为普通的扇区(一般为512字节大小)报告给系统or软件。这样的扇区片段我们称之为逻辑扇区。实际读写时由硬盘内的程序(固件)负责在逻辑扇区与物理扇区之间进行转换(比如写512,先读4K出来,在4K中写512,再写入磁盘),上层感觉不到物理扇区的存在。于是逻辑扇区是硬盘可以接受读写指令的最小操作单元,是操作系统及应用程序可以访问的扇区,多数情况下其大小为512字节。我们通常所说的扇区一般就是指的逻辑扇区
对磁盘的操作采用与物理扇区大小对齐的方式可以减少转换代价,提高性能
看看MySQL(CDB)的I/O方式。linux中通常有两种I/O方式:Buffered I/O、DIRECT I/O。在 Linux 的缓存 I/O 机制中,操作系统会将 I/O 的数据缓存在文件系统的页缓存( page cache )中,通常缓存 I/O 可以减少读盘的次数,从而提高性能。但是它也有一个很明显的缺点:不能直接在应用程序地址空间和磁盘之间进行数据传输,中间隔了一层文件系统。这样的话,数据在传输过程中需要在应用程序地址空间和页缓存之间进行多次数据拷贝操作,这些数据拷贝操作所带来的 CPU 以及内存开销是非常大的。对于MySQL(CDB)来说,避开操作系统内核缓冲区而直接在应用程序地址空间和磁盘之间传输数据会比使用操作系统内核缓冲区获取更好的性能,因为MySQL(CDB)内核对要操作的数据的语义了如指掌,内部采用了更加高效的缓存替换算法
所以MySQL(CDB)通常会将参数innodb_flush_method设置为O_DIRECT,来避免使用操作系统使用文件系统来缓存I/O,提高性能。使用O_DIRECT有一个限制,就是读写参数必须与磁盘逻辑扇区进行对齐。这个也比较好理解,没有任何转换的情况下,磁盘必须要以扇区为单位进行读写的。如果不满足条件则会报错:无效参数(EINVAL),错误码22
下面给出了查看硬盘(hdd/ssd通用)物理扇区和逻辑扇区的方式
#### sdd is a HDD
root@:/# cat /sys/block/sdd/queue/logical_block_size
512
root@:/# cat /sys/block/sdd/queue/physical_block_size
512
root@:/# fdisk -l /dev/sdd
Disk /dev/sdd: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@:/#
#### sda is a SSD ####
root@:~# cat /sys/block/sda/queue/logical_block_size
512
root@:~# cat /sys/block/sda/queue/physical_block_size
4096
root@:~# fdisk -l /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
Disk /dev/sda: 480.1 GB, 480103981056 bytes, 937703088 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
...
root@:~#
可以看到,cat
的对象是名为logical_block_size
和physical_block_size
的文件,顾名思义它们描述的是硬盘block的信息。而它们的值就是fdisk -l
显示的 Sector size (logical/physical)
block device vs byte device
Block devices are nonvolatile mass storage devices whose information can be accessed in any order(因为块设备是由块组成的,块可以被随机访问). Hard disks, floppy disks, and CD-ROMs are examples of block devices. (HDDs, SSDs, files, ...)
Byte devices (raw devices) are sequential-access mass storage devices, for example tape devices.
linux如何区别ssd和hdd
since kernel version 2.6.29, you may verify sda with:
cat /sys/block/sda/queue/rotational
You should get 1 for hard disks and 0 for a SSD.
hdparm
hdparm
provides a command line interface to various kernel interfaces supported by the Linux SATA/PATA/SAS "libata" subsystem and the older IDE driver subsystem.
因为块设备可以随机访问,因此控制对块设备的访问依赖一套相对复杂的机制,内核专门有一个子系统负责
hdparm
是通过 kernel 的 libata子系统 和 IDE子系统来和hdd交互,达到读取或者设置硬盘参数的目的
root@:/# hdparm /dev/sdk
/dev/sdk:
multcount = 0 (off)
readonly = 0 (off)
readahead = 256 (on)
geometry = 486401/255/63, sectors = 7814037168, start = 0
root@:/#
root@:/# hdparm -W /dev/sdk # 查看磁盘的write cache配置
/dev/sdk:
write-caching = 1 (on)
root@:/#
分区
分区可以将一块硬盘划分为多个partition
每个partition之间的数据互不干扰,就像一块大硬盘被划分成了很多块小硬盘一样
分区信息记录在硬盘的分区表中
Block devices can be divided into one or more logical disks called partitions. This division is recorded in the partition table, usually found in sector 0 of the disk.
parted
parted是一个交互式的分区命令
eg.
parted /dev/sdd
fdisk (format disk)
fdisk - manipulate disk partition table
All partitioning is driven by device I/O limits (the topology) by default. fdisk is able to optimize the disk layout for a 4K-sector size and use an alignment offset on modern devices for MBR and GPT.
root@:/# fdisk -l /dev/sdd
Disk /dev/sdd: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
cat /proc/partitions
The /proc/partitions
file contains a table with major and minor number of partitioned devices, their number of blocks and the device name in /dev. Verify with /proc/devices to link the major number to the proper device.
paul@RHELv4u4:~$ cat /proc/partitions
major minor #blocks name
3 0 524288 hda
3 64 734003 hdb
8 0 8388608 sda
8 1 104391 sda1
8 2 8281507 sda2
8 16 1048576 sdb
8 32 1048576 sdc
8 48 1048576 sdd
253 0 7176192 dm-0
253 1 1048576 dm-1
paul@RHELv4u4:~$ cat /proc/devices
Character devices:
1 mem
2 pty
3 ttyp
4 /dev/vc/0
4 tty
4 ttyS
5 /dev/tty
5 /dev/console
5 /dev/ptmx
7 vcs
9 st
10 misc
13 input
21 sg
29 fb
86 ch
128 ptm
136 pts
162 raw
180 usb
189 usb_device
202 cpu/msr
203 cpu/cpuid
206 osst
240 ipmidev
241 ptp
242 pps
243 pmcsas
244 gdth
245 megadev_legacy
246 ql2xapidev
247 aac
248 nvme
249 bsg
250 rtc
251 dax
252 dimmctl
253 ndctl
254 tpm
Block devices:
1 ramdisk
7 loop
8 sd
9 md
43 nbd
65 sd
66 sd
67 sd
68 sd
69 sd
70 sd
71 sd
128 sd
129 sd
130 sd
131 sd
132 sd
133 sd
134 sd
135 sd
251 device-mapper
252 mtip32xx
253 virtblk
254 mdp
259 blkext
The major number corresponds to the device type (or driver) and can be found in /proc/devices. In this case 3 corresponds to ide
and 8 to sd
. The major number determines the device driver to be used with this device.
The minor number is a unique identification of an instance of this device type. The devices.txt file in the kernel tree contains a full list of major and minor numbers.
格式化(为硬盘/分区写入文件系统)
file system
A filesystem is the set of methods and the data structures for keeping track of files on a disk or partition.
有了文件系统,文件才能被有序地组织在存储介质上
mkfs 创建文件系统
mkfs is used to build a Linux filesystem on a device, usually a hard disk partition. The device argument is either the device name (e.g. /dev/hda1, /dev/sdb2), or a regular file that shall contain the filesystem.
mkfs.ext4 -L 'LinuxCool' -b 2048 -F /dev/sdb
在/dev/sdb上建立ext4文件系统,该文件系统的block大小为2048字节,并给定了LinuxCool作为lable
如何判断一个磁盘是否已经格式化?
- 只需要挂载一下那块磁盘就可以知道他有没格式化.
mount and echo $?
但这种方法实在是太简陋了 -
parted -l
. 查看所有磁盘是否被格式化,以及格式化后的文件系统类型 - use
lsblk -no KNAME,FSTYPE $drive
orlsblk -f $drive
. recommended
If the device has been formatted, there must be a filesystem existing on the device. In this case we will get 2 columns such assdd ext4
while we typelsblk -no KNAME,FSTYPE /dev/sdd
. Otherwise we will only get the name column such assdd
while we typelsblk -no KNAME,FSTYPE /dev/sdd
And here is a fast helper function for testing if disk has been formatted:
is_formatted()
{
# two args
drive=$1
fs_type=$2
if [[ ! -z $drive ]]
then
if [[ ! -z $fs_type ]]
then
# key logic
current_fs=$(lsblk -no KNAME,FSTYPE $drive)
if [[ $(echo $current_fs | wc -w) == 1 ]]
then
echo "[INFO] '$drive' is not formatted."
return 0
else
current_fs=$(echo $current_fs | awk '{print $2}')
if [[ "$current_fs" == "$fs_type" ]]
then
echo "[INFO] '$drive' is formatted with correct fs type. Moving on."
return 0
else
echo "[WARN] '$drive' is formatted, but with wrong fs type '$current_fs'."
return 0
fi
fi
else
echo "[WARN] is_formatted() was called without specifying fs_type."
return 0
fi
else
echo "[FATAL] is_formatted() was called without specifying a drive. Quitting."
return 1
fi
}
参考https://unix.stackexchange.com/questions/299715/method-to-test-if-disks-in-system-are-formatted
挂载
mount 挂载硬盘
挂载后,操作系统就可以使用磁盘设备了
挂载分为一次生效和永久生效
- 在本次关机前生效,关机后失效
mkdir -p /data5; mount /dev/sdd /data5
- 写入/etc/fstab,永久生效
echo "/dev/sdd /data5 ext4 defaults 0 0" >> /etc/fstab; mount -a
查看挂载
mount -l
cat /etc/mtab
(mtab refers to mount table)
/etc/fstab vs /etc/mtab
/etc/fstab is a list of filesystems to be mounted at boot time. If you want your Windows or file-storage partitions mounted once your computer boots, you'll need to put appropriate entries into /etc/fstab.
/etc/mtab is a list of currently mounted filesystems. If you have a disk connected but not mounted, it won't show up in the /etc/mtab file. Once you mount it, it will show up there.
两块盘挂到同一目录,会发生什么?后面的挂载会wrap前面的挂载
# 两块盘挂到同一目录,会发生什么?后面的挂载会wrap前面的挂载
[root@TENCENT64 ~]# cat /etc/fstab
#
# /etc/fstab
# Created by anaconda
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
...
LABEL=disktest1 /data/test1 ext4 defaults 0 0
LABEL=disktest2 /data/test1 ext4 defaults 0 0
[root@TENCENT64 ~]# mount -a
[root@TENCENT64 ~]# findmnt
TARGET SOURCE FSTYPE OPTIONS
...
├─/data /dev/sda4 ext4 rw,noatime,data=ordered,barrier
│ └─/data/test1 /dev/sdd ext4 rw,relatime,data=ordered,barrier
│ └─/data/test1 /dev/sdf ext4 rw,relatime,data=ordered,barrier
...
[root@TENCENT64 ~]#
[root@TENCENT64 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/nvme1n1 2.9T 89M 2.8T 1% /data12
/dev/sdf 3.6T 89M 3.4T 1% /data/test1
# 再把/dev/sdd挂载到/data/test1
[root@TENCENT64 ~]# mount /dev/sdd /data/test1
[root@TENCENT64 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/nvme1n1 2.9T 89M 2.8T 1% /data12
/dev/sdd 3.6T 89M 3.4T 1% /data/test1
[root@TENCENT64 ~]# findmnt
TARGET SOURCE FSTYPE OPTIONS
...
├─/data /dev/sda4 ext4 rw,noatime,data=ordered,barrier
│ └─/data/test1 /dev/sdd ext4 rw,relatime,data=ordered,barrier
│ └─/data/test1 /dev/sdf ext4 rw,relatime,data=ordered,barrier
│ └─/data/test1 /dev/sdd ext4 rw,relatime,data=ordered,barrier
[root@TENCENT64 ~]#
同一块盘先后挂载两个目录,会发生什么?
# 同一块盘先后挂载两个目录
[root@TENCENT64 ~]# cat /etc/fstab
...
LABEL=disktest1 /data/test1 ext4 defaults 0 0
LABEL=disktest1 /data/test2 ext4 defaults 0 0
[root@TENCENT64 ~]# findmnt
TARGET SOURCE FSTYPE OPTIONS
...
├─/data /dev/sda4 ext4 rw,noatime,data=ordered,barrier
│ ├─/data/test1 /dev/sdd ext4 rw,relatime,data=ordered,barrier
│ └─/data/test2 /dev/sdd ext4 rw,relatime,data=ordered,barrier
[root@TENCENT64 ~]#
[root@TENCENT64 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/sdd 3.6T 89M 3.4T 1% /data/test1
[root@TENCENT64 ~]# ### 显示 /dev/sdd 挂载在/data/test1 ###
[root@TENCENT64 ~]# umount /data/test1
[root@TENCENT64 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/sdd 3.6T 89M 3.4T 1% /data/test2
[root@TENCENT64 ~]# ### umount /data/test1后,显示 /dev/sdd 挂载在/data/test1 ###
所以,os认为的挂载是findmnt
目录树的深度遍历优先的结果?
umount 解挂载
The umount command detaches the mentioned file system(s) from the file hierarchy. A file system is specified by giving the directory where it has been mounted.
umount [-f -l] /data/tmp
df (disk free)
df是文件系统层面的命令,可以查看【已经挂载的】分区和文件系统类型
df - report file system disk space usage
root@:/# df -Th
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 63G 4.0K 63G 1% /dev
tmpfs tmpfs 63G 20K 63G 1% /dev/shm
tmpfs tmpfs 63G 283M 63G 1% /run
tmpfs tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sda1 ext4 20G 9.3G 9.3G 50% /
/dev/sda3 ext4 20G 1.6G 17G 9% /usr/local
/dev/sda2 vfat 511M 4.5M 507M 1% /boot/efi
/dev/sda4 xfs 407G 33M 407G 1% /data
/dev/sdb1 ext4 2.0T 3.9G 1.9T 1% /data/data
/dev/sdd ext4 3.6T 89M 3.4T 1% /data5
root@:/#
du vs df
du
是按文件来统计磁盘空间的。它统计被文件系统记录到的每个文件的大小,然后进行累加得到的大小,这是通过文件系统获取到的,并且这个统计是可以跨文件系统的
df
是按盘来统计磁盘空间的,是从超级块(superblock)中读入硬盘使用信息,df获取到的是磁盘块被使用的情况
通常会使用df -lh
命令来检查每个挂载了文件系统的硬盘的总量和已使用量,使用du -sh [directory]
命令来统计某个目录下所有文件的空间占用
[root@TEN boot]# du -sh /boot
634M /boot
[root@TEN boot]# du --exclude=efi -sh /boot
625M /boot
[root@TEN boot]# du -sh /boot/efi
9.8M /boot/efi
[root@TEN boot]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 251G 0 251G 0% /dev
tmpfs 251G 0 251G 0% /dev/shm
tmpfs 251G 60M 251G 1% /run
tmpfs 251G 0 251G 0% /sys/fs/cgroup
/dev/nvme0n1p1 20G 12G 6.7G 65% /
/dev/nvme0n1p2 511M 9.8M 502M 2% /boot/efi
/dev/nvme0n1p4 841G 894M 797G 1% /data
/dev/nvme0n1p3 20G 772M 18G 5% /usr/local
tmpfs 51G 0 51G 0% /run/user/0
[root@TEN boot]#
如示例,/boot
这个文件夹占用的是/dev/nvme0n1p1
的磁盘空间,而/boot/efi
占用的是/dev/nvme0n1p2
的磁盘空间
而在du统计文件磁盘占用时,du -sh /boot
= 634M = 625M + 9.8M
显然统计/boot
的文件磁盘占用既包含了/dev/nvme0n1p1
的空间,也包含了/dev/nvme0n1p2
的空间
因此du统计是基于文件的、可跨文件系统的
网友评论