Managing Linux Filesystems
Rank and File
Imagine a filesystem as a library that stores data efficiently and in a structured way. Without filesystems, persistent data would not be possible. Virtually every Linux system has at least one block-based filesystem (e.g., ext4, XFS, Btrfs). Block-based means that an underlying physical data store is involved, such as a hard drive, solid-state drive (SSD), or SD card. Linux has a number of filesystems from which to choose, and the ext2/3/4 series is likely known by everyone. If you work with a current distribution, you have probably met other filesystems, too (Table 1).
Table 1
Standard Filesystems
Distribution | Filesystem |
---|---|
Debian (from v7.0 wheezy) | ext4 |
Ubuntu (from v9.04) | ext4 |
Fedora (from v22) | XFS |
SLES (from v12) | Btrfs for the root partition, XFS for data partitions |
RHEL 7 | XFS |
Most filesystems are very similar and differ only in detail. The following terms will help you understand them:
- Superblock: Stores metadata about a filesystem, such as the total number of blocks and inodes, block sizes, UUIDs, and timestamps.
- Inode: An index node , which comprises metadata associated with a file, such as permissions, owners, timestamps, and so on. In addition to this descriptive information, an inode can contain direct extents (data) or refer to another inode.
- Extents: An area of storage reserved for a file. Older filesystems used direct and indirect blocks to reference blocks of data, whereas modern filesystems use a more efficient method with extents [1]. Extent mapping is a more efficient way to map logical filesystem blocks to physical blocks.
- Journaling: A method of tracking changes that have not yet been committed to the filesystem. A journal comes into its own in exceptional situations, such as during the recovery of filesystems that have crashed (e.g., because of a sudden power failure). Journaling ensures filesystem consistency, because operations recorded in the journal are either performed in full or not at all. With this information, you can get back to a consistent state faster without having to go through a lengthy filesystem check.
From RAM to Persistent Memory
Random access memory (RAM) has speed advantages over hard drives and SSDs; therefore, the Linux kernel uses a caching mechanism that keeps data in RAM to reduce disk access. This cache is known as the page cache; running the free
command reveals its current size (Listing 1). At first glance, 2.7GB of 7.7GB of RAM is available to the system. If the RAM usage for the page cache is deducted, then actually 5.6GB is free. The page cache thus occupies 2.7GB (cached
column). The buffers
column also belongs to the page cache; buffers
is where cached filesystem metadata resides.
Listing 1
Free Space
$ free -h total used free shared buffers cached Mem: 7.7G 4.9G 2.7G 228M 203M 2.7G -/+ buffers/cache: 2.1G 5.6G Swap: 1.0G 0B 1.0G
The page cache consists of physical pages in RAM, whose data pages are associated with a block device. The page cache size is always dynamic, because it uses any RAM that is not being used by the operating system. If the system suffers from high memory consumption, the page cache size is reduced, freeing up memory for applications.
The page cache is a write-back cache, which means it buffers both read and write data. A read from the block device propagates the data to the cache, which is then passed to the application. A write access lands directly in the page cache and not immediately on the block device. Data pages modified while in the page cache are called "dirty pages," because the modified data has not yet been written to persistent storage. Gradually, the Linux kernel writes data from RAM to the block device.
In addition to periodically writing data through the kernel, ext4 explicitly synchronizes its data and metadata using an interval of five seconds by default. You can change the sync time if necessary with the commit
option to the mount
command (see the ext4 documentation at kernel.org [2]). In the worst case, the data still in the RAM is lost in a sudden power outage. The longer the commit interval, the greater the risk of data loss.
The use of RAM as a cache provides huge performance advantages for the user. Don't forget, however, that RAM is volatile and not persistent. This fact forced itself into the awareness of many ext4 users recently when the "data corruption caused by unwritten and delayed extents" bug caused a stir [3]. On ext4, ephemeral files may never even reach the block device [4] under certain circumstances because of "delayed allocation."
Unlike ext3, ext4 delays allocating physical write blocks so the filesystem can accumulate data and allocate contiguous blocks later. This method gains the user a speed advantage when reading and writing the data while in RAM. Because ext4 cannot write unallocated blocks, they depend on the kernel to flush them out, which can translate to minutes in RAM instead of five seconds. Ext4 is not the only filesystem that uses this acceleration action: XFS, ZFS, and Btrfs also use delayed allocation (Table 2).
Table 2
Overview of Functional Filesystem Differences
ext3 | ext4 | XFS | Btrfs | |
---|---|---|---|---|
Production-ready | X | X | X | Partially |
Utilities package | e2fsprogs
|
xfsprogs
|
btrfs-progs
|
|
Filesystem utilities | mke2fs , resize2fs , e2fsck , tune2fs
|
mkfs.xfs , xfs_growfs , xfs_repair , xfs_admin
|
mkfs.btrfs , btrfs resize , btrfsck , btrfs filesystem
|
|
Maximum filesystem size | 16TiB | 1EiB | 16EiB | 16EiB |
Maximum file size: | 2TiB | 1EiB | 8EiB | 8EiB |
Expand on the fly | X | X | X | X |
Shrink on the fly | – | – | – | X |
Expand offline | X | X | – | – |
Shrink offline | X | X | – | – |
Discard (ATA trim) [5] | X | X | X | X |
Metadata CRC [6] | X | X | X | X |
Data CRC | – | – | – | X |
Snapshots/clones/internal RAID/compression | – | – | – | X |
ext4
As the successor to ext3, ext4 is one of the most popular Linux filesystems. Although ext3 is slowly reaching its limits, with a maximum filesystem size of 16 tebibytes (TiB; slightly more than 16TB), ext4 provides sufficient space for many years with up to 1 exbibyte (EiB) capacity.
To create a new ext4 filesystem, you need an unused block device. You can simply use a spare partition (e.g., /dev/sdb1
if you have created an unused partition on the second disk) or an LVM logical volume. In the following examples, we use a logical volume (/dev/vg00/ext4fs
), which means we can also expand and shrink the filesystem.
With root privileges, run mkfs.ext4
to create the new filesystem:
mkfs.ext4 /dev/vg/00/ext4fs
A newly created ext4 filesystem requires that all inode tables and the journal do not contain data. The corresponding areas must therefore be reliably overwritten with zeros. This may take a fair amount of time for larger filesystems, especially with hard drives; however, to let you use a new filesystem as soon as possible, the ext4 developers have implemented what they refer to as "lazy initialization," or initialization that occurs not when you create a filesystem, but in the background when you first mount the filesystem. Little wonder then that you suddenly notice I/O activity on mounting a new filesystem.
Caution is therefore advised if you want to perform performance tests with a newly created filesystem. In such cases, you should not create the filesystem with lazy initialization; instead, you should use the following parameters:
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vg00/ext4fs
To mount the filesystem, create an appropriate mountpoint up front and then run the mount
command:
mkdir /mnt/ext4fs mount /dev/vg00/ext4fs /mnt/ext4fs
If you want to mount the new filesystem automatically at boot time, add a corresponding entry in the /etc/fstab
file. You can optionally specify the -o
parameter for the mount
command (e.g., to mount a partition as read-only). For the list of possible options, see the kernel.org ext4 documentation [2]. Once the filesystem is mounted, /proc/mounts
only shows a few options (rw
,relatime
,data=ordered
) that need to run with the mount command or exist in /etc/fstab
(e.g., errors = remount-ro
) to be enabled:
# cat /proc/mounts | grep ext4 /dev/sda1 / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0 /dev/mapper/vg00-ext4fs /mnt/ext4fs ext4 rw,relatime,data=ordered 0 0
In addition to these options, however, other standard options are active. Since Linux kernel version 3.4, you can now also view options in the /proc
filesystem. Listing 2 shows an example.
Listing 2
/proc Filesystem Info
# cat /proc/fs/ext4/sda1/options rw delalloc barrier user_xattr acl resuid=0 resgid=0 errors=remount-ro commit=5 min_batch_time=0 max_batch_time=15000 stripe=0 data=ordered inode_readahead_blks=32 init_itable=10 max_dir_size_kb=0
Filesystem Check
After completing the most important setup steps, the advanced administration activities start with a filesystem check. When you run a check, the corresponding ext4 filesystem must not be mounted. You simply run the check using the e2fsck
program; as an alternative, you can also use the symbolic link fsck.ext4
. If the filesystem was not properly unmounted, the check terminates; alternatively, you can force validation with the -f
parameter.