« Previous 1 2 3 Next »
NVDIMM and the Linux kernel
Steadfast Storage
New Options
The access semantics differ enormously between NVDIMM and block-oriented devices such as SSDs and hard drives [4]. NVDIMMs are connected like memory, and they behave similarly; the Linux kernel addresses them via RAM-oriented load/store operations. Drivers turn them into ultra-fast block devices, which are addressed via read/write operations, opening up new possibilities for custom programs and the kernel.
The only filesystems that use direct access (DAX) [5] at the moment are XFS and ext4. They can bypass the classic kernel request semantics for block devices and communicate directly with the NVDIMM modules, thus eliminating the route by way of classical RAM, and the CPU no longer needs to host DMA completion IRQs.
Not with RAID
The approach has one disadvantage: Because the filesystems address storage directly, the option of duplicating data in the style of software RAID is no longer available. If you store your data redundantly, you can't use this feature. The software RAID drivers cannot access the data [6] because the process of creating and saving data ideally happens on the NVDIMMs, removing the need to copy from DRAM to the NVDIMMs.
If you also want to bypass the kernel, then check out the NVM Library (NVML) project [7]. The library for C and C++ abstracts the specifics of NVDIMMs for the programmer and thus supports direct access to the hardware. This is especially suitable for back-end databases that require the speed of in-memory databases as well as a short startup time. The records already exist here and do not need to be copied into main memory. Figure 3 provides an overview of the software architecture of Linux NVDIMM support.
Integrity in the Arena
If the integrity of your data is more important to you than speed, you need to check out the NVDIMM Block Translation Table (BTT) subsystem in the kernel. BTT describes the method for accessing NVDIMMs atomically, in the manner typical of SCSI devices. The method either completes operations or it doesn't – without any in-between states.
If a BTT resides on the NVDIMM, the kernel divides the available storage space into areas of 512GB, known as arenas. Each arena starts with the 4KB arena info block. Each block ends with the BTT map and the BTT flog (a portmanteau from "free list" and "log"), as well as a 4KB copy of the arena info block. In between, the actual data can be found.
The BTT map is a simple table that translates a logical block address (LBA) to the internal blocks of the NVDIMM. The BTT flog is a list that maintains the free blocks on the NVDIMM. Any write access to a BTT-managed NVDIMM first goes to a free block from the flog. Linux only updates the BTT map if the write was successful. If a power failure occurs during the write process, the old version of the data exists until the BTT map is updated.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)