NVDIMM and the Linux kernel

Steadfast Storage

New Options

The access semantics differ enormously between NVDIMM and block-oriented devices such as SSDs and hard drives [4]. NVDIMMs are connected like memory, and they behave similarly; the Linux kernel addresses them via RAM-oriented load/store operations. Drivers turn them into ultra-fast block devices, which are addressed via read/write operations, opening up new possibilities for custom programs and the kernel.

The only filesystems that use direct access (DAX) [5] at the moment are XFS and ext4. They can bypass the classic kernel request semantics for block devices and communicate directly with the NVDIMM modules, thus eliminating the route by way of classical RAM, and the CPU no longer needs to host DMA completion IRQs.

Not with RAID

The approach has one disadvantage: Because the filesystems address storage directly, the option of duplicating data in the style of software RAID is no longer available. If you store your data redundantly, you can't use this feature. The software RAID drivers cannot access the data [6] because the process of creating and saving data ideally happens on the NVDIMMs, removing the need to copy from DRAM to the NVDIMMs.

If you also want to bypass the kernel, then check out the NVM Library (NVML) project [7]. The library for C and C++ abstracts the specifics of NVDIMMs for the programmer and thus supports direct access to the hardware. This is especially suitable for back-end databases that require the speed of in-memory databases as well as a short startup time. The records already exist here and do not need to be copied into main memory. Figure 3 provides an overview of the software architecture of Linux NVDIMM support.

Figure 3: The architecture of the NVDIMM software [8] on newer Linux kernels and in userspace.

Integrity in the Arena

If the integrity of your data is more important to you than speed, you need to check out the NVDIMM Block Translation Table (BTT) subsystem in the kernel. BTT describes the method for accessing NVDIMMs atomically, in the manner typical of SCSI devices. The method either completes operations or it doesn't – without any in-between states.

If a BTT resides on the NVDIMM, the kernel divides the available storage space into areas of 512GB, known as arenas. Each arena starts with the 4KB arena info block. Each block ends with the BTT map and the BTT flog (a portmanteau from "free list" and "log"), as well as a 4KB copy of the arena info block. In between, the actual data can be found.

The BTT map is a simple table that translates a logical block address (LBA) to the internal blocks of the NVDIMM. The BTT flog is a list that maintains the free blocks on the NVDIMM. Any write access to a BTT-managed NVDIMM first goes to a free block from the flog. Linux only updates the BTT map if the write was successful. If a power failure occurs during the write process, the old version of the data exists until the BTT map is updated.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • NVDIMM Persistent Memory

    Non-volatile dual in-line memory modules will provide storage as fast as RAM and keep its content through a reboot. The Linux kernel is already geared to handle the new technology and can even serve the modules up as block devices.

comments powered by Disqus