How Old Is That Data?

One of the fundamental theorems of system administration is: “Users will find a way to use space faster than new space can be added to systems.” The corollary to this theorem is: “Users will always insist that all of their data is critical and must be retained online.” To help system administrators get their arms around the data boom, they can use tools that scan filesystems to determine how much data is being used and to “age” data.

In this article, I want to introduce an essential admin application named agedu that you can use to get a snapshot of the age of files and directories. From this information, you can get a general sense of what directories have older data that hasn't been accessed (or modified) in a while. It can also be used in scripts to create reports about systems or simply to understand what’s going on with your storage, even on your desktop or laptop.

Studies of Data Age

A few years ago, a study from the University of California, Santa Cruz, and NetApp examined CIFS storage within the NetApp company. Part of the storage was deployed in the corporate data center where the hosts were used by more than 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by more than 500 engineering employees. From this study, a few observations can be made:

  • Workloads are more write oriented
  • Files are 10x bigger than in previous studies
  • Files are rarely reopened: >66 percent are reopened just once and 95 percent are opened fewer than five times
  • <1 percent of clients account for 50 percent of requests
  • >76 percent of files are opened by just one client
  • Only 5 percent of files are opened by multiple clients, and 90 percent of those are read just once

The big Vegas finish for the study was that more than 90 percent of the active storage was untouched during the study.

If you combine these results with those of other studies, it becomes apparent that users are creating more data than before, they are keeping it around, and they are not reusing too much of it. However, if you ask the users, they will naturally tell you that all of the data is important and cannot be erased. If the data hasn't been touched in two years, is still needed? To answer that question, you need to be able to scan a user’s filesystem(s) to determine the age of files.

Data Comes in All Ages

On *nix systems, “age” is not a single number. Age of a file or directory is measured in three ways:

  • change time(ctime )
  • access time (atime )
  • modify time (mtime )

The first, ctime , is the time that changes were last made to the file or directory’s inode. This can include changes to the data, file or directory permissions, file or directory ownership, and so on. The ctime can be viewed with the command, ls -lc . The second time, atime , is the time the file was last accessed. The access times can be found with the command ls -lu . The third time, mtime , is the modify time, or the time the actual file contents were changed. You can view the modify time with the command ls -l . To get all of the information in a quasi-readable format, you can use the stat command in Linux.

When talking about the "age" of a file, you need a precise definition. Does the discussion concernctime , atime , or mtime ? Do you need to take into consideration a combination of the three metrics? Are you interested in the oldest time, regardless of the metric (i.e., max[em>ctime, atime , mtime ] )?

As an aside, many users want to know when a file was first created (i.e., its “birth” time), regardless of whether the data or the inode information has changed. Some discussion has taken place about adding this time to various filesystems, but no real standard has developed around it.

Every time a file is accessed, the atime changes, forcing the filesystem to change the file inode by reading the inode, modifying the atime , and then writing the data back to storage, even if the data in the file is not actually changed. This process requires a large number of very small I/O operations (IOPS).

Many filesystems allow you to turn off atime to reduce the IOPS load, which can increase performance at the price of not being able to track when a file was last accessed. Although lot of people don't care about the last access time, particularly for local systems such as a laptop or a desktop, for HPC systems, atime can be a very important number because it allows you to track when the file was last accessed.

Before plowing into a very large filesystem with millions (or even billions) of files, it would be good to get a glimpse at the distribution of the three ages of the files and directories, so you can focus on where most of the older files are located. A simple tool for this task is agedu .

Related content

  • Understanding the Status of Your Filesystem

    Understanding the proliferation of data in your filesystem is key to being an administrator. Understanding file sizes and file ages and their distribution helps you tune filesystems for performance and develop policies for data management.

  • Using rsync for Backups

    Although commercial Linux backup tools are available, many people prefer open source to better understand and control the backup process. One open source tool that can do both full and incremental backups is rsync.

  • Safe Files

    Encrypting your data is becoming increasingly important, but you don’t always have to use an encrypted filesystem. Sometimes just encrypting files is enough.

  • Filesystem Encryption

    The revelation of wide-spread government snooping has sparked a renewed interest in data storage security via encryption. In this article, we review some options for encrypting files, directories, and filesystems on Linux.

  • What Is an Inode?

    Understanding inodes is key to a better understanding of HPC filesystems.

comments powered by Disqus