How Old Is That Data?
One of the fundamental theorems of system administration is: “Users will find a way to use space faster than new space can be added to systems.” The corollary to this theorem is: “Users will always insist that all of their data is critical and must be retained online.” To help system administrators get their arms around the data boom, they can use tools that scan filesystems to determine how much data is being used and to “age” data.
In this article, I want to introduce an essential admin application named agedu that you can use to get a snapshot of the age of files and directories. From this information, you can get a general sense of what directories have older data that hasn't been accessed (or modified) in a while. It can also be used in scripts to create reports about systems or simply to understand what’s going on with your storage, even on your desktop or laptop.
Studies of Data Age
A few years ago, a study from the University of California, Santa Cruz, and NetApp examined CIFS storage within the NetApp company. Part of the storage was deployed in the corporate data center where the hosts were used by more than 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by more than 500 engineering employees. From this study, a few observations can be made:
- Workloads are more write oriented
- Files are 10x bigger than in previous studies
- Files are rarely reopened: >66 percent are reopened just once and 95 percent are opened fewer than five times
- <1 percent of clients account for 50 percent of requests
- >76 percent of files are opened by just one client
- Only 5 percent of files are opened by multiple clients, and 90 percent of those are read just once
The big Vegas finish for the study was that more than 90 percent of the active storage was untouched during the study.
If you combine these results with those of other studies, it becomes apparent that users are creating more data than before, they are keeping it around, and they are not reusing too much of it. However, if you ask the users, they will naturally tell you that all of the data is important and cannot be erased. If the data hasn't been touched in two years, is still needed? To answer that question, you need to be able to scan a user’s filesystem(s) to determine the age of files.
Data Comes in All Ages
On *nix systems, “age” is not a single number. Age of a file or directory is measured in three ways:
- change time(ctime )
- access time (atime )
- modify time (mtime )
The first, ctime , is the time that changes were last made to the file or directory’s inode. This can include changes to the data, file or directory permissions, file or directory ownership, and so on. The ctime can be viewed with the command, ls -lc . The second time, atime , is the time the file was last accessed. The access times can be found with the command ls -lu . The third time, mtime , is the modify time, or the time the actual file contents were changed. You can view the modify time with the command ls -l . To get all of the information in a quasi-readable format, you can use the stat command in Linux.
When talking about the "age" of a file, you need a precise definition. Does the discussion concernctime , atime , or mtime ? Do you need to take into consideration a combination of the three metrics? Are you interested in the oldest time, regardless of the metric (i.e., max[em>ctime, atime , mtime ] )?
As an aside, many users want to know when a file was first created (i.e., its “birth” time), regardless of whether the data or the inode information has changed. Some discussion has taken place about adding this time to various filesystems, but no real standard has developed around it.
Every time a file is accessed, the atime changes, forcing the filesystem to change the file inode by reading the inode, modifying the atime , and then writing the data back to storage, even if the data in the file is not actually changed. This process requires a large number of very small I/O operations (IOPS).
Many filesystems allow you to turn off atime to reduce the IOPS load, which can increase performance at the price of not being able to track when a file was last accessed. Although lot of people don't care about the last access time, particularly for local systems such as a laptop or a desktop, for HPC systems, atime can be a very important number because it allows you to track when the file was last accessed.
Before plowing into a very large filesystem with millions (or even billions) of files, it would be good to get a glimpse at the distribution of the three ages of the files and directories, so you can focus on where most of the older files are located. A simple tool for this task is agedu .