How Old Is That Data?

aegdu

Filesystems can contain thousands, millions, or even billions of files. Tracking how they are being used is very difficult and time consuming. Fortunately, a simple tool named agedu can give you a quick glimpse into the “age” of the data on a directory basis. In the case of HPC systems, you can use it to scan directories quickly for old applications or user directories.

Ageduis likely to be in your distribution repository (e.g., the CentOS 6 and CentOS 7 EPEL repositories). However, if it isn't there, it is simple to install, configure, and run. A simple and familiar ./configure command builds the code, after which you install it into /usr/local as root. The other option is to build the code and install the tool in your user account; for example:

./configure --prefix=/home/laytonjb/bin/agedu

Then you just create an alias to the executable. For example, in your .bashrc file, add the line:

 alias agedu=/home/laytonjb/bin/agedu/bin/agedu

After you have installed agedu , building from source or from your package manager, you can proceed in a number of ways. The first thing you should do is create an index of all of the files in the directory tree and their sizes. All subsequent queries can be made against the index, which is much faster than continually scanning the filesystem. Note that for all directories and files below the current directory. agedu sums the used storage (e.g., like du -s ). Once the index is built, you can then “query” for a variety of information. Ageducomes with a basic HTML server, so it will produce a graphical display of the results.

To create an index of the directory tree, you just run the command,

[laytonjb@home4 ~]$ agedu -s /home/laytonjb
Built pathname index, 748917 entries, 67182982 bytes of index
Faking directory atimes
Building index
Final index file size = 162381160 bytes
[laytonjb@home4 ~]$ ls -s agedu*
158580 agedu.dat

where -s < directory > produces an index file named agedu.dat in the current directory. (Note: If the index file is in a directory being scanned, agedu will ignore it.)

Once the index is created, you can query it. A great way to get started is to use the HTML display capabilities. Agedu will print out a URL that you can then copy into your browser. For example,

[laytonjb@home4 ~]$ agedu -w
Using Linux /proc/net magic authentication
URL: http://127.0.0.1:42821/

A screenshot of the web browser output is shown in Figure 1.

Figure 1: Aegdu screenshot using access time (atime).

The web graphics display the age of the files in a specific directory – red being the oldest and green being the newest. (For this example, notice that the oldest file is fur years old.) The web page orders the directories by total space used. For this specific example, the first directory has the vast majority of the used space, as well as a large number of fairly new files. The second and third directories have some older files, as does the fifth directory.

The image also indicates the total space used bya directory to the far left and the percentage of the total space the directory uses (listed to the far right). When you are finished with the web page, just close agedu by pressing Ctrl+C.

In Figure 1, notice at the very top of the page that it states the data age is based on access time (atime ), which is the default setting. However, you can easily perform the same analysis using mtime if you like (at this time ctime is not an option), with:

[laytonjb@home4 ~]$ agedu --mtime -s /home/laytonjb
Built pathname index, 751666 entries, 67486718 bytes of index
Faking directory atimes
Building index
Final index file size = 174723288 bytes
[laytonjb@home4 ~]$ ls -s agedu*
170632 agedu.dat
[laytonjb@home4 ~]$ agedu -w
Using Linux /proc/net magic authentication
URL: http://127.0.0.1:51579/

Remember that the first command produces the index; then, you need either to display the graphic output, as in the second command. or to query the output.

Figure 2 shows the resulting web page when mtime is used as the metric.

Figure 2: Aegdu screenshot using modify time (mtime), even though it says “access time.”

Note that the top of the web page still says last-access time , even though mtime was used.

The range of dates is from six years to present. The oldest end of the spectrum (around 6 years) is fairly small,with a fairly long spread of “new” files in terms of mtime . The subdirectory AWS has the largest percentage of file capacity and some of the youngest files (lots of PDF files).

In addition to the HTML output, you can query the database to get text information (which is great for scripting). For example,

[laytonjb@home4 ~]$ agedu -s /home/laytonjb
[laytonjb@home4 ~]$ agedu -t /home/laytonjb

sends a summary of space usage (including subdirectories) as text to stdout.

By default agedu looks for the oldest file when creating the scale, as displayed in the web output. You can use the text option to query the index for the age of the data that doesn't have to follow that scale. For example, you can scan for the amount of space in each directory that is older than six months:

[laytonjb@home4 ~]$ agedu -s /home/laytonjb
[laytonjb@home4 ~]$ agedu -a 6m -t /home/laytonjb
4           /home/laytonjb/.abrt
7344        /home/laytonjb/.adobe
8           /home/laytonjb/.atom
14764       /home/laytonjb/.cache
8           /home/laytonjb/.cfncluster
16188       /home/laytonjb/.config
4           /home/laytonjb/.dbus
8           /home/laytonjb/.distlib
3524        /home/laytonjb/.e
8           /home/laytonjb/.emacs.d
44          /home/laytonjb/.fontconfig
148         /home/laytonjb/.gconf
712         /home/laytonjb/.gimp-2.2
76          /home/laytonjb/.gimp-2.6
20          /home/laytonjb/.gkrellm2
60          /home/laytonjb/.gnome2
16          /home/laytonjb/.gnote
24          /home/laytonjb/.gnupg
1704        /home/laytonjb/.icewm
2600        /home/laytonjb/.kde
19260       /home/laytonjb/.komodoedit
1852        /home/laytonjb/.libreoffice
1044        /home/laytonjb/.local
104         /home/laytonjb/.lyx
...
15305060    /home/laytonjb/src
622704576   /home/laytonjb

The output shows the space usage summary for each subdirectory below the main directory that has data older than six months. This capability can be extremely useful when searching for directories that have very old data. From a system administrator’s perspective, a prime example would be to use agedu to scan user directories for really old data after examining all home directories for the oldest data. This can also be run as part of a script that is run either daily, weekly, or monthly and creates a report of the directories with the oldest data, from which you can decide what to do (e.g., archiving).

The savvy administrators reading this article will be quick to realize that users could simply use the touch command to update the atime and mtime of their data, obscuring the real access and modify times of the data. However, one could use agedu to run reports fairly often to catch users doing this. It doesn't stop them, but at least you have a record of the users employing this method, and if they become abusers of space, you can at least talk to them and show them the reports. Despite the tone of this article, users are not evil in any sense, but having data to explain why they should compress or delete data is much more effective than simply demanding that they delete data. These reports can also help you identify users that need more space and then work with them to understand how they are using space, and they can help you when requesting more space because you can explain how space is being used, who is using it, and the how fast data is growing.

Summary – Check Your Space Usage Today!

As pointed out in the introduction, studies have pointed out that most stored data has not been accessed in a very long time or is not accessed often. Although there are legitimate reasons for keeping data online, at least knowing how much data has not been accessed in quite some time provides evidence for either adding storage or adding archiving capabilities. Knowing how data is used can also help identify the users that are using the most space, so they can be asked to delete or compress data that they are not using or have not touched in some time. (It's fairly convincing to ask the user to compress files they have not used in some time when you can show them how much disk space they are using and how long it's been since they last accessed the files.) This information can also be used to provide justification for more space, or perhaps even more importantly, it can be used to track trends in data usage.

Agedu is a tool that can give you a quick overview of your disk usage as a function of time. The tool is remarkably easy to build and use and has a great deal of flexibility, including the ability to be scripted. As pointed out, the scripting capability can be used in a variety of ways to help administrators. Even for the casual home user, this information can be very useful in understanding why your disks are getting so full can be even more useful when asking the household finance committee to flip for a new 10TB SATA drive (or two).

Related content

  • Understanding the Status of Your Filesystem

    Understanding the proliferation of data in your filesystem is key to being an administrator. Understanding file sizes and file ages and their distribution helps you tune filesystems for performance and develop policies for data management.

  • Using rsync for Backups

    Although commercial Linux backup tools are available, many people prefer open source to better understand and control the backup process. One open source tool that can do both full and incremental backups is rsync.

  • Safe Files

    Encrypting your data is becoming increasingly important, but you don’t always have to use an encrypted filesystem. Sometimes just encrypting files is enough.

  • Filesystem Encryption

    The revelation of wide-spread government snooping has sparked a renewed interest in data storage security via encryption. In this article, we review some options for encrypting files, directories, and filesystems on Linux.

  • What Is an Inode?

    Understanding inodes is key to a better understanding of HPC filesystems.

comments powered by Disqus