Metadata for Your Data

Understanding the proliferation of data in your filesystem is key to being an administrator. Understanding file sizes and file ages and their distribution helps you tune filesystems for performance and develop policies for data management.

If you are reading this article, you likely have a Linux system or a Linux cluster somewhere – or even a *nix system. Let me ask you a few simple questions. What is the largest file and how big is it? Which user has the largest capacity? Which user has the most files? What is the oldest file and how old is it? These are deceptively easy questions to answer, but what if you have 1,000 users and more than 1PB of data? Moreover, the answers constantly change because users are adding, modifying, and deleting data, but understanding – or at the very least, monitoring – your filesystem holistically can provide great benefits.

For example, in a very recent discussion on the Beowulf mailing list, the original poster said that about 40 million out of 50 million files were less than 10KB in size. He also pointed out, however, that about 500 of 700TB was used, with most of the capacity being consumed by some very, very large files. They were speculating whether the large number of small files was causing their filesystem to bottleneck on IOPS. Without knowledge of their file distribution, they would have had a difficult time looking for bottlenecks.

Furthermore, subsequent responses to the thread illustrated that other people create policies to prevent problems with their storage systems because of the large number of small files. For example, one policy from a person posting to the thread insisted that users zip their small files together before archiving them to reduce the load on the archive system.

This simple thread illustrates that having good information about the filesystem can lead to a greater understanding of your storage, which can lead to better policies or improved performance.

Several tools allow you to scan your filesystem and gather information about file metadata. For example, it’s pretty simple to gather information on size, age, and ownership. An example of such a tool is fsstats. It walks a filesystem, starting at a root directory, and produces a summary output that lists a file size histogram, capacity used histograms, directory sizes, file name lengths, link counts (links to files), symlink target links, and time histograms based on files and file size.

In a true devops manner, I decided to write my own tool for gathering filesystem statics. In doing this, I have control over what metadata I gather and how I present it. In this article, I want to discuss what metadata I'm gathering, how I'm gathering it, and how I present the data.

fsscan and mdpostp

I created two tools to gather metadata about data and present a statistical analysis of it. I wrote the tools in Python because some very nice libraries and tools exist to gather and display data. Python has a nice module called os module that can easily walk a filesystem and gather virtually all of the same information that the stat command produces. Even better, this module is part of the standard library for many of the Python packages in many of the distributions and can easily form the basis of a tool to walk a filesystem and gather detailed file information.

The first tool, fsscan, gathers the data and writes it to a file. It simply walks the directory tree using the os.walk() library function and gathers the following information using the os.stat() library function:

  • Full file path
  • Size of file in bytes
  • The mtime (modify time)
  • The ctime (change time)
  • The atime (access time)
  • The file owner (uid and gid)

The data is gathered into Python lists and then written to a file using pickle.

The second code, mdpostp, reads a file that contains a list of pickle files it will use in the statistical analysis. It computes some simple statistics and creates a few plots. Specifically, it gathers the following information:

  • Average age of all files
  • Oldest file
  • Youngest file
  • Standard deviation of all files
  • Top 10 oldest files
  • Top 10 largest files
  • Biggest users of capacity (based on user ID and group ID)
  • List of duplicate files (optional)
  • Histogram of file age
  • Histogram of file sizes

It does this statistical analysis for three times: ctime (change time – the last time the file metadata was changed – e.g., permissions), mtime (modify time – the last time the file was modified – e.g., the content), and atime (the last time the file was read). The histograms are created using matplotlib.

fsscan

The intent of this tool is to gather data and write it out in a pickle (serialized object). The current version of fsscan gathers the data in Python lists then creates a simple dictionary (key-value) and writes that to the pickle file. You can specify the starting directory from which to gather data, but if you don't, it defaults to the directory in which you run the code (cwd = current working directory). You can also specify the name of the pickle file with a command-line option, and the default is file.pickle. To get the help output for the code, run ./fsscan.py -h, which sends a little output to standard out (stdout), such as the root directory from which it starts, as well as the start time. However, it does print all of the directories it processes, which can sometimes lead to a great deal of output:

[laytonjb@home4 HPC_031] ./fsscan.py -o home.pickle -d /home
start_time: 1402853413.36
Starting directory (root):  /home
start_time: 1402853413.36
Writing output data (pickle) to file: home.pickle
Starting directory /home/
Starting directory /home/laytonjb/
....

I cut off the output after one directory, because it was fairly long and pointless. Notice that the start time is given in seconds since the epoch. It’s fairly easy to convert this to something useful (this is an exercise left to the reader).

One of the functions in the Python os module is called walk (os.walk). This function allows you to walk a directory tree (i.e., examine the files recursively in a directory tree) and gather information on the directories and the files. A simple example from the Python 2.7.7 documentation has been modified here:

#!/usr/bin/python
 
import os
from os.path import join, getsize
 
for root, dirs, files in os.walk('.'):
    print root, "consumes",
    print sum(getsize(join(root, name)) for name in files),
    print "bytes in", len(files), "non-directory files"

This quick code snippet displays the number of bytes taken by non-directory files in each directory under the starting directory (current working directory). This simple snippet can form the basis of a script that can walk through a directory tree and gather information about the files. [A quick note – this code snippet does not have any exception handling, so it is definitely possible you will encounter exceptions.]

With the ability to walk a directory tree, you can open the files in the directory and gather statistics on each file. The os module also has a function (method) called os.fstat that can give you most of the information that the stat command produces. Taking the previous example and extending it a bit results in the following:

#!/usr/bin/python
 
import os
from os.path import join, getsize
 
for root, dirs, files in os.walk('.'):
    print root, "consumes",
    print sum(getsize(join(root, name)) for name in files),
    print "bytes in", len(files), "non-directory files"
    for file in files:
        fileloc = root + "/" + file
        FILE = os.open(fileloc, os.O_RDONLY)
        junk = os.fstat(FILE)
        size = junk[6]
        atime = junk[7]
        mtime = junk[8]
        ctime = junk[9]
        uid = junk[4]
        gid = junk[5]
        print "   File: %s size: %s atime: %s mtime: %s ctime: %s" % (file,size,atime,mtime,ctime)
        os.close(FILE)

In the second for loop, the full path to the file is created (fileloc) using the root of the director tree (root) and the file name (file). Notice that the os.fstat function returns a list of attributes. For example, it returns the access time (atime), the modify time (mtime), and the change time (ctime), which are all in seconds since the epoch. Other attributes include the size in bytes (size) and the user ID (uid) and group ID (gid). Note that the times are stored in seconds since the epoch, and the uid and gid are converted to a string (hopefully a real user name) rather than stored as a numeric value.

The fsscan tool was derived from this simple example code. It does a great deal more error checking and offers command-line options (processing command lines is not always trivial). The code is not complicated, and if you know a little Python, you can easily modify it to do what you want.

Once it’s done processing, the code creates a dictionary:

    main_data["start_dir"] = start_directory;
    main_data["start_time"] = start_time;
    main_data["fullpath"] = fullpath_list;
    main_data["size_bytes"] = size_bytes_list;
    main_data["atime"] = atime_list;
    main_data["mtime"] = mtime_list;
    main_data["ctime"] = ctime_list;
    main_data["uid_name"] = uid_list;
    main_data["gid_name"] = gid_list;

I hope the variables are self-explanatory, but if not, here’s the magic decoding.

  • start_directory = starting directory
  • start_time = wall clock time when the code started
  • fullpath_list = Python list of the full path to every file and directory
  • size_bytes_list = size of each file in bytes
  • atime_list = atime for each file in seconds since the epoch
  • mtime_list = mtime for each file in seconds since the epoch
  • ctime_list = ctime for each file in seconds since the epoch
  • uid_list = user ID (uid) for each file
  • gid_list = group ID (gid) for each file

mdpostp

The mdpostp tool (metadata post-processing) is used to take the pickle file(s) and do a statistical analysis of the data. The code takes as input a list of pickle files you want to analyze. The command line is pretty simple:

./mdpostp [-help] [-dup|-nodup] 

Using -help on the command line displays the help text (it’s not very extensive). Then, you have the option to have the code search for and display duplicate files (-dup) or not search and not display duplicate files (-nodup). The final option on the command line should be the input file that contains a simple list of pickle files you want to analyze, such as these four pickle files that have been created via fsscan.py.

home1.pickle
home2.pickle
home3.pickle
data.pickle

The code mdpostp reads and process each of the files in turn. At this time, it does not concatenate the data in each pickle. This limitation exists because you can’t run fsscan.py on various subdirectories and combine the data into a parent directory.

The code seems long, but it’s really simple: It reads the data in the pickle, does a simple analysis, and prints the data to stdout, as well as an HTML file. The code creates a “report” in a subdirectory named HTML_REPORT. In this directory, you will see a file named report.html. Just open this file in your browser and you should see the HTML report. All of the figures (plots) are also stored in the same subdirectory as well.

The code should be easy to understand, but if it looks confusing, particularly, the plotting portion of the code, take a look at the documentation for matplotlib, particularly the section about plotting histograms.

Appendix A is sample output from stdout. My apologies for the length, but I think it’s useful to see all of the output because it contains some details that are contained in the HTML report. The original scan was run against /home on my workstation.

The corresponding HTML output is shown in Appendix B. It includes basically the same information, but with the plots. It also shows an analysis of the difference between ctime (the last time the file's metadata was changed, such as permissions) and mtime (the last time the file's contents were modified). The reason I’m examining ctime/mtime is because I want to see whether file metadata, but not the file content, is changed frequently. This tells me whether users are doing something like touch or chmod on a file or files.

Usage Recommendations

You can use the codes however you like (they are GPLv3), but I wanted to explain how I like to use them. The first thing I like to do is run fsscan against user home directories periodically. I also will run it against any other directory trees to which users have access – perhaps something like a WORK or SCRATCH directory tree. When I run it, I include the date of the run in the name of the pickle, so I can keep track of changes to the files over time. It might be useful to write a tool that reads multiple pickle files and looks for differences, such as how many files have been deleted, how many files have been created, how many files have had their metadata changed, how many files have had their content changed, and so on.

Then I run mdpostp against newly created pickle files to get a quick snapshot of the state of the filesystem. I like to read the standard output first to get a feel for the data; then, I pull up the HTML output to get a better view of the state of the filesystem.

Given enough time between filesystem scans, I like to compare the HTML output of the various subdirectories, such as /home. This tells me a little about how the filesystem is “evolving.”

If your filesystem is large and you don’t want to use too much memory gathering the raw data (recall that fsscan puts everything in memory before writing the pickle file), another good practice is always to run fsscan on different subdirectories at about the same time. This works well in parallel filesystems because you can run the code on different clients or storage servers at the same time on different parts of the directories. However, to get a whole view of the entire data set, you need to write simple code that combines the different pickle files into a single pickle file, which should be very simple to write if you are interested.

Summary

A good administrator has an understanding of the “status” of their filesystem. What are the really old files (using atime, ctime, mtime, or some combination)? Who is using the most data? Who has the largest number of files? What is average file size? What is the standard deviation for file size? What does a histogram of file size look like? All of these simple questions need to be answered, because it allows you watch the pulse of the filesystem and understand how it’s changing and how you can tune the storage to run better or develop better policies.

Some code out there allows you walk a file tree and gather all of this data on files and directories. However, I wanted some flexibility, so I wrote two tools to perform this task. The code is in Python because Python has easy-to-use libraries for gathering and plotting data. The first tool, fsscan, gathers the data and writes it to a Python pickle file (key-value). The second tool, mdpostp, reads the pickle file and does some simple statistical analysis, creating both a report to standard output (stdout) and an HTML report.

Running the code in this article (or similar code) is very easy and yields a great deal of information. At the very least, running this code against user directories will be very enlightening. Don’t be afraid, give it a try.

Appendix A: mdpostp Standard Output

[laytonjb@home4 HPC_031]$ ./mdpostp1.py -nodup files.in
filename =  home.pickle
start_time: 1402854660.07
html_filename =  ./HTML_REPORT/report.html
 
**************
*** Pickle file:  home.pickle
*** Start_dir:  /home  Scan date: Sun Jun 15 13:30:13 2014
**************
 
   Number of lines in scan =  388083
 
 
   ==============
   Mtime results:
   ==============
   Average mtime age in days:  907.036  days
   Oldest mtime age file in days:  5,401.445  days
   Youngest mtime age file in days:  0.016  days
   Standard deviation mtime age in days:  590.7352  days
 
   *** Mtime interval summary 
   [   0-   1 days]:    176  (  0.05%)  (  0.05% cumulative)
   [   1-   2 days]:      0  (  0.00%)  (  0.05% cumulative)
   [   2-   4 days]:   1091  (  0.28%)  (  0.33% cumulative)
   [   4-   7 days]:    217  (  0.06%)  (  0.38% cumulative)
   [   7-  14 days]:     75  (  0.02%)  (  0.40% cumulative)
   [  14-  28 days]:   7379  (  1.90%)  (  2.30% cumulative)
   [  28-  56 days]:  10655  (  2.75%)  (  5.05% cumulative)
   [  56- 112 days]:  12079  (  3.11%)  (  8.16% cumulative)
   [ 112- 168 days]:  27551  (  7.10%)  ( 15.26% cumulative)
   [ 168- 252 days]:   9820  (  2.53%)  ( 17.79% cumulative)
   [ 252- 365 days]:  79030  ( 20.36%)  ( 38.15% cumulative)
   [ 365- 504 days]:   3717  (  0.96%)  ( 39.11% cumulative)
   [ 504- 730 days]:   5438  (  1.40%)  ( 40.51% cumulative)
   [ 730-1095 days]:  39339  ( 10.14%)  ( 50.65% cumulative)
   [1095-1460 days]: 190052  ( 48.97%)  ( 99.62% cumulative)
   [1460-1825 days]:    602  (  0.16%)  ( 99.78% cumulative)
   [1825-2190 days]:    142  (  0.04%)  ( 99.81% cumulative)
   [2190-2920 days]:    392  (  0.10%)  ( 99.92% cumulative)
   [2920-3650 days]:    160  (  0.04%)  ( 99.96% cumulative)
   [3650-4380 days]:     41  (  0.01%)  ( 99.97% cumulative)
   [4380-5110 days]:     91  (  0.02%)  ( 99.99% cumulative)
   [5110-5840 days]:     36  (  0.01%)  (100.00% cumulative)
 
   Top  10  oldest files (mtime - modify time)
   ---------------------------------------------
    Rank  File                                                     Mtime Age (Days)
    #1    /home/laytonjb/.gkrellm2/themes/x17/bg_grid.png                 5,401.444
    #2    /home/laytonjb/.gkrellm2/themes/brushed/bg_grid.png             5,401.444
    #3    /home/laytonjb/.gkrellm2/themes/brushed/bg_chart.png            5,401.444
    #4    /home/laytonjb/.gkrellm2/themes/brushed/d                       5,401.444
    #5    /home/laytonjb/.gkrellm2/themes/brushed/gismrc~                 5,401.444
    #6    /home/laytonjb/.gkrellm2/themes/brushed/gkrellmrc               5,234.944
    #7    /home/laytonjb/.gkrellm2/themes/x17/host/bg_panel.png           5,233.824
    #8    /home/laytonjb/.gkrellm2/themes/x17/net/decal_net_leds.png      5,233.824
    #9    /home/laytonjb/.gkrellm2/themes/x17/frame_left.png              5,233.824
    #10   /home/laytonjb/.gkrellm2/themes/x17/frame_bottom.png            5,233.824
 
   ==============
   Ctime results:
   ==============
   Average ctime age in days:  205.772  days
   Oldest ctime age file in days:  248.086  days
   Youngest ctime age file in days:  0.016  days
   Standard deviation ctime age in days:  69.8168  days
 
   [   0-   1 days]:    180  (  0.05%)  (  0.05% cumulative)
   [   1-   2 days]:      0  (  0.00%)  (  0.05% cumulative)
   [   2-   4 days]:   4675  (  1.20%)  (  1.25% cumulative)
   [   4-   7 days]:    215  (  0.06%)  (  1.31% cumulative)
   [   7-  14 days]:     75  (  0.02%)  (  1.33% cumulative)
   [  14-  28 days]:  15845  (  4.08%)  (  5.41% cumulative)
   [  28-  56 days]:   4418  (  1.14%)  (  6.55% cumulative)
   [  56- 112 days]:  20402  (  5.26%)  ( 11.80% cumulative)
   [ 112- 168 days]:  72768  ( 18.75%)  ( 30.55% cumulative)
   [ 168- 252 days]: 269505  ( 69.45%)  (100.00% cumulative)
   [ 252- 365 days]:      0  (  0.00%)  (100.00% cumulative)
   [ 365- 504 days]:      0  (  0.00%)  (100.00% cumulative)
   [ 504- 730 days]:      0  (  0.00%)  (100.00% cumulative)
   [ 730-1095 days]:      0  (  0.00%)  (100.00% cumulative)
   [1095-1460 days]:      0  (  0.00%)  (100.00% cumulative)
   [1460-1825 days]:      0  (  0.00%)  (100.00% cumulative)
   [1825-2190 days]:      0  (  0.00%)  (100.00% cumulative)
   [2190-2920 days]:      0  (  0.00%)  (100.00% cumulative)
   [2920-3650 days]:      0  (  0.00%)  (100.00% cumulative)
   [3650-4380 days]:      0  (  0.00%)  (100.00% cumulative)
   [4380-5110 days]:      0  (  0.00%)  (100.00% cumulative)
   [5110-5840 days]:      0  (  0.00%)  (100.00% cumulative)
 
Top  10  oldest files (ctime - change time)
-------------------------------------------
Rank  File                                                                Ctime Age (Days)
#1    /home/laytonjb/.gconf/apps/nm-applet/%gconf.xml                              248.085
#2    /home/laytonjb/.gconf/apps/panel/applets/workspace_switcher/prefs/%gconf.xml 248.085
#3    /home/laytonjb/.gconf/apps/panel/applets/workspace_switcher/%gconf.xml       248.085
#4    /home/laytonjb/.gconf/apps/panel/applets/clock/prefs/%gconf.xml              248.085
#5    /home/laytonjb/.gconf/apps/panel/applets/clock/%gconf.xml                    248.085
#6    /home/laytonjb/.gconf/apps/panel/applets/window_list/prefs/%gconf.xml        248.085
#7    /home/laytonjb/.gconf/apps/panel/applets/window_list/%gconf.xml              248.085
#8    /home/laytonjb/.gconf/apps/panel/applets/%gconf.xml                          248.085
#9    /home/laytonjb/.gconf/apps/panel/%gconf.xml                                  248.085
#10   /home/laytonjb/.gconf/apps/gnote/%gconf.xml                                  248.085
 
   ===============================
   Ctime-Mtime Difference results:
   ===============================
   Number of non-zero difference files: 334,953 of 388,083 files: (86.31%)
   Average ctime-mtime age in days:  700.000  days
   Oldest ctime-mtime age file in days:  5,153.000  days
   Youngest ctime-mtime age file in days:  0.000  days
   Standard deviation ctime-mtime age in days:  543.8009  days
 
   [   0-   1 days]:  53756  ( 13.85%)  ( 13.85% cumulative)
   [   1-   2 days]:     73  (  0.02%)  ( 13.87% cumulative)
   [   2-   4 days]:    121  (  0.03%)  ( 13.90% cumulative)
   [   4-   7 days]:    168  (  0.04%)  ( 13.94% cumulative)
   [   7-  14 days]:   1111  (  0.29%)  ( 14.23% cumulative)
   [  14-  28 days]:   8768  (  2.26%)  ( 16.49% cumulative)
   [  28-  56 days]:  25377  (  6.54%)  ( 23.03% cumulative)
   [  56- 112 days]:   8101  (  2.09%)  ( 25.12% cumulative)
   [ 112- 168 days]:  47729  ( 12.30%)  ( 37.42% cumulative)
   [ 168- 252 days]:   3578  (  0.92%)  ( 38.34% cumulative)
   [ 252- 365 days]:   4970  (  1.28%)  ( 39.62% cumulative)
   [ 365- 504 days]:   2324  (  0.60%)  ( 40.22% cumulative)
   [ 504- 730 days]:   3270  (  0.84%)  ( 41.06% cumulative)
   [ 730-1095 days]:  37909  (  9.77%)  ( 50.83% cumulative)
   [1095-1460 days]: 189729  ( 48.89%)  ( 99.72% cumulative)
   [1460-1825 days]:    254  (  0.07%)  ( 99.78% cumulative)
   [1825-2190 days]:    255  (  0.07%)  ( 99.85% cumulative)
   [2190-2920 days]:    324  (  0.08%)  ( 99.93% cumulative)
   [2920-3650 days]:    136  (  0.04%)  ( 99.97% cumulative)
   [3650-4380 days]:      3  (  0.00%)  ( 99.97% cumulative)
   [4380-5110 days]:    122  (  0.03%)  (100.00% cumulative)
   [5110-5840 days]:      5  (  0.00%)  (100.00% cumulative)
 
   Top  10  oldest files (ctime-time)
   -------------------------------------------
    Rank  File                                                    Ctime-Mtime diff (Days)
    #1    /home/laytonjb/.gkrellm2/themes/x17/bg_grid.png                       5,153.421
    #2    /home/laytonjb/.gkrellm2/themes/brushed/bg_grid.png                   5,153.420
    #3    /home/laytonjb/.gkrellm2/themes/brushed/bg_chart.png                  5,153.420
    #4    /home/laytonjb/.gkrellm2/themes/brushed/d                             5,153.420
    #5    /home/laytonjb/.gkrellm2/themes/brushed/gismrc~                       5,153.420
    #6    /home/laytonjb/.gkrellm2/themes/brushed/gkrellmrc                     4,986.920
    #7    /home/laytonjb/.gkrellm2/themes/x17/host/bg_panel.png                 4,985.801
    #8    /home/laytonjb/.gkrellm2/themes/x17/net/decal_net_leds.png            4,985.801
    #9    /home/laytonjb/.gkrellm2/themes/x17/frame_left.png                    4,985.801
    #10   /home/laytonjb/.gkrellm2/themes/x17/frame_bottom.png                  4,985.801
 
   ==============
   Atime results:
   ==============
   Average atime age in days:  127.981  days
   Oldest atime age file in days:  2,029.980  days
   Youngest atime age file in days:  0.015  days
   Standard deviation atime age in days:  98.5828  days
 
   [   0-   1 days]:    436  (  0.11%)  (  0.11% cumulative)
   [   1-   2 days]:      0  (  0.00%)  (  0.11% cumulative)
   [   2-   4 days]:   4838  (  1.25%)  (  1.36% cumulative)
   [   4-   7 days]:    238  (  0.06%)  (  1.42% cumulative)
   [   7-  14 days]:     60  (  0.02%)  (  1.44% cumulative)
   [  14-  28 days]:  16110  (  4.15%)  (  5.59% cumulative)
   [  28-  56 days]:   7985  (  2.06%)  (  7.64% cumulative)
   [  56- 112 days]:  19803  (  5.10%)  ( 12.75% cumulative)
   [ 112- 168 days]: 334034  ( 86.07%)  ( 98.82% cumulative)
   [ 168- 252 days]:   1288  (  0.33%)  ( 99.15% cumulative)
   [ 252- 365 days]:    425  (  0.11%)  ( 99.26% cumulative)
   [ 365- 504 days]:    923  (  0.24%)  ( 99.50% cumulative)
   [ 504- 730 days]:     25  (  0.01%)  ( 99.51% cumulative)
   [ 730-1095 days]:     78  (  0.02%)  ( 99.53% cumulative)
   [1095-1460 days]:   1831  (  0.47%)  (100.00% cumulative)
   [1460-1825 days]:      0  (  0.00%)  (100.00% cumulative)
   [1825-2190 days]:      9  (  0.00%)  (100.00% cumulative)
   [2190-2920 days]:      0  (  0.00%)  (100.00% cumulative)
   [2920-3650 days]:      0  (  0.00%)  (100.00% cumulative)
   [3650-4380 days]:      0  (  0.00%)  (100.00% cumulative)
   [4380-5110 days]:      0  (  0.00%)  (100.00% cumulative)
   [5110-5840 days]:      0  (  0.00%)  (100.00% cumulative)
 
Top  10  oldest files (atime - access time)
-------------------------------------------
Rank  File                                                                                            Atime Age (Days)
#1    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/script/.exists                        2,029.980
#2    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/man3/.exists                          2,029.980
#3    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/lib/auto/DBIx/SimplePerl/.exists      2,029.980
#4    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/lib/DBIx/.exists                      2,029.980
#5    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/bin/.exists                           2,029.980
#6    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/arch/auto/DBIx/SimplePerl/.exists     2,029.980
#7    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/arch/.exists                          2,029.980
#8    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/man1/.exists                          2,029.980
#9    /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/pm_to_blib                                 2,029.980
#10   /home/laytonjb/src/CFD/CFD_2/USM3D/USM3D_DATA/RAMP/ramp1.m2                                            1,440.808
 
   ==================
   File Size results:
   ==================
   Average file size in KB:  909.000  KB
   Largest file in KB:  19,804,344.000  KB
   Smallest file size in KB:  0.000  KB
   Standard deviation file size in KB:  49,832.8061  KB
 
   *** File Size Intervals (KB):
   [      0-      1 KB]: 156455  ( 40.31%)  ( 40.31% cumulative)
   [      1-      2 KB]:  38278  (  9.86%)  ( 50.18% cumulative)
   [      2-      4 KB]:  30822  (  7.94%)  ( 58.12% cumulative)
   [      4-      8 KB]:  43118  ( 11.11%)  ( 69.23% cumulative)
   [      8-     16 KB]:  24296  (  6.26%)  ( 75.49% cumulative)
   [     16-     32 KB]:  33455  (  8.62%)  ( 84.11% cumulative)
   [     32-     64 KB]:  13502  (  3.48%)  ( 87.59% cumulative)
   [     64-    128 KB]:  12083  (  3.11%)  ( 90.70% cumulative)
   [    128-    256 KB]:   8623  (  2.22%)  ( 92.93% cumulative)
   [    256-    512 KB]:  13437  (  3.46%)  ( 96.39% cumulative)
   [    512-   1024 KB]:   5456  (  1.41%)  ( 97.79% cumulative)
   [   1024-   2048 KB]:   2687  (  0.69%)  ( 98.49% cumulative)
   [   2048-   4096 KB]:   2497  (  0.64%)  ( 99.13% cumulative)
   [   4096-   8192 KB]:   1361  (  0.35%)  ( 99.48% cumulative)
   [   8192-  16384 KB]:    949  (  0.24%)  ( 99.73% cumulative)
   [  16384-  32768 KB]:    373  (  0.10%)  ( 99.82% cumulative)
   [  32768-  65536 KB]:    246  (  0.06%)  ( 99.89% cumulative)
   [  65536- 131072 KB]:    179  (  0.05%)  ( 99.93% cumulative)
   [ 131072- 262144 KB]:     76  (  0.02%)  ( 99.95% cumulative)
   [ 262144- 524288 KB]:     36  (  0.01%)  ( 99.96% cumulative)
   [ 524288-1048576 KB]:    154  (  0.04%)  (100.00% cumulative)
 
Top 10 largest files 
=====================
Rank  File                    Size (KB)
#1    /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/cesm-strace/strace.janus017.tar                                  19,804,344
#2    /home/laytonjb/src/CFD/CFD_2/cfdpp.tar.gz                                                                          8,473,315
#3    /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.26702.call_variants_1.out     6,155,298
#4    /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.26702.call_variants_1.out           6,155,298
#5    /home/laytonjb/src/CFD/CFD_2/overflow.tar.gz                                                                           5,863,543
#6    /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24999.count_covariates_1.out  5,748,509
#7    /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24999.count_covariates_1.out        5,748,509
#8    /home/laytonjb/src/CFD/CFD_2/boeing_app_tuned.tar.gz                                                               4,240,130
#9    /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24084.realign_indels_1.out    4,225,463
#10   /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24084.realign_indels_1.out          4,225,463
 
 
 
   Top 10 biggest users 
   =====================
    Rank  User       Total Size (KB)     % of Total
    #1    laytonjb   353,100,132           99.9998%
    #2    root       645                    0.0002%
    #3    susy       0                      0.0000%
 
 
 
   Top 10 biggest group users 
   ===========================
    Rank  Group      Total Size (KB)     % of Total
    #1    laytonjb   353,100,132           99.9998%
    #2    root       645                    0.0002%
    #3    susy       0                      0.0000%

Appendix B: HTML Report Output

Metadata Report

A “highlight” analysis of each pickle file presents the top 10 oldest files, largest files, users with the most data, and so on. Number of files in scan = 388,083.

Mtime Age Statistics

The following statistics are for the mtime age of files in the pickle. Mtime is the change time of the file. It will change only if the actual data is changed, but not if the metadata alone is changed.

  • Average mtime age: 907.327 days
  • Oldest mtime age: 5,401.736 days
  • Youngest mtime age: 0.308 days
  • Standard deviation mtime age: 590.7352 days

Table 1 lists the mtime age intervals in days. The age is based on the mtime of the files when scanned and the current time.

Table 1:mtime Age Intervals

Interval (days) No. of files % of Total Cumulative %
0–1 151 0.04 0.04
1–2 0 0.00 0.04
2–4 215 0.06 0.09
4–7 111 0.29 0.38
7–14 75 0.02 0.40
14–28 7,376 1.90 2.30
28–56 10,658 2.75 5.05
56–112 12,079 3.11 8.16
112–168 27,551 7.10 15.26
168–252 9,819 2.53 17.79
252–365 79,031 20.36 38.15
365–504 3,717 0.96 39.11
504–730 5,421 1.40 40.51
730–1,095 39,356 10.14 50.65
1,095–1,460 190,014 48.96 99.61
1,460–1,825 640 0.16 99.78
1,825–2,190 142 0.04 99.81
2,190–2,920 392 0.10 99.92
2,920–3,650 160 0.04 99.96
3,650–4,380 41 0.01 99.97
4,380–5,110 91 0.02 99.99
5,110–5,840 36 0.01 100.00

Table 2 lists the top 10 files based on mtime, change time, when the files were scanned in the original pickle file.

Table 2: Top 10 Oldest Files Based on mtime

Rank File Mtime Age
1 /home/laytonjb/.gkrellm2/themes/x17/bg_grid.png 5,401.736
2 /home/laytonjb/.gkrellm2/themes/brushed/bg_grid.png 5,401.736
3 /home/laytonjb/.gkrellm2/themes/brushed/bg_chart.png 5,401.736
4 /home/laytonjb/.gkrellm2/themes/brushed/d 5,401.736
5 /home/laytonjb/.gkrellm2/themes/brushed/gismrc~ 5,401.736
6 /home/laytonjb/.gkrellm2/themes/brushed/gkrellmrc 5,235.235
7 /home/laytonjb/.gkrellm2/themes/x17/host/bg_panel.png 5,234.116
8 /home/laytonjb/.gkrellm2/themes/x17/net/decal_net_leds.png 5,234.116
9 /home/laytonjb/.gkrellm2/themes/x17/frame_left.png 5,234.116
10 /home/laytonjb/.gkrellm2/themes/x17/frame_bottom.png 5,234.116

Figure 1 is a histogram of the mtime (modify time) age of files in the pickle.

Figure 1: Mtime Histogram.

Ctime Age Statistics

The next statistics are for the ctime age of the files in the pickle. Ctime is the change time of the file. It will change if the actual data is changed and if the metadata is changed, such as the ownership or permissions on the file.

  • Average ctime age: 206.064 days
  • Oldest ctime age: 248.377 days
  • Youngest ctime age: 0.308 days
  • Standard deviation ctime age: 69.8168 days

Table 3 lists the ctime age intervals in days. The age is based on the ctime of the files when scanned and the current time.

Table 3Ctime Age Intervals

Interval (days) No. of files % of Total Cumulative %
0–1 155 0.04 0.04
1–2 0 0.00 0.04
2–4 259 0.07 0.11
4–7 4,656 1.20 1.31
7–14 75 0.02 1.33
14–28 15,842 4.08 5.41
28–56 4,421 1.14 6.55
56–112 20,402 5.26 11.80
112–168 72,768 18.75 30.55
168–252 269,505 69.45 100.00
252–365 0 0.00 100.00
365–504 0 0.00 100.00
504–730 0 0.00 100.00
730–1,095 0 0.00 100.00
1,095–1,460 0 0.00 100.00
1,460–1,825 0 0.00 100.00
1,825–2,190 0 0.00 100.00
2,190–2,920 0 0.00 100.00
2,920–3,650 0 0.00 100.00
3,650–4,380 0 0.00 100.00
4,380–5,110 0 0.00 100.00
5,110–5,840 0 0.00 100.00

Table 4 lists the top 10 files based on ctime, change time, when the files were scanned in the original pickle file.

Table 4: Top 10 Oldest Files Based on ctime

Rank File Ctime Age
1 /home/laytonjb/.gconf/apps/nm-applet/%gconf.xml 248.377
2 /home/laytonjb/.gconf/apps/panel/applets/workspace_switcher/prefs/%gconf.xml 248.377
3 /home/laytonjb/.gconf/apps/panel/applets/workspace_switcher/%gconf.xml 248.377
4 /home/laytonjb/.gconf/apps/panel/applets/clock/prefs/%gconf.xml 248.377
5 /home/laytonjb/.gconf/apps/panel/applets/clock/%gconf.xml 248.377
6 /home/laytonjb/.gconf/apps/panel/applets/window_list/prefs/%gconf.xml 248.377
7 /home/laytonjb/.gconf/apps/panel/applets/window_list/%gconf.xml 248.377
8 /home/laytonjb/.gconf/apps/panel/applets/%gconf.xml 248.377
9 /home/laytonjb/.gconf/apps/panel/%gconf.xml 248.377
10 /home/laytonjb/.gconf/apps/gnote/%gconf.xml 248.377

Figure 2 is a histogram of the ctime (change time) age of the files in the pickle.

Figure 2: Ctime histogram.

ctime-mtime Difference Statistics

The next set of statistics are for the difference between ctime and mtime., which can tell you the metadata changes (ctime) versus data changes (mtime). In the following analysis, the difference between the two (ctime/mtime) are used.

  • Average ctime/mtime: 700.000 days
  • Oldest ctime/mtime file: 5,153.000 days
  • Youngest ctime/mtime file: 0.000 days
  • Standard deviation ctime/mtime: 543.8009 days

Table 5:ctime/mtime Age Intervals

Interval (days) No. of files % of Total Cumulative %
0–1 53,756 13.85 13.85
1–2 73 0.02 13.87
2–4 121 0.03 13.90
4–7 168 0.04 13.94
7–14 1,111 0.29 14.2
14–28 8,768 2.26 16.49
28–56 25,377 6.54 23.03
56–112 8,101 2.09 25.12
112–168 47,729 12.30 37.42
168–252 3,578 0.92 38.34
252–365 4,970 1.28 39.62
365–504 2,324 0.60 40.22
504–730 3,270 0.84 41.06
730–1,095 37,909 9.77 50.83
1,095–1,460 189,729 48.89 99.72
1,460–1,825 254 0.07 99.78
1,825–2,190 255 0.07 99.85
2,190–2,920 324 0.08 99.93
2,920–3,650 136 0.04 99.97
3,650–4,380 3 0.00 99.97
4,380–5,110 122 0.03 100.00
5,110–5,840 5 0.00 100.00

Table 6 lists the top 10 files with the largest ctime/mtime differences.

Table 6: Top 10 Oldest Files Based on ctime/mtime

Rank File xtime/mtime Difference (days)
1 /home/laytonjb/.gkrellm2/themes/x17/bg_grid.png 5,153.421
2 /home/laytonjb/.gkrellm2/themes/brushed/bg_grid.png 5,153.420
3 /home/laytonjb/.gkrellm2/themes/brushed/bg_chart.png 5,153.420
4 /home/laytonjb/.gkrellm2/themes/brushed/d 5,153.420
5 /home/laytonjb/.gkrellm2/themes/brushed/gismrc~ 5,153.420
6 /home/laytonjb/.gkrellm2/themes/brushed/gkrellmrc 4,986.920
7 /home/laytonjb/.gkrellm2/themes/x17/host/bg_panel.png 4,985.801
8 /home/laytonjb/.gkrellm2/themes/x17/net/decal_net_leds.png 4,985.801
9 /home/laytonjb/.gkrellm2/themes/x17/frame_left.png 4,985.801
10 /home/laytonjb/.gkrellm2/themes/x17/frame_bottom.png 4,985.801

Figure 3 is a histogram of the ctime/mtime differences of the files in the pickle.

Figure 3: Ctime/mtime histogram.

Atime Age Statistics

Atime is the change time of the file. It will change only if the actual data is changed, but not if the metadata is changed.

  • Average atime age: 128.273 days
  • Oldest atime: 2,030.272 days
  • Youngest atime: 0.307 days
  • Standard deviation atime: 98.5828 days

Table 7 lists the atime age intervals. The age is based on the atime of the files when scanned and the current time.

Table 7: Atime Age Intervals

Interval (days) No. of Files % of Total Cumulative %
0–1 406 0.10 0.10
1–2 0 0.00 0.10
2–4 396 0.10 0.21
4–7 4,710 1.21 1.42
7–14 60 0.02 1.44
14–28 16,108 4.15 5.59
28–56 7,987 2.06 7.64
56–112 19,803 5.10 12.75
112–168 334,034 86.07 98.82
168–252 1,288 0.33 99.15
252–365 425 0.11 99.26
365–504 923 0.24 99.50
504–730 25 0.01 99.51
730–1,095 78 0.02 99.53
1,095–1,460 1,831 0.47 100.00
1,460–1,825 0 0.00 100.00
1,825–2,190 9 0.00 100.00
2,190–2,920 0 0.00 100.00
2,920–3,650 0 0.00 100.00
3,650–4,380 0 0.00 100.00
4,380–5,110 0 0.00 100.00
5,110–5,840 0 0.00 100.00

Table 8 lists the top 10 files based on atime (access time) when the files were scanned in the original pickle file.

Table 8: Top 10 Oldest Files Based on atime

Rank File atime Age
1 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/script/.exists 2,030.271
2 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/man3/.exists 2,030.271
3 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/lib/auto/DBIx/SimplePerl/.exists 2,030.271
4 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/lib/DBIx/.exists 2,030.271
5 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/bin/.exists 2,030.271
6 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/arch/auto/DBIx/SimplePerl/.exists 2,030.271
7 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/arch/.exists 2,030.271
8 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/man1/.exists 2,030.271
9 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/pm_to_blib 2,030.271
10 /home/laytonjb/src/CFD/CFD_2/USM3D/USM3D_DATA/RAMP/ramp1.m2 1,441.099


Figure 4: Atime histogram.

File Size Statistics

Thr dtatistics for files sizes of the files in the pickle are:

  • Average file size: 909.000KB
  • Largest file: 19,804,344.000KB
  • Smallest file: 0.000KB
  • Standard deviation file size: 49,832.8061KB

Table 9: File Size Intervals

Interval (KB) No. of Files % of Total Cumulative %
0–1 156,455 40.31 40.31
1–2 38,278 9.86 50.18
2–4 30,822 7.94 58.12
4–8 43,118 11.11 69.23
8–16 24,296 6.26 75.49
16–32 33,455 8.62 84.11
32–64 13,502 3.48 87.59
64–128 12,083 3.11 90.70
128–256 8,623 2.22 92.93
256–512 13,437 3.46 96.39
512–1,024 5,456 1.41 97.79
1,024–2,048 2,687 0.69 98.49
2,048–4,096 2,497 0.64 99.13
4,096–8,192 1,361 0.35 99.48
8,192–16,384 949 0.24 99.73
16,384–32,768 373 0.10 99.82
32,768–65,536 246 0.06 99.89
65,536–131,072 179 0.05 99.93
131,072–262,144 76 0.02 99.95
262,144–524,288 36 0.01 99.96
524,288-1,048,576 154 0.04 100.00

Table 10: Top 10 Largest Files

Rank File Size (KB)
1 /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/cesm-strace/strace.janus017.tar 19,804,344
2 /home/laytonjb/src/CFD/CFD_2/cfdpp.tar.gz 8,473,315
3 /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.26702.call_variants_1.out 6,155,298
4 /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.26702.call_variants_1.out 6,155,298
5 /home/laytonjb/src/CFD/CFD_2/overflow.tar.gz 5,863,543
6 /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24999.count_covariates_1.out 5,748,509
7 /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24999.count_covariates_1.out 5,748,509
8 /home/laytonjb/src/CFD/CFD_2/boeing_app_tuned.tar.gz 4,240,130
9 /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24084.realign_indels_1.out 4,225,463
10 /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24084.realign_indels_1.out 4,225,463

Figure 5: File size (KB) histogram.

Biggest Users

Table 11: Top 10 Largest Files

Rank User Total Size (KB) % of Total
1 laytonjb 353,100,132 99.9998
2 root 645 0.0002
3 susy 0 0.0000
Biggest Group Users

Table 12 shows the top 10 largest group users.

Table 12: Top 10 Largest Files

Rank User Total Size (KB) % of Total
1 laytonjb 353,100,132 99.9998
2 root 645 0.0002
3 susy 0 0.0000

Tags: metadata metadata