Metadata for Your Data
Understanding the proliferation of data in your filesystem is key to being an administrator. Understanding file sizes and file ages and their distribution helps you tune filesystems for performance and develop policies for data management.
If you are reading this article, you likely have a Linux system or a Linux cluster somewhere – or even a *nix system. Let me ask you a few simple questions. What is the largest file and how big is it? Which user has the largest capacity? Which user has the most files? What is the oldest file and how old is it? These are deceptively easy questions to answer, but what if you have 1,000 users and more than 1PB of data? Moreover, the answers constantly change because users are adding, modifying, and deleting data, but understanding – or at the very least, monitoring – your filesystem holistically can provide great benefits.
For example, in a very recent discussion on the Beowulf mailing list, the original poster said that about 40 million out of 50 million files were less than 10KB in size. He also pointed out, however, that about 500 of 700TB was used, with most of the capacity being consumed by some very, very large files. They were speculating whether the large number of small files was causing their filesystem to bottleneck on IOPS. Without knowledge of their file distribution, they would have had a difficult time looking for bottlenecks.
Furthermore, subsequent responses to the thread illustrated that other people create policies to prevent problems with their storage systems because of the large number of small files. For example, one policy from a person posting to the thread insisted that users zip their small files together before archiving them to reduce the load on the archive system.
This simple thread illustrates that having good information about the filesystem can lead to a greater understanding of your storage, which can lead to better policies or improved performance.
Several tools allow you to scan your filesystem and gather information about file metadata. For example, it’s pretty simple to gather information on size, age, and ownership. An example of such a tool is fsstats. It walks a filesystem, starting at a root directory, and produces a summary output that lists a file size histogram, capacity used histograms, directory sizes, file name lengths, link counts (links to files), symlink target links, and time histograms based on files and file size.
In a true devops manner, I decided to write my own tool for gathering filesystem statics. In doing this, I have control over what metadata I gather and how I present it. In this article, I want to discuss what metadata I'm gathering, how I'm gathering it, and how I present the data.
fsscan and mdpostp
I created two tools to gather metadata about data and present a statistical analysis of it. I wrote the tools in Python because some very nice libraries and tools exist to gather and display data. Python has a nice module called os module that can easily walk a filesystem and gather virtually all of the same information that the stat command produces. Even better, this module is part of the standard library for many of the Python packages in many of the distributions and can easily form the basis of a tool to walk a filesystem and gather detailed file information.
The first tool, fsscan, gathers the data and writes it to a file. It simply walks the directory tree using the os.walk() library function and gathers the following information using the os.stat() library function:
- Full file path
- Size of file in bytes
- The mtime (modify time)
- The ctime (change time)
- The atime (access time)
- The file owner (uid and gid)
The data is gathered into Python lists and then written to a file using pickle.
The second code, mdpostp, reads a file that contains a list of pickle files it will use in the statistical analysis. It computes some simple statistics and creates a few plots. Specifically, it gathers the following information:
- Average age of all files
- Oldest file
- Youngest file
- Standard deviation of all files
- Top 10 oldest files
- Top 10 largest files
- Biggest users of capacity (based on user ID and group ID)
- List of duplicate files (optional)
- Histogram of file age
- Histogram of file sizes
It does this statistical analysis for three times: ctime (change time – the last time the file metadata was changed – e.g., permissions), mtime (modify time – the last time the file was modified – e.g., the content), and atime (the last time the file was read). The histograms are created using matplotlib.
fsscan
The intent of this tool is to gather data and write it out in a pickle (serialized object). The current version of fsscan gathers the data in Python lists then creates a simple dictionary (key-value) and writes that to the pickle file. You can specify the starting directory from which to gather data, but if you don't, it defaults to the directory in which you run the code (cwd = current working directory). You can also specify the name of the pickle file with a command-line option, and the default is file.pickle. To get the help output for the code, run ./fsscan.py -h, which sends a little output to standard out (stdout), such as the root directory from which it starts, as well as the start time. However, it does print all of the directories it processes, which can sometimes lead to a great deal of output:
[laytonjb@home4 HPC_031] ./fsscan.py -o home.pickle -d /home start_time: 1402853413.36 Starting directory (root): /home start_time: 1402853413.36 Writing output data (pickle) to file: home.pickle Starting directory /home/ Starting directory /home/laytonjb/ ....
I cut off the output after one directory, because it was fairly long and pointless. Notice that the start time is given in seconds since the epoch. It’s fairly easy to convert this to something useful (this is an exercise left to the reader).
One of the functions in the Python os module is called walk (os.walk). This function allows you to walk a directory tree (i.e., examine the files recursively in a directory tree) and gather information on the directories and the files. A simple example from the Python 2.7.7 documentation has been modified here:
#!/usr/bin/python import os from os.path import join, getsize for root, dirs, files in os.walk('.'): print root, "consumes", print sum(getsize(join(root, name)) for name in files), print "bytes in", len(files), "non-directory files"
This quick code snippet displays the number of bytes taken by non-directory files in each directory under the starting directory (current working directory). This simple snippet can form the basis of a script that can walk through a directory tree and gather information about the files. [A quick note – this code snippet does not have any exception handling, so it is definitely possible you will encounter exceptions.]
With the ability to walk a directory tree, you can open the files in the directory and gather statistics on each file. The os module also has a function (method) called os.fstat that can give you most of the information that the stat command produces. Taking the previous example and extending it a bit results in the following:
#!/usr/bin/python import os from os.path import join, getsize for root, dirs, files in os.walk('.'): print root, "consumes", print sum(getsize(join(root, name)) for name in files), print "bytes in", len(files), "non-directory files" for file in files: fileloc = root + "/" + file FILE = os.open(fileloc, os.O_RDONLY) junk = os.fstat(FILE) size = junk[6] atime = junk[7] mtime = junk[8] ctime = junk[9] uid = junk[4] gid = junk[5] print " File: %s size: %s atime: %s mtime: %s ctime: %s" % (file,size,atime,mtime,ctime) os.close(FILE)
In the second for loop, the full path to the file is created (fileloc) using the root of the director tree (root) and the file name (file). Notice that the os.fstat function returns a list of attributes. For example, it returns the access time (atime), the modify time (mtime), and the change time (ctime), which are all in seconds since the epoch. Other attributes include the size in bytes (size) and the user ID (uid) and group ID (gid). Note that the times are stored in seconds since the epoch, and the uid and gid are converted to a string (hopefully a real user name) rather than stored as a numeric value.
The fsscan tool was derived from this simple example code. It does a great deal more error checking and offers command-line options (processing command lines is not always trivial). The code is not complicated, and if you know a little Python, you can easily modify it to do what you want.
Once it’s done processing, the code creates a dictionary:
main_data["start_dir"] = start_directory; main_data["start_time"] = start_time; main_data["fullpath"] = fullpath_list; main_data["size_bytes"] = size_bytes_list; main_data["atime"] = atime_list; main_data["mtime"] = mtime_list; main_data["ctime"] = ctime_list; main_data["uid_name"] = uid_list; main_data["gid_name"] = gid_list;
I hope the variables are self-explanatory, but if not, here’s the magic decoding.
- start_directory = starting directory
- start_time = wall clock time when the code started
- fullpath_list = Python list of the full path to every file and directory
- size_bytes_list = size of each file in bytes
- atime_list = atime for each file in seconds since the epoch
- mtime_list = mtime for each file in seconds since the epoch
- ctime_list = ctime for each file in seconds since the epoch
- uid_list = user ID (uid) for each file
- gid_list = group ID (gid) for each file
mdpostp
The mdpostp tool (metadata post-processing) is used to take the pickle file(s) and do a statistical analysis of the data. The code takes as input a list of pickle files you want to analyze. The command line is pretty simple:
./mdpostp [-help] [-dup|-nodup]
Using -help on the command line displays the help text (it’s not very extensive). Then, you have the option to have the code search for and display duplicate files (-dup) or not search and not display duplicate files (-nodup). The final option on the command line should be the input file that contains a simple list of pickle files you want to analyze, such as these four pickle files that have been created via fsscan.py.
home1.pickle home2.pickle home3.pickle data.pickle
The code mdpostp reads and process each of the files in turn. At this time, it does not concatenate the data in each pickle. This limitation exists because you can’t run fsscan.py on various subdirectories and combine the data into a parent directory.
The code seems long, but it’s really simple: It reads the data in the pickle, does a simple analysis, and prints the data to stdout, as well as an HTML file. The code creates a “report” in a subdirectory named HTML_REPORT. In this directory, you will see a file named report.html. Just open this file in your browser and you should see the HTML report. All of the figures (plots) are also stored in the same subdirectory as well.
The code should be easy to understand, but if it looks confusing, particularly, the plotting portion of the code, take a look at the documentation for matplotlib, particularly the section about plotting histograms.
Appendix A is sample output from stdout. My apologies for the length, but I think it’s useful to see all of the output because it contains some details that are contained in the HTML report. The original scan was run against /home on my workstation.
The corresponding HTML output is shown in Appendix B. It includes basically the same information, but with the plots. It also shows an analysis of the difference between ctime (the last time the file's metadata was changed, such as permissions) and mtime (the last time the file's contents were modified). The reason I’m examining ctime/mtime is because I want to see whether file metadata, but not the file content, is changed frequently. This tells me whether users are doing something like touch or chmod on a file or files.
Usage Recommendations
You can use the codes however you like (they are GPLv3), but I wanted to explain how I like to use them. The first thing I like to do is run fsscan against user home directories periodically. I also will run it against any other directory trees to which users have access – perhaps something like a WORK or SCRATCH directory tree. When I run it, I include the date of the run in the name of the pickle, so I can keep track of changes to the files over time. It might be useful to write a tool that reads multiple pickle files and looks for differences, such as how many files have been deleted, how many files have been created, how many files have had their metadata changed, how many files have had their content changed, and so on.
Then I run mdpostp against newly created pickle files to get a quick snapshot of the state of the filesystem. I like to read the standard output first to get a feel for the data; then, I pull up the HTML output to get a better view of the state of the filesystem.
Given enough time between filesystem scans, I like to compare the HTML output of the various subdirectories, such as /home. This tells me a little about how the filesystem is “evolving.”
If your filesystem is large and you don’t want to use too much memory gathering the raw data (recall that fsscan puts everything in memory before writing the pickle file), another good practice is always to run fsscan on different subdirectories at about the same time. This works well in parallel filesystems because you can run the code on different clients or storage servers at the same time on different parts of the directories. However, to get a whole view of the entire data set, you need to write simple code that combines the different pickle files into a single pickle file, which should be very simple to write if you are interested.
Summary
A good administrator has an understanding of the “status” of their filesystem. What are the really old files (using atime, ctime, mtime, or some combination)? Who is using the most data? Who has the largest number of files? What is average file size? What is the standard deviation for file size? What does a histogram of file size look like? All of these simple questions need to be answered, because it allows you watch the pulse of the filesystem and understand how it’s changing and how you can tune the storage to run better or develop better policies.
Some code out there allows you walk a file tree and gather all of this data on files and directories. However, I wanted some flexibility, so I wrote two tools to perform this task. The code is in Python because Python has easy-to-use libraries for gathering and plotting data. The first tool, fsscan, gathers the data and writes it to a Python pickle file (key-value). The second tool, mdpostp, reads the pickle file and does some simple statistical analysis, creating both a report to standard output (stdout) and an HTML report.
Running the code in this article (or similar code) is very easy and yields a great deal of information. At the very least, running this code against user directories will be very enlightening. Don’t be afraid, give it a try.
Appendix A: mdpostp Standard Output
[laytonjb@home4 HPC_031]$ ./mdpostp1.py -nodup files.in filename = home.pickle start_time: 1402854660.07 html_filename = ./HTML_REPORT/report.html ************** *** Pickle file: home.pickle *** Start_dir: /home Scan date: Sun Jun 15 13:30:13 2014 ************** Number of lines in scan = 388083 ============== Mtime results: ============== Average mtime age in days: 907.036 days Oldest mtime age file in days: 5,401.445 days Youngest mtime age file in days: 0.016 days Standard deviation mtime age in days: 590.7352 days *** Mtime interval summary [ 0- 1 days]: 176 ( 0.05%) ( 0.05% cumulative) [ 1- 2 days]: 0 ( 0.00%) ( 0.05% cumulative) [ 2- 4 days]: 1091 ( 0.28%) ( 0.33% cumulative) [ 4- 7 days]: 217 ( 0.06%) ( 0.38% cumulative) [ 7- 14 days]: 75 ( 0.02%) ( 0.40% cumulative) [ 14- 28 days]: 7379 ( 1.90%) ( 2.30% cumulative) [ 28- 56 days]: 10655 ( 2.75%) ( 5.05% cumulative) [ 56- 112 days]: 12079 ( 3.11%) ( 8.16% cumulative) [ 112- 168 days]: 27551 ( 7.10%) ( 15.26% cumulative) [ 168- 252 days]: 9820 ( 2.53%) ( 17.79% cumulative) [ 252- 365 days]: 79030 ( 20.36%) ( 38.15% cumulative) [ 365- 504 days]: 3717 ( 0.96%) ( 39.11% cumulative) [ 504- 730 days]: 5438 ( 1.40%) ( 40.51% cumulative) [ 730-1095 days]: 39339 ( 10.14%) ( 50.65% cumulative) [1095-1460 days]: 190052 ( 48.97%) ( 99.62% cumulative) [1460-1825 days]: 602 ( 0.16%) ( 99.78% cumulative) [1825-2190 days]: 142 ( 0.04%) ( 99.81% cumulative) [2190-2920 days]: 392 ( 0.10%) ( 99.92% cumulative) [2920-3650 days]: 160 ( 0.04%) ( 99.96% cumulative) [3650-4380 days]: 41 ( 0.01%) ( 99.97% cumulative) [4380-5110 days]: 91 ( 0.02%) ( 99.99% cumulative) [5110-5840 days]: 36 ( 0.01%) (100.00% cumulative) Top 10 oldest files (mtime - modify time) --------------------------------------------- Rank File Mtime Age (Days) #1 /home/laytonjb/.gkrellm2/themes/x17/bg_grid.png 5,401.444 #2 /home/laytonjb/.gkrellm2/themes/brushed/bg_grid.png 5,401.444 #3 /home/laytonjb/.gkrellm2/themes/brushed/bg_chart.png 5,401.444 #4 /home/laytonjb/.gkrellm2/themes/brushed/d 5,401.444 #5 /home/laytonjb/.gkrellm2/themes/brushed/gismrc~ 5,401.444 #6 /home/laytonjb/.gkrellm2/themes/brushed/gkrellmrc 5,234.944 #7 /home/laytonjb/.gkrellm2/themes/x17/host/bg_panel.png 5,233.824 #8 /home/laytonjb/.gkrellm2/themes/x17/net/decal_net_leds.png 5,233.824 #9 /home/laytonjb/.gkrellm2/themes/x17/frame_left.png 5,233.824 #10 /home/laytonjb/.gkrellm2/themes/x17/frame_bottom.png 5,233.824 ============== Ctime results: ============== Average ctime age in days: 205.772 days Oldest ctime age file in days: 248.086 days Youngest ctime age file in days: 0.016 days Standard deviation ctime age in days: 69.8168 days [ 0- 1 days]: 180 ( 0.05%) ( 0.05% cumulative) [ 1- 2 days]: 0 ( 0.00%) ( 0.05% cumulative) [ 2- 4 days]: 4675 ( 1.20%) ( 1.25% cumulative) [ 4- 7 days]: 215 ( 0.06%) ( 1.31% cumulative) [ 7- 14 days]: 75 ( 0.02%) ( 1.33% cumulative) [ 14- 28 days]: 15845 ( 4.08%) ( 5.41% cumulative) [ 28- 56 days]: 4418 ( 1.14%) ( 6.55% cumulative) [ 56- 112 days]: 20402 ( 5.26%) ( 11.80% cumulative) [ 112- 168 days]: 72768 ( 18.75%) ( 30.55% cumulative) [ 168- 252 days]: 269505 ( 69.45%) (100.00% cumulative) [ 252- 365 days]: 0 ( 0.00%) (100.00% cumulative) [ 365- 504 days]: 0 ( 0.00%) (100.00% cumulative) [ 504- 730 days]: 0 ( 0.00%) (100.00% cumulative) [ 730-1095 days]: 0 ( 0.00%) (100.00% cumulative) [1095-1460 days]: 0 ( 0.00%) (100.00% cumulative) [1460-1825 days]: 0 ( 0.00%) (100.00% cumulative) [1825-2190 days]: 0 ( 0.00%) (100.00% cumulative) [2190-2920 days]: 0 ( 0.00%) (100.00% cumulative) [2920-3650 days]: 0 ( 0.00%) (100.00% cumulative) [3650-4380 days]: 0 ( 0.00%) (100.00% cumulative) [4380-5110 days]: 0 ( 0.00%) (100.00% cumulative) [5110-5840 days]: 0 ( 0.00%) (100.00% cumulative) Top 10 oldest files (ctime - change time) ------------------------------------------- Rank File Ctime Age (Days) #1 /home/laytonjb/.gconf/apps/nm-applet/%gconf.xml 248.085 #2 /home/laytonjb/.gconf/apps/panel/applets/workspace_switcher/prefs/%gconf.xml 248.085 #3 /home/laytonjb/.gconf/apps/panel/applets/workspace_switcher/%gconf.xml 248.085 #4 /home/laytonjb/.gconf/apps/panel/applets/clock/prefs/%gconf.xml 248.085 #5 /home/laytonjb/.gconf/apps/panel/applets/clock/%gconf.xml 248.085 #6 /home/laytonjb/.gconf/apps/panel/applets/window_list/prefs/%gconf.xml 248.085 #7 /home/laytonjb/.gconf/apps/panel/applets/window_list/%gconf.xml 248.085 #8 /home/laytonjb/.gconf/apps/panel/applets/%gconf.xml 248.085 #9 /home/laytonjb/.gconf/apps/panel/%gconf.xml 248.085 #10 /home/laytonjb/.gconf/apps/gnote/%gconf.xml 248.085 =============================== Ctime-Mtime Difference results: =============================== Number of non-zero difference files: 334,953 of 388,083 files: (86.31%) Average ctime-mtime age in days: 700.000 days Oldest ctime-mtime age file in days: 5,153.000 days Youngest ctime-mtime age file in days: 0.000 days Standard deviation ctime-mtime age in days: 543.8009 days [ 0- 1 days]: 53756 ( 13.85%) ( 13.85% cumulative) [ 1- 2 days]: 73 ( 0.02%) ( 13.87% cumulative) [ 2- 4 days]: 121 ( 0.03%) ( 13.90% cumulative) [ 4- 7 days]: 168 ( 0.04%) ( 13.94% cumulative) [ 7- 14 days]: 1111 ( 0.29%) ( 14.23% cumulative) [ 14- 28 days]: 8768 ( 2.26%) ( 16.49% cumulative) [ 28- 56 days]: 25377 ( 6.54%) ( 23.03% cumulative) [ 56- 112 days]: 8101 ( 2.09%) ( 25.12% cumulative) [ 112- 168 days]: 47729 ( 12.30%) ( 37.42% cumulative) [ 168- 252 days]: 3578 ( 0.92%) ( 38.34% cumulative) [ 252- 365 days]: 4970 ( 1.28%) ( 39.62% cumulative) [ 365- 504 days]: 2324 ( 0.60%) ( 40.22% cumulative) [ 504- 730 days]: 3270 ( 0.84%) ( 41.06% cumulative) [ 730-1095 days]: 37909 ( 9.77%) ( 50.83% cumulative) [1095-1460 days]: 189729 ( 48.89%) ( 99.72% cumulative) [1460-1825 days]: 254 ( 0.07%) ( 99.78% cumulative) [1825-2190 days]: 255 ( 0.07%) ( 99.85% cumulative) [2190-2920 days]: 324 ( 0.08%) ( 99.93% cumulative) [2920-3650 days]: 136 ( 0.04%) ( 99.97% cumulative) [3650-4380 days]: 3 ( 0.00%) ( 99.97% cumulative) [4380-5110 days]: 122 ( 0.03%) (100.00% cumulative) [5110-5840 days]: 5 ( 0.00%) (100.00% cumulative) Top 10 oldest files (ctime-time) ------------------------------------------- Rank File Ctime-Mtime diff (Days) #1 /home/laytonjb/.gkrellm2/themes/x17/bg_grid.png 5,153.421 #2 /home/laytonjb/.gkrellm2/themes/brushed/bg_grid.png 5,153.420 #3 /home/laytonjb/.gkrellm2/themes/brushed/bg_chart.png 5,153.420 #4 /home/laytonjb/.gkrellm2/themes/brushed/d 5,153.420 #5 /home/laytonjb/.gkrellm2/themes/brushed/gismrc~ 5,153.420 #6 /home/laytonjb/.gkrellm2/themes/brushed/gkrellmrc 4,986.920 #7 /home/laytonjb/.gkrellm2/themes/x17/host/bg_panel.png 4,985.801 #8 /home/laytonjb/.gkrellm2/themes/x17/net/decal_net_leds.png 4,985.801 #9 /home/laytonjb/.gkrellm2/themes/x17/frame_left.png 4,985.801 #10 /home/laytonjb/.gkrellm2/themes/x17/frame_bottom.png 4,985.801 ============== Atime results: ============== Average atime age in days: 127.981 days Oldest atime age file in days: 2,029.980 days Youngest atime age file in days: 0.015 days Standard deviation atime age in days: 98.5828 days [ 0- 1 days]: 436 ( 0.11%) ( 0.11% cumulative) [ 1- 2 days]: 0 ( 0.00%) ( 0.11% cumulative) [ 2- 4 days]: 4838 ( 1.25%) ( 1.36% cumulative) [ 4- 7 days]: 238 ( 0.06%) ( 1.42% cumulative) [ 7- 14 days]: 60 ( 0.02%) ( 1.44% cumulative) [ 14- 28 days]: 16110 ( 4.15%) ( 5.59% cumulative) [ 28- 56 days]: 7985 ( 2.06%) ( 7.64% cumulative) [ 56- 112 days]: 19803 ( 5.10%) ( 12.75% cumulative) [ 112- 168 days]: 334034 ( 86.07%) ( 98.82% cumulative) [ 168- 252 days]: 1288 ( 0.33%) ( 99.15% cumulative) [ 252- 365 days]: 425 ( 0.11%) ( 99.26% cumulative) [ 365- 504 days]: 923 ( 0.24%) ( 99.50% cumulative) [ 504- 730 days]: 25 ( 0.01%) ( 99.51% cumulative) [ 730-1095 days]: 78 ( 0.02%) ( 99.53% cumulative) [1095-1460 days]: 1831 ( 0.47%) (100.00% cumulative) [1460-1825 days]: 0 ( 0.00%) (100.00% cumulative) [1825-2190 days]: 9 ( 0.00%) (100.00% cumulative) [2190-2920 days]: 0 ( 0.00%) (100.00% cumulative) [2920-3650 days]: 0 ( 0.00%) (100.00% cumulative) [3650-4380 days]: 0 ( 0.00%) (100.00% cumulative) [4380-5110 days]: 0 ( 0.00%) (100.00% cumulative) [5110-5840 days]: 0 ( 0.00%) (100.00% cumulative) Top 10 oldest files (atime - access time) ------------------------------------------- Rank File Atime Age (Days) #1 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/script/.exists 2,029.980 #2 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/man3/.exists 2,029.980 #3 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/lib/auto/DBIx/SimplePerl/.exists 2,029.980 #4 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/lib/DBIx/.exists 2,029.980 #5 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/bin/.exists 2,029.980 #6 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/arch/auto/DBIx/SimplePerl/.exists 2,029.980 #7 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/arch/.exists 2,029.980 #8 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/man1/.exists 2,029.980 #9 /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/pm_to_blib 2,029.980 #10 /home/laytonjb/src/CFD/CFD_2/USM3D/USM3D_DATA/RAMP/ramp1.m2 1,440.808 ================== File Size results: ================== Average file size in KB: 909.000 KB Largest file in KB: 19,804,344.000 KB Smallest file size in KB: 0.000 KB Standard deviation file size in KB: 49,832.8061 KB *** File Size Intervals (KB): [ 0- 1 KB]: 156455 ( 40.31%) ( 40.31% cumulative) [ 1- 2 KB]: 38278 ( 9.86%) ( 50.18% cumulative) [ 2- 4 KB]: 30822 ( 7.94%) ( 58.12% cumulative) [ 4- 8 KB]: 43118 ( 11.11%) ( 69.23% cumulative) [ 8- 16 KB]: 24296 ( 6.26%) ( 75.49% cumulative) [ 16- 32 KB]: 33455 ( 8.62%) ( 84.11% cumulative) [ 32- 64 KB]: 13502 ( 3.48%) ( 87.59% cumulative) [ 64- 128 KB]: 12083 ( 3.11%) ( 90.70% cumulative) [ 128- 256 KB]: 8623 ( 2.22%) ( 92.93% cumulative) [ 256- 512 KB]: 13437 ( 3.46%) ( 96.39% cumulative) [ 512- 1024 KB]: 5456 ( 1.41%) ( 97.79% cumulative) [ 1024- 2048 KB]: 2687 ( 0.69%) ( 98.49% cumulative) [ 2048- 4096 KB]: 2497 ( 0.64%) ( 99.13% cumulative) [ 4096- 8192 KB]: 1361 ( 0.35%) ( 99.48% cumulative) [ 8192- 16384 KB]: 949 ( 0.24%) ( 99.73% cumulative) [ 16384- 32768 KB]: 373 ( 0.10%) ( 99.82% cumulative) [ 32768- 65536 KB]: 246 ( 0.06%) ( 99.89% cumulative) [ 65536- 131072 KB]: 179 ( 0.05%) ( 99.93% cumulative) [ 131072- 262144 KB]: 76 ( 0.02%) ( 99.95% cumulative) [ 262144- 524288 KB]: 36 ( 0.01%) ( 99.96% cumulative) [ 524288-1048576 KB]: 154 ( 0.04%) (100.00% cumulative) Top 10 largest files ===================== Rank File Size (KB) #1 /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/cesm-strace/strace.janus017.tar 19,804,344 #2 /home/laytonjb/src/CFD/CFD_2/cfdpp.tar.gz 8,473,315 #3 /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.26702.call_variants_1.out 6,155,298 #4 /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.26702.call_variants_1.out 6,155,298 #5 /home/laytonjb/src/CFD/CFD_2/overflow.tar.gz 5,863,543 #6 /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24999.count_covariates_1.out 5,748,509 #7 /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24999.count_covariates_1.out 5,748,509 #8 /home/laytonjb/src/CFD/CFD_2/boeing_app_tuned.tar.gz 4,240,130 #9 /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24084.realign_indels_1.out 4,225,463 #10 /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24084.realign_indels_1.out 4,225,463 Top 10 biggest users ===================== Rank User Total Size (KB) % of Total #1 laytonjb 353,100,132 99.9998% #2 root 645 0.0002% #3 susy 0 0.0000% Top 10 biggest group users =========================== Rank Group Total Size (KB) % of Total #1 laytonjb 353,100,132 99.9998% #2 root 645 0.0002% #3 susy 0 0.0000%
Appendix B: HTML Report Output
Metadata Report
A “highlight” analysis of each pickle file presents the top 10 oldest files, largest files, users with the most data, and so on. Number of files in scan = 388,083.
Mtime Age Statistics
The following statistics are for the mtime age of files in the pickle. Mtime is the change time of the file. It will change only if the actual data is changed, but not if the metadata alone is changed.
- Average mtime age: 907.327 days
- Oldest mtime age: 5,401.736 days
- Youngest mtime age: 0.308 days
- Standard deviation mtime age: 590.7352 days
Table 1 lists the mtime age intervals in days. The age is based on the mtime of the files when scanned and the current time.
Table 1:mtime Age Intervals
Interval (days) | No. of files | % of Total | Cumulative % |
---|---|---|---|
0–1 | 151 | 0.04 | 0.04 |
1–2 | 0 | 0.00 | 0.04 |
2–4 | 215 | 0.06 | 0.09 |
4–7 | 111 | 0.29 | 0.38 |
7–14 | 75 | 0.02 | 0.40 |
14–28 | 7,376 | 1.90 | 2.30 |
28–56 | 10,658 | 2.75 | 5.05 |
56–112 | 12,079 | 3.11 | 8.16 |
112–168 | 27,551 | 7.10 | 15.26 |
168–252 | 9,819 | 2.53 | 17.79 |
252–365 | 79,031 | 20.36 | 38.15 |
365–504 | 3,717 | 0.96 | 39.11 |
504–730 | 5,421 | 1.40 | 40.51 |
730–1,095 | 39,356 | 10.14 | 50.65 |
1,095–1,460 | 190,014 | 48.96 | 99.61 |
1,460–1,825 | 640 | 0.16 | 99.78 |
1,825–2,190 | 142 | 0.04 | 99.81 |
2,190–2,920 | 392 | 0.10 | 99.92 |
2,920–3,650 | 160 | 0.04 | 99.96 |
3,650–4,380 | 41 | 0.01 | 99.97 |
4,380–5,110 | 91 | 0.02 | 99.99 |
5,110–5,840 | 36 | 0.01 | 100.00 |
Table 2 lists the top 10 files based on mtime, change time, when the files were scanned in the original pickle file.
Table 2: Top 10 Oldest Files Based on mtime
Rank | File | Mtime Age |
---|---|---|
1 | /home/laytonjb/.gkrellm2/themes/x17/bg_grid.png | 5,401.736 |
2 | /home/laytonjb/.gkrellm2/themes/brushed/bg_grid.png | 5,401.736 |
3 | /home/laytonjb/.gkrellm2/themes/brushed/bg_chart.png | 5,401.736 |
4 | /home/laytonjb/.gkrellm2/themes/brushed/d | 5,401.736 |
5 | /home/laytonjb/.gkrellm2/themes/brushed/gismrc~ | 5,401.736 |
6 | /home/laytonjb/.gkrellm2/themes/brushed/gkrellmrc | 5,235.235 |
7 | /home/laytonjb/.gkrellm2/themes/x17/host/bg_panel.png | 5,234.116 |
8 | /home/laytonjb/.gkrellm2/themes/x17/net/decal_net_leds.png | 5,234.116 |
9 | /home/laytonjb/.gkrellm2/themes/x17/frame_left.png | 5,234.116 |
10 | /home/laytonjb/.gkrellm2/themes/x17/frame_bottom.png | 5,234.116 |
Figure 1 is a histogram of the mtime (modify time) age of files in the pickle.
Ctime Age Statistics
The next statistics are for the ctime age of the files in the pickle. Ctime is the change time of the file. It will change if the actual data is changed and if the metadata is changed, such as the ownership or permissions on the file.
- Average ctime age: 206.064 days
- Oldest ctime age: 248.377 days
- Youngest ctime age: 0.308 days
- Standard deviation ctime age: 69.8168 days
Table 3 lists the ctime age intervals in days. The age is based on the ctime of the files when scanned and the current time.
Table 3: Ctime Age Intervals
Interval (days) | No. of files | % of Total | Cumulative % |
---|---|---|---|
0–1 | 155 | 0.04 | 0.04 |
1–2 | 0 | 0.00 | 0.04 |
2–4 | 259 | 0.07 | 0.11 |
4–7 | 4,656 | 1.20 | 1.31 |
7–14 | 75 | 0.02 | 1.33 |
14–28 | 15,842 | 4.08 | 5.41 |
28–56 | 4,421 | 1.14 | 6.55 |
56–112 | 20,402 | 5.26 | 11.80 |
112–168 | 72,768 | 18.75 | 30.55 |
168–252 | 269,505 | 69.45 | 100.00 |
252–365 | 0 | 0.00 | 100.00 |
365–504 | 0 | 0.00 | 100.00 |
504–730 | 0 | 0.00 | 100.00 |
730–1,095 | 0 | 0.00 | 100.00 |
1,095–1,460 | 0 | 0.00 | 100.00 |
1,460–1,825 | 0 | 0.00 | 100.00 |
1,825–2,190 | 0 | 0.00 | 100.00 |
2,190–2,920 | 0 | 0.00 | 100.00 |
2,920–3,650 | 0 | 0.00 | 100.00 |
3,650–4,380 | 0 | 0.00 | 100.00 |
4,380–5,110 | 0 | 0.00 | 100.00 |
5,110–5,840 | 0 | 0.00 | 100.00 |
Table 4 lists the top 10 files based on ctime, change time, when the files were scanned in the original pickle file.
Table 4: Top 10 Oldest Files Based on ctime
Rank | File | Ctime Age |
---|---|---|
1 | /home/laytonjb/.gconf/apps/nm-applet/%gconf.xml | 248.377 |
2 | /home/laytonjb/.gconf/apps/panel/applets/workspace_switcher/prefs/%gconf.xml | 248.377 |
3 | /home/laytonjb/.gconf/apps/panel/applets/workspace_switcher/%gconf.xml | 248.377 |
4 | /home/laytonjb/.gconf/apps/panel/applets/clock/prefs/%gconf.xml | 248.377 |
5 | /home/laytonjb/.gconf/apps/panel/applets/clock/%gconf.xml | 248.377 |
6 | /home/laytonjb/.gconf/apps/panel/applets/window_list/prefs/%gconf.xml | 248.377 |
7 | /home/laytonjb/.gconf/apps/panel/applets/window_list/%gconf.xml | 248.377 |
8 | /home/laytonjb/.gconf/apps/panel/applets/%gconf.xml | 248.377 |
9 | /home/laytonjb/.gconf/apps/panel/%gconf.xml | 248.377 |
10 | /home/laytonjb/.gconf/apps/gnote/%gconf.xml | 248.377 |
Figure 2 is a histogram of the ctime (change time) age of the files in the pickle.
ctime-mtime Difference Statistics
The next set of statistics are for the difference between ctime and mtime., which can tell you the metadata changes (ctime) versus data changes (mtime). In the following analysis, the difference between the two (ctime/mtime) are used.
- Average ctime/mtime: 700.000 days
- Oldest ctime/mtime file: 5,153.000 days
- Youngest ctime/mtime file: 0.000 days
- Standard deviation ctime/mtime: 543.8009 days
Table 5:ctime/mtime Age Intervals
Interval (days) | No. of files | % of Total | Cumulative % |
---|---|---|---|
0–1 | 53,756 | 13.85 | 13.85 |
1–2 | 73 | 0.02 | 13.87 |
2–4 | 121 | 0.03 | 13.90 |
4–7 | 168 | 0.04 | 13.94 |
7–14 | 1,111 | 0.29 | 14.2 |
14–28 | 8,768 | 2.26 | 16.49 |
28–56 | 25,377 | 6.54 | 23.03 |
56–112 | 8,101 | 2.09 | 25.12 |
112–168 | 47,729 | 12.30 | 37.42 |
168–252 | 3,578 | 0.92 | 38.34 |
252–365 | 4,970 | 1.28 | 39.62 |
365–504 | 2,324 | 0.60 | 40.22 |
504–730 | 3,270 | 0.84 | 41.06 |
730–1,095 | 37,909 | 9.77 | 50.83 |
1,095–1,460 | 189,729 | 48.89 | 99.72 |
1,460–1,825 | 254 | 0.07 | 99.78 |
1,825–2,190 | 255 | 0.07 | 99.85 |
2,190–2,920 | 324 | 0.08 | 99.93 |
2,920–3,650 | 136 | 0.04 | 99.97 |
3,650–4,380 | 3 | 0.00 | 99.97 |
4,380–5,110 | 122 | 0.03 | 100.00 |
5,110–5,840 | 5 | 0.00 | 100.00 |
Table 6 lists the top 10 files with the largest ctime/mtime differences.
Table 6: Top 10 Oldest Files Based on ctime/mtime
Rank | File | xtime/mtime Difference (days) |
---|---|---|
1 | /home/laytonjb/.gkrellm2/themes/x17/bg_grid.png | 5,153.421 |
2 | /home/laytonjb/.gkrellm2/themes/brushed/bg_grid.png | 5,153.420 |
3 | /home/laytonjb/.gkrellm2/themes/brushed/bg_chart.png | 5,153.420 |
4 | /home/laytonjb/.gkrellm2/themes/brushed/d | 5,153.420 |
5 | /home/laytonjb/.gkrellm2/themes/brushed/gismrc~ | 5,153.420 |
6 | /home/laytonjb/.gkrellm2/themes/brushed/gkrellmrc | 4,986.920 |
7 | /home/laytonjb/.gkrellm2/themes/x17/host/bg_panel.png | 4,985.801 |
8 | /home/laytonjb/.gkrellm2/themes/x17/net/decal_net_leds.png | 4,985.801 |
9 | /home/laytonjb/.gkrellm2/themes/x17/frame_left.png | 4,985.801 |
10 | /home/laytonjb/.gkrellm2/themes/x17/frame_bottom.png | 4,985.801 |
Figure 3 is a histogram of the ctime/mtime differences of the files in the pickle.
Atime Age Statistics
Atime is the change time of the file. It will change only if the actual data is changed, but not if the metadata is changed.
- Average atime age: 128.273 days
- Oldest atime: 2,030.272 days
- Youngest atime: 0.307 days
- Standard deviation atime: 98.5828 days
Table 7 lists the atime age intervals. The age is based on the atime of the files when scanned and the current time.
Table 7: Atime Age Intervals
Interval (days) | No. of Files | % of Total | Cumulative % |
---|---|---|---|
0–1 | 406 | 0.10 | 0.10 |
1–2 | 0 | 0.00 | 0.10 |
2–4 | 396 | 0.10 | 0.21 |
4–7 | 4,710 | 1.21 | 1.42 |
7–14 | 60 | 0.02 | 1.44 |
14–28 | 16,108 | 4.15 | 5.59 |
28–56 | 7,987 | 2.06 | 7.64 |
56–112 | 19,803 | 5.10 | 12.75 |
112–168 | 334,034 | 86.07 | 98.82 |
168–252 | 1,288 | 0.33 | 99.15 |
252–365 | 425 | 0.11 | 99.26 |
365–504 | 923 | 0.24 | 99.50 |
504–730 | 25 | 0.01 | 99.51 |
730–1,095 | 78 | 0.02 | 99.53 |
1,095–1,460 | 1,831 | 0.47 | 100.00 |
1,460–1,825 | 0 | 0.00 | 100.00 |
1,825–2,190 | 9 | 0.00 | 100.00 |
2,190–2,920 | 0 | 0.00 | 100.00 |
2,920–3,650 | 0 | 0.00 | 100.00 |
3,650–4,380 | 0 | 0.00 | 100.00 |
4,380–5,110 | 0 | 0.00 | 100.00 |
5,110–5,840 | 0 | 0.00 | 100.00 |
Table 8 lists the top 10 files based on atime (access time) when the files were scanned in the original pickle file.
Table 8: Top 10 Oldest Files Based on atime
Rank | File | atime Age |
---|---|---|
1 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/script/.exists | 2,030.271 |
2 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/man3/.exists | 2,030.271 |
3 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/lib/auto/DBIx/SimplePerl/.exists | 2,030.271 |
4 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/lib/DBIx/.exists | 2,030.271 |
5 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/bin/.exists | 2,030.271 |
6 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/arch/auto/DBIx/SimplePerl/.exists | 2,030.271 |
7 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/arch/.exists | 2,030.271 |
8 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/blib/man1/.exists | 2,030.271 |
9 | /home/laytonjb/CLUSTERBUFFER/STRACE/DB/DBIx-SimplePerl-1.90/pm_to_blib | 2,030.271 |
10 | /home/laytonjb/src/CFD/CFD_2/USM3D/USM3D_DATA/RAMP/ramp1.m2 | 1,441.099 |
File Size Statistics
Thr dtatistics for files sizes of the files in the pickle are:
- Average file size: 909.000KB
- Largest file: 19,804,344.000KB
- Smallest file: 0.000KB
- Standard deviation file size: 49,832.8061KB
Table 9: File Size Intervals
Interval (KB) | No. of Files | % of Total | Cumulative % |
---|---|---|---|
0–1 | 156,455 | 40.31 | 40.31 |
1–2 | 38,278 | 9.86 | 50.18 |
2–4 | 30,822 | 7.94 | 58.12 |
4–8 | 43,118 | 11.11 | 69.23 |
8–16 | 24,296 | 6.26 | 75.49 |
16–32 | 33,455 | 8.62 | 84.11 |
32–64 | 13,502 | 3.48 | 87.59 |
64–128 | 12,083 | 3.11 | 90.70 |
128–256 | 8,623 | 2.22 | 92.93 |
256–512 | 13,437 | 3.46 | 96.39 |
512–1,024 | 5,456 | 1.41 | 97.79 |
1,024–2,048 | 2,687 | 0.69 | 98.49 |
2,048–4,096 | 2,497 | 0.64 | 99.13 |
4,096–8,192 | 1,361 | 0.35 | 99.48 |
8,192–16,384 | 949 | 0.24 | 99.73 |
16,384–32,768 | 373 | 0.10 | 99.82 |
32,768–65,536 | 246 | 0.06 | 99.89 |
65,536–131,072 | 179 | 0.05 | 99.93 |
131,072–262,144 | 76 | 0.02 | 99.95 |
262,144–524,288 | 36 | 0.01 | 99.96 |
524,288-1,048,576 | 154 | 0.04 | 100.00 |
Table 10: Top 10 Largest Files
Rank | File | Size (KB) |
---|---|---|
1 | /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/cesm-strace/strace.janus017.tar | 19,804,344 |
2 | /home/laytonjb/src/CFD/CFD_2/cfdpp.tar.gz | 8,473,315 |
3 | /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.26702.call_variants_1.out | 6,155,298 |
4 | /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.26702.call_variants_1.out | 6,155,298 |
5 | /home/laytonjb/src/CFD/CFD_2/overflow.tar.gz | 5,863,543 |
6 | /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24999.count_covariates_1.out | 5,748,509 |
7 | /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24999.count_covariates_1.out | 5,748,509 |
8 | /home/laytonjb/src/CFD/CFD_2/boeing_app_tuned.tar.gz | 4,240,130 |
9 | /home/laytonjb/LAPTOP/CLUSTERBUFFER/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24084.realign_indels_1.out | 4,225,463 |
10 | /home/laytonjb/CLUSTERBUFFER2/STRACE_PY/EXAMPLES/23andme-jeff-runs/all2/strace.24084.realign_indels_1.out | 4,225,463 |
Biggest Users
Table 11: Top 10 Largest Files
Rank | User | Total Size (KB) | % of Total |
---|---|---|---|
1 | laytonjb | 353,100,132 | 99.9998 |
2 | root | 645 | 0.0002 |
3 | susy | 0 | 0.0000 |
Biggest Group Users
Table 12 shows the top 10 largest group users.
Table 12: Top 10 Largest Files
Rank | User | Total Size (KB) | % of Total |
---|---|---|---|
1 | laytonjb | 353,100,132 | 99.9998 |
2 | root | 645 | 0.0002 |
3 | susy | 0 | 0.0000 |