Autonomous File Recovery

Let users recover a deleted file without admin intervention by aliasing the rm command with mv or by writing your own script that moves the data to another location.

Have you ever deleted a file and immediately thought, “Ah! I needed that!”? There are some great stories about users who store data in filesystems that they know are not backed up, manage to erase all their data, and then start yelling that they need to have their data back – now! My favorite story was about a well-known university researcher who was storing data in a filesystem that was mounted with no_backup in the path name. Although warned several times that the data was not backed up, he still managed to erase his data, causing a fire drill right around Christmas time.

Although I hope this is a rare occurrence, there must be some way to help users who do their very best to shoot themselves in the foot. Users aren’t the only ones who can suffer from this problem. Admins also remove files from systems in the interest of cleaning up cruft, only to find out those files were important.

Begin at the Beginning

Coming up with ideas to help users recover data is part of an age-old question. One admin friend described this as, "How do we keep the users from hurting themselves?" As an engineer, I like to look for solutions to problems, so I started examining the data recovery request from several angles. Perhaps this problem is looking for more of a policy solution. Or perhaps it is a problem requiring a technical solution. Or is it both?

Policy Problem?

To gain a better perspective on this issue, I spoke with a number of admin friends, and one of the common themes I encountered during my discussions was that whatever policies were created and communicated by administrators, upper management sometimes intervened and countered the policies. This situation was infrequent, typically for a critical project and a critical piece of data, but more often than not, it followed the old adage “the squeaky wheel gets the grease.” Although I didn’t conduct a scientific survey, one thing I did notice was that when the problem made it to the upper levels of management, managers were not aware of the policies that had been set and publicized. To my mind, this pointed to one possible solution to the problem – developing policies in conjunction with management while addressing the resource implications.

To begin the development of policies, one should assume that users will erase data by accident and need it restored. The resource implications of this assumption can be quantified on the basis of past experience, data growth, and other factors. For example, will it require additional staff? Will it require additional hardware? Then the conclusions are presented to management. During the discussion, management should be made aware of the effect of restoring or recovering data and the effect of users erasing data before a “management approved” policy is established and communicated to the user base and any changes to resources is resolved.

This approach can help alleviate the problem because management is fully aware of the resource implications of the final decision and users realize a policy is in place. The subtle context is that the entire management hierarchy is now aware of the policies so that the “squeaky wheel” approach will (should) have little effect on operations (although there will always be exceptions).

Technical Solutions

I first started using Unix on a VAX in 19.., er, a long time ago. We didn’t have much disk space, but perhaps more importantly, it was a new operating system to everyone. Using Unix, one could interactively log in to a system while sitting in front of a terminal rather than submit jobs to the big mainframe. This shift in how we worked meant an associated period of learning. Because disk space was at such a premium, one of the things people had to learn was how to erase files when they were finished, including how to use the deadly options for rm: -r, -f, and -rf *.

To help people during the adjustment period, the administrators “aliased” the rm command so that, when used, the data was actually moved to a temporary disk location instead of being erased. Then, if you had an “oops” moment, you could go to the directory at the new location and recover the files yourself. If you didn’t know the location of the “erased” files, a quick email to the administrator would allow them to recover the files for you. Because disk space was expensive, the data only lived in the temporary disk location for a certain period of time and then was removed permanently. This approach saved my bacon on several occasions.

Why not bring back this practice? You could alias the rm command to something else (e.g., mv) so that the data is not actually erased, but moved to a different location. Or, you could write a simple script that moves the data to a different location, from which users could then copy the data back if needed. For either solution, a cron job or a daemon can be used to erase the files in the “special” directory based on some policies (e.g., oldest files are erased if a certain usage level is reached – the “high water” mark, or if the files have reached a certain age). Of course, it takes disk resources to do this because you need a target location for storing the data, but that can be part of resource planning, as discussed in the previous section on policies.

Alias rm with mv

If you want to try to alias the rm command with mv, the first step is to read the rm man page. A few key rm options are shown in Table 1. The rm command takes the form:

rm [OPTION]... FILE...

In my experience, some of these options are used fairly rarely, but to maintain compatibility with the original command, all of the options need to be considered.

Table 1: Key rm Options

Option Description
-f, --force Ignore nonexistent files, never prompt
-i Prompt before every removal
-I Prompt once before removing more than three files, or when removing recursively. Less intrusive than -i, while still giving protection against most mistakes
--interactive[=WHEN] Prompt according to WHEN: never, once (-I), or always (-i). Without WHEN, always prompt
--one-file-system When removing a hierarchy recursively, skip any directory that is on a filesystem different from that of the corresponding command-line argument
--no-preserve-root Do not treat “/” specially
--preserve-root Do not remove “/” (default)
-r, -R, --recursive Remove directories and their contents recursively
-v, --verbose Explain what is being done
--help Display this help and exit
--version Output version information and exit

Next, you should examine the mv man page. It, too, has a few key options (Table 2). The mv command takes the forms:

mv [OPTION]... [-T] SOURCE DEST
mv [OPTION]... SOURCE... DIRECTORY
mv [OPTION]... -t DIRECTORY SOURCE...

Table 2: Key mv Options

Option Description
--backup[=CONTROL] Make a backup of each existing destination file
-b Like --backup, but does not accept an argument
-f, --force Do not prompt before overwriting
-i, --interactive Prompt before overwriting
--strip-trailing-slashes Remove any trailing slashes from each SOURCE argument
-S, --suffix=SUFFIX Override the usual backup suffix
-t, --target-directory=DIRECTORY Move all SOURCE arguments into DIRECTORY
-T, --no-target-directory Treat DEST as a normal file
-u, --updatev Move only when the SOURCE file is newer than the destination file or when the destination file is missing
-v, --verbose Explain what is being done
--help Display this help and exit
--version Output version information and exit

As an admin, you have to make a decision about whether it’s possible simply to alias rm with mv. You have to be prepared for users that apply some of the lesser used options, and you should be prepared to tell users that the classic rm does not exist but has been aliased to mv.

Scripting

Using mv as a substitute for rm is not a perfect solution. Some corner cases will likely cause problems. For example, when a user removes a file using the aliased rm command, it is copied to the temporary disk storage and could be recovered. If the user then creates a new file with the exact same name and then removes that file, the first file on the temporary storage would be overwritten. Perhaps this is acceptable, perhaps it is not. It could be part of the policy.

By writing your own script, you can precisely define what you want to happen when a user “removes” a file. You could incorporate versioning so that the user wouldn't overwrite previously removed files. You could couple a cron job with the script, so it cleans the temporary directory by, for example, sweeping files out of the temporary directory when they reach a certain age or if they are very large.

As you write the code, be sure you take into consideration the properties of the file being removed that should be kept. At a minimum, you probably want to keep the file name, the user/group, and the file permissions. You might also want to keep the three dates of the file. As mentioned previously, you might want to add versioning to the file name, so multiple file removals could be stored in the temp directory; however, be careful, because this will change the file name.

It’s also highly recommended to keep some sort of log of what the script does. Although this might sound obvious, you would be surprised how many admins do not keep good logs of what is happening on their systems. The logs should include any cron job you use periodically to sweep the temporary directory. Be a lumberjack.

This approach cannot help you save files that applications erase or remove as part of their processing. When this happens, the only real alternative is to have a second copy of the data somewhere. This scenario should be brought to the attention of management so that policies can be developed (e.g., having two copies of all data at all times, or telling users that there is no recourse if this happens).

Extended File Attributes

With modern filesystems, one key aspect that must be considered for moving or copying a file is extended attributes. Extended File Attributes (EFAs) allow you to add metadata to files beyond what is normally there. A simple example is:

$ setfattr -n user.comment -v "this is a comment" test.txt

In this example, the user is adding a comment to the user namespace of the EFAs and is labeling the text comment (i.e., user.comment). The comment is this is a comment, which you can find by using the getfattr command.

In aliasing the rm command or writing your own script, at some point you will need to address EFAs by making it part of your policies that management endorses. Do you copy them along with the file or not? Personally, I think you should copy the EFAs along with the file itself, but that decision is up to you in consultation with users and management.

Backups

One thing I haven’t touched on yet are backups. They can be beautiful things that save your bacon, but they are somewhat limited, as I’m sure all administrators are aware. Backups happen at certain intervals, whether full or incremental. In between backups, users, scripts, and programs create, change, and remove data that backups miss. Backups might be able to restore data, but only if the data has, in fact, been backed up. Also, how many times have administrators tried to restore a file from backup only to discover that the backup failed or the media, most likely tape, is corrupt? Fortunately, this scenario is becoming more rare, but it still happens. (Be sure to check your backup logs and test a data restoration process on a small amount of data every so often).

Backups can help with data recovery, but they are not perfect. Moreover, given the size of filesystems, it might be impossible to do full backups, at least in an economical way. You might be restricted to performing a single full backup when the system is first installed and then doing incremental backups for the life of the system, which even for petabyte-size filesystems could be very difficult to accomplish and might require more hardware than can be afforded.

Using backups in combination with the options discussed here can offer some data protection and perhaps reduce the likelihood that users will hurt themselves.

Summary

Normally, having users remove data and then yell about getting it recovered quickly is a fairly rare occurrence. I talk to a great number of administrators, and this scenario was something they rarely encountered. If they did, they either restored the data, talked to the user to help educate them, developed scripts or tools to help alleviate the problems, or performed some combination of these actions.

As system sizes grow, the probability of catastrophic events that require the restoration of data or other extreme measures increases. In general terms, the majority of administrators feel that this problem is only going to get worse with more users and more data, so they are looking for solutions.

I see two aspects to the problem. The first is a policy aspect, wherein upper level management needs to be brought into discussions to develop appropriate policies. As part of this, people need to remove emotion from the discussions and present real information about the frequency of data restoration requests, how much work it requires, and how much it disrupts normal operations. In essence, the discussion, like many other discussions, should be around resource allocation and associated benefits. The benefit of having upper management involved is the agreement on policies at the highest levels. The policies should be published to all users, with the implication that management is very aware of the issues surrounding data recovery and no more squeaky wheels will be tolerated. (Score one for the administrators.)

The second aspect, which really accompanies the first, is technical. Can tools help easily restore or recover erased data or prevent a user from accidentally erasing data? Backups can help, but they are only part of the solution. Going back to my early Unix education, I discussed how administrators can either alias the rm command so the data is moved to a temporary disk location or create their own script to accomplish the same thing. Coupled with normal backups, this method could help alleviate some of the problems administrators are having. All you need is some sort of temporary disk-based storage and you are off to the races. Make sure the size of the temporary storage is adjustable, so if you need more space, it’s fairly easy to add more hardware (with its associated costs).

Have I made a case for aliasing the rm command or using a script? I think the answer is unique to each system, the administrators, and the users. Many times, this approach can help users recover needed files quickly, but it takes work to develop and test the scripts.

Tags: Backup Backup , backup backup , file recovery file recovery , mv mv , rm rm