Incremental Backups on Linux

Sample Script

Rubel’s article posted a sample script for creating three incremental backups, as well as a full backup. The basic script is very simple, yet it has a great deal of power in a few lines:

rm -rf backup.3
mv backup.2 backup.3
mv backup.1 backup.2
cp -al backup.0 backup.1
rsync -a --delete source_directory/  backup.0/

To better understand the script, I’ll use it on a simple example in which I create a sample directory in my account /home/laytonjb/TEST/SOURCE that I want to back up. To begin, I’ll put a single file in the directory and then run through the script. Next, I will add files to the directory, simulating the creation of new data, and keep running through the script, tracking what happens with the backups. I will also delete a file so you can see what happens in the backups.

For the first pass through the script, only one file is in the directory to be backed up. The output from the script is:

[laytonjb@home4 TEST]$ ls -s SOURCE/
total 7784
7784 Open-MPI-SC13-BOF.pdf
[laytonjb@home4 TEST]$ du -sh
7.7M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
mv: cannot stat `backup.2': No such file or directory
[laytonjb@home4 TEST]$ mv backup.1 backup.2
mv: cannot stat `backup.1': No such file or directory
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
cp: cannot stat `backup.0': No such file or directory
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0
[laytonjb@home4 TEST]$ ls -s
total 8
4 backup.0/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
16M .
[laytonjb@home4 TEST]$ du -sh SOURCE
7.7M SOURCE
[laytonjb@home4 TEST]$ du -sh backup.0
7.7M backup.0

This is the first pass through the script, so only the directory backup.0 is created and it is the same size as the SOURCE directory. You can think of this as a full backup of the directory.

To better understand what is happening with the backups, I’ll track the inode number of the files in the SOURCE subdirectory as well as the files in the four backup directories using ls -i (Table 1).

Table 1: inode Numbers After First Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 NA NA NA

Notice that the file has two different inode numbers, one in the SOURCE subdirectory and one in the first backup subdirectory backup.0 . This indicates that they are two different files (one is a copy of the other). The directory backup.0 is a “snapshot” of the SOURCE subdirectory, and the file in that directory is real, not a hard link. You can confirm this by running the command

stat Open-MPI-SC13-BOF.pdf

in the backup.0 directory and looking for the Links: output, which should be 1 . Also notice that if the backups don’t exist or if the file doesn’t exist in the backup directory, the inode number will be listed as NA .

Before executing the script a second time, I copied a second file into the SOURCE directory to serve the purpose of a new file being created. The output from running the backup script is:

[laytonjb@home4 TEST]$ cd SOURCE
[laytonjb@home4 SOURCE]$ ls -s
total 9376
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
7784 Open-MPI-SC13-BOF.pdf
[laytonjb@home4 SOURCE]$ du -sh
9.2M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
mv: cannot stat `backup.2': No such file or directory
[laytonjb@home4 TEST]$ mv backup.1 backup.2
mv: cannot stat `backup.1': No such file or directory
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0/
[laytonjb@home4 TEST]$ ls -s
total 12
4 backup.0/ 4 backup.1/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
19M .
[laytonjb@home4 TEST]$ du -sh SOURCE/
9.2M SOURCE/
[laytonjb@home4 TEST]$ du -sh backup.0
9.2M backup.0
[laytonjb@home4 TEST]$ du -sh backup.1
7.7M backup.1

Notice that I now have two subdirectories: backup.0 and backup.1 . The cp -al command copies the files in backup.0 to backup.1 using hard links instead of actually copying the files, then the rsync command copies only the new files into backup.0 and deletes any files from backup.0 that have been deleted from SOURCE . Therefore, you will see one file in backup.1 (the oldest backup), and all of the current files in backup.0 . The directory backup.0 becomes the most recent snapshot (full backup) of the SOURCE directory, and backup.1 becomes the incremental backup relative to backup.0 .

The command ls -i is used to examine the inodes of the files in the two backup directories and the SOURCE directory (Table 2).

Table 2: inode Numbers After Second Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 NA NA
easybuild_Python-BoF-SC12-lightning-talk.pdf 45220199 45220206 NA NA NA

This is the first time an incremental backup has been made. In Table 2, notice that the inode number of the first file is the same in both backups, which means the file is really only stored once with a hard link to it, saving time, space, and money. Because of the hard link, no extra data is required. To better understand this, you can run the stat command against the files in the two backup directories.

To see whether I am actually saving space, I can examine the space used in the two backup directories and the SOURCE directory. The SOURCE directory reports using 9.2MB; backup.0 , the most recent snapshot, also reports using 9.2MB (as it should), and backup.1 , the previous backup, reports using only using 7.7MB (as it should). This is a total of 27.5MB. However, when I ran the du -sh command in the root of the tree, it reported only using 19MB. The difference is a result of using hard links in the backup process, saving storage space.

Now I’ll add a third file to the SOURCE directory and run the backup script for a third time.

[laytonjb@home4 TEST]$ cd SOURCE
[laytonjb@home4 SOURCE]$ ls -s
total 12024
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf 
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf
[laytonjb@home4 SOURCE]$ du -sh
12M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
mv: cannot stat `backup.2': No such file or directory
[laytonjb@home4 TEST]$ mv backup.1 backup.2
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0
[laytonjb@home4 TEST]$ ls -s
total 16
4 backup.0/ 4 backup.1/ 4 backup.2/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
24M .
[laytonjb@home4 TEST]$ du -sh SOURCE/
12M SOURCE/
[laytonjb@home4 TEST]$ du -sh backup.0
12M backup.0
[laytonjb@home4 TEST]$ du -sh backup.1
9.2M backup.1
[laytonjb@home4 TEST]$ du -sh backup.2
7.7M backup.2

Notice how the sizes of the backup directories decrease as the “backup count” increases, indicating which are the incremental backup directories. (Remember that backup.0 is the full backup at any time.)

The size of the SOURCE directory is 24MB, as is the reported size of backup.0 . Also notice that the reported size of backup.1 is 9.2MB, and the reported size of backup.2 is 7.7MB as expected. This should total 64.9MB, but du -sh reports the actual space used is 24MB (about 37% of the actual total). I love hard links!

To understand what is happening with the backups, I’ll again tabulate the inodes of the files in the various backups (Table 3).

Table 3: inode Numbers After Third Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 45220196 NA
easybuild_Python-BoF-SC12-lightning-talk.pdf 45220199 45220206 45220206 NA NA
PrintnFly_Denver_SC13.pdf 45220217 45220219 NA NA NA

Notice how the file Open-MPI-SC13-BOF.pdf has the same inode in all three backup directories. This indicates the file is only stored once with two hard links to it. You can verify this by using the stat command against the file in backup.0 ; you should see that the Links: value is 3 . You can also check the stat output for the file easybuild_Python-BoF-SC12-lightning-talk.pdf , which should have a Links: value of 2 .

After adding a fourth file to SOURCE , I run the backup script again:

[laytonjb@home4 TEST]$ cd SOURCE
[laytonjb@home4 SOURCE]$ ls -s
total 14116
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf 
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf 
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf
[laytonjb@home4 SOURCE]$ du -sh
14M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
[laytonjb@home4 TEST]$ mv backup.1 backup.2
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a -delete /home/laytonjb/TEST/SOURCE/ backup.0/
[laytonjb@home4 TEST]$ ls -s
total 20
4 backup.0/ 4 backup.1/ 4 backup.2/ 4 backup.3/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
28M .
[laytonjb@home4 TEST]$ du -sh backup.0
14M backup.0
[laytonjb@home4 TEST]$ du -sh backup.1
12M backup.1
[laytonjb@home4 TEST]$ du -sh backup.2
9.2M backup.2
[laytonjb@home4 TEST]$ du -sh backup.3
7.7M backup.3

Notice how I now have the final directory backup.3 , but I haven’t eliminated any backups yet. Table 4 lists the inode numbers for the various files in the backup and source directories.

Table 4: inode Numbers After Fourth Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 45220196 45220196
easybuild_Python-BoF-SC12-lightning-talk.pdf 45220199 45220206 45220206 45220206 NA
PrintnFly_Denver_SC13.pdf 45220217 45220219 45220219 NA NA
IL-ARG-CaseStudy-13-01_HighLift.pdf 45220266 45220268 NA NA NA

To show what happens the first time a backup is eliminated, I’ll add a fifth file and run the script again:

[laytonjb@home4 SOURCE]$ ls -s
total 14648
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
 532 HPCTutorial.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
2648 PrintnFly_Denver_SC13.pdf
7784 Open-MPI-SC13-BOF.pdf
[laytonjb@home4 SOURCE]$ du -sh
15M .

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
[laytonjb@home4 TEST]$ mv backup.1 backup.2
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0/
[laytonjb@home4 TEST]$ ls -s
total 20
4 backup.0/ 4 backup.1/ 4 backup.2/ 4 backup.3/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
29M .
[laytonjb@home4 TEST]$ du -sh SOURCE
15M SOURCE/
[laytonjb@home4 TEST]$ du -sh backup.0
15M backup.0/
[laytonjb@home4 TEST]$ du -sh backup.1
14M backup.1
[laytonjb@home4 TEST]$ du -sh backup.2
12M backup.2
[laytonjb@home4 TEST]$ du -sh backup.3
9.2M backup.3

[laytonjb@home4 TEST]$ ls -i SOURCE/
45220199 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220220 HPCTutorial.pdf
45220266 IL-ARG-CaseStudy-13-01_HighLift.pdf
45220041 Open-MPI-SC13-BOF.pdf
45220217 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -i backup.0/
45220206 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220221 HPCTutorial.pdf
45220268 IL-ARG-CaseStudy-13-01_HighLift.pdf
45220196 Open-MPI-SC13-BOF.pdf
45220219 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -i backup.1/
45220206 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220268 IL-ARG-CaseStudy-13-01_HighLift.pdf
45220196 Open-MPI-SC13-BOF.pdf
45220219 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -i backup.2/
45220206 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220196 Open-MPI-SC13-BOF.pdf
45220219 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -i backup.3/
45220206 easybuild_Python-BoF-SC12-lightning-talk.pdf
45220196 Open-MPI-SC13-BOF.pdf

I included the inode discovery output (ls -i ), because I wanted to point out that the last backup, backup.3 has two files in it because the previous backup, which only had one file, was erased (rm -rf backup.3 ).

Table 5 shows the inode numbers for the various files in the backup and source directories for this fifth pass through the script.

Table 5: inode Numbers After Fifth Rsync

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 45220196 45220196
easybuild_Python-BoF-SC12-lightning-talk.pdf 45220199 45220206 45220206 45220206 45220206
PrintnFly_Denver_SC13.pdf 45220217 45220219 45220219 45220219 NA
IL-ARG-CaseStudy-13-01_HighLift.pdf 45220266 45220268 45220268 NA NA
HPCTutorial.pdf 45220220 45220221 NA NA NA

If you compare Tables 4 and 5, you can see the progression of the hard links in the various backup directories.

I want to do one last experiment with the script, in which I erase a file from the SOURCE directory and see how it propagates into the backups.

[laytonjb@home4 SOURCE]$ rm easybuild_Python-BoF-SC12-lightning-talk.pdf 
[laytonjb@home4 SOURCE]$ ls -s
total 13056
 532 HPCTutorial.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ rm -rf backup.3
[laytonjb@home4 TEST]$ mv backup.2 backup.3
[laytonjb@home4 TEST]$ mv backup.1 backup.2
[laytonjb@home4 TEST]$ cp -al backup.0 backup.1
[laytonjb@home4 TEST]$ rsync -a --delete /home/laytonjb/TEST/SOURCE/ backup.0/
[laytonjb@home4 TEST]$ ls -s
total 20
4 backup.0/ 4 backup.1/ 4 backup.2/ 4 backup.3/ 4 SOURCE/

[laytonjb@home4 TEST]$ du -sh
28M .
[laytonjb@home4 TEST]$ du -sh backup.0
13M backup.0
[laytonjb@home4 TEST]$ du -sh backup.1
15M backup.1
[laytonjb@home4 TEST]$ du -sh backup.2
14M backup.2
[laytonjb@home4 TEST]$ du -sh backup.3
12M backup.3

[laytonjb@home4 TEST]$ ls -s backup.0
total 13056
 532 HPCTutorial.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -s backup.1
total 14648
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
 532 HPCTutorial.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -s backup.2
total 14116
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
2092 IL-ARG-CaseStudy-13-01_HighLift.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

[laytonjb@home4 TEST]$ ls -s backup.3
total 12024
1592 easybuild_Python-BoF-SC12-lightning-talk.pdf
7784 Open-MPI-SC13-BOF.pdf
2648 PrintnFly_Denver_SC13.pdf

Notice how the file easybuild_Python-BoF-SC12-lightning-talk.pdf doesn't appear in backup.0 , but it does appear from backup.1 onward. This is expected behavior because I used the --delete option with rsync. However, it also illustrates one of the limitations of backups: You can’t back up everything, all of the time, because you would use too much space. To paraphrase Steven Wright: “You can’t back up everything. Where would you put it?” There will come a point in time when you won’t be able to recover an erased file. It’s just outside the scope of backups. The length of a backup period is a business- and process-based decision and not a technology-driven decision.

You can also see the deleted file in Table 6.

Table 6: inode Numbers After a File is Deleted

File SOURCE backup.0 backup.1 backup.2 backup.3
Open-MPI-SC13-BOF.pdf 45220041 45220196 45220196 45220196 45220196
easybuild_Python-BoF-SC12-lightning-talk.pdf NA NA 45220206 45220206 45220206
PrintnFly_Denver_SC13.pdf 45220217 45220219 45220219 45220219 45220219
IL-ARG-CaseStudy-13-01_HighLift.pdf 45220266 45220268 45220268 45220268 NA
HPCTutorial.pdf 45220220 45220221 45220221 NA NA

Notice that the file easybuild_Python-BoF-SC12-lightning-talk.pdf is no longer listed in SOURCE or backup.0 . The file has been erased from those directories but still exists in the others. If you did a stat on that file in backup.1 , you should see the Links: number decrease from 4 to 3 .

Rsync Can Do Hard Links

Although I personally like the method of using rsync with hard links, rsync has an option that does the hard links for you, so you don't have to create them manually. Most modern Linux distributions have a fairly recent rsync that includes the very useful option --link-dest= . This option allows rsync to compare the file copy to an existing directory structure and lets you tell rsync to copy only the changed files (an incremental backup) relative to the stated directory and to use hard links for other files.

I’ll look at a quick example of using this option. Assume you have a source directory, SOURCE , and you do a full copy of the directory to SOURCE.full :

[laytonjb@home4 TEST]$ rsync -avh --delete \
/home/laytonjb/TEST/SOURCE/ /home/laytonjb/TEST/SOURCE.full
sending incremental file list
created directory /home/laytonjb/TEST/SOURCE.full
./
Open-MPI-SC13-BOF.pdf
PrintnFly_Denver_SC13.pdf
easybuild_Python-BoF-SC12-lightning-talk.pdf

sent 12.31M bytes received 72 bytes 24.61M bytes/sec
total size is 12.31M speedup is 1.00

You can then create an incremental backup based on that full copy using the following command:

[laytonjb@home4 TEST]$ rsync -avh --delete --link-dest=/home/laytonjb/TEST/SOURCE.full /home/laytonjb/TEST/SOURCE/ /home/laytonjb/TEST/SOURCE.1

Rsync checks which files it needs to copy relative to SOURCE.full using hard links when it creates SOURCE.1 , creating the incremental copy.

To better use this approach, you would want to implement the backup rotation scheme discussed in the previous section. The script might look something like this:

rm -rf backup.3
mv backup.2 backup.3
mv backup.1 backup.2
mv backup.0 backup.1
rsync -avh --delete --link-dest= backup.1/ source_directory/ backup.0/

Summary

Although I’m sure the many commercial offerings to back up data work fine, I tend to like simple open source solutions to problems. Recently, I’ve been revisiting the question of how best to do backups, and I found that it’s possible to just use rsync to make both full and incremental backups. Over time, rsync has gained the ability to copy files that have changed relative to a directory other than the source directory and use hard links for common files, effectively allowing rsync to make incremental backups. I use rsync for a variety of tasks, but using it for backups was one that I had never considered, although I’m probably behind the times in this regard.

One advantage of using rsync for backups is that it will create a backup using a filesystem. You can then mount this filesystem as read-only on users’ systems or a central file server, and users can then restore files themselves (a self-service file restore). Furthermore, you can also use ssh to make the backups to a remote system and then use nfs to export the backups to the appropriate systems. This allows you to isolate the backups from a central server so that, in the event that the server dies, you still have the user data in a different location.

Using rsync for full backups is pretty simple, but making incremental backups requires a little more work. Combined with hard links, rsync creates incremental backups, with the added benefit that the most recent backup is a full backup. I reviewed a simple script from likely the first article that discussed using rsync and hard links for incremental backups in this way.

If you are looking for a backup tool, take a look at rsync. It has a very large feature set and is capable of making file copies to remote systems. Before you implement a backup that you, and possibly other users, rely on, be sure you test the process, and be sure you create very good logs of the backup process – then check those logs after the backup. The last suggestion I want to leave with you is that you test your backups, particularly your full backup. A backup that cannot restore data is worthless. Performing restores ensures that the backup works.

Related content

  • Safe Files

    Encrypting your data is becoming increasingly important, but you don’t always have to use an encrypted filesystem. Sometimes just encrypting files is enough.

  • How Old is That Data?

    The explosion of data is a storage burden that all system administrators bear. The agedu  tool lets you discover what data is being used.

  • Filesystem Encryption

    The revelation of wide-spread government snooping has sparked a renewed interest in data storage security via encryption. In this article, we review some options for encrypting files, directories, and filesystems on Linux.

  • Parallel Shells: Run Commands on Multiple Nodes

    The most fundamental tool needed to administer an HPC system is a parallel shell, which allows you to run the same command on a series of nodes. In this article, we look at pdsh.

  • Encrypted backup with Duplicity
    The free Duplicity backup program consistently encrypts all backups, which means that backups can even be stored in an insecure cloud.
comments powered by Disqus