Moving your data – It's not always pretty
Moving Day
5. Rsync
One tool I have not mentioned is rsync
[11]. It is something of the do-all, dance-all tool for synchronizing files, so in many cases, people reach for it automatically for data migration. As with tar
, rsync
possesses a myriad of options (Table 2) [12].
Table 2
Rsync Options
Option | Description |
---|---|
-l
|
Copy symlinks as symlinks. |
-d
|
Transfer directories without recursion. |
-D
|
Preserve device files (superuser only). |
-g
|
Preserve groups. |
-o
|
Preserve owner. |
-t
|
Preserve modification times. |
-v
|
Verbose (it's always a good idea to use this option). |
-p
|
Preserve permissions. |
-X
|
Preserve xattr information. |
-H
|
Preserve hard links. |
-E
|
Preserve executability. |
-A
|
Preserve ACLs. |
-r
|
Recursive. |
-a
|
Archive mode (same as -r , -l , -p , -t , -g , -o , and -D , but not -H , -A , and -X ).
|
-S
|
Handle sparse files efficiently. |
-x
|
Compress file data during transfer. |
--compress-level=NUM
|
Explicitly set compression level. |
-h
|
Output human-readable numbers. |
--inplace
|
Update destination files in place. |
Many of these options carry implications, so it is a good idea to experiment a bit before proceeding with any data migration. Be sure to pay attention to the attributes and how they are Rsync'd to the new storage.
Marc Stearman's paper from LUG 2013 that I mentioned previously had some good observations about using Rsync to migrate data from one storage system to another. He and his team found Rsync to be very "chatty" as well as very metadata intensive (recall that they are migrating data from one storage solution to another, so it's not syncing files between two locations).
In their migration, they also found that when Rsync migrates files, it first checks to see whether the file exists on the destination filesystem by opening it (a metadata operation). Then, it creates a temporary file, copies the data, and moves the temporary file into the correct location of the destination file.
All of these steps are very metadata intensive. The --inplace
option skips the temporary file step and writes the data to the file in the correct location on the destination storage. Depending on the number of files to be migrated, this approach can reduce the time because it reduces the effect on metadata operations.
Note that with Rsync, you don't necessarily need a data mover between the old storage and the new storage. You just need at least one data server associated with the old storage and at least one data server associated with the new storage that are networked and able to "see" each other. Then, you can use Rsync to copy the data to the new storage. Depending on the storage systems, it might be a good idea for the storage servers to be storage clients and not part of the storage solution. An example of this is Lustre [13], where you want the data servers to be a Lustre client rather than an OSS (Object Storage Server) server.
As far as I can tell, Rsync has no formal parallel version, but lots of people want something like this (Google Summer of Code, anyone?). The following articles
- Parallel rsyncing a huge directory tree [14]
- Parallelizing RSYNC Processes [15]
- Parallelizing
rsync
[16] might help you use Rsync in a parallel manner.
6. BitTorrent
BitTorrent [17] is not just a tool for downloading illegal copies of digital media. It is in fact a peer-to-peer sharing protocol with some associated tools that use multiple data streams to copy a single file. By using multiple streams, it can maximize network bandwidth even if the streams are relatively slow individually. It also supports parallel hashing, wherein the integrity of each piece can be verified independently and the file checked once it arrives to make sure it is correct.
BitTorrent is attractive because of its ability to download files from multiple sources at the same time. This avoids the single-stream performance limit of tools like cp
or rsync
and is why it is so popular for uploading and downloading files from home systems with really low network performance. The single-stream performance is fairly low, but there are so many participating systems that, in aggregate, the performance could be quite good.
If you want to copy a file from the old data storage to the new data storage using BitTorrent, you first have to create a torrent descriptor file that you use on the new data storage server(s). The data server(s) on the old storage act as a seed. The new data servers, acting as "peers," can then download the file by connecting to the seed server(s) and downloading it.
You just need a network connection between the seed servers and the peers. You can also scale the performance by adding more servers (seeds and peers) attached to the old storage, as well as more BitTorrent clients. With a centralized filesystem, scaling the file server portion of the exchange shouldn't be a problem because each node has access to the data (centralized filesystem).
A file for downloading is split into pieces, and each piece can be downloaded from a different BitTorrent server (seed or peer). The pieces are hashed so that if the files are corrupted, it can be detected. The pieces are not downloaded in sequential order but in a somewhat random order, with the BitTorrent client requesting and assembling the pieces.
On the software side, you need a BitTorrent server on each of the seed nodes (old data servers) and a BitTorrent client on each of the peer servers and pure client servers. Servers and clients are available for many operating systems and distributions.
Figure 2 shows a very simple BitTorrent layout for data migration. In this layout, I've chosen a slightly different route and made each data server on the old storage a seed server, so it will be serving a particular file and no other server will provide for it by creating the correct torrent files.
On the new storage are BitTorrent clients that request the files. All of the files that need to be migrated from the old storage are given to the new storage servers. It's fairly easy to split the file list between the clients so that you don't waste time recopying the files.
There are many ways to create a BitTorrent network to migrate files. I have not tested them, so I don't know if it can also copy over file attributes, but to be honest, I don't believe it can. If this is true, then you will need a way to recreate all of the attributes on the transferred files.
Very recently, it was announced that BitTorrent had begun alpha testing a new "sync" tool [18]. BitTorrent Sync [19] allows you to keep files between various servers in sync. The new tool also offers one-way sync [20], which could help in data migration. Keep an eye out for BitTorrent Sync because it could be a huge benefit to data migration.
7. bbcp
Other tools floating around can handle multistream data transfers. One of them, bbcp
[21], can transfer files between servers using multiple streams and do recursive data transfers as well. As with other tools, BBCP has several options, and the main page describes them in a fair amount of detail. A web page at Cal Tech also describes BBCP in more detail and includes some network tuning parameters to improve performance [22].
BBCP has the ability to compress data during transfer (--compress
), do a recursive transfer (--recursive
), and preserve some metadata (--preserve
). However, the BBCP main page mentions that the --preserve
option only preserves the source file's mode, group name, access time, and modification time.
If your data has xattr metadata associated with it, you will lose that data if BBCP is used. However, you can always write a script to go back and add the xattr information.
One limitation of BBCP is that it can't migrate directories by themselves. I have not tried BBCP, so I can't say too much about how to set it up and use it. However, it sounds promising if your data files have no xattr information.