Moving your data – It's not always pretty
Moving Day
Almost everyone reading this article has, at some time in their life, moved from one place to another. This process involves packing up all your possessions, putting them into some type of truck, driving them to the new location, unloading them, and ultimately unpacking everything. I've done it several times and, to me, it's neither fun nor pleasant. A friend of mine loathes moving so much that he has threatened just to sell or give away everything and buy new things for his new home. Honestly, I'm not far from his opinion, but at the same time, it's always interesting to see what you have accumulated over the years and, perhaps more interestingly, why you have kept it.
I think the same thing is true for data storage. At some point you're going to have to move the data from your existing storage solution to a new one, for any number of reasons: Perhaps your current storage solutions will become old and fall out of warranty, or perhaps you need additional capacity and can't expand your current storage. Of course, another option is just to erase (burn) all of the data on your existing storage solution.
Approach
I recently went to the Lustre User Group conference [1], and I was impressed with a presentation by Marc Stearman from Lawrence Livermore National Laboratory. The title of his talk was "Sequoia Data Migration Experiences" [2]. Sequoia is a large petascale system with 55PB of storage running Lustre on top of ZFS [3] and reaching about 850GBps. Marc talked about moving data from older systems to Sequoia and some of the issues the team faced (it wasn't easy). His talk inspired me to examine tools that can be used to migrate data.
One of the many ways to migrate data is a block-based approach, in which you copy a set of blocks using something like a snapshot, or even a Copy-on-Write (COW) filesystem that writes the data to a new storage solution. A block-based approach needs to be planned very carefully because you're using a set of blocks rather than individual files; however, it can simplify the process, particularly in preserving metadata, although it is dependent on the hardware for the old and new storage.
The other approach, and the one that I will use in this article, is file based. In this method, files are migrated with the use of a variety of tools and techniques. File-based migration is more prevalent because it's easier to track what is migrated. On the other hand, it can be more difficult because you handle each file individually.
I talked to a number of people at various HPC centers about data migration, and I found two general approaches to migrating data via files. The first approach is fairly simple – make the users do it. The second way is to do it for the users.
With the first option, you let the users decide what data is important enough to be migrated. You have to give them good, tested tools, but the point is that the users will move data that they deem important. The general rule of thumb is that users will keep all of their data forever, but one site in particular told me they've found that users don't necessarily migrate all of their data. (Sending the Google vernacular "+1" to their users.) However, I'm not always this optimistic, partly because I still have files from 1991 on my system and files from the late 1980s on floppy disks (once a pack rat, always a pack rat).
With the second approach – doing the migration for the users – migrating data takes a bit more planning and even a bit of hardware. The first thing you need to understand is the "state" of the data on your current storage. How much of it is changing and which user directories are being used actively? You should also examine the data files for permissions, ownership, time stamps, and xattr (extended attributes). To make your life easier, you can also mount the older data as read-only so you can copy it over without fear of its changing (Stearman ran into this problem).
Ultimately what you want is a list of files (full path) that can be migrated safely, along with the attributes that need to be preserved. Be sure to run some tests before migrating the data, so you don't lose any attributes or even any data.
Data migration can succeed under many possible hardware configurations; Figure 1 shows a somewhat generic configuration. On the right is the old storage configuration, which has one or more data servers connected to a switch (red box). The new storage is on the left, with one or more new data servers connected to a switch. In between the two switches is a single data mover. The function of the data mover is to act as the conduit between the old and new storage solutions and effectively "move data." You can have as many of these as you want or need.
The simplest way to move data is to make the data mover a client of the old storage system and a client of the new storage system. This might not be possible because of conflicting software requirements, but you can use a TCP/IP network and a server to NFS-export the storage to the data mover. Then, just copy the data between mountpoints. For example:
$ cp -a --preserve=all -r /old_storage/data /new_storage/data
Be warned that if you use NFS, you will lose all xattr information because no NFS protocol can transfer it. In this case, you might have to use something else (see below).
This simple example has many variations. For example, you could skip the data mover and just NFS-export the storage from an old storage data server and mount it on a data server on the new storage, but I think you get the basic concept that somehow you have to be able to move data from one storage system to the other.
Software Options
A number of tools can be used to migrate data, ranging from common tools, to those that are a bit rough around the edges, to tools that might not be able to transfer file attributes. Some are easier to use than others, but all of them potentially can be used to migrate data, depending on your requirements.
To begin, I'll examine the first, and probably most obvious, tool that comes to mind: cp
.
1. cp
In the simple layout shown in Figure 1, the most obvious way to migrate data is to use the data mover to mount both storage solutions. Then, you can just use cp
between the mountpoints, as demonstrated in the cp
command above. It's all pretty straightforward, but you need to notice a few things.
As I mentioned earlier, if either one of the storage solutions is NFS-mounted on the data mover, you will lose all of your xattr data. There is a way around this problem, but it's not easy and involves scripting. The script reads the xattr data from the file on the old storage and stores it in some sort of file or database. The cp
command then copies the file from the old storage to the new storage. On the new storage, you need to "replay" the xattr data from the file or database for the file, and you have effectively copied the xattr. If you take this approach, be sure to test the script before applying it to user data.
Also notice that I used the -a
option with the cp
command [4]. This option stands for "archive," and it will do the following:
- Copy symbolic links as symbolic links.
- Preserve all of the attributes (mode, ownership, timestamps, links, context, xattr) because I used the
--preserve=all
option. If you don't use this option, it will default to mode, ownership, and timestamps only. - Recursively copy all files and directories in subtrees. I also specified this with the
-r
option, but the archive option,-a
, invokes the recursive copy option as well. I use this option because it preserves all of the attributes, including those SELinux uses (if you haven't guessed, I'm using Linux for my examples).
If you use cp
, note that it is a single-threaded application. Consequently, it is not likely to saturate the network, which would be a waste of resources and take a very long time. If you have a nice data-mover server, you likely have multiple cores that aren't really doing anything, so why not take advantage of them?
The easiest way to take advantage of the number of cores is to run multiple cp
commands at the same time, either on specific files or specific subtrees (start at the lowest branch and work upward to root). This isn't too difficult to script if you have a list of files that need to be migrated: Just start a cp
command on the data mover for each core.
Because you're a good HPC admin, you also know to use the numactl
command to pin the process to the core, so you don't start thrashing processes all over the place. Just write the script so that when a cp
command is issued on a particular core, it grabs the next file that needs to be migrated (just split the list of files across the cores). A second way to take advantage of the extra cores is to use a distributed file copy tool.
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.