Facebook Moves Data Center: How to Migrate 30 PB of Stored Data
At some point in spring 2011 Facebook just ran out of power and space in the data center, and it became clear they had to move over to a new one. The stored data had grown from 20 PB to 30 PB in just about one year, that's 30 million Gigabytes or 3000 times the the size of the Library of Congress, according to a post in Facebook's tech blog. The data is stored in the distributed HDFS filesystem of the Apache Hadoop clustering project.
Moving the physical nodes from one data center to another was not an option, because this would lead to a significant downtime, Facebook doesn't want to afford. Instead they opted for a stategy of replicating the data to the new data center continuously. First they copied all of the data over and then used a custom replication technology for keeping up to date. For copying they took advantage of Hadoop's own DistCp tool that makes use of a Map/Reduce algorithm to parallelize copying as much as possible.
Facebook's own replication tool had a plugin for the Hive software monitor the distributed filesystem for changes and record the changes in a logfile. On the basis of that logfile they replication software could synchronize changes to the new data center. At some point the Facebook engineers switched Hadoop's Jobtracker to stop the old filesystem from being modified. Then they changed DNS entries and started the Jobtracker in the new center.
More details on the migration can be found in the Facebook-Blog.