Flexible backup for large-scale environments
Bats in the Data Center
A network of universities (Universities of Ulm, Constance, and Tübingen, Germany) was looking for a new backup solution, but the requirements were tough. The new solution not only had to offer a simple and predictable license model, it also had to cope with billions of files and petabytes of data in all kinds of international character sets on thousands of computers with all kinds of operating systems at several locations in the country. Open and documented formats and interfaces were essential to ensure permanent, at least read-only, access to the data, and it had to allow the continued use of existing tape drives, including a large tape library.
In the end, the people in charge opted for a proof of concept based on Bacula Enterprise Edition [1] by Switzerland's Bacula Systems [2] and, by doing so, for a combination of open source software with extension modules and commercial support.
Starting Point
The backup software originally used was IBM's Tivoli Storage Manager [3]. However, a review revealed that the license costs were difficult to calculate, with no central overview of how many CPUs were included in the backup in the individual decentralized systems in the branch offices and the data volume allowed for each CPU. Moreover, the figures could potentially change at any time. Inventorying the hardware data and constantly updating the figures looked like a very labor-intensive and time-consuming task, which is why license models by volume or processor performance are not a good fit for university operations.
The current hardware comprises two x86 servers running Solaris 11.4, which have access to three collections of hard drives (just a bunch of disks, JBODs). Each array comprises 90 SAS2 disks connected by several controllers, each with a capacity of 12TB. ZFS ensures the required filesystem redundancy. Each JBOD accommodates a zpool (one or more virtual devices, vdevs), which serves as a disk cache for one of the participating universities.
A tape library with more than 2,000 stations and eight IBM 3592-60 [4] tape drives completes the storage system. Each drive can write a tape with up to 20TB of uncompressed data. Hosts are connected to the tape drives over two redundant Fibre Channel paths at 32Gbps. Another storage server at the Tübingen site serves as a local backup-to-disk medium for selected systems, without a connection to the tape drives in Ulm. However, it would be easy to implement a link at a logical level with Bacula if required.
Bacula Architecture
The Bacula system comprises, among other things, what is known as a Director, which is responsible for controlling all the processes. The Director controls the tape library, initiates client backups, migrates data from disk to tape, and restores data.
To handle this task, the Director always needs to know what versions of which files have been saved by which clients and where the data is stored in the disk cache or on which tape it is stored at which location. This information is stored in a PostgreSQL database, which currently runs to 1.4TB and uses NVMe devices for performance reasons. Experience has shown that the database server also needs plenty of RAM – 512GB in this specific case.
Almost all the processes within the Bacula system (Figure 1) are driven by the Director and its configuration. The only exception to this is the backup encryption, which is handled by the clients themselves. All of the Bacula components can be configured with simple text files that contain full details of the objects and parameters. The first category of objects is the client systems for which addresses, backup jobs, and file sets need to be defined. It makes sense to classify the systems by size and to pursue different backup strategies as a function of this attribute.
On some systems, too many files or too large a data volume prohibits a conventional backup. The initial backup of a mail system with more than 100 million files would take more than three weeks, in parallel to ongoing operations. Above all, however, it would take a similar amount of time to restore the system in an emergency – several days at least. These clients therefore require alternative strategies.
The next category of clients comprises systems that are so large or slow that a full backup would take several days, but on which an incremental backup can be completed in a single night. From experience, the limits of this category in the Ulm installation were defined as follows: The size of a full backup is between 500GB and 25TB, and an incremental backup takes less than eight hours. Experience has shown that such systems can be restored within three to five days. However, it is advisable to test this regularly, because the bottlenecks usually originate outside the backup system.
One of Bacula's features, known as Virtual Full backups, helps to backup such clients. After an initial backup, these systems never carry out full backups again – only daily incremental backups. Additionally, Full Virtual backups are performed every two months, which Bacula creates from the data available on the server without any intervention on the part of the client. The Full Virtual backups include the last (virtual) full backup along with the incremental or differential backups created by the client since then. A Virtual Full backup is seven times or more faster than a physical full backup.
This arrangement results in two scheduling classes. Large servers only create daily incremental backups of the client, a Virtual Full backup every two months, and a differential backup in between. On smaller systems, full monthly backups are created in addition to the daily incremental backups.
All backup data, whether created by the client or by virtual backups, first ends up in the disk cache. Full backups older than one month or larger than 32GB are then migrated to tape. Incremental and differential backups, on the other hand, remain on the disks. This procedure reduces restore times, because a restore usually affects data on which work is being performed and is therefore available in the last incremental backups. The full backups needed for a complete restore can be read from tape at maximum speed, because they were written in a single piece and not fragmented during migration.
Configuration
As already mentioned, the Director controls all the processes as the central instance, which is why a large share of the configuration files resides there. The storage daemon is the server responsible for moving all data from the client to the disk cache, from the disk cache to tape, and vice versa. Client data is retrieved by the file daemon, an agent running on the client that the Director contacts.
The entire directory tree of the configuration could reside in /opt/bacula/etc/conf.d
; for example:
# ls /opt/bacula/etc/conf.d/ clients.d pools.d filesets.d schedules.d jobs.d storage.d macos_excludes.inc unix_excludes.inc messages.d windows_excludes.inc [...]
The object definitions can be clearly divided into clients, jobs, data pools, backup schedules, and so on. These categories can be further subdivided, for example, by grouping the clients by organizational unit. There are few limits to what you can do.
In the client definition, you can specify the properties of the associated backup job, the schedule on which it runs, and whether or not scripts are executed on the client before and after the backup. You can also specify which files are backed up and which are excluded from the backup. The example in Listing 1 shows the definition for backing up a PostgreSQL server, where Bacula leaves out the database files and simply retrieves database dump files.
Listing 1
Client Configuration
Client { Name = dbserv Address = dbserv.rz.uni-ulm.de Password = "<XXXXXXXXXXXXXXXXXX>" @/opt/bacula/etc/conf.d/clients.d/client_defaults.inc } Job { Name = dbserv-bck Client = dbserv JobDefs = kiz-job FileSet = dbserv-fileset Schedule = daily-0500-sched Messages = dbserv-msg Client Run Before Job = "/pgsql/scripts/preschedule" Client Run After Job = "/pgsql/scripts/postschedule" } Messages { Name = dbserv-msg @/opt/bacula/etc/conf.d/messages.d/default_message.inc MailOnError = hostadm@uni-ulm.de = all, !skipped } FileSet { Name = dbserv-fileset Ignore FileSet Changes = yes Include { Options { exclude = yes @/opt/bacula/etc/conf.d/unix_excludes-exclude_cache_dir_too.inc WildFile = "/pgsql/log/*" WildFile = "/pgsql/logs/*" WildFile = "/pgsql/sw/logs/*" WildFile = "/pgsql/tmp/*" } Options { @/opt/bacula/etc/conf.d/filesets.d/default_fileset_options.inc } File = /pg_backup File = /pg_journal File = /pgsql } Exclude { @/opt/bacula/etc/conf.d/kiz_unix_file-excludes.inc File = /pgsql/sw/tapes/build File = /pgsql/data File = /pgsql/xlog File = /pgsql/tsm } }
The counterpart to this configuration file on the Director is the file daemon configuration on the client side (Listing 2). It must contain the correct password and contains logging specifications. Encryption can also be configured here, if required. The Director has no access to the data or the keys; however, metadata, such as file names, are available in the clear, because this information is essential for the system to work.
Listing 2
File Daemon Configuration
FileDaemon { Name = "dbserv" FDAddress = dbserv-fl-m.rz.uni-ulm.de FDport = 9102 WorkingDirectory = /opt/bacula/working Pid Directory = /opt/bacula/working Maximum Concurrent Jobs = 4 } Director { Name = uniulm-dir Password = "<XXXXXXXXXXXXXXXXXX>" } Messages { Name = Standard director = uniulm-dir = all, !skipped, !restored }
Buy this article as PDF
(incl. VAT)