Backup tools

[ Site Index] [ Linux Index] [ Feedback ]

Imagine you've got a PC at home, and you've just installed Linux in a spare partition. You've been playing with graphics tools like The Gimp, and you want to be sure your flaky hard disk isn't going to lose your masterpiece.
Alternatively, imagine you're the senior system administrator in charge of a bet-the-company mainframe installation: an IBM zSeries mainframe with ten active processor units, 40 Terabytes of storage array, and about another hundred terabytes of cheaper but slower network attached drives. If you lose data, your company may lose millions -- because all its UNIX applications have been consolidated onto this mainframe, which is running Linux. You'll probably lose your job into the bargain.
Both these scenarios revolve around Linux, and Linux backup tools. Whether you just want to back up user files on a home machine, or disaster-proof a corporate mainframe, all Linux distributions come with tools to make the job easier. In this feature we're going to examine the tools, albeit with an emphasis on the small jobs (because more Shopper readers own PC's than mainframes).

What needs backing up, anyway?

Linux (like other UNIX-type systems) stores data in files in a filesystem. A filesystem is an organising schema applied to data in a partition on a hard disk. Hang onto this distinction -- it's important! We can back up both the data contained in the files (on a file by file basis) and the entire filesystem (as a lump, by copying a raw disk partition). But there's nothing else to back up -- unlike Windows, Linux doesn't have a registry that exists outside the filesystem.
There are just two situations in the UNIX world where you'll find data that isn't stored in the filesystem. The first is data that only exists in memory allocated to a running program (or process). This stuff simply can't be backed up without taking extraordinary measures -- but it's virtually never used for anything other than transient working data (such as a password for accessing encrypted files). The other situation is data in a raw disk partition (one with no filesystem). A few applications -- notably databases such as Oracle -- prefer to read and write disk blocks directly, rather than storing data in files. In this case, tools exist that let a Linux system administrator dump the entire partition to tape or another disk (dd, described later, is the main one), but for more selective backup situations you'll need to use the application's own tools.
So in general we can focus on backing up and restoring files or filesystems, and treat everything else as a special case.
While it's possible to install an entire Linux system in a single filesystem, this isn't necessarily a good idea. Extra filesystems give extra flexibility. For example, if we put user directories belonging by users in a separate filesystem mounted on /home, we can unmount it (when nobody's using the machine) and dump it to tape or CD, certain that nobody is working with a file on the partition while it's being backed up. Again: consider what happens if a power failure takes down the machine. When we reboot, having multiple filesystems compartmentalises any filesystem damage: any of /home or /opt or /var (mounted on different filesystems) may be damaged, but if one of them is lost the system as a whole can be rebuilt quite easily. In contrast, a single filesystem that stores everything is much more vulnerable to filesystem damage. The laptop this article is coming to you from has four filesystems: one for /boot (where boot files and kernel images live), one for / (where the core OS software lives), one for /home (to make repairing user files easier), and one for /var (where spool files and some configuration stuff lives).
In addition to partitioning into separate filesystems, we've got the question of what files to back up -- or whether to back up in the first place. If you're a home user, and you're using a stock Linux distribution, you probably don't want to bother backing up anything except your home directory and personal files, the configuration files in /var and /etc, and any special applications you installed in /opt. That's because backups take time and tapes (or disks) that cost money. If a home PC goes down, it's probably easier to re-install the operating system and restore the user data, than to laboriously restore the OS (only to discover that the new hard disk is larger, and the partitioning scheme doesn't match perfectly, or something). Note that per-user application configuration files are almost always stored in the user's home directory (because applications for UNIX systems, being designed for multi-user by default, find this the most logical place to store them). So there's not usually much of a problem chasing down user settings for, say, StarOffice -- just back up the whole home directory and it's there.
Large servers have a different set of problems. Users frequently delete files by accident then ask for them to be restored. A file-level backup (rather than partition-level backup) makes this sort of request easier to handle, but it tends to take longer and be less efficient. In addition, a complete backup of the operating system image may be useful if you run a large server, because you know you can get an exact replacement for the damaged machine and time is of the essence.
When planning backups, the starting point should always be to ask what you need to get back after a disaster, how fast you need it, and whether it's got to be accessible from some other Linux system or just an exact replica of the one that succumbed to disk failure or abuse.

Backing up files

Backing up individual files or directories on Linux is usually done from the command line using a number of tools -- the preferred one being GNU tar. tar (short for Tape ARchive) is the traditional UNIX tool for distributing files; while older versions of tar had some shortcomings, the GNU version of tar (which comes with almost all Linux distributions) is a robust, flexible, and powerful archiving tool.
Tar's job is to take a bunch of files in a filesystem, and serialise them -- emitting a stream of data consisting of file descriptions (name, date stamp, and so on) and file contents. This stream can end up in a file (called a tar archive), as with the more familiar Windows or Mac archivers (WinZip or Stuffit). However, it can also end up on a raw disk partition, or on a magnetic tape instead -- unlike the Windows/Mac archivers, it isn't limited to creating another file. Going in the opposite direction, tar can read through a tarfile or a tape until it sees a file description, create a file of that name, and then copy the subsequent blocks on the tape into the new file -- if necessary repeating the operation many times.
On UNIX (and Linux) everything is a file. As a result, there are some weird files! Interfaces to hardware peripherals -- such as disks, tape drives, terminals, Palm Pilots, and sound cards -- look like files; by convention they're stashed in the special directory /dev. The /proc filesystem contains files that actually provide access to the allocated memory of running programs. And UNIX provides real files, called sparse files, that have huge gaps in them, spanning unallocated regions of empty disk space. Older versions of tar didn't handle these, but GNU tar can save device files and compress sparse files efficiently. In addition, GNU tar can compress its output (or decompress its input) using the gzip or bzip2 compression algorithms, or external compression filters. At its best, tar with gzip compression produces archive files that are about 15% smaller than a Windows ZIP archive containing the same files.
Tar is a command-line tool. To create an archive containing the contents of, say, the directory /home, you issue a command like this:
  tar -cf /tmp/home-backup.tar /home
(Where the 'c' flag means 'create' and the 'f' flag means 'save to the following filename'). You could equally well dump the archive onto a SCSI DAT tape drive connected to the device file /dev/st0:
  tar -cf /dev/st0 /home
To add gzip compression (squishing the output archive as its created) and a verbose listing of what's going into the file, add the 'z' and 'v' flags:
  tar -cvzf /dev/st0 /home
You can scan through a tar archive, listing its contents, using the 't' (test) and 'v' (verbose) flags:
  tar -tvf my-archive.tar
Or, for a compressed tar archive:
  tar -tvzf my-archive.tar.gz
All the 't' flag does is check the integrity of an archive. To extract files, you use the 'x' (extract) mode flag:
  tar -xvzf my-archive.tar.gz 
This defaults to unpacking the contents of my-archive.tar.gz in the current directory.
There are a lot more things you can do with tar: archive files by name rather than directory, strip directory information, dereference symbolic links, exclude named files from the archive (archive everything under /home except files called core or '*~', for instance), only archive files on one named filesystem, and so on. The various options are listed in the man page (type 'man tar' at a command prompt) or via the GNU info help system. The main points to remember about GNU tar is that it's a successor to (and superset of) the original tar archiver, it's used for archiving files and directories on a named basis, and it's a command-line tool -- usually used with a backup script or scheduler (see "Backup schedulers" below).
Before moving on to look at partition-level backups, it's worth glancing at the other UNIX file-level backup tools -- they're not so often encountered, but they're still there.
The earliest versions of tar had shortcomings: they couldn't do compression on the fly, were unable to do multi-volume archives (where a large archive of, say, 32 Mb of data could be written to a series of 1.44Mb floppy disks or 10Mb tapes), and didn't get on well with device nodes or other weird file types. To address this shortcoming, a second file-level archiver -- called cpio -- was created.
cpio differs from tar in that it can cope with device nodes, can block its output to span multiple output media (disks or tapes), uses a different archive output format, and has an arcane command line syntax that differs from anything else you'll ever meet on UNIX -- rather than specifying an archive destination and a bunch of filenames to put in the archive, cpio reads a list of files to process from its standard input. It was designed to be used with other tools, such as find: for example:
  find /home -type f -name '*.sdw' -print | cpio -o > /dev/st0
The first find command -- everything up to the pipe '|' symbol -- searches under /home for anything of type 'f' (a standard file) named '*.sdw' (a StarOffice writer file), and prints its name into a pipe. The pipe is read by a cpio process that generates output and squirts it onto /dev/st0. So this command selectively backs up all StarOffice Writer files under /home onto a SCSI tape drive.
Just to muddy the waters, the GNU version of cpio can read and write GNU tar archives as well as cpio archives, and GNU tar can cope with device nodes and multi-volume archives! In point of fact, there's no longer much to choose between them for file-level backup purposes -- the distinction is now historical. cpio's more complex syntax does, however, make it easier to write scripts that generate lists of specific filenames to back up (compared to tar, which expects a list of filenames on its command line).
Before the GNU versions of tar and cpio overlapped at the edges, some attempts were made to merge the two. The result was a program called pax, the portable archive exchanger -- pax can read and write both cpio and tar archive formats and has its own command syntax. However, pax isn't very standard on Linux distributions (which generally prefer the GNU solution to any given problem).
Note that both tar and cpio attempt to honour UNIX filesystem access permissions. You can backup files that are readable by your own user ID, and you can restore into directories that you can write to. Root (the administrative super-user) can backup files belonging to other users while preserving their ownership attributes in the archive. Because the owner and group IDs are stored as numeric identifiers, if your UID on the system you back up from is, say, 501, and you restore the archive (with root privileges) on a machine where your UID is 502, don't be surprised if the files show up as being owned by someone completely different! This shortcoming is worked around in a network environment by using a system like NIS+ that provides a central service for setting usernames and user IDs, but it can bite you when moving tar or cpio archives between unrelated machines. For reasons of security, files owned by an unknown UID are treated as if they're owned by nobody (the lowest, least-trusted permissions apply).
Finally, various DOS and MacOS based archivers have dipped a toe in the UNIX waters from time to time. On the Mac side, Aladdin Systems, makers of the Stuffit Deluxe compression suite, have released a free (but not open source) binary-only stuffit extractor for Linux; there's also a free (as in free software) Stuffit expander project, although it can't cope with the latest Stuffit file format version. From the DOS side of things, the Info-ZIP project's free implementation of the ZIP compression/archiving tool runs happily on Linux and can be found in all distributions. There are also (usually free) ports of the Zoo, LHarc, and Arj archivers from the DOS world (although ZIP is the defacto standard). If all you need to do is to back up some personal files, and if you're used to using pkzip, you may find the Linux zip and unzip commands reassuringly familiar.

Backing up partitions

UNIX and Linux provide a couple of tools for backing up entire partitions. The two commands to know about are dump/undump (which dump an entire filesystem to tape, and restore it as an image), and the dreaded dd.
Dump scans files on a linux ext2 (or ext3) filesystem and determines which files need to be backed up. These files are copied to a backup disk, tape or other storage medium for safe keeping; dump can also send its output to a remote machine over TCP/IP using the rmt (remote magtape protocol) tool.
A dump that is larger than the output medium can be broken into multiple volumes. Dump can also execute scripts at the end of each volume (for example, to cause a tape robot to unload the drive and load the next tape in a dump sequence).
The biggest difference at the user level between dump and a file-level tool like tar or cpio is that dump is designed to do incremental backups. Dump recognizes up to ten backup levels; if you tell it to dump at level 7, it will back up only files that are new or have changed since the last backup at level 6 or lower. Because most of the files in a filesystem aren't changed regularly, the idea is that you start by making a level 0 -- complete -- backup. You then cycle through a series of incremental backups at different backup levels: a detailed plan for executing these is given in the dump manpage (type 'man dump' at a shell prompt to read it). If you follow this dump schedule, which provides a sequence of monthly, weekly, and daily backups, then in event of a disaster you can restore your filesystem by loading three tapes -- the most recent monthly, weekly, and daily dumps.
Restore does the opposite of dump. You create a partition, format it using mke2fs, mount it, then call 'restore -rf /dev/st0' in the root of that filesystem to begin restoring its contents from dump backup tapes inserted into /dev/st0.
Note that while you can restore individual files, it's a pain in the neck; you feed the dump tapes to 'restore -i', which scans an index of the entire backup then gives you an interactive shell that lets you cd around the directory tree of the dump, selecting files to extract. Otherwise, you can use a command line to scan a dump and restore individual files -- but you may need to work your way through two or three backup levels to find them. There's lots of tape or disk swapping involved.
Dump comes into its own when you're dealing with a big system -- possibly a network of workstations that need to be backed up to a central tape robot on a regular basis. We'll discuss network backups under "Network backup systems" (below). Shell scripts to automate the dump process can be written to cycle through filesystems and perform incremental backups regularly. You can find out more about dump and restore at the dump project page (dump.sourceforge.net). One example of a dump automation script (which also helps with cpio and tar as well) is flexbackup. Basically, it's a shell script that reads a commented configuration file (actually, a Perl file that's included by the core perl code that runs it!) to find out where to put the backups, what to backup, how to log results, and so on. It then provides a much friendlier command line interface for backup and restore operations.
In contrast to tar and cpio (which archive files) or dump (which is a backup tool that dumps/restores filesystems), dd is really low-level: it's a tool for copying disk or tape data blocks. dd has no concept of files or inodes -- all it knows about is a raw block device like a partition or a tape drive. And it uses a bizarrely deviant syntax on the command line (which was allegedly the idea of Ken Thompson, who wrote it to emulate the horrors of IBM OS/360 just to remind the early UNIX developers what they'd escaped from).
To dump a filesystem in 1024-byte chunks onto a SCSI tape drive you might do something like this:
  dd if=/dev/hda1 of=/dev/nst0 bs=1024
This reads from the input file (if) /dev/hda1, writes to the output file (of) /dev/nst0, using a block size (bs) of 1024 bytes.
To create a file consisting of 100000 4Kb blocks of random data:
  dd if=/dev/urandom of=/tmp/lots_of_noise bs=4096 count=100000
(This tells dd to copy 'count' blocks of 'bs' bytes from the input file 'if' to the output file 'of'.)
To copy the contents of a magnetic tape into a partition, converting from IBM EBCDIC encoding to ASCII and skipping a 380 byte tape header:
  dd if=/dev/nst0 of=/dev/hdb3 ascii bs=380 skip=380
And so on.
dd is a disk duplicator. For example, you can copy a floppy disk image like this:
  # insert floppy disk in first 1.44MB floppy disk drive -- /dev/fd0h1440
  dd if=/dev/fd0h1440 of=/tmp/disk.img
  read -p "Eject the floppy and stick the new one in" TMP
  dd if=/tmp/disk.img of=/dev/fd0h1440
(The 'read' command just waits for you to hit a key before continuing with the second dd command, to copy the disk image back onto the new medium.)
Or you can copy a floppy disk and mount it as a loopback filesystem (effectively creating a RAMdisk):
  TMP=$$
  dd if=/dev/fd0h1440 of=/tmp/disk.img.$TMP
  mkdir /mnt/$TMP
  mount -t autofs -o loop /tmp/disk.img.$TMP /mnt/$TMP
And, of course, you can dump whole partitions to tape using dd. (Did I explain that it also makes the tea?)
The drawback of dd is that it has no concept of "backup" -- it's just a block copying tool. You can use it to back up weird partition types, but you have to dump and restore the entire partition every time. This also makes it less than useful for dealing with things like raw disk partitions allocated to an Oracle 8i database -- you'd have to shut the database server down completely before using dd to copy it to tape. Finally, dd on Linux doesn't support CD-RW media.

Backup media

Speaking of backup media, there are several to choose between and one category -- CD's -- that uses different software tools from all the others.
Floppy disks are now effectively obsolete. Among backup media still in use there are removable hard disks, Zip/Jaz disks (which are basically a variation on the same theme), tapes, and CD's (soon to be augmented with DVD's).
Tapes are the cheapest backup medium per gigabyte, but also the least convenient. To locate a file in a given tape backup requires, on average, a linear search through half the tape. Some backup software tools (not the ones described so far) maintain indexes of the backup tapes, recording the block offsets at which files are located so that they can fast-forward to the right point: even so, these systems are far from fast.
Hard disks are the fastest but most expensive backup media. If you've got a removable USB or Firewire drive or a hot-swappable ATAPI drive (such as a Zip drive) you can mount the drive as a filesystem, copy files back and forth at a speed limited by your bus (11 mbp/s for USB, faster for built-in ATAPI, Firewire, or SCSI drives, slower for parallel-port drives), and unmount again.
CD's are dealt with differently under Linux. Standard CDROM's use a filesystem defined by ISO committee -- the ISO9660 format. Linux can read this format, and a couple of derivatives adapted to support long filenames (Rock Ridge -- the UNIX version -- and Joliet -- Microsoft's variant, not to mention HFS -- Apple -- filesystems) and bootable (El Torito) versions. ISO9660 was not designed to provide a writable filesystem; to create a CD you use a tool called mkisofs that takes a snapshot of a directory tree and generates an ISO9660 filesystem image as a single huge file. You then use a specialised tool to burn it onto a CD. For example:
  mkisofs -r -o /tmp/backup.img /home
Outputs a file called /tmp/backup.img containing a copy of everything below /home, formatted as a Rock Ridge variant CDROM image.
Note that mkisofs can generate secondary images for multi-session CD's (CD-R), when the secondary image "overlays" an earlier disk image and updates it with changes. To figure out how to do this you need to read the CD-Writing HOWTO.
Mkisofs doesn't talk directly to the CD writer: for that you need a tool called cdrecord. cdrecord's job is to blast raw ISO9660 disk images (or audio tracks) onto a CD in a writer, and optionally to blank CD-RW disks. It's possible to write a script that pipes the output of mkisofs straight into cdrecord, but this isn't how they're usually used.
The main weakness of using mkisofs/cdrecord for backups is that they can't really cope with backing up more than 670Mb in one go -- you'd overflow the CDROM -- and they require as much free disk space as is needed to store the backup image. However if you just want to copy your personal files to a CD this might be the best tool out there. If the command line versions look a bit intimidating, there's a graphical tool -- Xcdroast -- that acts as a front end to mkisofs and cdrecord; you can get it from www.xcdroast.org (or as part of many Linux distributions).
Of course, cdrecord isn't limited to writing only disk images: it can write tar archives or dump images, too! As long as you're willing to think of CDs as backup disks rather than something you can shove in your drive and browse like a normal CD-ROM, you can store just about anything on a CD. A number of scripts exist to allow tar or dump images to be split across multiple CD-R's: for example yacdback or multiCD and cdbackup all do more or less this.

Network backup systems

If you're backing up your personal files on a home machine, something like Xcdroast -- or a shell script running tar or dump once a night with a tape left in the drive -- is more than sufficient. But what if you do this for a living and have to ride herd on fifty machines?
One of the key tools in the system administrator's arsenal is network backup software. We mentioned earlier that dump can send its output to the remote tape device of another suitably configured machine, but this doesn't help much if the tape device is already in use. To deal with this, we need a proper client/server backup system -- such as Amanda. AMANDA -- the Advanced Maryland Automatic Network Disk Archiver, allows a network administrator to configure a single master backup server that backs up multiple hosts to a single large capacity tape drive. (You can get it or read about it at www.amanda.org.) Amanda uses dump or tar to back up UNIX workstations, and has recently gained the ability to use SAMBA to back up Windows clients; there's a handy introduction to Amanda.
A tool like Amanda is the solution of choice for larger servers; a number of commercial workalikes with various additional features exist. For example, IBM's Tivoli file migration system supports Linux workstation and server clients, and Sun's Solstice backup system provides a networked backup tool for UNIX and Solaris systems -- with a cute graphical front-end as well. However, all these systems are expensive; they're marketed at corporate network administrators and priced accordingly.

[ Site Index] [ Linux Index] [ Feedback ]