February 2002 Column


[ Site Index] [ Linux Index] [ Feedback ]


A bestiary of filesystems

In January 2001 I wrote about exotic filesystems on Linux -- AFS, Coda, and Intermezzo, distributed systems designed to let users connect and disconnect transparently to a file server across a network.

This was probably a mistake, or the first sign of creeping senility; because I haven't actually written about filesystems in general, yet. As kernel 2.4.15 (slouching towards stability) adds ext3 and Intermezzo to the previous bestiary, it's about time I did that. Filesystems cause nothing but confusion to newbies, and they're one of the major compatability stumbling blocks when moving between operating systems. So here's my attempt to explain where they came from, why you shouldn't ignore them, and what your options are.

What is a filesystem?

A filesystem is a piece of software (usually built into the operating system or loaded as a module) that takes a symbolic name (such as "/etc/passwd") and retrieves the data associated with it. The data is usually stored on a physical medium such as a disk in the form of blocks -- chunks of storage that exist at a physical address (such as track #40014, sector #12, side #2). In addition to providing access to a file by name, a filesystem provides access to specific positions within the file ("give me 512 bytes staring at an offset 5667346 bytes from the beginning, please"), and an organisational abstraction that lets you partition files into manageable groups (directories or folders).

The first computer operating systems didn't bother with much in the way of filesystems -- they just used magnetic tapes where a header block contained metainformation describing the contents of some following blocks (such as their name and creation date and what sort of bits the empty areas were padded with). To retrieve a file you'd have to read through the entire tape until you came across its header. When disk-based systems showed up -- such as Digital Research's CP/M, the first real personal computer operating system and a spiritual descendant of DEC's early minicomputer operating systems -- this wasn't good enough. Instead, a group of blocks in a set position on the disk were used to store the file metainformation, including the location of those blocks on the disk that were assigned to the file -- like an index in a book. ("Turn to the index, look up the file 'fred.txt' and tell me what blocks it is stored on.")

UNIX (and its nearly-clone, Linux) take a more sophisticated view of a filesystem. Firstly, a distinction is made between metainformation describing files (such as their creation date, ownership, and what disk blocks they occupy), the file contents, and the human-readable name assigned to the file. (We'll see how this is important shortly.) Secondly, the Linux kernel provides a virtual filesystem layer. The virtual filesystem provides a uniform public interface to programs that want to read or write files; real filesystems plug in underneath the VFS, and the VFS does the heavy lifting of translating user program requests into actual calls to the underlying filesystem. This allows the Linux to talk to a huge range of different filesystems -- such as MSDOS, traditional UNIX, RiscOS, MacOS, Amiga, CDROM, Windows NT filesysytems, or even a disk filesystem mapped onto a tape drive or a network socket -- without the need for applications to be aware of the underlying filesystem type.

Why use different filesystem types? Well, every operating system developer has traditionally had a different way of thinking about file storage, so the wheel keeps getting reinvented in a variety of shapes and sizes. In addition, older Linux filesystems don't cope so well with huge (multi- gigabyte or multi-terabyte) files, or with home users who believe in shutting down by unplugging the computer. Fault tolerance is a big problem; a filesystem needs to minimize the risk of data corruption or loss if someone yanks out a power cable, and recover rapidly after such a crash. Journaling is the most popular technology for achieving this goal, followed by the (more expensive) RAID techniques (which rely on hardware, rather than software).

Another major reason for using different filesystem types is the availability of network filesystems -- Sun's NFS (Network File System), Microsoft's SMB system and Apple's AppleShare are the best known. These are network protocols; rather than storing data on a hard disk they allow a server to export a volume (filesystem) that can be mounted via the network on a client machine: thereafter the client reads and writes files over the network to the server's hard drive. Both NFS and SMB let you specify that a filesystem will be exported, and control which client machines can mount it. But if your server goes down for some reason, the clients that are using files on the NFS filesystem will be left high and dry. The answer to this is to use a clustering network filesystem -- one where a bunch of servers work as a team. (See Shopper 156).

I'm going to look at these categories in turn -- at conventional filesystems, journaling filesystems, and network filesystems -- and see what the state of play is on Linux.

ext2, the standard Linux filesystem

The ext2 filesystem is descended from the traditional Unix filesystems, which tended to be fast but temperamental -- if you sneezed at them they tended to break. Ext2 is two different filesystems, depending on whether you look at the metadata layer or the file layer. All Unix filesystems rely on a basic data structure called the inode (information node). An inode has a unique ID number, and some information about creation time, number of links (filenames) attached to it, and pointers to the disk blocks that store its associated data; in fact, an inode encapsulates everything we need to know about a file except its name. Inodes live in a table on the disk at a well-known location -- they're basically the core metadata of the system. Lose your inode table and you can kiss your data goodbye.

A directory, in contrast, is just a special file that consists of a list of filenames, each of which is accompanied by an inode number. It's data, not metadata. Lose your directory files and you can (in principle) reconstruct your data -- although you'll have lost the identifying names.

To run a program like /usr/bin/bash, the kernel needs to open inode number one (the root directory), read through its data and locate the inode for "usr". It then reads that inode's data, locates the inode for "bin" and so on until it finds the code stored in the data blocks referred to by the inode referred to by "bash". If this sounds long-winded, it is -- that's why Linux has a buffer cache, an area of RAM that contains copies of recently read disk blocks. Memory access speeds are roughly a million times faster than disk access speeds, so if we're trying to access an often-read chunk of data it makes a lot of sense to keep a copy hanging about in the cache. (The buffer cache is dynamically resizable and uses up all free memory that other programs aren't asking for. So if you've got a 1Gb machine running nothing in particular, it'll have a buffer cache taking up most of its RAM -- and if you then load ten copies of StarOffice the buffer cache will downsize itself to almost nothing. If only StarOffice behaved in such a civilized manner ...)

As well as speeding up reads and writes, the cache has an undesirable side-effect: the state of the filesystem isn't committed to disk as it changes. If someone pulls the plug there is a chance that some data may have been modified in the cache without the disk being updated. If a file is being appended to when the computer crashes, data may have been written to disk blocks which have not been added to the inode's table or removed from the free list. In the most extreme cases this can result in files being lost or metadata being corrupted. Older Unix filesystems were very vulerable to this sort of damage and a crash could render them unmountable; ext2 can take a lot more punishment, but still isn't perfect.

To fix consistency problems caused by a crash, Unix systems use a program called fsck (filesystem check). fsck ensures that the metadata structures are up to date and consistent -- it checks that no data blocks in the free list are assigned to inodes, that each inode has the right number of filenames for its link count, and so on. But fsck is slow; even on a fast machine it can take up to a minute per gigabyte. This is intolerable on a server with multiple 30Gb volumes that needs to minimize downtime, and such systems are increasingly common.

A secondary problem of ext2 is the inode list. A fundamental weakness of Unix filesystems is that the inode structures are all stored in a table at the start of the filesystem's partition. You can't add inodes to an existing filesystem. Some filesystems -- such as a usenet spool directory, or some kinds of database -- need to store huge numbers of very small files. In this sort of situation not being able to expand the inode table is a major headache.

Finally, it's not easy to add arbitrary metadata to an ext2 filesystem. In addition to the access permissions and file attributes stored in the inode, ext2 has some extended file attributes, intended to support facilities like dynamic file compression and access control lists. It's not possible to easily extend an ext2 filesystem to add new attributes.

As a result, for the past year or so work has been underway on a number of possible replacements for ext2. And they're now showing up: kernel 2.2.15 supports both ReiserFS and ext3, giving you a choice of solutions to the metadata journaling problem.

(Note: you can find a full description of ext2 in the file Documentation/filesystems/ext2.txt in the Linux kernel source tree.)

Journaling filesystems

Journaling filesystems maintain a special file called a log, or journal, the contents of which are not cached. Every time the filesystem is updated, a record describing the transaction is added to the log. When the filesystem is idle, a background thread will process these transactions, write data to the filesystem, then flag each transaction as completed. If someone pulls the plug on the filesystem while outstanding transactions exist in the journal file, when the machine reboots the background process kicks in and simply finishes copying updates from the journal to the filesystem. Uncompleted transactions in the journal file aren't attempted, so the filesystem's internal consistency is maintained.

What this means in practice is that a journaling filesystem should virtually never need a full-blown consistency check, as carried out by fsck; the time taken to restore a filesystem after a reboot is cut by a couple of orders of magnitude.

Journaling filesystems are reliable but slow: everything is written to the disk twice, first to the journal file and then to the filesystem itself. Other approaches such as tux2's phase trees promise performance within 10% of a cached filesystem, and ReiserFS appears to be slightly faster than ext2.

ReiserFS: radical redesign

One option is to replace ext2 with a totally different filesystem. The best known of these is ReiserFS, and it's a radical departure from the traditional Unix filesystems, which are block-structured. Hans Reiser writes: "In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes." Operations like the ext2 filename lookup described above are more efficient, and tiny files are stored more efficiently too: and ReiserFS never runs out of inodes. ReiserFS uses a scheme called "preserve lists" to update metadata, ensuring that old metadata isn't overwritten directly -- this reduces the risk of inconsistencies occuring in event of a crash. ReiserFS isn't a true journaling filesystem, but because it journals metadata it's far less likely to lose files or corrupt filesystems in event of a crash.

ReiserFS's advantages currently include rapid restart after a crash and efficient storage of large numbers of small files -- indeed, part of its designer's intention was to make it possible to store objects much smaller than those that are normally saved as separate files. Future design plans include adding set-theoretic semantics, making it possible to retrieve files by specifying their attributes rather than an explicit pathname.

tux2 -- ext2 goes tree-shaped

tux2 takes a different approach, but has a similar effect. In tux2, both metadata and files are stored as a tree structure on disk. tux2 uses a phase tree algorithm to ensure that metadata and data are updated synchronously and the changes propagate up the tree from the modified leaves without introducing inconsistencies, even in event of a sudden crash. The tux2 filesystem is based on ext2, and it's possible to take an ext2 filesystem and update it to tux2 -- but it's not downward compatible in the same way as ext3.

Tux2 is probably best described as a branch off the main tree of ext2 development, because the real clear successor to ext2 is ext3 -- the journaling remix.

ext3 -- ext2 gets journaling support

Possibly the best approach to fixing ext2's deficiencies is the ext3 filesystem developed by Steven Tweedie.

ext3 is ext2 with added journaling. An ext3 filesystem on disk is simply an ext2 filesystem that contains a funny journal file. You can mount a damaged ext3 filesystem, or one on removable media, as ext2; you can use ext2's fsck tools and filesystem debugger. The magic that makes ext3 fault tolerant is confined to the journal file and the ext3 kernel module. This means that it's not only upward-compatible with ext2, it's downward compatible too -- almost unique in the world of filesystems, and a good argument for switching to it immediately.

To switch on ext3, you need to first have a kernel that supports it (or apply the ext3 patch to your kernel and recompile it). Briefly, you can convert any ext2 filesystem to ext3 by running tune2fs on it to create a journal file, then remounting it as ext3:

   tune2fs -j /dev/hda2
   mount -t ext3 /dev/hda2 /home

(by way of example). You can also unmount an ext3 filesystem and remount it as ext2. If you need to fix a damaged ext3 filesystem after a bad shutdown, use the updated fsck:

   e2fsck -fy /dev/hda2

This carries out journal replay rather than the traditional fsck scan.

For full details of how to manage an ext3 filesystem, see The usage FAQ.

ext3 isn't the only journaling filesystem for Linux. SGI (Silicon Graphics) recently released the source to their XFS journaling filesystem; a patch for the Linux 2.4 development kernels is available from SGI. XFS is a 64-bit filesystem, able to store millions of terabytes of data, millions of files, and a million or more files per directory. It, too, uses journaling to reduce consistency checking time; internally it stores files and metadata using b-trees, which massively reduces the time taken to search through sparse (largely empty) files or large directories.

Alien filesystem types

If you have to deal with Windows systems, you'll have used some form of FAT filesystem (probably VFAT, for Windows 95 or 98 disks, or NTFS, for Windows NT/2000), while MacOS uses a system called HFS or HFS+. All these filesystems have certain features in common: they only provide file storage for a single machine, they usually reside on a single hard disk partition, and they use cacheing to speed up performance.

You can mount non-native filesystems on Linux using the -t option to the mount command; for example:

  mount -t hfs /dev/fd0 /mnt/floppy

mounts a floppy disk with an HFS (Macintosh) filesystem.

The Linux kernel supports a huge range of filesystems from other operating systems. A standard 2.4.15 kernel, for example, comes with support for:

MSDOS
FAT12 and FAT16 types, FAT-CVF (compressed FAT, e.g. DoubleSpace)
Windows 95/98
VFAT
Windows NT/2000
NTFS
Macintosh
HFS
OS/2
HPFS
Minix
Minix FS
CDROM
ISO9660, Rock Ridge, UDF, Jolliet (and others)
BSD/SysV family Unixes
UFS, SYSV, Coda, BFS (UnixWare/SCO)
Linux-specific
ext2, ext3, ReiserFS, cramFS (for RAMdisks), ROM-FS (for ROMable embedded systems), IBM JFS
Novell
NCPFS
In addition, it's possible to compile up support for Atari ST, Acorn Archimedes, and Amiga filesystems (although these are may be omitted from standard distributions).

Network filesystems

There are three standard network filesystems on Linux today: NFS version 2 (the traditional Unix network filesystem), Samba (which implements Windows SMB, also known as CIFS, the common internet filesystem), and Netatalk (which provides an AppleTalk-over-IP implementation, allowing Linux to act as a fileserver for Macintosh networks). NFS v3 support has just found its way into the 2.4 development kernel and should provide some performance improvements over the older Linux NFS system; meanwhile, work is just commencing on defining and implementing NFS v4, which will for the first time provide a useable file locking mechanism.

All of these filesystems share two problems: they rely on a single server and they don't support disconnected operation. Disconnected operation is needed by laptop users who want their mobile machine to see the same home directory as their desktop, even when it's not connected to the network, or for a cluster of machines that are serving web files for a business and need to provide the same data even when one machine is down for maintenance.

coda -- an academic research filesystem

There are a number of experimental multi-server network filesystems that support disconnected operation on the client side. The best known is Coda, an academic project from CMU. Coda is designed to support server replication, so that a cluster of servers can stay in lockstep, and contined operation, even when the network is suffering a partial failure. Each Coda client runs a cache application called Venus; this communicates with the kernel, and answers queries for data either from a local cache file or by requesting a copy of the file over the network. Coda has been present (quietly!) in the Linux kernel since 2.4.10.

InterMezzo -- light and fast

InterMezzo is a clean-sheet design inspired by Coda, but stripped down and simpler. Its goal is to provide support for flexible replication of directories, with disconnected operation and a persistent cache. InterMezzo uses an existing ext2 filesystem as the storage location for all data. When an ext2 file system is mounted as type InterMezzo instead of ext2, the InterMezzo software starts monitoring all access to the file system. It manages the journals of modification records and negotiates permits to modify the disk file system, to avoid conflicting updates during connected operation. When disconnected, the contents of the ext2 filesystem are still accessible to the client. There are two components, the Presto kernel module (which handles journaling and permit negotiation) and the Lento server (which replicates file system information between hosts).

Intermezzo is currently under development but was added to the kernel in 2.4.15; at the time of writing it looks most promising, but your reviewer hasn't had a chance to play with it in earnest yet.

OpenAFS -- industrial strength clustering

The other major contender in the clustering filesystem sector is OpenAFS, from IBM. AFS was developed as a commercial product by IBM; OpenAFS is a version of AFS 2 that IBM has donated back to the open source community under the IBM Public License. The current release is available from OpenAFS.org; it's still in the early stages of porting, and takes rather a lot of technical competence to install and set up, but promises an industrial-strength distributed filesystem with backup tools, ACLs, replication support, and secure transactions; expect it to show up in the 2.5 kernel. As AFS is cross-compatible with MacOS X and Windows, this is a promising platform for a large roll-out in the near future.


[ Site Index] [ Linux Index] [ Feedback ]