A tour of the Linux VFS

The HyperNews Linux KHG Discussion Pages

A tour of the Linux VFS

I'm not an expert on this topic. I've never written a filesystem from scratch; I've only worked on the proc filesystem, and I didn't do much real filesystem hacking there, only extensions to what was already there.
So if you see any mistakes or ommissions here (there have got to be ommissions in a piece this short on a topic this large), please respond, in order to let me fix them and let other people know about them.

In Linux, all files are accessed through the Virtual Filesystem Switch, or VFS. This is a layer of code which implements generic filesystem actions and vectors requests to the correct specific code to handle the request. Two main types of code modules take advantage of the VFS services, device drivers and filesystems. Because device drivers are covered elsewhere in the KHG, we won't cover them explicitly here. This tour will focus on filesystems. Because the VFS doesn't exist in a vacuum, we'll show its relationship with the favorite Linux filesystem, the ext2 filesystem.

One warning: without a decent understanding of the system calls that have to do with files, you are not likely to be able to make heads or tails of filesystems. Most of the VFS and most of the code in a normal Linux filesystem is pretty directly related to completing normal system calls, and you will not be able to understand how the rest of the system works without understanding the system calls on which it is based.

Where to find the code

The source code for the VFS is in the fs/ subdirectory of the Linux kernel source, along with a few other related pieces, such as the buffer cache and code to deal with each executable file format. Each specific filesystem is kept in a lower subdirectory; for example, the ext2 filesystem source code is kept in fs/ext2/.

This table gives the names of the files in the fs/ subdirectory and explains the basic purpose of each one. The middle column, labeled system, is supposed to show to which major subsystem the file is (mainly) dedicated. EXE means that it is used for recognizing and loading executable files. DEV means that is for device driver support. BUF means buffer cache. VFS means that it is a part of the VFS, and delegates some functionality to filesystem-specific code. VFSg means that this code is completely generic and never delegates part of its operation to specific filesystem code (that I noticed, anyway) and which you shouldn't have to worry about while writing a filesystem.

Filename system Purpose

binfmt_aout.c EXE Recognize and execute old-style a.out executables.

binfmt_elf.c EXE Recognize and execute new ELF executables

binfmt_java.c EXE Recognize and execute Java apps and applets

binfmt_script.c EXE Recognize and execute #!-style scripts

block_dev.c DEV Generic read(), write(), and fsync() functions for block devices.

buffer.c BUF The buffer cache, which caches blocks read from block devices.

dcache.c VFS The directory cache, which caches directory name lookups.

devices.c DEV Generic device support functions, such as registries.

dquot.c VFS Generic disk quota support.

exec.c VFSg Generic executable support. Calls functions in the binfmt_* files.

fcntl.c VFSg fcntl() handling.

fifo.c VFSg fifo handling.

file_table.c VFSg Dynamically-extensible list of open files on the system.

filesystems.c VFS All compiled-in filesystems are initialized from here by calling init_name_fs().

inode.c VFSg Dynamically-extensible list of open inodes on the system.

ioctl.c VFS First-stage handling for ioctl's; passes handling to the filesystem or device driver if necessary.

locks.c VFSg Support for fcntl() locking, flock() locking, and manadatory locking.

namei.c VFS Fills in the inode, given a pathname. Implements several name-related system calls.

noquot.c VFS No quotas: optimization to avoid #ifdef's in dquot.c

open.c VFS Lots of system calls including (surprise) open(), close(), and vhangup().

pipe.c VFSg Pipes.

read_write.c VFS read(), write(), readv(), writev(), lseek().

readdir.c VFS Several different interfaces for reading directories.

select.c VFS The guts of the select() system call

stat.c VFS stat() and readlink() support.

super.c VFS Superblock support, filesystem registry, mount()/umount().

Filename	system	Purpose
binfmt_aout.c	EXE	Recognize and execute old-style a.out executables.
binfmt_elf.c	EXE	Recognize and execute new ELF executables
binfmt_java.c	EXE	Recognize and execute Java apps and applets
binfmt_script.c	EXE	Recognize and execute `#!`-style scripts
block_dev.c	DEV	Generic read(), write(), and fsync() functions for block devices.
buffer.c	BUF	The buffer cache, which caches blocks read from block devices.
dcache.c	VFS	The directory cache, which caches directory name lookups.
devices.c	DEV	Generic device support functions, such as registries.
dquot.c	VFS	Generic disk quota support.
exec.c	VFSg	Generic executable support. Calls functions in the binfmt_* files.
fcntl.c	VFSg	fcntl() handling.
fifo.c	VFSg	fifo handling.
file_table.c	VFSg	Dynamically-extensible list of open files on the system.
filesystems.c	VFS	All compiled-in filesystems are initialized from here by calling `init_name_fs()`.
inode.c	VFSg	Dynamically-extensible list of open inodes on the system.
ioctl.c	VFS	First-stage handling for ioctl's; passes handling to the filesystem or device driver if necessary.
locks.c	VFSg	Support for fcntl() locking, flock() locking, and manadatory locking.
namei.c	VFS	Fills in the inode, given a pathname. Implements several name-related system calls.
noquot.c	VFS	No quotas: optimization to avoid `#ifdef`'s in dquot.c
open.c	VFS	Lots of system calls including (surprise) open(), close(), and vhangup().
pipe.c	VFSg	Pipes.
read_write.c	VFS	read(), write(), readv(), writev(), lseek().
readdir.c	VFS	Several different interfaces for reading directories.
select.c	VFS	The guts of the select() system call
stat.c	VFS	stat() and readlink() support.
super.c	VFS	Superblock support, filesystem registry, mount()/umount().

Attaching a filesystem to the kernel

If you look at the code in any filesystem for init_name_fs(), you will find that it probably contains about one line of code. For instance, in the ext2fs, it looks like this (from fs/ext2/super.c):

   int init_ext2_fs(void)
   {
           return register_filesystem(&ext2_fs_type);
   }

All it does is register the filesystem with the registry kept in fs/super.c. ext2_fs_type is a pretty simple structure:

   static struct file_system_type ext2_fs_type = {
           ext2_read_super, "ext2", 1, NULL
   };

The ext2_read_super entry is a pointer to a function which allows a filesystem to be mounted (among other things; more later). "ext2" is the name of the filesystem type, which is used (when you type mount ... -t ext2) to determine which filesystem to use to mount a device. The 1 says that it needs a device to be mounted on (unlike the proc filesyste or a network filesystem), and the NULL is required to fill up space that will be used to keep a linked list of filesystem types in the filesystem registry, kept in (look it up in the table!) fs/super.c.

It's possible for a filesystem to support more than one type of filesystem. For instance, in fs/sysv/inode.c, three possible filesystem types are supported by one filesystem, with this code:

   static struct file_system_type sysv_fs_type[3] = {
           {sysv_read_super, "xenix", 1, NULL},
           {sysv_read_super, "sysv", 1, NULL},
           {sysv_read_super, "coherent", 1, NULL}
   };

   int init_sysv_fs(void)
   {
           int i;
           int ouch;
   
           for (i = 0; i < 3; i++) {
                   if ((ouch = register_filesystem(&sysv_fs_type[i])) != 0)
                           return ouch;
           }
           return ouch;
   }

Connecting the filesystem to a disk

The rest of the communication between the filesystem code and the kernel doesn't happen until a device bearing that type of file system is mounted. When you mount a device containing an ext2 file system, ext2_read_super() is called. If it succeeds in reading the superblock and is able to mount the filesystem, it fills in the super_block structure with information that includes a pointer to a structure called super_operations, which contains pointers to functions which do common operations related to superblocks; in this case, pointers to functions specific to ext2.

A superblock is the block that defines an entire filesystem on a device. It is sometimes mythical, as in the case of the DOS filesystem--that is, the filesystem may or may not actually have a block on disk that is the real superblock. If not, it has to make something up. Operations that pertain to the filesystem as a whole (as opposed to individual files) are considered superblock operations. The super_operations structure contains pointers to functions which manipulate inodes, the superblock, and which refer to or change the status of the filesystem as a whole (statfs() and remount()).

You have probably noticed that there are a lot of pointers, and especially pointers to functions, here. The good news is that all the messy pointer work is done; that's the VFS's job. All the author for the filesystem needs to do is fill in (usually static) structures with pointers to functions, and pass pointers to those structures back to the VFS so it can get at the filesystem and the files.

For example, the super_operations structure looks like this (from <linux/fs.h>):

   struct super_operations {
           void (*read_inode) (struct inode *);
           int (*notify_change) (struct inode *, struct iattr *);
           void (*write_inode) (struct inode *);
           void (*put_inode) (struct inode *);
           void (*put_super) (struct super_block *);
           void (*write_super) (struct super_block *);
           void (*statfs) (struct super_block *, struct statfs *, int);
           int (*remount_fs) (struct super_block *, int *, char *);
   };

That's the VFS part. Here's the much simpler declaration of the ext2 instance of that structure, in fs/ext2/super.c:

   static struct super_operations ext2_sops = {
           ext2_read_inode,
           NULL,
           ext2_write_inode,
           ext2_put_inode,
           ext2_put_super,
           ext2_write_super,
           ext2_statfs,
           ext2_remount
   };

First, notice that an unneeded entry has simply been set to NULL. That's pretty normal Linux behavior; whenever there is a sensible default behavior of a function pointer, and that sensible default is what you want, you are almost sure to be able to provide a NULL pointer and get the default painlessly. Second, notice how simple and clean the declaration is. All the painful stuff like sb->s_op->write_super(sb); s hidden in the VFS implementation.

The details of how the filesystem actually reads and writes the blocks, including the superblock, from and to the disk will be covered in a different section. There will actually be (I hope) two descriptions--a simple, functional one in a section on how to write filesystems, and a more detailed one in a tour through the buffer cache. For now, assume that it is done by magic...

Mounting a filesystem

When a filesystem is mounted (which file is in charge of mounting a filesystem? Look at the table above, and find that it is fs/super.c. You might want to follow along in fs/super.c), do_umount() calls read_super, which ends up calling (in the case of the ext2 filesystem), ext2_read_super(), which returns the superblock. That superblock includes a pointer to that structure of pointers to functions that we see in the definition of ext2_sops above. It also includes a lot of other data; you can look at the definition of struct super_block in include/linux/fs.h if you like.

Finding a file

Once a filesystem is mounted, it is possible to access files on that filesystem. There are two main steps here: looking up the name to find what inode it points to, and then accessing the inode.

When the VFS is looking at a name, it includes a path. Unless the filename is absolute (it starts with a / character), it is relative to the current directory of the process that made the system call that included a path. It uses filesystem-specific code to look up files on the filesystems specified. It takes the path name one component (filename components are separated with / characters) at a time, and looks it up. If it is a directory, then the next component is looked up in the directory returned by the previous lookup. Every component which is looked up, whether it is a file or a directory, returns an inode number which uniquely identifies it, and by which its contents are accessed.

If the file turns out to be a symbolic link to another file, then the VFS starts over with the new name which is retrieved from the symbolic link. In order to prevent infinite recursion, there's a limit on the depth of symlinks; the kernel will only follow so many symlinks in a row before giving up.

When the VFS and the filesystem together have resolved a name into an inode number (that's the namei() function in namei.c), then the inode can be accessed. The iget() function finds and returns the inode specified by an inode number. The iput() function is later used to release access to the inode. It is kind of like malloc() and free(), except that more than one process may hold an inode open at once, and a reference count is maintained to know when it's free and when it's not.

The integer file handle which is passed back to the application code is an offset into a file table for that process. That file table slot holds the inode number that was looked up with the namei() function until the file is closed or the process terminates. So whenever a process does anything to a ``file'' using a file handle, it is really manipulating the inode in question.

inode Operations

That inode number and inode structure have to come from somewhere, and the VFS can't make them up on it's own. They have to come from the filesystem. So how does the VFS look up the name in the filesystem and get an inode back?

It starts at the beginning of the path name and looks up the inode of the first directory in the path. Then it uses that inode to look up the next directory in the path. When it reachs the end, it has found the inode of the file or directory it is trying to look up. But since it needs an inode to get started, how does it get started with the first lookup? There is an inode pointer kept in the superblock called s_mounted which points at an inode structure for the filesystem. This inode is allocated when the filesystem is mounted and de-allocated when the filesystem is unmounted. Normally, as in the ext2 filesystem, the s_mounted inode is the inode of the root directory of the filesystem. From there, all the other inodes can be looked up.

Each inode includes a pointer to a structure of pointers to functions. Sound familiar? This is the inode_operations structure. One of the elements of that structure is called lookup(), and it is used to look up another inode on the same filesystem. In general, a filesystem has only one lookup() function that is the same in every inode on the filesystem, but it is possible to have several different lookup() functions and assign them as appropriate for the filesystem; the proc filesystem does this because different directories in the proc filesystem have different purposes. The inode_operations structure looks like this (defined, like most everything we are looking at, in <linux/fs.h>):

   struct inode_operations {
           struct file_operations * default_file_ops;
           int (*create) (struct inode *,const char *,int,int,struct inode **);
           int (*lookup) (struct inode *,const char *,int,struct inode **);
           int (*link) (struct inode *,struct inode *,const char *,int);
           int (*unlink) (struct inode *,const char *,int);
           int (*symlink) (struct inode *,const char *,int,const char *);
           int (*mkdir) (struct inode *,const char *,int,int);
           int (*rmdir) (struct inode *,const char *,int);
           int (*mknod) (struct inode *,const char *,int,int,int);
           int (*rename) (struct inode *,const char *,int,struct inode *,const char *,int);
           int (*readlink) (struct inode *,char *,int);
           int (*follow_link) (struct inode *,struct inode *,int,int,struct inode **);
           int (*readpage) (struct inode *, struct page *);
           int (*writepage) (struct inode *, struct page *);
           int (*bmap) (struct inode *,int);
           void (*truncate) (struct inode *);
           int (*permission) (struct inode *, int);
           int (*smap) (struct inode *,int);
   };

Most of these functions map directly to system calls.

In the ext2 filesystem, directories, files, and symlinks have different inode_operations (this is normal). The file fs/ext2/dir.c contains ext2_dir_inode_operations, the file fs/ext2/file.c contains ext2_file_inode_operations, and the file fs/ext2/symlink.c contains ext2_symlink_inode_operations.

There are many system calls related to files (and directories) which aren't accounted for in the inode_operations structure; those are found in the file_operations structure. The file_operations structure is the same one used when writing device drivers and contains operations that work specifically on files, rather than inodes:

   struct file_operations {
           int (*lseek) (struct inode *, struct file *, off_t, int);
           int (*read) (struct inode *, struct file *, char *, int);
           int (*write) (struct inode *, struct file *, const char *, int);
           int (*readdir) (struct inode *, struct file *, void *, filldir_t);
           int (*select) (struct inode *, struct file *, int, select_table *);
           int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
           int (*mmap) (struct inode *, struct file *, struct vm_area_struct *);
           int (*open) (struct inode *, struct file *);
           void (*release) (struct inode *, struct file *);
           int (*fsync) (struct inode *, struct file *);
           int (*fasync) (struct inode *, struct file *, int);
           int (*check_media_change) (kdev_t dev);
           int (*revalidate) (kdev_t dev);
   };

There are also a few functions which aren't directly related to system calls--and where they don't apply, they can simply be set to NULL.

Summary

The role of the VFS is:

Keep track of available filesystem types.
Associate (and disassociate) devices with instances of the appropriate filesystem.
Do any reasonable generic processing for operations involving files.
When filesystem-specific operations become necessary, vector them to the filesystem in charge of the file, directory, or inode in question.

The interaction between the VFS and specific filesystem types occurs through two main data structures, the super_block structure and the inode structure, and their associated data structures, including super_operations, inode_operations, file_operations, and others, which are kept in the include file <linux/fs.h>.

Therefore, the role of a specific filesystem code is to provide a superblock for each filesystem mounted and a unique inode for each file on the filesystem, and to provide code to carry out actions specific to filesystems and files that are requested by system calls and sorted out by the VFS.

Messages

1. A couple of comments and corrections by Jeremy Fitzhardinge