Linux 2.4 Filesystem Porting Issues

This document is a partial comparison of Linux kernels 2.2.18 and 2.4.0 focusing on changes in filesystem code. Kernel version references are found in endnotes. Please send any thoughts regarding errors or improvements to Jay Miller.

Modules are handled differently
LFS (Large File Support)
New error handling
Global filesystem stuff
The inode structure
The file structure
Concerns about the dentry cache
VFS operations
*_ops specified differently
VFS file_operations
VFS inode_operations
VFS super_operations
VFS dquot_operations
Miscellaneous
Buffer/Page Caches
Other References
Endnotes

Change Log

Date	Version	Author
2001-02-19	v0.3	Jay Miller (jnmiller, cryptofreak dot org)
Changes:	Conversion to HTML.
2001-01-23	v0.2	Jay Miller (jnmiller, cryptofreak dot org)
Changes:	Added kernel version endnotes and a few VFS ops additions.
2001-01-19	v0.1	Jay Miller (jnmiller, cryptofreak dot org)
Changes:	Initial release.

Modules are handled differently

Module initialization is now handled differently.[1] The old method (for a fake fs called 'myfs') is shown here:

   static struct file_system_type myfs_fs_type =
         { "myfs", FS_REQUIRES_DEV, myfs_read_super, NULL };
   
   __initfunc(int init_myfs_fs(void)) {
      return register_filesystem(&myfs_fs_type);
   }

   #ifdef MODULE
   EXPORT_NO_SYMBOLS;

   int init_module(void) {
      return init_myfs_fs();
   }

   void cleanup_module(void) {
      unregister_filesystem(&myfs_fs_type);
   }
   #endif

In addition, a MOD_INC_USE_COUNT; call is required in the FSD's read_super() function and corresponding decrement calls (MOD_DEC_USE_COUNT;) are required in put_super(). The new method is shown below:

   static DECLARE_FSTYPE_DEV(myfs_fs_type, "myfs", myfs_read_super);

   static int __init init_myfs_fs(void) {
      return register_filesystem(&myfs_fs_type);
   }

   static void __exit exit_myfs_fs(void) {
      unregister_filesystem(&myfs_fs_type);
   }

   EXPORT_NO_SYMBOLS;
   module_init(init_myfs_fs);
   module_exit(exit_myfs_fs);

MOD_XXX_USE_COUNT is now handled by the VFS during filesystem registration. (ie. don't use them anymore with FSDs)

LFS (Large File Support)

The VFS now supports 64-bit files (x86 and Sparc only).[2]

Uses the 64-bit type loff_t
The kernel doesn't support a 64-bit rlimit(2) system call yet
glibc supports getrlimit64(2) and setrlimit64(2) but wraps too large values to RLIMIT_INFINITY.

For more complete information on LFS support (including the source of the above info), head to Andreas Jaeger's LFS page.

New error handling

People are trying to move to better error handling now that the functions ERR_PTR(), PTR_ERR() and IS_ERR() have been changed a bit:

Old: if (!dir || !dir->i_nlink) { *err = -EPERM; return NULL; }

New: if (!dir || !dir->i_nlink) return ERR_PTR(-EPERM);

Global filesystem stuff

The dynamically tunable fs parameters nr_files, nr_free_files, and max_files are now part of a new structure:[3]

   struct files_stat_struct {
      int nr_files;
      int nr_free_files;
      int max_files;
   };

New inode flags: S_SYNC, S_NOATIME (to replace the need to use mount flags MS_SYNCHRONOUS and MS_NOATIME in inodes) and S_DEAD[4] for a removed but still open directory (and IS_DEADDIR() to check).

And some new global filesystem flags:[5]

FS_SINGLE: Filesystem that can have only one superblock
FS_NOMOUNT: Never mount from userland
FS_LITTER: Keeps the tree in dcache

The inode structure

The file_operations pointer has been moved from the inode_operations structure to the actual inode structure.[6]

New: struct file_operations *i_fop;

Referenced with something like:

   inode->i_fop;

Also, the count on the inode is now of type atomic_t:[7]

Old: int i_count;

New: atomic_t i_count;

This type's definition is architecture dependent, so there is a special way to access these variables. To read the variable atom or set it equal to value, respectively, use these functions:

   atomic_read(atom);
   atomic_set(atom, value);

See also the sections on the file structure and the dentry structure.

The file structure

The count on the file is now of type atomic_t:[8]

Old: int f_count;

New: atomic_t f_count;

See the section on the inode structure for instructions on modifying this variable.

Concerns about the dentry cache

The count on the inode is now of type atomic_t:[9]

Old: int d_count;

New: atomic_t d_count;

See the section on the inode structure for instructions on modifying this variable.

The d_delete function now returns an int measuring the success of the call:[10]

Old: void (*d_delete)(struct dentry *);

New: int (*d_delete)(struct dentry *);

And d_alloc_root has changed in the following way:[11]

Old: struct dentry *d_alloc_root(struct inode *, struct dentry *);

New: struct dentry *d_alloc_root(struct inode *);

VFS operations

A new unsigned char argument for the filldir helper function:[12]

Old: typedef int (*filldir_t)(void *, const char *, int, off_t, ino_t);

New: typedef int (*filldir_t)(void *, const char *, int, off_t, ino_t, unsigned);

The new argument is meant to be one of the following file type constants:

DT_UNKNOWN	DT_FIFO
DT_CHR	DT_DIR
DT_BLK	DT_REG
DT_LINK	DT_SOCK
DT_WHT

*_ops specified differently

All operations structures are specified differently now.[13] The old structure form (again, for our fake filesystem) might have looked like:

   struct file_operations myfs_file_operations = {
      myfs_file_lseek,
      generic_file_read,
      generic_file_write,
      NULL,
      NULL,
      myfs_ioctl,
      NULL,
   };

Now you can use:

   struct file_operations myfs_file_operations = {
      llseek: myfs_file_lseek,
      read: generic_file_read,
      write: generic_file_write,
      ioctl: myfs_ioctl,
   };

This is using a GNU C language extension that is actually made obsolete by the ISO C99 standard. C99 designated initializers look something like this:

   struct foo {
      int foo;
      long bar;
   };

   struct foo x = { .bar = 3, .foo = 4 };

The GNU C extension we use in the fs code is called 'labeled initializer elements'. gcc supports both the extension (duh) and the C99 compatible .member syntax.[*]

VFS file_operations

Added a new argument to one function: if set, don't bother flushing timestamps (see the Miscellaneous section for more on this function).[14]

Old: int (*fsync) (struct file *, struct dentry *);

New: int (*fsync) (struct file *, struct dentry *, int);

Two operations have been removed from this structure:[15]

Old: int (*check_media_change) (kdev_t dev); int (*revalidate) (kdev_t dev);

These functions where moved to a new structure:

   struct block_device_operations {
      int (*open) (struct inode *, struct file *);
      int (*release) (struct inode *, struct file *);
      int (*ioctl)(struct inode *, struct file *, unsigned, unsigned long);
      int (*check_media_change)(kdev_t);
      int (*revalidate) (kdev_t);
   };

This structure is referenced through a new member now found in the inode structure:

New: struct block_device *i_bdev;

Which in turn has a pointer to the operations structure as shown here:

   struct block_device {
      struct list_head bd_hash;
      atomic_t bd_count;
      dev_t bd_dev;
      atomic_t bd_openers;
      const struct
      block_device_operations *bd_op;
      struct semaphore bd_sem;
   };

Two further operations have been added.[16] They can be called without the big kernel lock held in all filesystems. They implement the readv(2) and writev(2) system calls.

Old: ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);

VFS inode_operations

The file_operations pointer has been moved from the inode_operations structure to the actual inode structure.

follow_link is changed:[17]

Old: struct dentry * (*follow_link) (struct dentry *, struct dentry *, unsigned int);

New: int (*follow_link) (struct dentry *, struct nameidata *);

The first argument remains the same, while a new structure contains the previous final two arguments. This new structure looks like:

   struct nameidata {
      struct dentry *dentry;
      struct vfsmount *mnt;
      struct qstr last;
      unsigned int flags;
      int last_type;
   };

TODO: describe this structure?

This is all part of a rewrite of the symbolic link handling. These are the rules (and the order in which they are applied):

inside the path - always follow
in the last component in creation/removal/renaming - never follow
if LOOKUP_FOLLOW passed - follow
if the pathname has trailing slashes - follow
otherwise - don't follow

Two new functions now appear:[18]

New: int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct dentry *, struct iattr *);

These functions (really just setattr()) replace the old superblock operation notify_change(). Their use is just as it used to be. In addition, the following five functions have all disappeared, but see the section on caches, because they've really only moved:[19]

Old: int (*readpage) (struct file *, struct page *); int (*writepage) (struct file *, struct page *); int (*updatepage) (struct file *, struct page *, unsigned long, unsigned int, int); int (*bmap) (struct inode *,int); int (*smap) (struct inode *,int);

VFS super_operations

The write_inode function has added an extra parameter:[20]

Old: void (*write_inode) (struct inode *);

New: void (*write_inode) (struct inode *, int);

The added parameter is a boolean flag used to decide whether to sync the inode to disk. See also write_new_inode(), in the Miscellaneous section.

OTOH, statfs() lost a parameter because the size of the statfs structure is not needed.[21]

Old: int (*statfs) (struct super_block *, struct statfs *, int);

New: int (*statfs) (struct super_block *, struct statfs *);

Finally, one function has been removed:[22]

Old: int (*notify_change) (struct dentry *, struct iattr *);

This functionality now appears in the inode_operations structure as getattr() and setattr(). See also the section on inode operations.

VFS dquot_operations

alloc_block(), alloc_inode() and transfer() all no longer require the uid as an argument:[23]

Old: int (*alloc_block) (const struct inode *, unsigned long, uid_t, char);

New: int (*alloc_block) (const struct inode *, unsigned long, char);

Old: int (*alloc_inode) (const struct inode *, unsigned long, uid_t);

New: int (*alloc_inode) (const struct inode *, unsigned long);

Old: int (*transfer) (struct dentry *, struct iattr *, uid_t);

New: int (*transfer) (struct dentry *, struct iattr *);

Miscellaneous

Old: int do_truncate(struct dentry *, unsigned long);

New: int do_truncate(struct dentry *, loff_t);[24]

fsync() added a new argument: if set, don't bother flushing timestamps (see the section on file operations).[25]

Old: int file_fsync(struct file *, struct dentry *);

New: int file_fsync(struct file *, struct dentry *, int);

iget() now takes the place of the old iget_in_use() function:[26]

Old: struct inode *iget_in_use(struct super_block *, unsigned long);

New: static inline void __iget(struct inode *);

write_inode_now() requires a 'sync' flag. (a la write_inode())[27]

Old: void write_inode_now(struct inode *);

New: void write_inode_now(struct inode *, int);

Buffer/Page Caches

The old buffer cache is still used for metadata, but use has changed a bit:[28]

Old: void mark_buffer_dirty(struct buffer_head *, int);

New: void mark_buffer_dirty(struct buffer_head *);

The page cache now handles file-content data by replacing two inode members with a third:[29]

Old: unsigned long i_nrpages; struct list_head i_pages;

New: struct address_space i_data;

It is a generic page cache, and each group of pages belonging to an object is described by an address_space structure:

   struct address_space {
      struct list_head clean_pages;
      struct list_head dirty_pages;
      struct list_head locked_pages;
      unsigned long nrpages;
      struct address_space_operations *a_ops;
      struct inode *host;
      struct vm_area_struct *i_mmap;
      struct vm_area_struct *i_mmap_shared;
      spinlock_t i_shared_lock;
   };

host is a pointer to the object that is the owner of these pages, like an inode or a block device. i_mmap and i_mmap_shared are pointers to private and public mappings, respectively. i_shared_lock is a spinlock protecting the address space. a_ops is a pointer to a new list of function pointers, the address_space_operations:[30]

   struct address_space_operations {
      int (*writepage)(struct page *);
      int (*readpage)(struct file *, struct page *);
      int (*sync_page)(struct page *);
      int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
      int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
      int (*bmap)(struct address_space *, long);
   };

These functions used to reside in the inode_operations structure. (See the inode_operations section)

Other References

For more information on the 2.4 kernel, you might try any of the following links.

Richard Gooch has a page on the wait queue system's changes.
Tigran Aivazian has a large amount on the 2.4 kernel including info on booting, process and interrupt management, VFS internals, and the page cache.
Andreas Jaeger maintains a page on large file support (LFS) in the new Linux kernel.

Footnotes

[*] The 'obsolete since GCC 2.5' clauses in the GCC extensions document are fairly recent additions.

Endnotes

[1]  Done throughout the 2.3 process.
[2]  2.4.0-test7.
[3]  2.4.0-test3.
[4]  2.4.0-test6.
[5]  single/nomount: 2.3.99-pre7; litter: 2.4.0-test3.
[6]  2.3.48.
[7]  2.4.0-test2.
[8]  2.3.9.
[9]  2.4.0-test3.
[10]  2.3.99-pre9.
[11]  2.3.0.
[12]  2.4.0-test7.
[13]  Done throughout the 2.3 process.
[14]  2.4.0-test3.
[15]  All this block device stuff happened during 2.3.38.
[16]  2.3.44.
[17]  2.3.99-pre4.
[18]  2.3.48.
[19]  2.3.43.
[20]  2.4.0-test3.
[21]  2.3.51.
[22]  2.3.48.
[23]  2.3.30.
[24]  2.3.30.
[25]  2.4.0-test3.
[26]  2.3.0.
[27]  2.4.0-test3.
[28]  2.4.0-test8.
[29]  This business started in 2.3.24.
[30]  First appearing in 2.3.43.

Old:	`if (!dir \|\| !dir->i_nlink) { *err = -EPERM; return NULL; }`
New:	`if (!dir \|\| !dir->i_nlink) return ERR_PTR(-EPERM);`

Old:	`void (d_delete)(struct dentry );`
New:	`int (d_delete)(struct dentry );`

Old:	`struct dentry d_alloc_root(struct inode , struct dentry *);`
New:	`struct dentry d_alloc_root(struct inode );`

Old:	`typedef int (filldir_t)(void , const char *, int, off_t, ino_t);`
New:	`typedef int (filldir_t)(void , const char *, int, off_t, ino_t, unsigned);`

Old:	`int (fsync) (struct file , struct dentry *);`
New:	`int (fsync) (struct file , struct dentry *, int);`

Old:	`struct dentry * (follow_link) (struct dentry , struct dentry *, unsigned int);`
New:	`int (follow_link) (struct dentry , struct nameidata *);`

Old:	`void (write_inode) (struct inode );`
New:	`void (write_inode) (struct inode , int);`

Old:	`int (statfs) (struct super_block , struct statfs *, int);`
New:	`int (statfs) (struct super_block , struct statfs *);`

Old:	`int (alloc_block) (const struct inode , unsigned long, uid_t, char);`
New:	`int (alloc_block) (const struct inode , unsigned long, char);`

Old:	`int (alloc_inode) (const struct inode , unsigned long, uid_t);`
New:	`int (alloc_inode) (const struct inode , unsigned long);`

Old:	`int (transfer) (struct dentry , struct iattr *, uid_t);`
New:	`int (transfer) (struct dentry , struct iattr *);`

Old:	`int do_truncate(struct dentry *, unsigned long);`
New:	`int do_truncate(struct dentry *, loff_t);[24]`

Old:	`int file_fsync(struct file , struct dentry );`
New:	`int file_fsync(struct file , struct dentry , int);`

Old:	`struct inode iget_in_use(struct super_block , unsigned long);`
New:	`static inline void __iget(struct inode *);`

Old:	`void write_inode_now(struct inode *);`
New:	`void write_inode_now(struct inode *, int);`

Old:	`void mark_buffer_dirty(struct buffer_head *, int);`
New:	`void mark_buffer_dirty(struct buffer_head *);`

Old:	`unsigned long i_nrpages; struct list_head i_pages;`
New:	`struct address_space i_data;`

Linux 2.4 Filesystem Porting Issues

Table of Contents

Change Log

Footnotes

Endnotes