The VM bug that delayed Linux-2.6.34-rc4

Arguably, the most complex area in Operating Systems is – memory management. i.e, managing the Virtual Memory (VM).

Linux-2.6.34-rc3 had been out on March 30 and after nearly 2 weeks, -rc4 has not yet appeared.And the reason? Well, we are going to look at it.

The anon_vma scalability patches submitted by Rik Van Riel was merged in the -rc1 phase. Borislav Petkov using -rc3 has hit on a bug causing crash while suspending to disk(hibernate). Linus chimed in suspecting this could be caused by the new scalable anon_vma linking code by Rik. The bug usually appears under severe memory pressure – Borislav explains that the procedure to consistently trigger the bug is to run 3 KVM guests, open firefox and load a huge html file, and try to s2disk – kaboom! Though he himself doubted that this could be a hardware issue since not many people observed it, Linus refused to agree with that because he himself has seen a similar OOPS in the Mac Mini his kids are using. So it is likely a real bug which needed to be identified and fixed. And the anon_vma code is very complex with various levels of locking and RCU usages, Linus wants to simplify mm/rmap.c considerably.

So the bug hunting began by Linus Torvalds and Rik Van Riel, joined by Johannes Wiener, Kosaki Motohiro and Minchan Kim – and every patch being tested by Borislav Petkov. After 10 days of debugging, flying many patches around, {,in}validating various theories, finding and fixing 3 other independent bugs (1, 2, 3) in the VM area (though second one may not be required), Linus came up with a new theory which he explains along with small patch. And, Borislav confirmed that his Netbook just survived more than 20 suspend cycles even under severe memory pressure.

Had the bug not been isolated and fixed, Linus was planning to revert the whole anon_vma scalability patches, which didn’t sound good and that they’d drop their effort to fix it even when feeling so close to fixing it, didn’t sound good either. The whole 4 patches can now be found here – 1, 2, 3, 4. And with that, -rc4 is out in the wild.


Update: The-as-usual-excellent article:

Ext3 ‘data=guarded’ mode coming for Linux kernel 2.6.30?

In the light of recent “Ext3 fsync” problem related discussions happened in the Linux Kernel mailing list involving many experts in the field, there has been quite a few improvements. There were some patches from Theodore T’so – the Ext4 maintainer, Jens Axboe – the block layer maintainer, Chris Mason – the Btrfs developer et al. An overview of the discussion can be found at, here.

Ext3 filesystem by default mounts the disk in a “data=ordered” mode. It basically means, the actual data will be written on the disk before the metadata is written on the disk. But this causes long delays while using the “fsync()” system call. An improvement suggested usually is to change the mount option to “data=writeback“. Whenever I install a Fedora system, one of the first things I used to speedup the file operations is to change the Ext3 mount option to “writeback“. But it has its own share of problems too. After a crash, it is very likely to have corrupted data than in “ordered” mode. Risking the data integrity for performance, I myself ran into issues – sometimes system used to lock up when I use a composite manager like Compiz or Kwin, and I will have to use Alt-SysRq to reboot. After the crash, while logging back, some of the GNOME settings would be lost or slightly corrupted.

Well, Chris Mason has proposed another mode, which can be enabled by a new mount option called “data=guarded” which takes care of most of the issues with performance and data integrity. The version 3 of the patchset has been posted yesterday. And it looks very probable that Linus will pull it for inclusion into kernel 2.6.30, which is in the rc2 state now.

Update: The first 2 patches in the patch series (of 3 patches), which are adding infrastructure for the “data=guarded” mode has been merged by Linus. The 3rd patch which adds the functionality has been tested by Mike Galbraith and found issue of data corruption, for which the root cause was quickly found by Chris Mason and fixed.