Follow up to 'I booted Linux 292,612 times'
Well that blew up. It was supposed to be just a silly off-the-cuff comment about how some bugs are very tedious to bisect.
To answer a few questions people had, here’s what actually happened. As they say, don’t believe everything you read in the press.
[...]
At that point I thought I had the right commit, but Paolo Bonzini suggested to me that I boot the kernel in parallel, in a loop, for 24 hours at the point immediately before the commit, to try to show that there was no latent issue in the kernel before. (As it turns out while this is a good idea, this analysis is subtly flawed as we’ll see).
So I did just that. After 21 hours I got bored (plus this is using a lot of electricity and generating huge amounts of heat, and we’re in the middle of a heatwave here in the UK). I killed the test after 292,612 successful boots.
I had a commit that looked suspicious, but what to do now? I posted my findings on LKML.
We still didn’t fully understand how to trigger the hang, except it was annoying and rare, seemed to happen with different frequencies on AMD and Intel, could be reproduced by several independent people, but crucially kernel developer Peter Zijlstra could not reproduce it.