Troubleshooting GRUB hangs

This post describes one possible scenario in which GRUB hangs (freezes) with the four letters GRUB and a blinking cursor displayed on the screen, even before the boot menu appears. It might also display "Loading stage2..." and then hang. You might or might not be able to restart the computer with CTRL+ALT+DEL when this occurs. An explanation and a procedure for identifying or excluding the scenario is provided.

When GRUB hangs before displaying the boot menu, the immediate cause most likely is a failed attempt to load the stage2 file from the root partition ("root" refers to whatever has been specified when installing GRUB into boot sector). If you get no error messages at all, just a blinking cursor as mentioned above, then the underlying problem may lie in the implementation of the BIOS interrupt 13h, on which the installed GRUB (i.e. stage1 stored in the boot sector) depends to read raw data from disk. GRUB cannot avoid the crash if its humble attempt to read sector n from disk leads to a locked-up computer. When this happens, some garbage characters might also appear on the screen (printed by BIOS, not GRUB).

It seems that some BIOSes, even versions released as late as 2005, have big trouble reading sectors (blocks) beyond a certain boundary. In my test case (AMIBIOS, 80 GB Maxtor disk) the last readable sector turned out to be 66059279.

In order to find out whether or not you are experiencing the same problem, you should perform these additional tests:

  • Check the sector number where your root partition begins and ends (using fdisk, type u to change display units to sectors). (If it is the first disk partition, it's unlikely that you have the problem discussed here.)
  • Insert a boot CD with a working GRUB menu (a simple /usr/sbin/grub shell is not enough; see note below). Press the key 'c' after the GRUB menu appears while booting from the CD. Now enter the command root (hdm,n), replacing m with the disk number (in Linux: 0 = hda, 1 = hdc) and n with the partition number (in Linux: 0 = hdm1, 1 = hdm2, etc.) on which the file /boot/grub/stage2 is supposed to be found. If it hangs immediately after entering the command, but does not hang for root (hd0,0), it is likely that you have the described problem.
  • You can refine the diagnosis further by attempting to read individual sectors using the command cat (hd0)sector_num+1. For example: cat (hd0)66059280+1 produced the error message "Error 18: Selected cylinder exceeds maximum supported by BIOS" in my case, attempts with a smaller sector number worked, and attempts to read a much higher sector number (where the actual root partition started) caused hanging. When you see this behavior, you can be quite certain, that you have a broken BIOS.

Note that performing the above tests from a grub shell after having successfully booted (say, to another OS on the same PC or through a Live CD) will not give the symptoms and therefore will not help troubleshooting. The grub shell binary uses different system calls to read sectors than the actual stage2 binary and these may work well where the native BIOS interrupt fails.

In my case, it didn't help to upgrade BIOS. However, repartitioning and placing the root partition into lower sectors (swapping hdc2 with hdc3) solved the problem... or so I thought.

In a cruel twist of fate, the problem re-occurred with exactly the same symptoms just a few days later. It turned out that I moved the root partition toward the disk's beginning, but did not resize it, so that the end still stretched beyond the fatal sector. Editing menu.lst or possibly simply rebooting (OpenSolaris) moved one of the files required by GRUB to higher sectors, making the partition unbootable again. Lesson: if you have the problem described here (which due to its nature you might only find out when your system suddenly and inexplicably stops booting), the whole partition which contains GRUB files must be contained in low sectors!

Update: the recommendation of moving the boot partition paritition into low sectors is apparently not enough with OpenSolaris. My machine won't boot again. The file system code in GRUB actually reads sectors that lie outside of the boot partition boundary. And even after hacking it stop reading such sectors, the final configuration doesn't boot. Although both kernel$ and module$ can be set and files are reported found, all I get on a boot attempt is a blinking cursor. I suppose some of the OpenSolaris kernel/boot loader may be trying to read the high sectors as well. This is where I say farewell to OpenSolaris (on that machine ;-)).


Anonymous said...

This very clearly describes the problem. We were experiencing the same issue when we tried to have multiple instances of GRUB on the same drive. One GRUB would just chainload to the other and it would be seemingly random whether a new install of the OS would cause this issue. We instead went with using one instance of GRUB with its files located in the very first partition that gets created on the system.

Anonymous said...

Excellent explanation.

Billy Cassanova said...

sometime irq in your bios setting is conflict.
try change the irq or unplug all pci

lee woo said...

Love your job but don't love your company, because you may not know when your company stops loving you. See the link below for more info.


Post a Comment