Tuesday, October 7, 2014

And now for something completely different: bad blocks, bad trip

You've already met thule, my recently resurrected Macintosh IIci running NetBSD. The peril of running something old is that things fail. After I got a new recapped motherboard in it, now the hard disk started throwing a couple bad sectors.

Bad sectors happen; magnetic media occasionally gets glitchy. (I do have a backup, but I like running things until they die, so I'm going to keep this drive running until it don't run no more.) Modern drives can often recruit from a pool of sectors and transparently redirect a write to a bad sector to a good sector, but no hard drive can do this on read if the media is bad, and on a drive this geriatric (a 2GB Quantum Fireball from oh-my-gawd-it's-old) even the former is not a given. We want some way to tell the operating system to never use those sectors again.

Some vintage operating systems do this more or less overtly. For example, my Alpha Micro Eagle 300 has an actual bad block file that tells the operating system hands off these sectors (one wonders what would happen if a bad block occurred in the sectors that actually contain the bad block file). Classic Mac OS doesn't really support this, though tools like Norton Disk Doctor can quietly allocate them out of reach. However, Norton Disk Doctor doesn't understand a NetBSD FFS volume. NetBSD does have a tool called badsect(8) that can take a list of sectors and create bogus file descriptors that fsck_ffs will turn into invalid files soaking up those sectors; similar tools exist on other BSDs and even some Linux distributions. This is a little scary and there's not much documentation on this because there aren't many freaks like me running NetBSD on something this old. So here goes.

The console will tell you the bad sector (either on screen or through dmesg). Here's the actual output thule was spitting out:

sd0(ncrscsi0:0:0):
Check Condition on CDB: 0x28 00 00 32 3d 61 00 00 01 00
    SENSE KEY:  Media Error
   INFO FIELD:  3292513
     ASC/ASCQ:  Unrecovered Read Error

sd0(ncrscsi0:0:0):
Check Condition on CDB: 0x28 00 00 32 3d 62 00 00 01 00
    SENSE KEY:  Media Error
   INFO FIELD:  3292514
     ASC/ASCQ:  Unrecovered Read Error

sd0(ncrscsi0:0:0):
Check Condition on CDB: 0x28 00 00 32 3d 63 00 00 01 00
    SENSE KEY:  Media Error
   INFO FIELD:  3292515
     ASC/ASCQ:  Unrecovered Read Error
Eeewww.

What this means is that relative sectors 3292513-5 inclusive are kaput. I noticed this because the periodic full system backup had started failing on certain files, so I knew which files were bad, and I could restore them from the previous backup. But before I could do that, I wanted to make sure the OS didn't try to reuse the bogus sectors.

First, I rebooted it in single user mode. This is very important; you are tinkering with a live filesystem and run not only the risk of kernel panic but serious data loss if other things are writing to the disk while you are. (Yes, I've panicked systems at least once by cheating on this. Learn from my mistakes.) Next, I ran fsck on the disk and allowed it to remove the wrecked files, generating more media errors in the process. It did so, and instructed me to reboot. But we don't want to do that yet; we still have those sectors to account for.

Now we use badsect. We create a directory /BAD and tell badsect to create the placeholder files for those crashed sectors:

badsect /BAD 3292513 3292514 3292515

BE VERY CAREFUL. There is no going back. If you type the sector wrong, you may only waste a perfectly good unallocated sector, or you may destroy a file mid-stream. Check your typing twice, and then check it again. If you did this right, the sectors will be marked and badsect will tell you do run fsck again. Do so.

This time, fsck will notice the bogus files and ask HOLD BAD BLOCK? The answer is yes. It should ask you each time for each sector. If you correctly reserved all the bad sectors, there should be no more media errors on the console. If the bad sectors were part of the file system, you may have some DUPs to resolve, which you should answer yes as well if it asks.

Now the part that threw me initially: when fsck proceeds to the second phase and checks pathnames, it will (correctly) notice that the files in /BAD point to bad blocks ... and offer to delete them! Do not let it!

DUP/BAD I=277769 OWNER=root MODE=100600
SIZE=1024 MTIME=Oct 7 19:58 2014
FILE=/BAD/3292513

REMOVE?

The answer this time, and for anything in /BAD, is NO -- or you'll remove the linkage to the bad sector, and you'll have to create it all over again. This will be true when you fsck this volume again, and you will, because bad sectors will eventually occur somewhere else.

By the end, after you've allowed fsck to complete all the remaining salvage, you shouldn't see any more media errors on the console. Now reboot and restore the files that were lost, if they were important, and tell your backup script to exclude anything in /BAD.

I'm toying with resurrecting thule a second time, but this time running A/UX. That might be fun.

1 comment:

  1. I used to do this the simpleton way, when I couldn't read a file anymore, I would move that file to a BAD directory and just rename it BAD001 and then restore the original file and the computer would continue as normal.
    Also back in the day the advanced SCSI formatting utilities had all kinds of extra things like disk encryption and mapping out bad blocks. I miss those fun days. :)

    ReplyDelete

Due to an increased frequency of spam, comments are now subject to moderation.