PDA

View Full Version : DirecTiVo Rebooting! Hard Drive Possibly Failing? (Just Sharing My Pain)



JakFrost
04-16-2004, 11:19 AM
Long story short, my Sony SAT-T60 DirecTiVo started rebooting every few minutes or every few hours the day before yesterday. It has a Maxtor 120GB IDE (Model 4G120J6) hard drive in it.

When it was first installed I would wake up in the morning sometimes and hear the hard drive doing the clank-clank sound, but after I pulled the power and put it back the system would work again. This only happened once or twice ever and then the hard drive would run like a champ for months on end.

I pulled the hard drive out, hooked it up to my spare PC and did a backup on it.

mfsbackup -6so /mnt/dos/tivo-2004-04-14.bak /dev/hdb

Then I ran the Maxtor PowerMax 4.09 using the Quick Test mode, which reported a problem and said to run the Advanced Test (Full Test) mode. I did that and lo-and-behold there were errors on the disk but the utility was able to correct them (I'm guessing just mark the sectors as bad).

Well, since the test did not erase the hard drive I put it back into the DirecTiVo and let it run overnight. Everything was cool, the system was up for 6-hours but then went down right when I connected to it over the network to check the uptime using TivoWeb. But it was the morning and I had to go to work, so I just left the system like that. It continued it's crashing every few minutes to every few hours.

Last night I pulled out the old Quantum Fireball 40GB hard drive that came with the TiVo and tried to do a restore to it using the current backup.

I think that I'm prepared, I have backup images that I've done every few months.

Backup Images

DirecTiVo - Sony SAT-T60 - 2.5.2-01-1-011 - 2002-09-05.bak
DirecTiVo - Sony SAT-T60 - 2.5.2-01-1-011 - 2002-09-05.bak.md5
DirecTiVo - Sony SAT-T60 - 3.1.0-01-1-011 - 2003-01-09.bak
DirecTiVo - Sony SAT-T60 - 3.1.0-01-1-011 - 2003-01-09.bak.md5
DirecTiVo - Sony SAT-T60 - 3.1.0b-02-1-011 - 2003-11-08.bak
DirecTiVo - Sony SAT-T60 - 3.1.0b-02-1-011 - 2003-11-08.bak.md5
DirecTiVo - Sony SAT-T60 - 3.1.0b-02-1-011 - 2004-01-10.bak
DirecTiVo - Sony SAT-T60 - 3.1.0b-02-1-011 - 2004-01-10.bak.md5

Yesterday's backup too.

DirecTiVo - Sony SAT-T60 - 3.1.0b-02-1-011 - 2004-04-14.bak
DirecTiVo - Sony SAT-T60 - 3.1.0b-02-1-011 - 2004-04-14.bak.md5

So I get ready to do the restore.

mfsrestore -pxzi /mnt/dos/tivo-2004-04-14.bak /dev/hdb

But instead of a successful restore I get an error.

Decompression failed: 39%

I'm like WTF! I just did the backup damn it, why doesn't it work.

I go back to try and restore the 2004-01-10 backup image and get the same damn error, but at 37% instead.

Now I'm really peeved since these backups aren't working. I know that they are not corrupted because every time that I do a "mfsbackup" I always do a "md5sum" on it to store the checksum to check for corruption and I checked the checksums on these two backup files and they are fine.

I end up going back to the 2003-11-08 image and finally that one works and restore properly but I get a different error message instead.

Inode 43507 something something error something

Luckily this image work well enough that the old hard drive and I was able to boot the TiVo with it. I did the service call and the system downloaded the new 3.1.0b-02-01-011 software.

Now I just have to set my Season Passes up again.

Anyway, below are the short excerpts of the crash times from the kernel log. There are a lot more crashes in there but I'm only posting a few.

The stupid thing is that the logs overflow because of these damn "tcp_keepalive" messages. Are these from the TurboNet driver or from TivoWeb or frome something else? I gotta get rid of them this time around when I get on that hard drive.

kernel Log




First Crash

Apr 14 05:28:50 (none) kernel: tcp_keepalive: call keepopen(0x81426c40)
Jan 1 00:00:18 (none) kernel: Loaded 11 symbols from 7 modules.

Following Crashes

Apr 14 06:39:46 (none) kernel: tcp_keepalive: call keepopen(0x8141ac40)
Jan 1 00:00:18 (none) kernel: Loaded 11 symbols from 7 modules.

Apr 14 10:07:46 (none) kernel: tcp_keepalive: call keepopen(0x81467c40)
Jan 1 00:00:18 (none) kernel: Cannot find map file.

Apr 14 10:17:04 (none) kernel: tcp_keepalive: call keepopen(0x8140dc40)
Jan 1 00:00:18 (none) kernel: DMA Channel IDESG, next irq 54, idle irq 62, channel 21

Apr 14 10:26:59 (none) kernel: tcp_keepalive: call keepopen(0x8145bc40)
Jan 1 00:00:18 (none) kernel: memstart = 0x801bd000 : size = 0x280000

Apr 14 14:37:58 (none) kernel: tcp_keepalive: call keepopen(0x813ffc40)
Jan 1 00:00:17 (none) kernel: Loaded 11 symbols from 7 modules.

Apr 14 14:48:13 (none) kernel: tcp_keepalive: call keepopen(0x81413c40)
Jan 1 00:00:18 (none) kernel: Loaded 11 symbols from 7 modules.

JakFrost
04-16-2004, 11:26 AM
Well the moral of the story is that even having good backups sometimes just isn't enough and you need a lot of luck instead.

I thought that mfsbackup and mfsrestore would be damn reliable but it doesn't appear so with these Decompression Errors. I wonder if the new beta that I found on the net from Tiger of mfsrestore 2.1a fixes these issues. I might try to make it work if I can get the right libraries for it.

Now, I'm just low level formatting the Maxtor 120GB, it's about 15% done. I'll try to salvage this hard drive and see if I can make it work.

forecast8
04-17-2004, 05:08 PM
I really think your drive the culprit. Obviously you don't have another one on hand to test.

I think that tivo file system makes assumptions about drives that the maxtor util can't fix. Could be marking sectors as bad isn't allowed by the restore utility.

A low level format basically tests every secotor and logically maps in replacements from a special area reserved for this. The size of this area is fixed and not very big, if many sectors are bad it may be impossible to make the drive whole. I think in your case this will probably resolve the issue for a while. New drives have no bad sectors since they are low level formated before shipment.

However if a hard drive starts developing bad spots, I bet it will continue to get go bad. These things are low level formatted at the factory with the assumption that the drive quality is fixed.

alldeadhomiez
04-17-2004, 05:28 PM
Now I'm really peeved since these backups aren't working. I know that they are not corrupted because every time that I do a "mfsbackup" I always do a "md5sum" on it to store the checksum to check for corruption and I checked the checksums on these two backup files and they are fine.

If the file got corrupted as it was being written out, md5sum won't tell you there was a problem.

Unfortunately, subtle data corruption issues are notoriously difficult to isolate. Likely culprits along the data path include: system RAM, CPU (possibly overclocked/overheating), external cache if any, the drive controller, IDE cables, or the target drive. I wouldn't initially suspect the source drive if you are getting decompression errors, since the compression happens after the data has been read into memory. You may wish to start troubleshooting by running memtest86 for a few hours.