PDA

View Full Version : Need help troubleshooting random reboots


kadiir
01-15-2008, 01:15 AM
Hello:

I have the 2 S2 DTivos listed in my sig. I had recently put in new hard drives and installed 4.0.1b fresh and then upgraded to 6.2a via slices. I am running 6.2 by using Jamie's custom kernel and using a somewhat modified version of AlphaWolf's framework (I removed some of the things that I don't need).

Everything was working fine for a few weeks and then one of them started rebooting somewhat randomly. A couple of weeks later, the other one started, too.

Every single time it happens is when the menu is "refreshing." All but one occurrence was after watching a show and deleting it via the end-of-show prompt, while the NowShowing list was refreshing, it paused for a moment and then a reboot.

The one exception was when I was looking at the To Do list and went into the log at the top. When I hit back (left) to return to the To Do list, while refreshing it paused and then a reboot.

In the first scenario, it doesn't happen every time - it's maybe once a week with watching & deleting anywhere from 1 to 10 shows in that timeframe. I have not noticed it happening if I don't delete the program from the first prompt but instead let the Tivo go back to the NowShowing and then deleting it but I'm not sure if that's conclusive given the frequency of the reboots.

Looking through the files in var/log I do see this in the kernel log:

Kernel stack overflow might have happened in the sys_call sequence
How do I go about troubleshooting (or even fixing) this? I searched the site on 'stack overflow' and sys_call but didn't find anything.

TIA!

kadiir
01-19-2008, 11:10 PM
It's not just at deletes, apparently - I hit back (left) while watching a show (not paused) that I had played directly from Now Showing and it rebooted :(

kadiir
02-04-2008, 12:32 PM
After quite a bit of searching through the forums, I stumbled across a post that stated that 6.2 has "issues" with doing a long fakecall. Since AW's framework had a fakecall.tcl 21 in it (although, I'm using a version of fakecall40.tcl that I corrected for 6.2). So, I removed the '21' from it and let it use the default.

However, I also decided to take out the reboot cron job (safereboot) that is in AW's framework as well. I had it set to reboot every week at 1:15 am (I also called fakecall40.tcl 5 mins. prior). While the reboots either never worked or only worked the first week (I'm not sure), I just took it out (and the fakecall40.tcl call) anyway.

So far, I'm about 10 days with no reboots (previously, it would reboot within a week).

Edit 2008-02-11: still no reboots. In the near future, I'll put back in the safereboot cron job and see what happens.

kadiir
03-20-2008, 12:28 AM
Well, after just over a month of no reboots, one of my Tivos has started rebooting again with the same symptoms (while 'printing' out a listing - Now Playing in the two cases so far). Still the same dump in the kernel log.

Does anyone have any ideas?

Jamie
03-20-2008, 01:25 AM
Attach all your logs (messages, tvlog, tverr, etc) to a post.

What tivoapp patches do you have? Please list VMA address and replacement code. A bad patch is one of the most common reasons for seemingly "random" patches.

kadiir
03-21-2008, 03:06 AM
Thanks Jamie.

After upgrading to 6.2a via slices, I ran superpatch-67all-NutKase-1.2.tcl patched to "1.13" as found in one of the threads (I'd have to dig it up) after booting to the 6.2a partition. I'm attaching that as superpatch-1.13.tcl.

FYI, the patch I used is *not* the one that comes in AW's busybox but one posted here as well (I can also dig that post up if need be).

Looking at my notes, I also ran Superpatch67Standby.tcl with the default option.

edit 1: shoot, I just realized that the reboot I had this evening doesn't have any log file entries pre-reboot. When the next reboot happens I'll grab those files & attach.

edit 2: I just realized I shouldn't have posted the patched superpatch tcl so I removed that and attached just the patch diff file.

kadiir
03-21-2008, 03:07 AM
Sheesh, I 've been trying to crash it for 4 days now - no crashes. I've watched & deleted (at the end of watching, just like when it reboots unexpectedly) about 10 shows and nothing goes wrong. I'll keep trying, though.

kadiir
04-09-2008, 11:45 PM
I finally had a crash again last night (never thought I'd hope for a crash :) ). I'm attaching the log files in a zip file to make it easier & fit into one post.

The kernel log file starts with the last reboot (manually rebooted after running the fakecall40.tcl).

The timestamp on the crash is at Apr 9 04:32:54.

Jamie
04-10-2008, 01:10 AM
Well, as you observed, this:Apr 9 04:32:54 DTMB1 kernel: ***Kernel stack overflow might have happened in the sys_call sequence***
Apr 9 04:32:54 DTMB1 kernel: ***Contact kernel-dev@tivo.com***
Apr 9 04:32:54 DTMB1 kernel: ***Please see /var/log/kernel for the stack dump*** in the kernel log makes it looks like a kernel issue. It's possible there is some incompatibility between the 8.x custom kernels and the 6.x software releases.

Perhaps go back to a stock kernel and see if it goes away? Or try a different custom kernel,, maybe one built from 6.x sources?

kadiir
04-11-2008, 11:05 AM
I had a suspicion I might have to change kernels. I'll go to a stock kernel and start learning how to do my own custom kernel.

Thanks!

Jamie
04-11-2008, 11:55 AM
The newer kernel configs (starting in 8.x?) have a CONFIG_KSTACK_GUARD option that adds code to detect kernel stack overflows. It's that code that is causing this kernel crash. I'm not sure what is causing the kernel stack overflow.

kadiir
10-19-2008, 10:26 PM
Well, I finally had a chance to work on this (recovering from a back injury).

First, I tried your (Jamie's) custom 7.2 kernel found on DDB and it went into a reboot loop (serial output attached). I was able to revert back to what I had before (thanks to AW's chainload having an early bash prompt :) ) and so I thought well, let me just go back to a stock and make sure.

So, I went to a 3.1.5 kernel and my Tivo is now completely hosed. Somehow the 'chain of trust' was broken and it started wiping files like crazy (serial attached - it ends when I pulled the plug). After pulling the plug, when I powered it back on it hangs after:

VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 60k freed

The only thing I can think of is that the 3.1.5 I used wasn't really a 3.1.5 kernel. The filesize is 2024056 bytes.

So, I'm about to the pull the drive and see if I can save it (especially since there's about 400 hours of recordings on it :( ).

jt1134
10-20-2008, 10:38 PM
If you were initially using jamie's 8.x custom kernel and then switched to a 7.2 kernel, then your initial crashes were caused by an incompatibility between your usb drivers and the kernel. (don't mix/match 7.x/8.x kernels/drivers)

Looks like you didn't run killhdinitrd on the 3.1.5 kernel, and since you're not running 3.1.5, everything in your root fs was hosed. You could probably get away with dd'ing a good root partition over from another drive, or extracting a fresh filesystem from mfs and then rehacking it.

kadiir
10-21-2008, 03:28 PM
Back up & running (on 8.x) after using my other one to restore files.

If you were initially using jamie's 8.x custom kernel and then switched to a 7.2 kernel, then your initial crashes were caused by an incompatibility between your usb drivers and the kernel. (don't mix/match 7.x/8.x kernels/drivers)

I'm using the USB backport drivers (usbobj2.4.27-20071023, 2.4.20 directory) which I thought worked with any kernel. Do I have that wrong?

That could explain why it crashed right after the "usb.c: USB device 2 (vend/prod 0x846/0x1040) is not claimed by any active driver." line.

Looks like you didn't run killhdinitrd on the 3.1.5 kernel, and since you're not running 3.1.5, everything in your root fs was hosed. You could probably get away with dd'ing a good root partition over from another drive, or extracting a fresh filesystem from mfs and then rehacking it.

I thought that monte'ing to a kernel didn't require the target to be hillhdinitrd'd. I thought of trying that last night when I got it restored but didn't have time to play around like that.

jt1134
10-21-2008, 04:10 PM
a kernel >= 8.x requires kernel modules compiled from newer sources. your initial boot kernel always needs to have killhdinitrd applied, though a monte target kernel (usually a custom kernel or otherwise neutered) doesn't have any initrd to kill.

kadiir
10-21-2008, 08:21 PM
a kernel >= 8.x requires kernel modules compiled from newer sources.

I could've sworn 7.2 was 2.4.20. I'll have to double-check - maybe it's 2.4.18 then that would completely make sense.

your initial boot kernel always needs to have killhdinitrd applied, though a monte target kernel (usually a custom kernel or otherwise neutered) doesn't have any initrd to kill.

My initial kernel is killhdinitrd'd (3.1.1c - it's what's in partition 6 (7 is my active partition for bootpage)). It's actually the same kernel I was running when I was on 4.01b and then upgraded (I dd'd it from my then-active part. 3 over to 6 so I could chainload as I wasn't doing that with 4.01b).

To change kernels, I simply copied over the vmlinux.px in /chainload (to which the check.sh points) from one version's px file followed by a reboot. I ftp'd the various versions onto the box and then did 'cp 72.px vmlinux.px' for example. When the 3.1.5 monte target blew up I did the same thing ('cp 315.px vmlinux.px) and I copied back the 8.1 px to go back ('cp 81.px vmlinux.px).

I was trying to use "off the shelf" custom kernels (and stock in the case of 3.1.5) as I haven't been able to get my cygwin cross-compiler working to build my own (but that's a topic for another thread).

Oh, and I also figured out my source for the 3.1.5 kernel: the PTVUpgrade v4.04 disc. So, killhdinitrd is not applied to that one, but it isn't applied to Jamie's 7.2 custom kernel, either.

BTW, I hope I don't come across as all agro - your comments/help are much appreciated!

jt1134
10-21-2008, 08:36 PM
I could've sworn 7.2 was 2.4.20. I'll have to double-check - maybe it's 2.4.18 then that would completely make sense.
Everything since 5.x has been 2.4.20. Since 8.x however, tivo changed some things and now newer kernels aren't compatible with older usb modules. jamie made a short mention of this in the backport readme, and the "8.1 and my NIC" thread covers this issue in some detail.

I was trying to use "off the shelf" custom kernels (and stock in the case of 3.1.5) as I haven't been able to get my cygwin cross-compiler working to build my own (but that's a topic for another thread).
ah... you don't want to monte to a virgin kernel. If you do, then there's not really much point in doing so, since the initrd in the target kernel will run and wipe out anything it doesn't like. At the least you'll want to use replace_initrd on the target kernel. I used to run a custom 6.2 kernel on my boxes, if you want I *may* be able to dig it up.

Oh, and I also figured out my source for the 3.1.5 kernel: the PTVUpgrade v4.04 disc. So, killhdinitrd is not applied to that one, but it isn't applied to Jamie's 7.2 custom kernel, either.
Hmmm... the kernels on that disc should have had killhdinitrd applied already (AFAIK anyways). Custom kernels don't need to be changed, since typically they're compiled with CONFIG_BLK_DEV_INITRD disabled, and thus aren't in "search and destroy" mode when they first boot.

BTW, I hope I don't come across as all agro - your comments/help are much appreciated!
HA! You're troubleshooting man! Helping out is what a PDA and a slow day at work is meant for. ;)

kadiir
10-21-2008, 08:57 PM
Everything since 5.x has been 2.4.20. Since 8.x however, tivo changed some things and now newer kernels aren't compatible with older usb modules. jamie made a short mention of this in the backport readme, and the "8.1 and my NIC" thread covers this issue in some detail.

D'oh - I didn't re-read it (haven't read it since last year) far enough and I made the mistake of relying on memory :(

ah... you don't want to monte to a virgin kernel. If you do, then there's not really much point in doing so, since the initrd in the target kernel will run and wipe out anything it doesn't like. At the least you'll want to use replace_initrd on the target kernel.

Wow, I wonder how I missed that in all the kernel related threads I read. Proven empirically, of course, but always nice to know the theory is well documented :)

I used to run a custom 6.2 kernel on my boxes, if you want I *may* be able to dig it up.

If it's got the netfilter tweak for better network performance & it's not too much trouble, I'll try it out. Earlier in the thread Jamie recommended doing exactly that. If not, I'll try the 7.2 kernel again (with the right USB drivers ;) ) and if that still doesn't work then I'll have to dig in and build my own if I can - I'm a network/Windows guy that works with network gear running linux/unix so I have some *nix knowledge, but I can't for the life of me figure out how to fix my cygwin cross-compiler (using Jamie's script, it won't extract the cmd file correctly). I may just throw together a linux 'system' using VMWare.

Hmmm... the kernels on that disc should have had killhdinitrd applied already (AFAIK anyways).

I'm relying on memory again here. If so, then what the heck was it doing?

Custom kernels don't need to be changed, since typically they're compiled with CONFIG_BLK_DEV_INITRD disabled, and thus aren't in "search and destroy" mode when they first boot.

Thanks for the exact answer - I knew that in concept but didn't know what the specific parameter was.

HA! You're troubleshooting man! Helping out is what a PDA and a slow day at work is meant for. ;)

lol - of course, I'm troubleshooting my Tivo during working hours (PCI remediation among other things) since most of our recordings are in prime time :). My excursions turned my day yesterday into a very long one :) I've almost flipped my day around - fiddling w/ my Tivo during the day in between work emails and doing my real work at night.

kadiir
10-21-2008, 10:19 PM
Hmmm, it looks like the PTVUpgrade kernel is already killhdinitrd'd. I based my statement of source on file size. Maybe it wasn't from the CD after all.

jt1134
10-21-2008, 10:54 PM
If it's got the netfilter tweak for better network performance & it's not too much trouble, I'll try it out. Earlier in the thread Jamie recommended doing exactly that. If not, I'll try the 7.2 kernel again (with the right USB drivers ;) ) and if that still doesn't work then I'll have to dig in and build my own if I can - I'm a network/Windows guy that works with network gear running linux/unix so I have some *nix knowledge, but I can't for the life of me figure out how to fix my cygwin cross-compiler (using Jamie's script, it won't extract the cmd file correctly). I may just throw together a linux 'system' using VMWare.
I found my notes from back then but can't compile it since 6.2 requires a gcc-3.0 compiler, and I no longer have one. There's a binary version here (http://tivoutils.sourceforge.net/) built for linux, but I no longer have a fully-functioning linux box to use it on. If you get linux running, you can easily use that precompiled toolchain to build a kernel. If you want it, pm me and I'll send you some patches for the 6.2 source.

kadiir
10-27-2008, 11:18 PM
jt1134 & I have been PM'ing - I'm posting back here so I can upload the output from the commands he sent me:

Here's the output.

In order (1.txt through 3.txt) after the commands (and "export ROOT=/usr/local/mips-tivo/cmd/root" and "export TIVO_SYSTEM=release-mips" were done first):

yes "" | make config
make dep
make -j2 TV_FEATURE_STRONG_CRYPTO=0 vmlinux.px

Also, while running the 3rd one & re-directing it to 3.txt, I noticed some additional errors appeared on screen - these are 4.txt.

Thanks!

jt1134
10-28-2008, 12:43 AM
do a 'make mrproper' and then try again with this patch and .config file. you can reverse the previous patch with patch -R < "patchfile"

kadiir
10-28-2008, 03:01 AM
I'm not sure what "do a 'make mrproper'" means - is it just literally run that command (i.e., type those 2 words & hit enter)?

jt1134
10-28-2008, 09:31 AM
yep.............

kadiir
10-29-2008, 03:55 AM
I had a nice long, detailed reply that was lost - hopefully I can remember it......

make mrproper failed, however I forgot to run it before "unpatching" and I then promptly forgot to grab the errors :(

Then, I couldn't re-patch - it couldn't find any of the files (files were there). So, I "patched" by manually making the edits.

I then ran the commands (exports & makes) but it failed. In the attached, 1-4 are the same as before (1-3 command redirected to a text file plus additional on-screen ouput (4)).

I thought what the heck, let me run make mrproper to get that error and it actually ran (output in make_proper.txt).

So, I "unpatched" which mostly worked and then patched again (which also mostly worked). And, re-ran the commands. Failed - attached as 5-8 (representing 1-4 as before).

There was some additional on-screen errors that are in 8.

I'm beginning to wonder if I have something fundamentally wrong with my setup. If you are out of ideas, I'll try it under Cygwin (at a minimum, it failed before as I didn't have the right cmd tarball) and if it fails there I'll re-install my Centos VM and start over and post a summary of all steps performed to make sure I'm not doing something amazingly stupid.