Bug 25765

Summary: X server crash with linux 2.6.32 (KMS), xorg-intel 2.9.1, libdrm 2.4.15 on EeePC 900 (915GM/GMS/910GML)
Product: xorg Reporter: Daniel Kahn Gillmor <dkg>
Component: Driver/intelAssignee: Carl Worth <cworth>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: axet, n-roeser
Version: 7.4 (2008.09)   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
system state after crash and automatic (but weirdly broken) gdm restart
none
system state after second such crash
none
system state after another crash none

Description Daniel Kahn Gillmor 2009-12-22 15:37:31 UTC
Created attachment 32256 [details]
system state after crash and automatic (but weirdly broken) gdm restart

I just had an X server crash with linux 2.6.32 (KMS enabled), xorg-intel 2.9.1, libdrm 2.4.15 on an EeePC 900 (lspci reports 915GM/GMS/910GML).

after the X server crashed, i got dumped back to vt1, and then gdm restarted a new X server.  But gdm was really oddly-behaved: the screen would update when a switched to it (with ctrl-alt-F8) but then when i interacted with it, nothing would visually update (only the mouse cursor would move), and the input box for gdm had a black bar over it (it's usually white).  switching away to vt1 and then back would repaint the screen with the outcome of whatever activity i'd done, but it would still not be an active screen.

a second restart of gdm didn't resolve the problem either.

i grabbed a gpu dump after the automated gdm restart (but before i'd restarted it by hand).  i'm attaching a dump of the system state including that gpu dump, gathered by the script i posted here as attachment 31967 [details]).

Some interesting output in the gdm log from the crashed session (full log included in the attached tarball):

Errors from xkbcomp are not fatal to the X server
../../../libdrm/intel/intel_bufmgr_gem.c:899: Error setting domain 835: Input/output error
../../../libdrm/intel/intel_bufmgr_gem.c:825: Error setting to CPU domain 518: Input/output error

Fatal server error:
Failed to map batchbuffer: Input/output error


Please consult the The X.Org Foundation support 
         at http://wiki.x.org
 for help. 
Please also check the log file at "/var/log/Xorg.0.log" for additional information.

X: ../../src/i830_batchbuffer.h:79: intel_batch_emit_dword: Assertion `pI830->batch_ptr != ((void *)0)' failed.


the relevant bit of dmesg appears to be:

[172107.332058] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... 
GPU hung
[172107.332071] render error detected, EIR: 0x00000000
[172107.332077] i915: Waking up sleeping processes
[172107.332094] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (aw
aiting 7486395 at 7486371)
[172107.332377] reboot required
[172107.335458] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[172107.632013] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... 
GPU hung
[172107.632025] render error detected, EIR: 0x00000000
[172107.632030] i915: Waking up sleeping processes
[172107.833838] reboot required

full dmesg is also within the attached tarball if you want more.  Is that "reboot required" something that should be propagated out to the user more directly somehow?

After a full system reboot, things seem to be back to normal.

please let me know if i can do anything to help debug this.
Comment 1 Daniel Kahn Gillmor 2010-01-06 12:22:17 UTC
Created attachment 32479 [details]
system state after second such crash

So i've had another X.org crash with this same setup :(  again, gdm tried to restart, but failed.  Attached is the system state after that gdm restart (i've also copied in :0.log.3, which i think is the gdm log from the crashed session).

fwiw, the first crash was done without having gfxpayload set from the bootloader.  The system run from this latest crash was started with gfxpayload=800x600 (see bug 25919 for why this might affect the state of the card).  I think the problem is still basically identical, however.
Comment 2 Daniel Kahn Gillmor 2010-01-06 12:36:21 UTC
just to clarify: after the crash, gdm tried to restart, and successfully got kind-of started.  but it was with a weird, poorly-updated screen.  moving the mouse over the screen revealed (via changes in the cursor from pointer to text-entry) that the login field was present, but the screen itself was black.

The screen only got redrawn with the intended content piecemeal during events like when a tooltip would be triggered: the tooltip would not show up at all, but when it disappeared, the "correct" login screen background would be present in the region where the tooltip was supposed to have been.
Comment 3 Daniel Kahn Gillmor 2010-01-19 10:50:02 UTC
Created attachment 32720 [details]
system state after another crash

here is another dump from the same system, after another X11 crash, where gdm restarted with the same wacky behavior.

I worry a bit that i'm reporting to the void here, as this is the third instance i'm reporting of this exact crash, but i haven't gotten any response.  i'm ready and willing to try to provide more information if that would be useful.  i'm also willing to try patches or other suggestions.

but i'm not a [graphics|X11|kernel] hacker, so i'm kind of at loose ends until i get some direction from someone who has a better idea of what to look for.

what can i do to make my problem more appealing to work on? ;)
Comment 4 Carl Worth 2010-02-17 09:26:59 UTC
(In reply to comment #3)
> I worry a bit that i'm reporting to the void here, as this is the third
> instance i'm reporting of this exact crash, but i haven't gotten any response. 
> i'm ready and willing to try to provide more information if that would be
> useful.  i'm also willing to try patches or other suggestions.
> 
> but i'm not a [graphics|X11|kernel] hacker, so i'm kind of at loose ends until
> i get some direction from someone who has a better idea of what to look for.
> 
> what can i do to make my problem more appealing to work on? ;)

Hi Dan,

Sorry about the lack of response. We get a lot of reports so we're not always
really quick to be able to reply.

It's not obvious what the cause of the original crash was. One thing you might
be able to do to make the problem more appealing to work on would be to find
a set of versions under which things work, and perform a git-bisect to identify
what commits caused the problem. This can be hard though, (both finding a known-
good version and knowing which of the modules to bisect).

Meanwhile, there have also been some recent fixes, (including one in libdrm
2.4.18), that may have addressed your bug. So you might have some luck updating
to newer versions.

Finally, the strange behavior after restarting sounds like it may be a result
of the automatic reset code in the driver. Jesse has recently done work to make
the driver detect an error and reset itself, which often means you can still
have a usable system after the error. (Prior to this reset logic your system
would have been completely unusable after the error and until the next
reboot).

-Carl
Comment 5 Daniel Kahn Gillmor 2010-02-17 10:04:45 UTC
Thanks for the feedback, Carl.  it's been several weeks since this crash has re-occurred, and i'm now on libdrm 2.4.17.  if things get bad again, i'll try the move to 2.4.18.
Comment 6 Chris Wilson 2010-05-11 10:38:36 UTC
If the crash became unreproducible with an update to libdrm, it was probably the poor EINTR handling that was fixed. Closing as fixed, until proven otherwise.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.