Created attachment 34122 [details] intel_gpu_dump.txt.gz After a number of S4 cycles (number varies, and is pretty high, about 600x) the machine resumes with only a black screen (cursor is visible). I have kernel messages, but unfortunately no X backtrace (no debuginfo packages installed). Machine is equipped with a IGDNG_M_G. kernel is 2.6.32.9, intel driver is 2.10.0, libdrm is 2.6.18. Currently testing with kernel 2.6.33, whether the bug still occurs. The bug is persistent over reboots. A power cycle is required to get the chip in working state again. Basically, direct after resume, the DRM module spits out: Mar 13 02:00:29 linux-nc5s kernel: [ 7.744398] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Mar 13 02:00:29 linux-nc5s kernel: [ 7.744407] render error detected, EIR: 0x00000000 Mar 13 02:00:29 linux-nc5s kernel: [ 7.744411] i915: Waking up sleeping processes [repeated 2x] Mar 13 02:00:29 linux-nc5s kernel: [ 8.059984] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged [repeated 10x] then both types intermixed (several 100 times), about two hangcheck timer elapses per second. EIR is always 0. After several hundred of these messages, I get kernel errors: Mar 13 03:01:08 linux-nc5s kernel: [ 3640.736874] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Mar 13 03:01:08 linux-nc5s kernel: [ 3640.736880] render error detected, EIR: 0x00000000 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.736883] i915: Waking up sleeping processes Mar 13 03:01:08 linux-nc5s kernel: [ 3640.736895] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 181915 at 172405) Mar 13 03:01:08 linux-nc5s kernel: [ 3640.737988] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739393] vmap allocation failed - use vmalloc=<size> to increase size. Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739399] vmalloc size=6000 start=f77fe000 end=feffe000 node=-1 gfp=80d2 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739405] Pid: 2046, comm: X Tainted: P NX 2.6.32.5-0.1.1.1026.0.PTF-pae #1 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739408] Call Trace: Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739427] [<c0206921>] try_stack_unwind+0x1b1/0x1f0 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739435] [<c020589f>] dump_trace+0x3f/0xe0 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739442] [<c020652b>] show_trace_log_lvl+0x4b/0x60 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739449] [<c0206558>] show_trace+0x18/0x20 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739457] [<c056a3c9>] dump_stack+0x6d/0x74 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739466] [<c02e3df9>] alloc_vmap_area+0x2d9/0x2f0 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739474] [<c02e3f11>] __get_vm_area_node+0x101/0x1c0 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739481] [<c02e493e>] __vmalloc_node+0x9e/0xe0 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739487] [<c02e4b76>] __vmalloc+0x36/0x50 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739514] [<f82f4817>] i915_gem_execbuffer+0x247/0xe40 [i915] Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739549] [<f81d568c>] drm_ioctl+0x15c/0x340 [drm] Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739561] [<c030e1e8>] vfs_ioctl+0x78/0x90 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739567] [<c030e663>] do_vfs_ioctl+0x373/0x3f0 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739573] [<c030e78a>] sys_ioctl+0xaa/0xb0 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739579] [<c02030a4>] sysenter_do_call+0x12/0x22 Mar 13 03:01:08 linux-nc5s kernel: [ 3640.739598] [<ffffe424>] 0xffffe424 Again, about 2 per second, intermixed with (fewer) hangcheck timer messages. The attached intel_gpu_dump buffer dump shows that there is no batch buffer (?), but only a ring buffer with a few commands included. As the commands refer to a batch buffer, this looks odd, but this may be related to the kernel errors (alloc failed).
Jeese, Care to take a look at this one? -Carl
Hm, no error reported sounds like our hangcheck timer might be buggy. Maybe hibernate needs some special hangcheck handling. Can you instrument the i915 irq handler to see if we're getting a spurious user interrupt at resume? If so, we might need an if (!dev_priv->mm.suspended) in there somewhere.
As the issue is persistent over boot this cannot only be the hangcheck timer. Some state must be not initialized correctly after a reset. Just trying to reproduce on the machine I last encountered it. This will take some time.
I was just able to reproduce after 1975 hibernate cycles. The effect is persistent over reboots, *and* over hibernate. Initially I thought it would not survive hibernate This indicates that some state is not initialized correctly, but saved & restored over hibernate. Weird. Jesse, as this only occurs after oh so many hibernate cycles (there are supposed to be machines that exhibit this issue earlier, but none I have access to), can you elaborate a bit what information would be helpful? Any information that could be extracted from the broken state before I power cycle the machine?
Yuck. It could also be that we're initializing some chip state out of order on resume, and get lucky and avoid a hardware race most of the time. We could also be hitting one of the bugs fixed since libdrm 2.4.18, have you tested with 2.4.20?
Haven't tested updated libdrm yet (will do), but if that fixes the issue for good, drm is still doing something wrong (userspace should never be able to mess around like that). In the meantime, any ideas for reasonable post-mortem analysis? The gpu dump is already attached to this bug from an earlier breakage (same machine). I have to correct myself - it seems this issue isn't seen on many more machines. Is is plausible that this could be a single unit failure? I.e. hardware issue? OTOH the machine regularly works fine, and persistence speaks against this theory.
It's possible it's a hw problem on this specific machine, but I'd be more inclined to believe it's a sw problem that just doesn't trigger very often. As for debugging, there may be more state that changes across suspend/resume than is tracked by our reg dumper. You could capture the whole register map from sysfs before and after (you may need to write a simple mmap + dump program for this). I know there are MCHBAR regs we don't bother with that we probably should, but it would be interesting to see exactly what's changed.
I managed to copy the installed image on the machine with broken state into a second partition (with separate swap) so I can resume the machine now in broken and in working state as I wish. I'm analyzing the machine right now. intel_stepping says Device 0x0046, Revision: 0x12 (??) There are a few differences, I'll post them here. Beware that the machine booted between two different states (image was cloned, though, so same driver version etc.) - quite some differences could be irrelevant.
Ok, the gpu dump in the attachment isn't telling much - a new gpu dump shows a different story: broken state: ACTHD: 0x00000000 EIR: 0x00000000 EMR: 0xffffff3f ESR: 0x00000000 PGTBL_ER: 0x00000000 IPEHR: 0x00000000 IPEIR: 0x00000000 INSTDONE: 0xfffffffe INSTDONE1: 0xffffffff Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write ringbuffer at 0x00000000: 0x00000000: HEAD 0x00000000: MI_NOOP 0x00000004: 0x00000000: MI_NOOP [all MI_NOOP up to the end] working state: ACTHD: 0x00003e88 EIR: 0x00000000 EMR: 0xffffff3f ESR: 0x00000001 PGTBL_ER: 0x00000000 IPEHR: 0x01000000 IPEIR: 0x00000000 INSTDONE: 0xfffffffe INSTDONE1: 0xffffffff Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write ringbuffer at 0x00000000: 0x00000000: 0x02000000: MI_FLUSH 0x00000004: 0x00000000: MI_NOOP 0x00000008: 0x18800180: MI_BATCH_BUFFER_START 0x0000000c: 0x0243c000: dword 1 0x00000010: 0x02000004: MI_FLUSH 0x00000014: 0x00000000: MI_NOOP 0x00000018: 0x10800001: MI_STORE_DATA_INDEX 0x0000001c: 0x00000080: dword 1 0x00000020: 0x00000001: dword 2 0x00000024: 0x01000000: MI_USER_INTERRUPT [etc. pp until 0x00003e84:, then MI_NOOP at HEAD] I don't see a TAIL in either of those dumps, so either I don't understand, or the tool still has a bug here.
Comment on attachment 34122 [details] intel_gpu_dump.txt.gz Old dump is obsolete.
Created attachment 35309 [details] intel_reg_dumper output in broken state Only difference to working state: --- broken.reg_dumper 2010-04-27 16:38:03.000000000 +0200 +++ works.reg_dumper 2010-04-27 16:35:10.000000000 +0200 @@ -117,1 +117,1 @@ - TRANSC_DP_LINK_N2: 0x00ffffff (val 0xffffff 16777215) + TRANSC_DP_LINK_N2: 0x00000000 (val 0x0 0)
Created attachment 35310 [details] intel_reg_read -f on the broken machine
Created attachment 35311 [details] Diff of intel_reg_read -f between broken and working state
Sorry for the neglect Matthias, looks like another ILK mode setting problem. Maybe Zhenyu has an idea.
Given that drm:i915_gem_execbuffer fails when the screen is black, I doubt this is a mode setting issue, but rather that the rendering engine is stalled.
(In reply to comment #15) > Given that drm:i915_gem_execbuffer fails when the screen is black, I doubt this > is a mode setting issue, but rather that the rendering engine is stalled. fails how? EIO or EBUSY? If EIO, please attach the i915_error_state.
(In reply to comment #16) > (In reply to comment #15) > > Given that drm:i915_gem_execbuffer fails when the screen is black, I doubt this > > is a mode setting issue, but rather that the rendering engine is stalled. > fails how? EIO or EBUSY? If EIO, please attach the i915_error_state. With fails I mean Mar 13 02:00:29 linux-nc5s kernel: [ 8.059984] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged which I conclude is neither.
That's EIO, but also implies that your kernel is too old to have a meaningful i915_error_state. :(
I can *try* to update the kernel - the issue is persistent over reboots. OTOH it is difficult to reproduce. I'll post when I have new results.
Is this still there with recent linus's fix?
Matthias, this is most likely the memory corruption bug. So I am marking fixed, please reopen it reoccurs. Thanks for the report.
Will do. Sorry for not having time to re-test this lately.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.