Created attachment 34108 [details] Xorg.0.log This is no joke. On some machines the Xserver freezes inside the intel driver in a ioctl (all output ceases) until the mouse is moved. I assume moving the mouse triggers some interrupt that unfortunately was lost in the drm module, which in turn frees up drm resources that block the ioctl. Triggering the bug is extremely difficult, and seems to depend on weird circumstances (I considered air pressure or state of the moon for a while). We had one machine shipped to us that exposed the bug, when it arrived (it even woke up from suspend, so it was in exactly the same state) we weren't able to reproduce even on this machine. But colleagues we trust have seen the bug with their very eyes. By remote debugging I was able to get quite some information from the machine in frozen state - each time the issue occurred it froze with equivalent stack frames: The machine is running a 2.6.32.5 32bit linux kernel. Xserver version is 1.6.5, intel driver version is 2.10.0. libdrm is 2.4.17 with the most important fixes in 2.4.18 included. Xserver/driver backtrace during freeze (local vars and code in driver only, context and/or exact source is available if needed): #0 0xffffe424 in __kernel_vsyscall () No symbol table info available. #1 0xb7274909 in ioctl () from /lib/libc.so.6 No symbol table info available. #2 0xb713993f in drm_intel_gem_bo_start_gtt_access (bo=0x96cb9d0, write_enable=0) at intel_bufmgr_gem.c:1145 bufmgr_gem = (drm_intel_bufmgr_gem *) 0x8234aa0 set_domain = {handle = 598, read_domains = 64, write_domain = 0} ret = <value optimized out> 1145 ret = ioctl(bufmgr_gem->fd, #3 0xb71399f5 in drm_intel_gem_bo_wait_rendering (bo=0x96cb9d0) at intel_bufmgr_gem.c:1123 No locals. 1123 drm_intel_gem_bo_start_gtt_access(bo, 0); #4 0xb71366f2 in drm_intel_bo_wait_rendering (bo=0x96cb9d0) at intel_bufmgr.c:133 No locals. 133 bo->bufmgr->bo_wait_rendering(bo); #5 0xb708c445 in i830_uxa_block_handler (screen=0x8234d20) at i830_uxa.c:789 intel = (intel_screen_private *) 0x82325c0 789 dri_bo_wait_rendering(intel->front_buffer->bo); #6 0xb707a92c in I830BlockHandler (i=0, blockData=0x0, pTimeout=0xbff4a478, pReadmask=0x821c3a0) at i830_driver.c:1004 screen = (ScreenPtr) 0x8234d20 scrn = (ScrnInfoPtr) 0x8231dd8 intel = (intel_screen_private *) 0x82325c0 #7 0x0817dc1b in AnimCurScreenBlockHandler (screenNum=0, blockData=0x0, pTimeout=0xbff4a478, pReadmask=0x821c3a0) at animcur.c:222 #8 0x081465a8 in compBlockHandler (i=0, blockData=0x0, pTimeout=0xbff4a478, pReadmask=0x821c3a0) at compinit.c:158 #9 0x08091168 in BlockHandler (pTimeout=0xbff4a478, pReadmask=0x821c3a0) at dixutils.c:384 #10 0x08132e4d in WaitForSomething (pClientsReady=0x9994640) at WaitFor.c:216 #11 0x0808d156 in Dispatch () at dispatch.c:386 #12 0x08071fbd in main (argc=9, argv=0xbff4a5c4, envp=Cannot access memory at address 0x400c6467) at main.c:397 Kernel thread backtrace: [ 804.351460] X S addc8eee 0 2995 2994 0x00400000 [ 804.351465] f3625e04 00003082 00000004 addc8eee 000000b8 f875e044 00003202 f8789b2a [ 804.351474] c084cb60 c084cb60 f4eb0d30 f4eb0fe4 c1207b60 00000000 c1207b60 f875e044 [ 804.351481] 00000001 f4eb0fe4 f4eb0d30 f3625e30 f41d3000 00028d46 f875d290 f41d3e48 [ 804.351489] Call Trace: [ 804.351504] [<f875d290>] i915_do_wait_request+0x2d0/0x350 [i915] [ 804.351531] [<f875e174>] i915_gem_object_set_to_gtt_domain+0x34/0xc0 [i915] [ 804.351558] [<f875e4c4>] i915_gem_set_domain_ioctl+0xc4/0x150 [i915] [ 804.351585] [<f823e68c>] drm_ioctl+0x15c/0x340 [drm] [ 804.351596] [<c030e1e8>] vfs_ioctl+0x78/0x90 [ 804.351602] [<c030e663>] do_vfs_ioctl+0x373/0x3f0 [ 804.351608] [<c030e78a>] sys_ioctl+0xaa/0xb0 [ 804.351613] [<c02030a4>] sysenter_do_call+0x12/0x22 [ 804.351622] [<ffffe424>] 0xffffe424 I will attach Xorg.0.log (which contains nothing obvious) and intel_gpu_dump output.
Created attachment 34109 [details] intel_gpu_dump.log.gz Batchbuffer contains the following entries: 0x04123000: 0x54300804: XY_COLOR_BLT (rgb enabled, alpha enabled, dst tile 1) 0x04123004: 0x03f00580: format 8888, pitch 1408, clipping disabled 0x04123008: 0x025b0147: (327,603) 0x0412300c: 0x026b014c: (332,619) 0x04123010: 0x04436000: offset 0x04436000 0x04123014: 0x00ffffff: color 0x04123018: 0x02000000: MI_FLUSH 0x0412301c: 0x05000000: MI_BATCH_BUFFER_END Ringbuffer seems to have HEAD == TAIL, at least I cannot find TAIL. It's completely filled up with patterns similar (but not 100% identical) to 0x000025e8: HEAD 0x02000000: MI_FLUSH 0x000025ec: 0x00000000: MI_NOOP 0x000025f0: 0x10800001: MI_STORE_DATA_INDEX 0x000025f4: 0x00000080: dword 1 0x000025f8: 0x00027a18: dword 2 0x000025fc: 0x01000000: MI_USER_INTERRUPT 0x00002600: 0x18800180: MI_BATCH_BUFFER_START 0x00002604: 0x04094000: dword 1 [... continues with MI_FLUSH]
For internal records: this bug is associated with Novell bug https://bugzilla.novell.com/show_bug.cgi?id=567723
One additional thought: mouse moves create SIGIO, right? In that case the ioctl() would return with EINTR, so this pretty much explains why moving the mouse has an effect.
Kernel configured without MSI?
I see a similar, maybe identical, at least I believe related issue when starting mutter on top of a plain Xserver. After some mouse movements the screen no longer gets repainted including mouse pointer. How to reproduce: X & xlock -update 1 & mutter <move the mouse around until screen no longer gets repainted> Stack trace for mutter process: cat /proc/955/stack [<c0433924>] i915_wait_request+0x13a/0x1ba [<c0433a30>] i915_gem_object_wait_rendering+0x28/0x2a [<c0433a5e>] i915_gem_object_set_to_gtt_domain+0x2c/0x6f [<c0433f8c>] i915_gem_set_domain_ioctl+0x94/0x108 [<c04203c4>] drm_ioctl+0x206/0x286 [<c02c5bfe>] vfs_ioctl+0x50/0x69 [<c02c5feb>] do_vfs_ioctl+0x326/0x34f [<c02c6054>] sys_ioctl+0x40/0x5a [<c0202bc9>] syscall_call+0x7/0xb [<ffffffff>] 0xffffffff This is on Ironlake (8086:0046) with - xf86-video-intel 2.8.1, - Mesa 7.7 - libdrm 2.4.17 (with commit 4f0f871) - xorg-server 1.6.3.901. - Kernel 2.6.31.12 After attaching with gdb to mutter process and doing a 'continue' in gdb repainting works fine again. If it hangs again, pressing Ctrl-C followed by 'continue' in gdb fixes the issue reliably. It seems the issue only occurs when you move the mouse cursor to the top of the screen, where some mutter menu pops up. > Kernel configured without MSI? Not sure what you mean. One of these? # zcat /proc/config.gz |grep -i msi CONFIG_ARCH_SUPPORTS_MSI=y CONFIG_PCI_MSI=y CONFIG_MSI_LAPTOP=m CONFIG_MSI_WMI=m
Looks like my issue is fixed in SLE11-SP1-RC1 kernel 2.6.32.9-0.5, whereas it's still broken in SLE11-SP1-Beta5 kernel 2.6.32.8-0.3. AFAICS we didn't add any additional patches between these kernel packages. Thus it appears to be fixed upstream between 2.6.32.8 and 2.6.32.9.
My issue is NOT fixed with 2.6.33.
Just for the record. My issue *is* fixed with the same 2.6.33 kernel package (2.6.33-5-pae).
Are all the machines in question Ironlake? The interrupt handling on those is different, and there may still be bugs.
Yes, this issue has only be seen on Ironlake so far.
OK. Zhenyu has done most of the work on the Ironlake interrupt handler, but I think he's busy with other modesetting stuff right now. I'm running Ironlake hardware daily for my GL development now, and haven't run into this, though.
We haven't been able to reproduce here as well, but colleagues *did* see it with their own eyes on a partner's site. I used scripts for debugging, and can get remote access if there's anything to try out. The effect seems to be really rare (machine wise), but on machines where it is reproducible it seems that you can trigger it easily. But it also seems to depend on the air pressure or whatever, after shipping a laptop (it even arrived suspended, and did resume successfully) the effect wasn't reproducible. Sigh.
Matthias, I have a recent irq patch for ILK at http://lists.freedesktop.org/archives/intel-gfx/2010-April/006444.html, which hopefully make first level irq enable/disable more reliable on ILK. Our media team seems require that patch, so please help to test. If ok, we can push this for stable kernels.
(In reply to comment #13) > Matthias, I have a recent irq patch for ILK at > http://lists.freedesktop.org/archives/intel-gfx/2010-April/006444.html, which Thanks, we had that patch tried, with no effect. For what it's worth, 2.6.34rc3 seems to be much more stable, but the issue still pops up from time to time.
commit e552eb7038a36d9b18860f525aa02875e313fe16 Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Wed Apr 21 11:39:23 2010 -0700 drm/i915: use PIPE_CONTROL instruction on Ironlake and Sandy Bridge Since 965, the hardware has supported the PIPE_CONTROL command, which provides fine grained GPU cache flushing control. On recent chipsets, this instruction is required for reliable interrupt and sequence number reporting in the driver. So add support for this instruction, including workarounds, on Ironlake and Sandy Bridge hardware. https://bugs.freedesktop.org/show_bug.cgi?id=27108 Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Eric Anholt <eric@anholt.net> commit 1918ad77f7f908ed67cf37c505c6ad4ac52f1ecf Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Fri Apr 23 09:32:23 2010 -0700 drm/i915: fix non-Ironlake 965 class crashes My PIPE_CONTROL fix (just sent via Eric's tree) was buggy; I was testing a whole set of patches together and missed a conversion to the new HAS_PIPE_CONTROL macro, which will cause breakage on non-Ironlake 965 class chips. Fortunately, the fix is trivial and has been tested. Be sure to use the HAS_PIPE_CONTROL macro in i915_get_gem_seqno, or we'll end up reading the wrong graphics memory, likely causing hangs, crashes, or worse. Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Reported-by: Toralf Förster <toralf.foerster@gmx.de> Tested-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It seems to turn out that the issue is related to a BIOS change. In any case, the latest patches are currently checked whether they fix the issue as well, and whether they have any side effects. Thanks so far.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.