I'm forwarding this bug from an Ubuntu reporter: https://bugs.edge.launchpad.net/bugs/568138 [Problem] Reporter gets occasional X freezes with no obvious trigger. The system remains responsive to SSH. Attempting to restart X fails. There are no errors recorded in Xorg.0.log when these freezes occur, but dmesg contains hung task backtraces ending in __mutex_lock_slowpath. Reporter is using libdrm 2.4.18, drm 2.6.33 and xf86-video-intel 2.9.1. I have also experienced freezes with the same dmesg backtrace using libdrm 2.4.20, drm 2.6.35-rc3 and xf86-video-intel 2.11.0 on a GM45 card. [lspci] 00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02) Xorg.0.log: http://launchpadlibrarian.net/44988804/Xorg.0.log Example dmesg outputs containing blocked task backtrace: http://launchpadlibrarian.net/45481306/dmesg http://launchpadlibrarian.net/44988795/dmesg Blocked task & held locks output from lock-debugging kernel: http://launchpadlibrarian.net/47665797/locks.txt intel_gpu_dump output: http://launchpadlibrarian.net/44988870/gpudump [Original report] Occasionally X hangs completely, not even SAK will kill it, however SAK followed by ctrl-alt-del successfully switches to the shutdown splash. When this happens, the following shows up in dmesg: [ 5169.457964] INFO: task i915:302 blocked for more than 120 seconds. [ 5169.457970] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5169.457974] i915 D 00000000ffffffff 0 302 2 0x00000000 [ 5169.457981] ffff88022e0d9d20 0000000000000046 0000000000015bc0 0000000000015bc0 [ 5169.457988] ffff88022f791ab0 ffff88022e0d9fd8 0000000000015bc0 ffff88022f7916f0 [ 5169.457994] 0000000000015bc0 ffff88022e0d9fd8 0000000000015bc0 ffff88022f791ab0 [ 5169.458000] Call Trace: [ 5169.458015] [<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170 [ 5169.458021] [<ffffffff8153f75b>] mutex_lock+0x2b/0x50 [ 5169.458063] [<ffffffffa00d5550>] intel_idle_update+0x60/0x1d0 [i915] [ 5169.458085] [<ffffffffa00d54f0>] ? intel_idle_update+0x0/0x1d0 [i915] [ 5169.458095] [<ffffffff81080777>] run_workqueue+0xc7/0x1a0 [ 5169.458101] [<ffffffff810808f3>] worker_thread+0xa3/0x110 [ 5169.458107] [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40 [ 5169.458113] [<ffffffff81080850>] ? worker_thread+0x0/0x110 [ 5169.458117] [<ffffffff81084fa6>] kthread+0x96/0xa0 [ 5169.458124] [<ffffffff810141ea>] child_rip+0xa/0x20 [ 5169.458129] [<ffffffff81084f10>] ? kthread+0x0/0xa0 [ 5169.458133] [<ffffffff810141e0>] ? child_rip+0x0/0x20 [ 5169.458154] INFO: task Xorg:1364 blocked for more than 120 seconds. [ 5169.458157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5169.458160] Xorg D 0000000000000000 0 1364 1263 0x00400004 [ 5169.458166] ffff88020bd13ca8 0000000000000082 0000000000015bc0 0000000000015bc0 [ 5169.458172] ffff880208c74890 ffff88020bd13fd8 0000000000015bc0 ffff880208c744d0 [ 5169.458177] 0000000000015bc0 ffff88020bd13fd8 0000000000015bc0 ffff880208c74890 [ 5169.458183] Call Trace: [ 5169.458189] [<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170 [ 5169.458195] [<ffffffff8153f75b>] mutex_lock+0x2b/0x50 [ 5169.458214] [<ffffffffa00c3435>] i915_gem_madvise_ioctl+0x55/0x160 [i915] [ 5169.458231] [<ffffffffa006de2a>] drm_ioctl+0x27a/0x480 [drm] [ 5169.458250] [<ffffffffa00c33e0>] ? i915_gem_madvise_ioctl+0x0/0x160 [i915] [ 5169.458256] [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40 [ 5169.458264] [<ffffffff81152a92>] vfs_ioctl+0x22/0xa0 [ 5169.458269] [<ffffffff81152d41>] do_vfs_ioctl+0x81/0x380 [ 5169.458274] [<ffffffff811431df>] ? vfs_read+0x12f/0x1a0 [ 5169.458279] [<ffffffff811530c1>] sys_ioctl+0x81/0xa0 [ 5169.458286] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
I believe this is the relevant commit: commit e552eb7038a36d9b18860f525aa02875e313fe16 Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Wed Apr 21 11:39:23 2010 -0700 drm/i915: use PIPE_CONTROL instruction on Ironlake and Sandy Bridge Since 965, the hardware has supported the PIPE_CONTROL command, which provides fine grained GPU cache flushing control. On recent chipsets, this instruction is required for reliable interrupt and sequence number reporting in the driver. So add support for this instruction, including workarounds, on Ironlake and Sandy Bridge hardware. https://bugs.freedesktop.org/show_bug.cgi?id=27108 Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Eric Anholt <eric@anholt.net>
I have been experiencing this issue on 2.6.35-rc3 when a bunch of kernel debug options are enabled - let me know if you'd like the .config (tuned for an Intel desktop board). It reproduces reliably usually within an hour of activity and we get a similar signature [1] to other reports. Commit e552eb7038a36d9b18860f525aa02875e313fe16 doesn't seem to address the issue, as this commit was introduced in 2.6.34. --- [1] INFO: task i915:774 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. i915 D ffff88010f0ba610 6280 774 2 0x00000000 ffff88010f22dcb0 0000000000000086 ffff88010e742a00 ffff88010f0ba610 ffff88010f22dfd8 00000000001d4440 ffff88010f22dfd8 ffff88010f0ba610 00000000001d4440 00000000001d4440 ffff88010f22dfd8 00000000001d4440 Call Trace: [<ffffffff813a41ba>] ? intel_idle_update+0x4a/0x220 [<ffffffff816425fa>] mutex_lock_nested+0x1ea/0x4c0 [<ffffffff813a41ba>] ? intel_idle_update+0x4a/0x220 [<ffffffff813a41ba>] ? intel_idle_update+0x4a/0x220 [<ffffffff81644d03>] ? _raw_spin_unlock_irqrestore+0x53/0xa0 [<ffffffff813a4170>] ? intel_idle_update+0x0/0x220 [<ffffffff813a41ba>] intel_idle_update+0x4a/0x220 [<ffffffff813a4170>] ? intel_idle_update+0x0/0x220 [<ffffffff81069b20>] worker_thread+0x220/0x390 [<ffffffff81069ace>] ? worker_thread+0x1ce/0x390 [<ffffffff8108475d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8106eb70>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81069900>] ? worker_thread+0x0/0x390 [<ffffffff8106e65e>] kthread+0xae/0xc0 [<ffffffff81004024>] kernel_thread_helper+0x4/0x10 [<ffffffff81645114>] ? restore_args+0x0/0x30 [<ffffffff8106e5b0>] ? kthread+0x0/0xc0 [<ffffffff81004020>] ? kernel_thread_helper+0x0/0x10 INFO: lockdep is turned off. INFO: task Xorg:2004 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Xorg D ffffffff8180a360 4608 2004 1936 0x00400004 ffff8801112d5c58 0000000000000086 ffff8801112d5bd8 ffffffffffffffff ffff8801112d5fd8 00000000001d4440 ffff8801112d5fd8 ffff8801078f0000 00000000001d4440 00000000001d4440 ffff8801112d5fd8 00000000001d4440 Call Trace: [<ffffffff813958fb>] ? i915_gem_throttle_ioctl+0x3b/0x90 [<ffffffff816425fa>] mutex_lock_nested+0x1ea/0x4c0 [<ffffffff813958fb>] ? i915_gem_throttle_ioctl+0x3b/0x90 [<ffffffff814f27f5>] ? sock_aio_write+0x125/0x140 [<ffffffff813958fb>] ? i915_gem_throttle_ioctl+0x3b/0x90 [<ffffffff813958fb>] i915_gem_throttle_ioctl+0x3b/0x90 [<ffffffff8137734a>] drm_ioctl+0x33a/0x4c0 [<ffffffff812d3d9c>] ? debug_object_deactivate+0x5c/0x110 [<ffffffff812d333c>] ? do_raw_spin_unlock+0x6c/0xc0 [<ffffffff8108280d>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff81644d03>] ? _raw_spin_unlock_irqrestore+0x53/0xa0 [<ffffffff812d3deb>] ? debug_object_deactivate+0xab/0x110 [<ffffffff81072ffc>] ? lock_hrtimer_base+0x2c/0x60 [<ffffffff8112eff8>] vfs_ioctl+0x38/0xd0 [<ffffffff8112f1ca>] do_vfs_ioctl+0x8a/0x5b0 [<ffffffff81058dbc>] ? do_setitimer+0x1cc/0x1f0 [<ffffffff810e8d72>] ? might_fault+0x72/0xd0 [<ffffffff8112f73a>] sys_ioctl+0x4a/0x80 [<ffffffff812d3d9c>] ? debug_object_deactivate+0x5c/0x110 [<ffffffff81003182>] system_call_fastpath+0x16/0x1b
Can you confirm that is waiting on a request and not stuck polling registers? i.e. cpu time is 0%. Can you also get the full stack traces of all processes so we can see which one is holding the mutex. The contents of /sys/kernel/debug/dri/0/ (in particular the interrupts and requests) will be useful as would an intel_reg_dumper.
The 6 pipe-control flushes [actually repeated dword stores] are what the docs suggest as the work-around for fun hardware. Can you confirm if you are indeed still seeing this on 2.6.35? And if so give me the info I requested. Thanks.
Still not sure what the cause actually is, but 2.6.36-rc3 contains a patch that will break any waits on a missed interrupt.
Hello, I think I have exactly the same problem. The logfile is really similar. At a certain moment my X hangs, I can do a ssh shell and login and reboot my machine. I think this problem might be caused because the GPU gets to warm. I have had this problem 3 times now, yesterday I got it after only 10 minutes working with my system or so... So it is really become a serious problem for me! I'm using Ubuntu 10.04.1 LTS. Kernel: 2.6.32-24-generic. I have a D945GSEJT board (ATOM N270, 945G intel onboard graphics). This board does not have a FAN, it is cooled passively. I have had many troubles with my first board which at a certain moment did not work at all anymore. So I had to send it back and I got a replacement but I think this is a repaired board. I now have again a problem. This one. Can this problem be caused because the GPU gets to warm because it is not sufficiently cooled? One other thing: It this problem really fixed already? What do I need to do to get the fix? And is the cause clear? I mean is it fixed as a workaround or was it really a bug... Please let me know. I found this page, by googling on the error that I have... Thanks, Pieter Langendonck
We are fairly confident that we have this fixed, so I am marking the report fixed. If the new hangcheck fires or if the machine is still freezing on 2.6.36, then please re-open. Pietr, you have a separate bug; google is mistaken. Unlikely to be overheating since the Atom has such a low TDP that passive cooling should be sufficient, if it was it should be throwing up more errors [MCE] than just hangs.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.