Bug 28805

Summary: [Arrandale] Intermittent freezes - missed interrupt?
Product: xorg Reporter: Christopher James Halse Rogers <chalserogers>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: daniel, ilmari
Version: unspecifiedKeywords: NEEDINFO
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Christopher James Halse Rogers 2010-06-28 23:35:38 UTC
I'm forwarding this bug from an Ubuntu reporter:
https://bugs.edge.launchpad.net/bugs/568138

[Problem]
Reporter gets occasional X freezes with no obvious trigger.  The system remains responsive to SSH.  Attempting to restart X fails.

There are no errors recorded in Xorg.0.log when these freezes occur, but dmesg contains hung task backtraces ending in __mutex_lock_slowpath.

Reporter is using libdrm 2.4.18, drm 2.6.33 and xf86-video-intel 2.9.1.
I have also experienced freezes with the same dmesg backtrace using libdrm 2.4.20, drm 2.6.35-rc3 and xf86-video-intel 2.11.0 on a GM45 card.

[lspci]
00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02)

Xorg.0.log: http://launchpadlibrarian.net/44988804/Xorg.0.log
Example dmesg outputs containing blocked task backtrace: 
http://launchpadlibrarian.net/45481306/dmesg
http://launchpadlibrarian.net/44988795/dmesg
Blocked task & held locks output from lock-debugging kernel: http://launchpadlibrarian.net/47665797/locks.txt
intel_gpu_dump output: http://launchpadlibrarian.net/44988870/gpudump

[Original report]
Occasionally X hangs completely, not even SAK will kill it, however SAK followed by ctrl-alt-del successfully switches to the shutdown splash.

When this happens, the following shows up in dmesg:

[ 5169.457964] INFO: task i915:302 blocked for more than 120 seconds.
[ 5169.457970] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5169.457974] i915 D 00000000ffffffff 0 302 2 0x00000000
[ 5169.457981] ffff88022e0d9d20 0000000000000046 0000000000015bc0 0000000000015bc0
[ 5169.457988] ffff88022f791ab0 ffff88022e0d9fd8 0000000000015bc0 ffff88022f7916f0
[ 5169.457994] 0000000000015bc0 ffff88022e0d9fd8 0000000000015bc0 ffff88022f791ab0
[ 5169.458000] Call Trace:
[ 5169.458015] [<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170
[ 5169.458021] [<ffffffff8153f75b>] mutex_lock+0x2b/0x50
[ 5169.458063] [<ffffffffa00d5550>] intel_idle_update+0x60/0x1d0 [i915]
[ 5169.458085] [<ffffffffa00d54f0>] ? intel_idle_update+0x0/0x1d0 [i915]
[ 5169.458095] [<ffffffff81080777>] run_workqueue+0xc7/0x1a0
[ 5169.458101] [<ffffffff810808f3>] worker_thread+0xa3/0x110
[ 5169.458107] [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
[ 5169.458113] [<ffffffff81080850>] ? worker_thread+0x0/0x110
[ 5169.458117] [<ffffffff81084fa6>] kthread+0x96/0xa0
[ 5169.458124] [<ffffffff810141ea>] child_rip+0xa/0x20
[ 5169.458129] [<ffffffff81084f10>] ? kthread+0x0/0xa0
[ 5169.458133] [<ffffffff810141e0>] ? child_rip+0x0/0x20
[ 5169.458154] INFO: task Xorg:1364 blocked for more than 120 seconds.
[ 5169.458157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5169.458160] Xorg D 0000000000000000 0 1364 1263 0x00400004
[ 5169.458166] ffff88020bd13ca8 0000000000000082 0000000000015bc0 0000000000015bc0
[ 5169.458172] ffff880208c74890 ffff88020bd13fd8 0000000000015bc0 ffff880208c744d0
[ 5169.458177] 0000000000015bc0 ffff88020bd13fd8 0000000000015bc0 ffff880208c74890
[ 5169.458183] Call Trace:
[ 5169.458189] [<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170
[ 5169.458195] [<ffffffff8153f75b>] mutex_lock+0x2b/0x50
[ 5169.458214] [<ffffffffa00c3435>] i915_gem_madvise_ioctl+0x55/0x160 [i915]
[ 5169.458231] [<ffffffffa006de2a>] drm_ioctl+0x27a/0x480 [drm]
[ 5169.458250] [<ffffffffa00c33e0>] ? i915_gem_madvise_ioctl+0x0/0x160 [i915]
[ 5169.458256] [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
[ 5169.458264] [<ffffffff81152a92>] vfs_ioctl+0x22/0xa0
[ 5169.458269] [<ffffffff81152d41>] do_vfs_ioctl+0x81/0x380
[ 5169.458274] [<ffffffff811431df>] ? vfs_read+0x12f/0x1a0
[ 5169.458279] [<ffffffff811530c1>] sys_ioctl+0x81/0xa0
[ 5169.458286] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
Comment 1 Chris Wilson 2010-07-05 11:57:06 UTC
I believe this is the relevant commit:

commit e552eb7038a36d9b18860f525aa02875e313fe16
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Wed Apr 21 11:39:23 2010 -0700

    drm/i915: use PIPE_CONTROL instruction on Ironlake and Sandy Bridge
    
    Since 965, the hardware has supported the PIPE_CONTROL command, which
    provides fine grained GPU cache flushing control.  On recent chipsets,
    this instruction is required for reliable interrupt and sequence number
    reporting in the driver.
    
    So add support for this instruction, including workarounds, on Ironlake
    and Sandy Bridge hardware.
    
    https://bugs.freedesktop.org/show_bug.cgi?id=27108
    
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Eric Anholt <eric@anholt.net>
Comment 2 Daniel J Blueman 2010-07-11 02:08:21 UTC
I have been experiencing this issue on 2.6.35-rc3 when a bunch of kernel debug options are enabled - let me know if you'd like the .config (tuned for an Intel desktop board). It reproduces reliably usually within an hour of activity and we get a similar signature [1] to other reports.

Commit e552eb7038a36d9b18860f525aa02875e313fe16 doesn't seem to address the issue, as this commit was introduced in 2.6.34.

--- [1]

INFO: task i915:774 blocked for more than 120 seconds.

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

i915          D ffff88010f0ba610  6280   774      2 0x00000000

 ffff88010f22dcb0 0000000000000086 ffff88010e742a00 ffff88010f0ba610

 ffff88010f22dfd8 00000000001d4440 ffff88010f22dfd8 ffff88010f0ba610

 00000000001d4440 00000000001d4440 ffff88010f22dfd8 00000000001d4440

Call Trace:

 [<ffffffff813a41ba>] ? intel_idle_update+0x4a/0x220

 [<ffffffff816425fa>] mutex_lock_nested+0x1ea/0x4c0

 [<ffffffff813a41ba>] ? intel_idle_update+0x4a/0x220

 [<ffffffff813a41ba>] ? intel_idle_update+0x4a/0x220

 [<ffffffff81644d03>] ? _raw_spin_unlock_irqrestore+0x53/0xa0

 [<ffffffff813a4170>] ? intel_idle_update+0x0/0x220

 [<ffffffff813a41ba>] intel_idle_update+0x4a/0x220

 [<ffffffff813a4170>] ? intel_idle_update+0x0/0x220

 [<ffffffff81069b20>] worker_thread+0x220/0x390

 [<ffffffff81069ace>] ? worker_thread+0x1ce/0x390

 [<ffffffff8108475d>] ? trace_hardirqs_on+0xd/0x10

 [<ffffffff8106eb70>] ? autoremove_wake_function+0x0/0x40

 [<ffffffff81069900>] ? worker_thread+0x0/0x390

 [<ffffffff8106e65e>] kthread+0xae/0xc0

 [<ffffffff81004024>] kernel_thread_helper+0x4/0x10

 [<ffffffff81645114>] ? restore_args+0x0/0x30

 [<ffffffff8106e5b0>] ? kthread+0x0/0xc0

 [<ffffffff81004020>] ? kernel_thread_helper+0x0/0x10

INFO: lockdep is turned off.

INFO: task Xorg:2004 blocked for more than 120 seconds.

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Xorg          D ffffffff8180a360  4608  2004   1936 0x00400004

 ffff8801112d5c58 0000000000000086 ffff8801112d5bd8 ffffffffffffffff

 ffff8801112d5fd8 00000000001d4440 ffff8801112d5fd8 ffff8801078f0000

 00000000001d4440 00000000001d4440 ffff8801112d5fd8 00000000001d4440

Call Trace:

 [<ffffffff813958fb>] ? i915_gem_throttle_ioctl+0x3b/0x90

 [<ffffffff816425fa>] mutex_lock_nested+0x1ea/0x4c0

 [<ffffffff813958fb>] ? i915_gem_throttle_ioctl+0x3b/0x90

 [<ffffffff814f27f5>] ? sock_aio_write+0x125/0x140

 [<ffffffff813958fb>] ? i915_gem_throttle_ioctl+0x3b/0x90

 [<ffffffff813958fb>] i915_gem_throttle_ioctl+0x3b/0x90

 [<ffffffff8137734a>] drm_ioctl+0x33a/0x4c0

 [<ffffffff812d3d9c>] ? debug_object_deactivate+0x5c/0x110

 [<ffffffff812d333c>] ? do_raw_spin_unlock+0x6c/0xc0

 [<ffffffff8108280d>] ? trace_hardirqs_off+0xd/0x10

 [<ffffffff81644d03>] ? _raw_spin_unlock_irqrestore+0x53/0xa0

 [<ffffffff812d3deb>] ? debug_object_deactivate+0xab/0x110

 [<ffffffff81072ffc>] ? lock_hrtimer_base+0x2c/0x60

 [<ffffffff8112eff8>] vfs_ioctl+0x38/0xd0

 [<ffffffff8112f1ca>] do_vfs_ioctl+0x8a/0x5b0

 [<ffffffff81058dbc>] ? do_setitimer+0x1cc/0x1f0

 [<ffffffff810e8d72>] ? might_fault+0x72/0xd0

 [<ffffffff8112f73a>] sys_ioctl+0x4a/0x80

 [<ffffffff812d3d9c>] ? debug_object_deactivate+0x5c/0x110

 [<ffffffff81003182>] system_call_fastpath+0x16/0x1b
Comment 3 Chris Wilson 2010-07-22 10:06:19 UTC
Can you confirm that is waiting on a request and not stuck polling registers? i.e. cpu time is 0%.

Can you also get the full stack traces of all processes so we can see which one is holding the mutex.

The contents of /sys/kernel/debug/dri/0/ (in particular the interrupts and requests) will be useful as would an intel_reg_dumper.
Comment 4 Chris Wilson 2010-08-06 00:33:14 UTC
The 6 pipe-control flushes [actually repeated dword stores] are what the docs suggest as the work-around for fun hardware. Can you confirm if you are indeed still seeing this on 2.6.35? And if so give me the info I requested. Thanks.
Comment 5 Chris Wilson 2010-09-06 10:23:30 UTC
Still not sure what the cause actually is, but 2.6.36-rc3 contains a patch that will break any waits on a missed interrupt.
Comment 6 Pieter L. 2010-09-14 01:54:28 UTC
Hello,

I think I have exactly the same problem. The logfile is really similar. 
At a certain moment my X hangs, I can do a ssh shell and login and reboot my machine.

I think this problem might be caused because the GPU gets to warm. 

I have had this problem 3 times now, yesterday I got it after only 10 minutes working with my system or so... So it is really become a serious problem for me!

I'm using Ubuntu 10.04.1 LTS. Kernel: 2.6.32-24-generic.
I have a D945GSEJT board (ATOM N270, 945G intel onboard graphics).
This board does not have a FAN, it is cooled passively.

I have had many troubles with my first board which at a certain moment did not work at all anymore. So I had to send it back and I got a replacement but I think this is a repaired board. 
I now have again a problem. This one. 

Can this problem be caused because the GPU gets to warm because it is not sufficiently cooled?


One other thing:
It this problem really fixed already?
What do I need to do to get the fix? And is the cause clear? I mean is it fixed as a workaround or was it really a bug... 

Please let me know.
I found this page, by googling on the error that I have...

Thanks,
Pieter Langendonck
Comment 7 Chris Wilson 2010-09-19 11:33:53 UTC
We are fairly confident that we have this fixed, so I am marking the report fixed. If the new hangcheck fires or if the machine is still freezing on 2.6.36, then please re-open.

Pietr, you have a separate bug; google is mistaken. Unlikely to be overheating since the Atom has such a low TDP that passive cooling should be sufficient, if it was it should be throwing up more errors [MCE] than just hangs.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.