Bug 22336

Summary: [i965] GPU hang with compiz, active system use
Product: xorg Reporter: Bryce Harrington <bryce>
Component: Driver/intelAssignee: Eric Anholt <eric>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: albertomilone, cmsj, jbarnes, jwbaker, mdz, yingying.zhao, zOOmER.gm
Version: 7.4 (2008.09)Keywords: NEEDINFO
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
intel_gpu_dump.txt.gz
none
dmesg
none
intel_gpu_dump output from a subsequent hang
none
output of intel_gpu_dump.gz after hang
none
intel gpu dump, dmesg and system info
none
Avoid wrapping mid-instruction. none

Description Bryce Harrington 2009-06-17 10:33:47 UTC
Created attachment 26894 [details]
intel_gpu_dump.txt.gz

Forwarding this Ubuntu bug:
https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/388357

[Problem]
GPU hang with call trace in dmesg occurs subsequent to full screen activity (video playback); also seen by other users after resuming from screen blanking via DPMS and after resuming from screensaver.

[Call Trace]
[ 6000.528124] INFO: task events/1:10 blocked for more than 120 seconds.
[ 6000.528133] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6000.528140] events/1 D 0000000100151496 0 10 2
[ 6000.528152] ffff8800bded1db0 0000000000000046 ffff8800bded1d30 0000000000013000
[ 6000.528163] ffff8800bdec83a8 0000000000013000 0000000000013000 0000000000013000
[ 6000.528173] 0000000000013000 0000000000013000 ffff8800bdec83a8 0000000000013000
[ 6000.528183] Call Trace:
[ 6000.528203] [<ffffffff806d9467>] __mutex_lock_slowpath+0xd7/0x160
[ 6000.528216] [<ffffffff802436b1>] ? finish_task_switch+0x51/0x110
[ 6000.528225] [<ffffffff806d9186>] mutex_lock+0x26/0x50
[ 6000.528260] [<ffffffffa0251ec8>] i915_gem_retire_work_handler+0x38/0x90 [i915]
[ 6000.528283] [<ffffffffa0251e90>] ? i915_gem_retire_work_handler+0x0/0x90 [i915]
[ 6000.528292] [<ffffffff802643d5>] run_workqueue+0x95/0x170
[ 6000.528300] [<ffffffff80264554>] worker_thread+0xa4/0x120
[ 6000.528310] [<ffffffff80268e90>] ? autoremove_wake_function+0x0/0x40
[ 6000.528318] [<ffffffff802644b0>] ? worker_thread+0x0/0x120
[ 6000.528327] [<ffffffff80268a35>] kthread+0x55/0xa0
[ 6000.528335] [<ffffffff802130ca>] child_rip+0xa/0x20
[ 6000.528344] [<ffffffff802689e0>] ? kthread+0x0/0xa0
[ 6000.528351] [<ffffffff802130c0>] ? child_rip+0x0/0x20

[Original Report]
I had finished watching a video in totem, and had been writing email using mutt and vim in a terminal for some time, when the screen stopped updating. My music was still playing, though; everything seemed to be running except for the X server (symptoms similar to bug 359392).

I was able to ssh in from another system and collect intel_gpu_dump output, which i will attach.

/proc/interrupts showed no change in the number of interrupts for i915.

The kernel logged a page allocation failure while intel_gpu_dump was running(!), which will be shown in the attached dmesg.

I've seen it happen twice now (in the span of 2 hours), and both times, dmesg shows the above trace.

ProblemType: Bug
Architecture: amd64
Date: Wed Jun 17 10:20:15 2009
DistroRelease: Ubuntu 9.10
MachineType: LENOVO 6465CTO
Package: xserver-xorg-video-intel 2:2.7.99.1+git20090602.ec2fde7c-0ubuntu2
ProcCmdLine: root=UUID=305dde78-d20a-4248-aaf4-09447b7c5791 ro quiet splash
ProcEnviron:
 LC_COLLATE=C
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/zsh
ProcVersionSignature: Ubuntu 2.6.30-9.10-generic
RelatedPackageVersions:
 xserver-xorg 1:7.4~5ubuntu21
 libgl1-mesa-glx 7.4.1-1ubuntu2
 libdrm2 2.4.11-0ubuntu1
 xserver-xorg-video-intel 2:2.7.99.1+git20090602.ec2fde7c-0ubuntu2
 xserver-xorg-video-ati 1:6.12.2-2ubuntu1
SourcePackage: xserver-xorg-video-intel
Uname: Linux 2.6.30-9-generic x86_64
dmi.bios.date: 01/21/2008
dmi.bios.vendor: LENOVO
dmi.bios.version: 7LETB0WW (2.10 )
dmi.board.name: 6465CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr7LETB0WW(2.10):bd01/21/2008:svnLENOVO:pn6465CTO:pvrThinkPadT61:rvnLENOVO:rn6465CTO:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 6465CTO
dmi.product.version: ThinkPad T61
dmi.sys.vendor: LENOVO
fglrx: Not loaded
system:
 distro: Ubuntu
 architecture: x86_64kernel: 2.6.30-9-generic
Comment 1 Bryce Harrington 2009-06-17 10:34:08 UTC
Created attachment 26895 [details]
dmesg
Comment 2 Bryce Harrington 2009-06-17 11:07:32 UTC
Ubuntu bugs with similar backtraces which I suspect are dupes:

  https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/383973
  Freeze when trying to resume from a blanked screen

  https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/383822
  Freeze when trying to resume from screensaver

  https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/384242
  Freeze after DPMS has kicked in

  https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/384865
  kernel oops with intel graphics when screensaver turns screen off

Comment 3 Bryce Harrington 2009-06-17 11:08:23 UTC
*** Bug 22318 has been marked as a duplicate of this bug. ***
Comment 4 Eric Anholt 2009-06-17 11:33:48 UTC
That backtrace is a generic "the gpu is hung" backtrace.  Don't use it for classifying bugs.

The dump in this report is broken because there were too many batchbuffers queued up and seqfile failed thanks to its use of kmalloc (the page allocation failure warning).  If you can find a way to reliably reproduce the problem, and any 3D applications are in use, running them with INTEL_DEBUG=sync in the environment may help get successful dumping.

Comment 5 Bryce Harrington 2009-06-17 11:47:55 UTC
Here is a backtrace of the X server at the time of the hang:

#0 0x00007feedbda0ec7 in ioctl () from /lib/libc.so.6
#1 0x00007feeda9812e3 in drmIoctl () from /usr/lib/libdrm.so.2
#2 0x00007feeda9815e6 in drmCommandNone () from /usr/lib/libdrm.so.2
#3 0x00007feeda50b370 in I830BlockHandler (i=0,
    blockData=<value optimized out>, pTimeout=0x7fff7b671df8,
    pReadmask=0x7dff80) at ../../src/i830_driver.c:2281
#4 0x0000000000536885 in AnimCurScreenBlockHandler (
    screenNum=<value optimized out>, blockData=<value optimized out>,
    pTimeout=<value optimized out>, pReadmask=<value optimized out>)
    at ../../render/animcur.c:222
#5 0x0000000000500d86 in compBlockHandler (i=0, blockData=0x0,
    pTimeout=0x7fff7b671df8, pReadmask=<value optimized out>)
    at ../../composite/compinit.c:158
#6 0x00000000004520e0 in BlockHandler (pTimeout=0x7fff7b671df8,
    pReadmask=0x7dff80) at ../../dix/dixutils.c:384
#7 0x00000000004eed31 in WaitForSomething (
    pClientsReady=<value optimized out>) at ../../os/WaitFor.c:215
#8 0x000000000044dd52 in Dispatch () at ../../dix/dispatch.c:367
#9 0x0000000000433f15 in main (argc=<value optimized out>,
    argv=0x7fff7b672018, envp=<value optimized out>) at ../../dix/main.c:397

Looks the same as in
https://bugs.freedesktop.org/show_bug.cgi?id=20560
Comment 6 Matt Zimmerman 2009-06-18 14:02:27 UTC
I have experienced this problem several times without the use of any 3D applications (I switched from compiz to metacity in hopes of a workaround).

I'll attach another GPU dump.  This one was taken sooner after the hang, so perhaps it will be more useful.  How can I tell a useful dump from a useless one?
Comment 7 Matt Zimmerman 2009-06-18 14:03:01 UTC
Created attachment 26932 [details]
intel_gpu_dump output from a subsequent hang
Comment 8 Eric Anholt 2009-06-22 08:57:47 UTC
The dump there looks pretty sane.  Do you have a way to reproduce this bug?
Comment 9 Matt Zimmerman 2009-06-23 03:47:25 UTC
"sane" as in "not broken like the previous one" or "sane" as in "contains no indication of any problem"?  Has this information provided any clue as to where the problem lies?

I've switched from compiz to metacity to get my life back, but was seeing this recur a couple of times per day while using compiz.  I expect I could reproduce it by going back to compiz.  Is there something more I can do to help diagnose the problem if indeed I can reproduce it?  I am happy to try.

If you want to have a go at reproducing it on your own hardware, I recommend trying with Ubuntu Karmic alpha 2: http://cdimage.ubuntu.com/releases/karmic/alpha-2/
Comment 10 Eric Anholt 2009-06-24 14:03:36 UTC
"contains no indication of any problem"

There are changes in intel-gpu-tools git that improve dump reporting and might have more information, but I don't expect it to help.  We just need to figure out how to reliably reproduce the problem in a short period of time, so we can fix it.
Comment 11 jwbaker 2009-07-05 12:02:59 UTC
*** Bug 22624 has been marked as a duplicate of this bug. ***
Comment 12 jwbaker 2009-07-05 12:04:43 UTC
I have a 100% reliable way to reproduce this on Ubuntu Karmic x86_64.  On any normal system with all defaults, kms and compiz enabled, just login and wait for the screen to blank.  That hangs the display right there, and I get these same stacks in dmesg (see my duplicated bug, which actually has two hung stacks, not just the one noted here).
Comment 13 jwbaker 2009-07-05 12:22:50 UTC
Created attachment 27401 [details]
output of intel_gpu_dump.gz after hang

My GPU dump after hang ... looks similar to previous but I'm not smart enough to tell the difference.
Comment 14 Eric Anholt 2009-07-06 13:37:50 UTC
jwbaker, you have a completely different bug.  We still need to figure out how to reliably reproduce this one.
Comment 15 Matt Zimmerman 2009-07-08 02:37:59 UTC
What additional information can I provide?  The best recipe I have so far is:

- Install a recent Ubuntu snapshot
- Boot the system
- Work normally in X for a while

I've re-enabled compiz to confirm that it still happens with the latest bits.  However, unless you can provide instructions for diagnosis, the best I'll be able to do is run intel_gpu_dump and attach another (probably useless) dump.
Comment 16 Achim Frase 2009-08-02 08:52:40 UTC
Created attachment 28262 [details]
intel gpu dump, dmesg and system info

I am not 100% sure if I have the same problem but I hope the attached information will help to clarify this.

How I produced the gpu hung.

(steps which were used to produce the gpu_dump)
1. compiz is in use
2. open 25-30 pictures 1920x1200 with EOG
3. press <Alt>+<Tap>
4. the screen should now be frozen except the mouse cursor, but it could be that the mouse cursor is also frozen.

I was not able to reproduce this with metacity as window manager (30 pictures).

Another way to hung the gpu is to change the wallpaper in gnome while compiz is active (1920x1200). The gpu doesn't always hung immediately.

1. compiz is in use
2. 1-3 workspaces with each a full-screen windows open
3. 4th workspace to change the wallpaper
4. the screen should be frozen immediately or after some time.

If it doesn't hung immediately you should choose different wallpaper until the system is frozen. I think that the system freezes much easier if I had it in use for some time, before I try to change the wallpaper.

I hope this information is somehow helpful.

Regards
Achim
Comment 17 Eric Anholt 2009-08-07 19:07:04 UTC
I can't say for sure in your case, since you didn't mention using any other 3d apps, it looks like you've got a screen with an appropriately aligned height, and it looks like compiz doesn't use a depth buffer, but it may still be worth trying with this commit series:


xf86-video-intel:
commit e8f0763d405a8152c74c28792c52fe12c1d41dd5
Author: Eric Anholt <eric@anholt.net>
Date:   Fri Aug 7 18:24:44 2009 -0700

    Fix math in the tiling alignment fix.

commit 222b52ef16895823fbf3a0fc0be4eb23b930ed1b
Author: Eric Anholt <eric@anholt.net>
Date:   Fri Aug 7 18:05:29 2009 -0700

    Align tiled pixmap height so we don't address beyond the end of our buffers.

Mesa:
commit ceb8afcca5b0a52b005a782ea54b301beaee1a15
Author: Eric Anholt <eric@anholt.net>
Date:   Fri Aug 7 18:09:31 2009 -0700

    intel: Align region height as required for tiled regions.

    Otherwise, we would address beyond the end of our buffers.  Fixes reliable
    GPU segfault with texture_tiling=true and oglconform shadow.c.

    Bug #22406.
Comment 18 Chris Wilson 2009-09-07 02:06:39 UTC
Created attachment 29294 [details] [review]
Avoid wrapping mid-instruction.

The first gpu dump shows that we wrapped the ringbuffer mid-instruction, which is invalid according to the docs. I've posted this patch for review.
Comment 19 Gordon Jin 2009-09-21 20:11:53 UTC
decreasing priority and not to block Q3 release, as lacking of response.
Comment 20 Eric Anholt 2009-09-25 11:14:13 UTC
Closing this due to lack of response.  If the problem continues with the components updated for the other hangs we've fixed, please reopen.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.