Bug 34307

Summary: [i945gme] GPU lockup (ESR: 0x00000001 IPEHR: 0x00007272)
Product: xorg Reporter: Bryce Harrington <bryce>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: high CC: davidcoggins1, jeeves_bond
Version: 7.6 (2010.12)Keywords: regression
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
BootDmesg.txt
none
CurrentDmesg.txt
none
XorgLog.txt
none
i915_error_state.txt
none
dmesg none

Description Bryce Harrington 2011-02-15 12:10:47 UTC
Forwarding this bug from Ubuntu reporter Liam McDermott:
http://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/718767

[Problem]
GPU freeze during regular system usage.  User reports having seen freezes off and on since upgrading to Ubuntu 11.04 a few weeks ago.

We have been getting a variety of gpu dumps similar to this one, with ESR: 0x00000001 and some IPEHR value that varies from report to report.  They generally have dmesg output similar to this one, with no specific error message.

[Original Description]
The notification that this had crashed appeared just after rebooting. The bug reporting tool was also crashing at the same time so it's hard to say when this happened/what the cause was.

ACTHD: 0xffffffff
EIR: 0x00000000
EMR: 0xffffffed
ESR: 0x00000001
PGTBL_ER: 0x00000000
IPEHR: 0x00007272
IPEIR: 0x00000000
INSTDONE: 0x7fffffc1

[   13.413946] mtrr: no more MTRRs available
[   13.413958] [drm] MTRR allocation failed.  Graphics performance may suffer.
[   13.425214] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[   13.425225] [drm] Driver supports precise vblank timestamp query.
[   13.507981] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[   13.508920] [drm] initialized overlay support
...
[ 1120.564076] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1120.571548] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 123059 at 123042, next 123237)
[ 1120.578527] [drm:i915_reset] *ERROR* Failed to reset chip.
[ 1120.672949] show_signal_msg: 6 callbacks suppressed
[ 1120.672964] compiz[1262]: segfault at 0 ip 003690e0 sp bff00110 error 6 in libc-2.12.2.so[255000+15a000]

ProblemType: Crash
DistroRelease: Ubuntu 11.04
Package: xserver-xorg-video-intel 2:2.14.0-1ubuntu7
ProcVersionSignature: Ubuntu 2.6.38-3.30-generic 2.6.38-rc4
Uname: Linux 2.6.38-3-generic i686
Architecture: i386
Chipset: i945gme
DRM.card0.LVDS.1:
 status: connected
 enabled: enabled
 dpms: On
 modes: 1024x600
 edid-base64: AP///////wAGr9IwAAAAAAETAQOAFg14CmaVlllXkSgfUFQAAAABAQEBAQEBAQEBAQEBAQEBsBMAQEFYGSAYiDEA330AAAAYAAAADwAAAAAAAAAAAAAAAAAgAAAA/gBBVU8KICAgICAgICAgAAAA/gBCMTAxQVcwMyBWMCAKAFI=
DRM.card0.VGA.1:
 status: disconnected
 enabled: disabled
 dpms: Off
 modes:
 edid-base64:
Date: Mon Feb 14 08:44:56 2011
DistUpgraded: Yes, recently upgraded Log time: 2011-01-27 10:36:04.407155
DistroCodename: natty
DistroVariant: ubuntu
DumpSignature: 1d5b69ea (ESR: 0x00000001 IPEHR: 0x00007272)
ExecutablePath: /usr/share/apport/apport-gpu-error-intel.py
GraphicsCard:
 Subsystem: QUANTA Computer Inc Device [152d:1777]
   Subsystem: QUANTA Computer Inc Device [152d:1777]
InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Alpha i386 (20110122)
InterpreterPath: /usr/bin/python2.7
MachineType: Quanta UW1
ProcCmdline: /usr/bin/python /usr/share/apport/apport-gpu-error-intel.py
ProcEnviron:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.38-3-generic root=UUID=fd4da622-36c1-4d74-811d-a8a5c90f2738 ro quiet splash vt.handoff=7
ProcKernelCmdLine_: BOOT_IMAGE=/boot/vmlinuz-2.6.38-3-generic root=UUID=fd4da622-36c1-4d74-811d-a8a5c90f2738 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 xserver-xorg 1:7.6~3ubuntu4
 libdrm2 2.4.23-1ubuntu3
 xserver-xorg-video-intel 2:2.14.0-1ubuntu7
SourcePackage: xserver-xorg-video-intel
Title: [i945gme] GPU lockup 1d5b69ea (ESR: 0x00000001 IPEHR: 0x00007272)
UserGroups:

dmi.bios.date: 05/19/2009
dmi.bios.vendor: INSYDE
dmi.bios.version: Q3F21
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: Base Board Product Name
dmi.board.vendor: Quanta
dmi.board.version: 03
dmi.chassis.asset.tag: 
dmi.chassis.type: 1
dmi.chassis.vendor: Chassis Manufacturer
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnINSYDE:bvrQ3F21:bd05/19/2009:svnQuanta:pnUW1:pvr04:rvnQuanta:rnBaseBoardProductName:rvr03:cvnChassisManufacturer:ct1:cvrChassisVersion:
dmi.product.name: UW1
dmi.product.version: 04
dmi.sys.vendor: Quanta
version.compiz: compiz 1:0.9.2.1+glibmainloop4-0ubuntu11
version.libdrm2: libdrm2 2.4.23-1ubuntu3
version.libgl1-mesa-glx: libgl1-mesa-glx 7.10-1ubuntu1
version.xserver-xorg: xserver-xorg 1:7.6~3ubuntu4
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:6.13.2+git20110124.fadee040-0ubuntu4
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.14.0-1ubuntu7
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:0.0.16+git20110107+b795ca6e-0ubuntu4
Comment 1 Bryce Harrington 2011-02-15 12:12:01 UTC
Created attachment 43393 [details]
BootDmesg.txt
Comment 2 Bryce Harrington 2011-02-15 12:12:24 UTC
Created attachment 43394 [details]
CurrentDmesg.txt
Comment 3 Bryce Harrington 2011-02-15 12:12:46 UTC
Created attachment 43395 [details]
XorgLog.txt
Comment 5 Bryce Harrington 2011-02-15 12:22:32 UTC
Btw, what is 'IPEHR'?  Is it significant that two otherwise similar gpu crash reports would have differing values?
Comment 6 Chris Wilson 2011-02-19 10:59:45 UTC
IPEHR is the 'instruction pointer error header', i.e. the first dword of the last instruction parsed.

This looks like memory corruption nothing to do with i915.ko. Something wrote garbage into the physical memory we are using for the ringbuffer:

0x000078a0:      0x00007272: MI_NOOP
0x000078a4:      0xf1ecfc44:    UNKNOWN
0x000078a8:      0xf1ecfc44:    UNKNOWN
0x000078ac:      0x00000000: MI_NOOP
0x000078b0:      0x00000000: MI_NOOP
0x000078b4:      0x00000000: MI_NOOP

That doesn't match any pattern used by i915.ko, mesa, or the ddx. It could be a wild write from an unrelocated target surface, but that usually clobbers a whole lot more (and starting from the beginning of the ringbuffer).
Comment 7 Chris Wilson 2011-02-19 11:21:54 UTC
Bryce, for all the 915/945 bugs can you please have the reporters test the latest kernel with the enlarged unfenced alignment. That's the most likely cause of random writes, though I don't suspect it in this case.
Comment 8 Bryce Harrington 2011-02-22 15:17:35 UTC
(In reply to comment #7)
> Bryce, for all the 915/945 bugs can you please have the reporters test the
> latest kernel with the enlarged unfenced alignment. That's the most likely
> cause of random writes, though I don't suspect it in this case.

Alright, doing so for both i915 and i945.  I am pointing them at this package repository, which has daily snapshots of the kernel, and currently provides linux-image-2.6.38-999-generic_2.6.38-999.201102221357

  http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/

For reference, what commit(s) provide the enlarged unfenced alignment?  I was not able to locate commit messages referring to unfenced alignments in either the current linus tree or in your drm-intel-next tree.  If the patches help, I'd like to forward them to our kernel team to look at including.
Comment 9 Bryce Harrington 2011-02-22 15:19:20 UTC
Created attachment 43683 [details]
dmesg

Fwiw, I also got this user to test your debug patch on bug #34014.  Attached is his dmesg from after reproducing the lockup.

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/718767/+attachment/1861287/+files/dmesg.txt
Comment 10 Bryce Harrington 2011-02-22 15:49:46 UTC
(In reply to comment #8)
> For reference, what commit(s) provide the enlarged unfenced alignment?  I was
> not able to locate commit messages referring to unfenced alignments in either
> the current linus tree or in your drm-intel-next tree.  If the patches help,
> I'd like to forward them to our kernel team to look at including.

Looks like perhaps kernel commit 5e7833?
Comment 11 Bryce Harrington 2011-03-02 17:37:14 UTC
Chris, I've had multiple i915 and i945 reporters test the current daily kernel.  Universally all say it makes no difference; they all still these same freezes.

I also have verified we've had that enlarged unfenced alignment (commit 5e7833) in our kernel for some time.
Comment 12 Chris Wilson 2011-03-08 03:01:52 UTC
(In reply to comment #11)
> Chris, I've had multiple i915 and i945 reporters test the current daily kernel.
>  Universally all say it makes no difference; they all still these same freezes.
> 
> I also have verified we've had that enlarged unfenced alignment (commit 5e7833)
> in our kernel for some time.

That's a relief in one sense. Can you keep the error states coming? Establishing a pattern would be most useful. There's only been one related fix so far:


commit 467cffba85791cdfce38c124d75bd578f4bb8625
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Mar 7 10:42:03 2011 +0000

    drm/i915: Rebind the buffer if its alignment constraints changes with tiling
    
    Early gen3 and gen2 chipset do not have the relaxed per-surface tiling
    constraints of the later chipsets, so we need to check that the GTT
    alignment is correct for the new tiling. If it is not, we need to
    rebind.
    
    Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 13 Chris Wilson 2011-03-20 04:05:07 UTC
Can you give drm-intel-staging, and in particular,

commit 0faba0d4e49361886b16c703995a3477951b14e5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 17 15:23:22 2011 +0000

    drm/i915: Fix tiling corruption from pipelined fencing
    
    ... even though it was disabled. A mistake in the handling of fence reuse
    caused us to skip the vital delay of waiting for the object to finish
    rendering before changing the register.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34584
    Cc: Andy Whitcroft <apw@canonical.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    [Note for 2.6.38-stable, we need to reintroduce the interruptible passing]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

a whirl?
Comment 14 Chris Wilson 2011-03-22 23:53:55 UTC
Working on the theory that it is one and the same bug:

commit b5b5ac2dec49ea5ae033434efa90863aa5cdfb2c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 17 15:23:22 2011 +0000

    drm/i915: Fix tiling corruption from pipelined fencing
    
    ... even though it was disabled. A mistake in the handling of fence reuse
    caused us to skip the vital delay of waiting for the object to finish
    rendering before changing the register.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34584
    Cc: Andy Whitcroft <apw@canonical.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    [Note for 2.6.38-stable, we need to reintroduce the interruptible passing]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Dave Airlie <airlied@linux.ie>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.