Bug 32396 - GPU lockup - Invalid GTT entry during Display B Fetch or Host Access
Summary: GPU lockup - Invalid GTT entry during Display B Fetch or Host Access
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Daniel Vetter
QA Contact:
URL:
Whiteboard:
Keywords: regression
: 35974 35975 35976 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-12-14 10:38 UTC by Bryce Harrington
Modified: 2016-11-03 12:20 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
BootDmesg.txt (67.64 KB, text/plain)
2010-12-14 10:42 UTC, Bryce Harrington
no flags Details
CurrentDmesg.txt (3.49 KB, text/plain)
2010-12-14 10:42 UTC, Bryce Harrington
no flags Details
i915_error_state.txt (672.27 KB, text/plain)
2010-12-14 10:43 UTC, Bryce Harrington
no flags Details
IntelGpuDump.txt (110.10 KB, text/plain)
2011-01-18 09:58 UTC, Bryce Harrington
no flags Details
i915_error_state.txt (672.27 KB, text/plain)
2011-01-20 14:46 UTC, Bryce Harrington
no flags Details
Disable outputs before KMS takeover (5.33 KB, patch)
2011-03-29 03:44 UTC, Chris Wilson
no flags Details | Splinter Review

Description Bryce Harrington 2010-12-14 10:38:14 UTC
Forwarding this bug from Ubuntu reporter pschonmann:
http://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/686388

[Problem]
One-time crash occurred while using firefox.  Steps to reproduce are unknown.

[i965gm] GPU lockup - Invalid GTT entry during Display B Fetch

[Original Description]
Crashed, when i was reporting https://bugs.launchpad.net/ubuntu/+source/apt-xapian-index/+bug/686386

This bug appear only one time... yet
No changes in HW / SW, just updating to latest nattys packages
I did not make any unusual things, just run update manager and update packages.

---
Time: 1291713055 s 678993 us
PCI ID: 0x2a02
EIR: 0x00000010
  PGTBL_ER: 0x00000100
    Invalid GTT entry during Display B Fetch
  INSTPM: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xffe5fafe
    busy: Projection and LOD
    busy: Bypass FIFO
    busy: Color calculator
  ACTHD: 0x00000000
  INSTPS: 0x00100000
  INSTDONE1: 0x000fffff
---

ProblemType: Crash
DistroRelease: Ubuntu 11.04
Package: xserver-xorg-video-intel 2:2.13.901-2ubuntu1
ProcVersionSignature: Ubuntu 2.6.37-7.19-generic 2.6.37-rc3
Uname: Linux 2.6.37-7-generic x86_64
Architecture: amd64
Chipset: i965gm
DRM.card0.DVI.D.1:
 status: disconnected
 enabled: disabled
 dpms: Off
 modes:
 edid-base64:
DRM.card0.LVDS.1:
 status: connected
 enabled: enabled
 dpms: On
 modes: 1440x900 1440x900 1024x768 800x600 640x480
 edid-base64: AP///////wAwrjNAAAAAAAAPAQOAHhN46s11kVVPiyYhUFQhCAABAQEBAQEBAQEBAQEBAQEBMiagQFGEGjAwIDYAL74QAAAZ1R+gQFGEGjAwIDYAL74QAAAZAAAADwCQCjKQCigUAQBMo1dEAAAA/gBMVE4xNDFXRC1MMDUKAKY=
DRM.card0.VGA.1:
 status: disconnected
 enabled: disabled
 dpms: Off
 modes:
 edid-base64:
Date: Tue Dec  7 10:10:58 2010
DumpSignature: 94ada031 (EIR: 0x00000010)
ExecutablePath: /usr/share/apport/apport-gpu-error-intel.py
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release amd64 (20101007)
InterpreterPath: /usr/bin/python2.6
MachineType: LENOVO 7661CH8
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcCmdline: /usr/bin/python /usr/share/apport/apport-gpu-error-intel.py
ProcEnviron:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.37-7-generic root=UUID=a3d37e1f-bbf7-4766-b8cb-22b445b6e7e0 ro quiet splash
ProcKernelCmdLine_: BOOT_IMAGE=/boot/vmlinuz-2.6.37-7-generic root=UUID=a3d37e1f-bbf7-4766-b8cb-22b445b6e7e0 ro quiet splash
SourcePackage: xserver-xorg-video-intel
Title: [i965gm] GPU lockup 94ada031 (EIR: 0x00000010)
UserGroups:

dmi.bios.date: 04/08/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 7LETC7WW (2.27 )
dmi.board.name: 7661CH8
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: BB70301820
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr7LETC7WW(2.27):bd04/08/2010:svnLENOVO:pn7661CH8:pvrThinkPadT61:rvnLENOVO:rn7661CH8:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 7661CH8
dmi.product.version: ThinkPad T61
dmi.sys.vendor: LENOVO
glxinfo: Error: [Errno 2] No such file or directory
system:
  codename:           natty
 architecture:       x86_64
 kernel:             2.6.37-7-generic
Comment 1 Bryce Harrington 2010-12-14 10:42:22 UTC
Created attachment 41117 [details]
BootDmesg.txt
Comment 2 Bryce Harrington 2010-12-14 10:42:40 UTC
Created attachment 41118 [details]
CurrentDmesg.txt
Comment 3 Bryce Harrington 2010-12-14 10:43:00 UTC
Created attachment 41119 [details]
i915_error_state.txt
Comment 5 Chris Wilson 2010-12-14 14:07:39 UTC
That PGTBL_ER only makes sense in conjunction with a mode change. I can't see the actual crash dmesg to confirm that it was the only error detected along with the crash running firefox.
Comment 6 Bryce Harrington 2011-01-03 12:29:00 UTC
Same user saw similar gpu lockup (same PGTBL error code at least) during a resume from suspend.

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/694244
Comment 7 Bryce Harrington 2011-01-18 09:58:51 UTC
Created attachment 42174 [details]
IntelGpuDump.txt

Here's the GPU dump from the last freeze.  This includes several calls to XY_COLOR_BLT() - we've had several other bug reports with this in their batchbuffers, although the gpu dumps differ from bug to bug.

Bugzilla won't allow the raw file to be attached, so I've attached it gzipped and here's a link:

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/694244/+attachment/1775717/+files/IntelGpuDump.txt
Comment 8 Chris Wilson 2011-01-19 01:17:51 UTC
(In reply to comment #7)
> Here's the GPU dump from the last freeze.  This includes several calls to
> XY_COLOR_BLT() - we've had several other bug reports with this in their
> batchbuffers, although the gpu dumps differ from bug to bug.

The contents of the batch buffers are more or less irrelevant to the reported error, since we only manipulate the display surfaces directly from the kernel. The latest i915_error_state is more interesting with regards to capturing these errors, since it includes the display settings and the pinned buffers.

Pinning down the timing to that of a mode change and confirming that on 2.6.38-rc1, which has extra paranoia with regards the timing of the display surface removal and the improved error state, would be most useful.
Comment 9 Bryce Harrington 2011-01-20 14:46:26 UTC
Created attachment 42241 [details]
i915_error_state.txt

No chance, kernel team says we have 2.6.38-rc1 but it's too horridly borked to foist on users; the kernel guys say it doesn't even boot.

But here's the i915_error_state.txt you asked for.
Comment 10 Chris Wilson 2011-03-29 03:44:54 UTC
Created attachment 44991 [details] [review]
Disable outputs before KMS takeover

If the reported hang was occurring early in the boot process, then the attached might be an answer. But afaics, this hang is much later...
Comment 11 Chris Wilson 2011-04-05 00:30:20 UTC
Hmm, there is a second patch required to fix an oops. In drm-intel-staging:

commit ea1167d6601f370f5d7e425eb0b3c7577edd02cd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 29 13:19:09 2011 +0100

    drm/i915: Move the irq wait queue initialisation into the ring init
    
    Required so that we don't obliterate the queue if initialising the
    rings after the global IRQ handler is installed.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit f8acdf5aa142926961e1f7ddb9e86490c50f8e6a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 29 10:40:27 2011 +0100

    drm/i915: Disable all outputs early, before KMS takeover
    
    If the outputs are active and continuing to access the GATT when we
    teardown the PTEs, then there is a potential for us to hang the GPU.
    The hang tends to be a PGTBL_ER with either an invalid host access or
    an invalid display plane fetch.
    
    Reported-by: Pekka Enberg <penberg@kernel.org>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 12 Chris Wilson 2011-04-05 00:30:49 UTC
*** Bug 35976 has been marked as a duplicate of this bug. ***
Comment 13 Chris Wilson 2011-04-05 00:30:54 UTC
*** Bug 35975 has been marked as a duplicate of this bug. ***
Comment 14 Chris Wilson 2011-04-05 00:30:59 UTC
*** Bug 35974 has been marked as a duplicate of this bug. ***
Comment 15 Chris Wilson 2011-04-05 00:36:02 UTC
Those two patches have been reported by one user to have fixed the issue for him, but I need a few more testers since they seem to foul up a MacBook (but then there are more than one issue at play with MacBooks...)
Comment 16 Bryce Harrington 2011-04-05 12:17:02 UTC
Thanks Chris, the early output disablement especially sounds promising.

Our kernel team does daily builds of drm-intel-next and drm-next, but not drm-intel-staging, so may take a while before we can produce something for the reporters to test (and I doubt the reporters would be patching their kernel manually although who knows).
Comment 17 Chris Wilson 2011-04-05 12:34:30 UTC
I know, catch 22. They can't be accepted into -fixes unless we know they fix the bug. And whilst they are in -staging, only the foolhardy will try them.
Comment 18 Bryce Harrington 2011-04-05 13:01:58 UTC
One of our kernel engineers was kind enough to do a quick package of the patches for user testing.  This is a cherrypick of the natty kernel with these two patches, not a package of drm-intel-staging:

Please install the Natty test kernel 2.6.32-32.61~lp719446.1 from https://launchpad.net/~timg-tpi/+archive/ppa

echo "deb http://ppa.launchpad.net/timg-tpi/ppa/ubuntu natty main"|sudo tee /etc/apt/sources.list.d/timg-ppa.list
sudo apt-get update
sudo apt-get -u dist-upgrade

Hopefully one of the reporters of this bug will test the kernel and give feedback.
Comment 19 Bryce Harrington 2011-04-06 10:43:45 UTC
There has been one tester of this patched kernel so far, Daniel G. Taylor, who writes:

"""
Above commands caused my display to not work and I had to reboot into an older kernel selected in grub to get things working again. It does NOT fix the issue for me. I'm on a ~2007 Macbook.  The display was off and showed no graphics whatsoever. I can't tell if the boot process succeeded or failed and had to do a hard-reboot.

I installed 2.6.38-8-generic_2.6.38-8.42~lp686388 for i386 and the associated linux-image-generic, linux-generic, linux-libc-dev. That's the one that caused the issue.

I tried booting both with an external monitor attached and without an external monitor. Bug LP #749784 is apparently my dupe of this one so you can see my system information in that one. Let me know what else I can do to help.
"""
Comment 20 Chris Wilson 2011-04-06 11:42:39 UTC
MacBooks seem to enter the kernel with a PGTBL_ER already pending. You need the v2 patch to survive, but as stated it looks like MacBooks has a separate issue.
Comment 21 Chris Wilson 2012-05-09 02:24:51 UTC
commit c7bd4c25650704d4d065eb4ce2a122d2a80ce804
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 24 16:36:50 2012 +0100

    drm/i915: Remove too early plane enable on pre-PCH hardware
    
    Enabling the plane before we have assigned valid address means that it
    will access random PTE (often with conflicting memory types) and cause
    GPU lockups. However, enabling the plane too early appears to workaround
    a number of bugs in our modesetting code.
    
    Cc: Franz Melchior <melchior.franz@gmail.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=39947
    References: https://bugs.freedesktop.org/show_bug.cgi?id=41091
    References: https://bugs.freedesktop.org/show_bug.cgi?id=49041
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 22 Jari Tahvanainen 2016-11-03 12:20:39 UTC
Closing resolved+fixed. CommitDate: Thu May 3 2012.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.