Bug 26746

Summary: [i855] drm-intel-next Freeze shortly after X startup (i915_error_state)
Product: DRI Reporter: Geir Ove Myhr <gomyhr>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED DUPLICATE QA Contact:
Severity: normal    
Priority: medium CC: bill.farrow
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Batch buffer dump from drm-intel.git kernel
none
Batch buffer dump with v8 patch on top of Linus' kernel as of 2010-02-21
none
Xorg.0.log
none
lspci -vvnn
none
crash2: dmesg log
none
crash2: Xorg log
none
crash2: batch buffer dump
none
msleep(magic_delay)
none
Batch buffer dump from Crash 3
none
Xorg log from crash 3
none
Xorg log after X restarted but with black screen
none
Batch buffer dump from Crash 4
none
Xorg.0.old log from before the freeze
none
Xorg log from freeze none

Description Geir Ove Myhr 2010-02-24 23:36:19 UTC
Created attachment 33551 [details]
Batch buffer dump from drm-intel.git kernel

Forwarding bug report from Ubuntu user Bill Farrow:
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/511001

[Problem]
GPU hang with 855GM with Ubuntu Lucid, also with newest intel-drm-next and kernel.org kernels and with the xorg-edgers PPA. GPU error state from Chris Wilson's Record batch buffer following GPU hang patch captured.

[Original report]
Testing with Lucid Lynx Alpha 2 Netbook Remix on USB stick. The laptop is an Asus M5200N with Intel i855GM graphics chip.

I have the same graphics freezing bug when running 9.10 Karmic. There is already an open bug for Karmic https://bugs.launchpad.net/bugs/447892 but since this bug has not been fixed in the Lucid yet, I am raising a separate bug report.

00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
---
Architecture: i386
DistroRelease: Ubuntu 10.04
DkmsStatus: Error: [Errno 2] No such file or directory
InstallationMedia: Error: [Errno 13] Permission denied: '/var/log/installer/media-info'
Lsusb:
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: ASUSTeK Computer Inc. M5N
Package: xserver-xorg-video-intel 2:2.10.0+git20100220.c2c670ef-0ubuntu0sarvatt
PackageArchitecture: i386
PccardctlIdent:
 Socket 0:
   no product info available
 Socket 1:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
 Socket 1:
   no card
ProcCmdLine: auto BOOT_IMAGE=Linux ro root=/dev/sda1
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.32-14.20-generic
RelatedPackageVersions:
 xserver-xorg 1:7.5+1ubuntu6
 libgl1-mesa-glx 7.8.0~git20100219.496724b8-0ubuntu0sarvatt
 libdrm2 2.4.18+git20100217.2d9990c7-0ubuntu0sarvatt
 xserver-xorg-video-intel 2:2.10.0+git20100220.c2c670ef-0ubuntu0sarvatt
Tags: lucid
Uname: Linux 2.6.32-14-generic i686
UnreportableReason: This is not a genuine Ubuntu package
UserGroups: adm admin cdrom dialout lpadmin plugdev sambashare
dmi.bios.date: 12/08/2004
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0212
dmi.board.name: M5N
dmi.board.vendor: ASUSTeK Computer Inc.
dmi.board.version: 1.0
dmi.chassis.asset.tag: ATN12345678901234567
dmi.chassis.type: 10
dmi.chassis.vendor: ASUSTeK Computer Inc.
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0212:bd12/08/2004:svnASUSTeKComputerInc.:pnM5N:pvr1.0:rvnASUSTeKComputerInc.:rnM5N:rvr1.0:cvnASUSTeKComputerInc.:ct10:cvr1.0:
dmi.product.name: M5N
dmi.product.version: 1.0
dmi.sys.vendor: ASUSTeK Computer Inc.
system:
 distro: Ubuntu
 architecture: i686kernel: 2.6.32-14-generic
Comment 1 Geir Ove Myhr 2010-02-24 23:40:48 UTC
Created attachment 33552 [details]
Batch buffer dump with v8 patch on top of Linus' kernel as of 2010-02-21
Comment 2 Geir Ove Myhr 2010-02-24 23:41:18 UTC
Created attachment 33553 [details]
Xorg.0.log
Comment 3 Geir Ove Myhr 2010-02-24 23:41:41 UTC
Created attachment 33554 [details]
lspci -vvnn
Comment 4 Geir Ove Myhr 2010-02-24 23:46:47 UTC
Assigning to Chris Wilson since I assume he may be interested at looking at the captured error state from his patch drm/i915: Record batch buffer following GPU error. Hope this is okay.
Comment 5 Bill Farrow 2010-02-25 06:34:09 UTC
Created attachment 33561 [details]
crash2: dmesg log
Comment 6 Bill Farrow 2010-02-25 06:34:56 UTC
Created attachment 33562 [details]
crash2: Xorg log
Comment 7 Bill Farrow 2010-02-25 06:35:40 UTC
Created attachment 33563 [details]
crash2: batch buffer dump
Comment 8 Bill Farrow 2010-02-25 06:49:55 UTC
At the suggestion from Geir, I have collected dmesg [1], Xorg.0.log [2], and the batch buffer dump [3] from a single boot up and crash/freeze instance.  This is much better than the previous log files which came from differing runs and maybe even different kernel builds.

The kernel was built from the drm-intel.git repository [4], which includes Chris Wilson's gpu debug code.
  kernel = 2.6.33-rc8-v2.6.29-rc1-51333-g9df3079
  git describe = v2.6.29-rc1-51333-g9df3079


[1]: attachment 33561 [details]  dmesg
[2]: attachment 33562 [details]  /var/log/Xorg.0.log
[3]: attachment 33563 [details]  cat /sys/kernel/debug/dri/0/i915_error_state
[4]: http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git
Comment 9 Chris Wilson 2010-02-27 04:01:49 UTC
Thanks, this is another cache flushing bug. The telltale here is:

  IPEHR: 0x40c00000
...
0x02618194:      0x7c09c0cc: 3DSTATE_MAP_COORD_SET_I830
0x02618198:      0x7d020000: 3DSTATE_MAP_COORD_SETBIND_I830
0x0261819c: HEAD 0x00000098:    dword 1
0x026181a0:      0x7c291099: 3DSTATE_MAP_TEX_STREAM_I830

i.e. the last instruction header does not match the previous dword of the command stream -- the GPU is seeing a different state of memory wrt the CPU.
Comment 10 Chris Wilson 2010-02-27 04:03:00 UTC
Created attachment 33616 [details] [review]
msleep(magic_delay)

This patch has proven vital to work-around more obvious cache-flushing bugs. I'd appreciate much wider testing...
Comment 11 René Gabriëls 2010-02-27 17:17:21 UTC
(In reply to comment #10)
> Created an attachment (id=33616) [details]
> msleep(magic_delay)
> 
> This patch has proven vital to work-around more obvious cache-flushing bugs.
> I'd appreciate much wider testing...
> 

I've tested this patch for over an hour and my GPU is still up and running.  I'm running latest intel-drm-next kernel from git, libdrm-2.4.18, Xorg 1.7.5, latest xf86-video-intel from git. Furtermore, this patch also fixes the render errors that are reported in this bug #26346.

That said, rendering (both 2d and 3d) is now quite slow, as expected.
Comment 12 Bill Farrow 2010-02-28 10:30:40 UTC
Chris, the msleep patch [1] works, I can log in with gdm and get to the desktop now.  Moving and redrawing windows is slow, as expected.

I had one weird freeze when closing firefox where the mouse pointer still moved, and clicking on panel icons changes the mouse pointer to the spinning circle as if it was launching the application, but then the mouse pointer returns to an arrow and no application was displayed.  Unfortunately I did not grab the logs, and I have been unable to reproduce it since.

So how do we clean this up and fix this cache flush problem properly ?  I'm happy to code if you give me some pointers.

[1]: attachment 33616 [details] [review]  msleep(magic_delay)
Comment 13 Chris Wilson 2010-03-02 08:41:09 UTC
As it is clearly the CPU/GPU coherency issue, I'm duping this so as to consolidate the reports...

As to how to fix it, I've yet to find a suitable solution. The key is to ensure that the ICH has finished its writes prior to the GPU starting to DMA from memory. Sounds like it should be a fairly trivial, well-documented problem... But I've yet to find this precise scenario mentioned.

*** This bug has been marked as a duplicate of bug 26345 ***
Comment 14 Bill Farrow 2010-03-03 19:23:31 UTC
Created attachment 33742 [details]
Batch buffer dump from Crash 3

Crash with msleep() patch applied
Comment 15 Bill Farrow 2010-03-03 19:24:44 UTC
Created attachment 33743 [details]
Xorg log from crash 3

Crash with msleep() patch applied
Comment 16 Bill Farrow 2010-03-03 19:25:36 UTC
Created attachment 33744 [details]
Xorg log after X restarted but with black screen

Crash with msleep() patch applied
Comment 17 Bill Farrow 2010-03-03 19:26:23 UTC
Created attachment 33745 [details]
Batch buffer dump from Crash 4

Crash with msleep() patch applied
Comment 18 Bill Farrow 2010-03-03 19:27:23 UTC
Created attachment 33746 [details]
Xorg.0.old log from before the freeze

Crash with msleep() patch applied
Comment 19 Bill Farrow 2010-03-03 19:28:02 UTC
Created attachment 33747 [details]
Xorg log from freeze

Crash with msleep() patch applied
Comment 20 Bill Farrow 2010-03-03 19:31:42 UTC
Tonight I updated my ubuntu packages including xserver-xorg-* keeping the kernel with msleep() patch and I had an Xorg crash and restart, and on the next boot an Xorg freeze.  I have captured the batch buffer and Xorg log files if that helps.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.