23967 – [945GM] GPU hang randomly, garbage in batchbuffer.

Bug 23967 - [945GM] GPU hang randomly, garbage in batchbuffer.

Summary: [945GM] GPU hang randomly, garbage in batchbuffer.

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	7.4 (2008.09)
Hardware:	x86 (IA32) Linux (All)

Importance:	medium critical
Assignee:	Chris Wilson
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:	NEEDINFO

Depends on:
Blocks:

Reported:	2009-09-15 11:55 UTC by Fryderyk Dziarmagowski
Modified:	2010-08-16 11:26 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
gpu dump (bzip2) (324.28 KB, application/x-bzip) 2009-09-15 11:57 UTC, Fryderyk Dziarmagowski	no flags	Details
xorg log (24.03 KB, text/plain) 2009-09-15 11:57 UTC, Fryderyk Dziarmagowski	no flags	Details
kernel log (71.58 KB, application/octet-stream) 2009-09-15 11:58 UTC, Fryderyk Dziarmagowski	no flags	Details
xorg.conf (997 bytes, application/octet-stream) 2009-09-15 11:58 UTC, Fryderyk Dziarmagowski	no flags	Details
gpu dump 2 (174.95 KB, application/x-bzip) 2009-09-16 13:08 UTC, Fryderyk Dziarmagowski	no flags	Details
gpu dump before suspend (119.06 KB, application/x-bzip) 2009-09-16 13:41 UTC, Fryderyk Dziarmagowski	no flags	Details
gpu dump after suspend (46.57 KB, application/x-bzip) 2009-09-16 13:42 UTC, Fryderyk Dziarmagowski	no flags	Details
stellarium showing big triangles after s2ram (compressed with xz) (969.15 KB, application/octet-stream) 2009-09-16 13:45 UTC, Fryderyk Dziarmagowski	no flags	Details
bzipped gpu dump (78.60 KB, application/octet-stream) 2009-10-18 10:31 UTC, Fryderyk Dziarmagowski	no flags	Details
on more GPU dump (xz compressed) (50.63 KB, application/x-xz) 2009-12-23 07:10 UTC, Fryderyk Dziarmagowski	no flags	Details
Record batch buffer at time of error (15.37 KB, patch) 2010-02-23 03:16 UTC, Chris Wilson	no flags	Details \| Splinter Review
recorded error state (84.23 KB, application/x-tar) 2010-02-25 10:16 UTC, Fryderyk Dziarmagowski	no flags	Details
kernel log (10.85 KB, application/x-tar) 2010-02-25 10:18 UTC, Fryderyk Dziarmagowski	no flags	Details
Rebind fbo if unaligned. (1.84 KB, patch) 2010-03-04 10:29 UTC, Chris Wilson	no flags	Details \| Splinter Review
i915_error_state running skyrocket (760.61 KB, text/plain) 2010-04-03 01:52 UTC, Fryderyk Dziarmagowski	no flags	Details
one more error_state (926.94 KB, text/plain) 2010-04-13 09:02 UTC, Fryderyk Dziarmagowski	no flags	Details
fresh error state (760.47 KB, application/octet-stream) 2010-07-21 12:45 UTC, Fryderyk Dziarmagowski	no flags	Details
View All

Description Fryderyk Dziarmagowski 2009-09-15 11:55:46 UTC

Bug description:
GPU hangs randomly, one time per day.

System environment:
-- chipset:
Integrated Graphics Chipset: Intel(R) G45/G43

-- system architecture: 32-bit, i686
-- xf86-video-intel: 2.8.99.901
-- xserver: 2.8.99.901
-- mesa: 7_6_branch
-- libdrm: 2.4.13
-- kernel: 2.6.31
-- Linux distribution: custom
-- Machine or mobo model:
  system.board.product = '0MG532'  (string)
  system.board.serial = '.3WFXS2J.CN701666BF024M.'  (string)
  system.board.vendor = 'Dell Inc.'  (string)
  system.board.version = ''  (string)
  system.chassis.manufacturer = 'Dell Inc.'  (string)
  system.chassis.type = 'Portable'  (string)
  system.firmware.release_date = '04/02/2007'  (string)
  system.firmware.vendor = 'Dell Inc.'  (string)
  system.firmware.version = 'A10'  (string)
  system.formfactor = 'laptop'  (string)
  system.hardware.primary_video.product = 10146  (0x27a2)  (int)
  system.hardware.primary_video.vendor = 32902  (0x8086)  (int)
  system.hardware.product = 'MXC061'  (string)
  system.hardware.serial = '3WFXS2J'  (string)
  system.hardware.uuid = '44454C4C-5700-1046-8058-B3C04F53324A'  (string)
  system.hardware.vendor = 'Dell Inc.'  (string)
  system.hardware.version = ''  (string)
  system.kernel.machine = 'i686'  (string)
  system.kernel.name = 'Linux'  (string)
  system.kernel.version = '2.6.31-desktop-1'  (string)
  system.kernel.version.major = 2  (0x2)  (int)
  system.kernel.version.micro = 31  (0x1f)  (int)
  system.kernel.version.minor = 6  (0x6)  (int)

-- Display connector:
LVDS

Additional info:
% lspci
00:00.0 Host bridge: Intel Corporation 4 Series Chipset DRAM Controller (rev 03)
00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03)
00:03.0 Communication controller: Intel Corporation 4 Series Chipset HECI Controller (rev 03)
00:19.0 Ethernet controller: Intel Corporation 82567LF-2 Gigabit Network Connection
00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5
00:1a.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1b.0 Audio device: Intel Corporation 82801JI (ICH10 Family) HD Audio Controller
00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller
00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
00:1f.5 IDE interface: Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE Controller
01:01.0 FireWire (IEEE 1394): Agere Systems FW322/323 (rev 70)

After hang kernel is not really happy about it:
Sep 15 20:19:17 aragorn kernel: INFO: task i915/0:978 blocked for more than 120 seconds.                                                 
Sep 15 20:19:17 aragorn kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                
Sep 15 20:19:17 aragorn kernel: i915/0        D c1bee400     0   978      2 0x00000000                                                   
Sep 15 20:19:17 aragorn kernel: f68279e0 00000046 c1002065 c1bee400 000001e0 c1240400 c1bf085c 0001c3d9                                  
Sep 15 20:19:17 aragorn kernel: 00000000 c138b800 f73f3ac0 c123772a 00000000 c1388724 c138b800 f6827b90                                  
Sep 15 20:19:17 aragorn kernel: 00000000 00000000 c1bf0800 83e228c4 000007e7 c1016295 c1c00800 f6a85c14                                  
Sep 15 20:19:17 aragorn kernel: Call Trace:                                                                                              
Sep 15 20:19:17 aragorn kernel: [<c1002065>] ? __switch_to+0xa8/0x178                                                                    
Sep 15 20:19:17 aragorn kernel: [<c123772a>] ? _spin_unlock_irq+0x5/0x23                                                                 
Sep 15 20:19:17 aragorn kernel: [<c1016295>] ? smp_apic_timer_interrupt+0x54/0x7f                                                        
Sep 15 20:19:17 aragorn kernel: [<c12364b1>] ? __mutex_lock_slowpath+0xc9/0x149                                                          
Sep 15 20:19:17 aragorn kernel: [<c1236370>] ? mutex_lock+0x10/0x20                                                                      
Sep 15 20:19:17 aragorn kernel: [<c103bec9>] ? queue_delayed_work+0x18/0x24                                                              
Sep 15 20:19:17 aragorn kernel: [<f8249aa1>] ? i915_gem_retire_work_handler+0x1c/0x239 [i915]                                            
Sep 15 20:19:17 aragorn kernel: [<c103b4c5>] ? worker_thread+0x105/0x1cb                                                                 
Sep 15 20:19:17 aragorn kernel: [<c1023dd5>] ? __wake_up_common+0x41/0x63                                                                
Sep 15 20:19:17 aragorn kernel: [<f8249a85>] ? i915_gem_retire_work_handler+0x0/0x239 [i915]                                             
Sep 15 20:19:17 aragorn kernel: [<c103e90d>] ? autoremove_wake_function+0x0/0x37                                                         
Sep 15 20:19:17 aragorn kernel: [<c103b3c0>] ? worker_thread+0x0/0x1cb                                                                   
Sep 15 20:19:17 aragorn kernel: [<c103e69e>] ? kthread+0x74/0x78                                                                         
Sep 15 20:19:17 aragorn kernel: [<c103e62a>] ? kthread+0x0/0x78                                                                          
Sep 15 20:19:17 aragorn kernel: [<c1004057>] ? kernel_thread_helper+0x7/0x10                                                             
Sep 15 20:21:17 aragorn kernel: INFO: task i915/0:978 blocked for more than 120 seconds.
Reproducing steps:
I can't trigger it manually, it happens randomly after wake up from s2ram.

Comment 1 Fryderyk Dziarmagowski 2009-09-15 11:57:37 UTC

Created attachment 29572 [details]
gpu dump (bzip2)

Comment 2 Fryderyk Dziarmagowski 2009-09-15 11:57:55 UTC

Created attachment 29573 [details]
xorg log

Comment 3 Fryderyk Dziarmagowski 2009-09-15 11:58:14 UTC

Created attachment 29574 [details]
kernel log

Comment 4 Fryderyk Dziarmagowski 2009-09-15 11:58:33 UTC

Created attachment 29575 [details]
xorg.conf

Comment 5 Gordon Jin 2009-09-15 19:52:41 UTC

You've already using the latest mesa with this patch, right?
http://cgit.freedesktop.org/mesa/mesa/commit/?id=acfea5c705f383692e661d37c5cd7da2f3db559b

Can you try if this kernel patch provide more info:
http://lists.freedesktop.org/archives/intel-gfx/2009-September/004243.html

Comment 6 Fryderyk Dziarmagowski 2009-09-16 11:06:53 UTC

Above Mesa patch is dedicated for i965 and I got no such hardware ;-)
(it's i915, but... there are no apps using 3d when it hangs)

I've applied mentioned kernel patch. I will post some results after something occurs.

Comment 7 Fryderyk Dziarmagowski 2009-09-16 11:08:38 UTC

By the way, it smells like #23699

Comment 8 Fryderyk Dziarmagowski 2009-09-16 13:08:32 UTC

Created attachment 29603 [details]
gpu dump 2

Comment 9 Fryderyk Dziarmagowski 2009-09-16 13:16:03 UTC

I've just triggered a hang running stellarium after back from s2ram (dump attached).

Comment 10 Fryderyk Dziarmagowski 2009-09-16 13:41:23 UTC

Created attachment 29604 [details]
gpu dump before suspend

Comment 11 Fryderyk Dziarmagowski 2009-09-16 13:42:00 UTC

Created attachment 29605 [details]
gpu dump after suspend

Comment 12 Fryderyk Dziarmagowski 2009-09-16 13:45:25 UTC

Created attachment 29606 [details]
stellarium showing big triangles after s2ram (compressed with xz)

This screencast shows, that after back from s2ram 3d rendering is broken too.

Comment 13 Fryderyk Dziarmagowski 2009-09-16 13:49:55 UTC

Unfortunaltely kernel with patch from #5 stays calm as before

Comment 14 Eric Anholt 2009-09-21 15:44:13 UTC

00:00.0 Host bridge: Intel Corporation 4 Series Chipset DRAM Controller (rev
03)

that's a 965, so please test with the mesa patch.

Comment 15 Fryderyk Dziarmagowski 2009-09-22 00:44:17 UTC

I'm really sorry, but I've attached wrong lspci :-[

00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS, 943/940GML and 945GT Express Memory Controller Hub (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)
00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 01)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01)
00:1c.1 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 2 (rev 01)
00:1c.3 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface Bridge (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GBM/GHM (ICH7 Family) SATA IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
02:00.0 Ethernet controller: Broadcom Corporation BCM4401-B0 100Base-TX (rev 02)
02:01.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller
02:01.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 19)
02:01.2 System peripheral: Ricoh Co Ltd R5C843 MMC Host Controller (rev 0a)
02:01.3 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 05)
02:01.4 System peripheral: Ricoh Co Ltd xD-Picture Card Controller (rev ff)
0c:00.0 Network controller: Intel Corporation PRO/Wireless 3945ABG [Golan] Network Connection (rev 02)

Comment 16 Fryderyk Dziarmagowski 2009-09-29 11:49:16 UTC

Just tested the Intel 2009Q3 release, the problem still exists.
This time I've made some test using "skyrocket" 3d application (just a screensaver, http://rss-glx.sourceforge.net/). It hangs gpu (i.e after starting it or on resizing the application's window) just after few seconds (with stellarium was a bit harder).

Comment 17 Fryderyk Dziarmagowski 2009-10-16 11:18:16 UTC

upgrading kernel to 2.6.32rc5 with Eric's "for-linus" branch solves the problems described above :-)

Comment 18 Fryderyk Dziarmagowski 2009-10-18 10:30:31 UTC

Unfortunetely I was a bit too fast closing this bug.
This time it happened during operations in gnome-terminal, firstly a vertical bar with artifacts appeared on the left side of my laptop (5mm x100mm, I had a impression a text line in midnight commander was not rendered correctly). A few seconds after whole screen stopped to response on mouse movements/clicks.
(A good thing about 2.6.32 is, I was able to switch to console with ctrl+alt+Fx. Running to my second machine and remote conection are no longer needed.)

This time kernel log show something:
Oct 18 18:55:18 aragorn kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung                                                                      
Oct 18 18:55:18 aragorn kernel: render error detected, EIR: 0x00000000                                                                                                        
Oct 18 18:55:18 aragorn kernel: i915: Waking up sleeping processes                                                                                                            
Oct 18 18:55:18 aragorn ke1.7.0.901rnel: reboot required                                                                                                                               
Oct 18 18:55:18 aragorn kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 1386051 at 1386039)                                                    
Oct 18 18:55:18 aragorn kernel: [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged

Last message repeats until a few times per second until reboot.

I'm attaching new gpu dump, just after the hang.

Once more some details:
linux 2.6.32rc5 with Eric's "for Linus" branch
Mesa 7_6 git branch
intel 2d driver 2.9 git branch
xorg xserver 1.7.0.901

Comment 19 Fryderyk Dziarmagowski 2009-10-18 10:31:54 UTC

Created attachment 30536 [details]
bzipped gpu dump

Comment 20 Fryderyk Dziarmagowski 2009-12-23 07:09:43 UTC

Small upgrade:

Mesa 7.6.1
xorg 1.7.3.901
xorg intel driver 2.9.99.902
kernel 2.6.32.2

Unfortunetely GPU lockup is still here:
Dec 23 15:56:13 aragorn kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung                                                                              
Dec 23 15:56:13 aragorn kernel: render error detected, EIR: 0x00000000                                                                                                                
Dec 23 15:56:13 aragorn kernel: i915: Waking up sleeping processes                                                                                                                    
Dec 23 15:56:13 aragorn kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 330264 at 330250)                                                              
Dec 23 15:56:13 aragorn kernel: [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged

Attaching new gpu dump.

Comment 21 Fryderyk Dziarmagowski 2009-12-23 07:10:44 UTC

Created attachment 32263 [details]
on more GPU dump (xz compressed)

Comment 22 Fryderyk Dziarmagowski 2009-12-30 10:47:02 UTC

Upgraded Mesa to 7.7, the problem is still present.

I'm able now to trigger same lockup even on "freshly" started machine: it happens during using webbrowser (midori, webkit based). It is still random, but browsing phoronix benchmarks almost always locks up my GPU. I'm getting same drm messages as described above.

Comment 23 Fryderyk Dziarmagowski 2009-12-30 11:06:02 UTC

One more thing from Xorg.0.log (trigger phoronix.com ;-)

(WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.

Comment 24 Fryderyk Dziarmagowski 2010-01-05 09:53:43 UTC

Adding Option "DebugWait" "true" to xorg.conf solves the problem. 2D performance suffers a bit, but at least it is stable as in EXA times.

Comment 25 Fryderyk Dziarmagowski 2010-01-09 14:14:32 UTC

ping

Comment 26 Chris Wilson 2010-02-23 03:16:24 UTC

Created attachment 33504 [details] [review]
Record batch buffer at time of error

The curse of the empty gpu dump. Can you try the attached patch and upload the i915_error_state following a hang? The fact that DebugWait fixes the issue for you suggests a form of corruption I've been hunting for in the wild.

Comment 27 Fryderyk Dziarmagowski 2010-02-24 12:12:00 UTC

Even more cursed is the fact, that after the patch was applied, I was only once able to lock the gpu. Since debugfs is mounted permanently, no matter how hard I try I'm no longer able to reproduce it (and that was really easy before)

The first and only time until now produced something like that:
(debugfs mounted after lock up)

Time: 1266959695 s 294930 us
PCI ID: 0x27a2
EIR: 0x00000010
  PGTBL_ER: 0x00000003
  INSTPM: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0x7fffffc0
  ACTHD: 0x00000000
seqno: 0x00035937
--- ringbuffer = 0x007bf000
00000000 :  00000000
00000004 :  00000000
00000008 :  00000000
... (cut: only 00000000 here!)
0001fffc :  00000000

Seeing only zeros in second column I don't really believe it is something valuable.

Comment 28 Chris Wilson 2010-02-24 12:20:42 UTC

(In reply to comment #27)
> Seeing only zeros in second column I don't really believe it is something
> valuable.

It is. But I guess it is reporting an earlier bug, that should be fixed with

commit fd2e8ea597222b8f38ae8948776a61ea7958232e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 9 14:14:36 2010 +0000

    drm/i915: Increase fb alignment to 64k
    
    An untiled framebuffer must be aligned to 64k. This is normally handled
    by intel_pin_and_fence_fb_obj(), but the intelfb_create() likes to be
    different and do the pinning itself. However, it aligns the buffer
    object incorrectly for pre-i965 chipsets causing a PGTBL_ERR when it is
    installed onto the output.

(Bug #22936)

Can you either apply that patch or use the .33-rc8 + record buffers?

Comment 29 Fryderyk Dziarmagowski 2010-02-24 12:24:58 UTC

sure, I need some more time to do it 8-)

Comment 30 Fryderyk Dziarmagowski 2010-02-25 10:16:16 UTC

ok, upgraded to 2.6.33 with patch from #26

Looks like error state is triggered directly after resuming from s2ram.

Just before s2ram:

cat /sys/kernel/debu/dri/0/i915_error_state 
no error state collected

and just after:
Time: 1267121097 s 317013 us
PCI ID: 0x27a2
EIR: 0x00000010
  PGTBL_ER: 0x00000003
  INSTPM: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0x7fffffc0
  ACTHD: 0x00000000
seqno: 0x00000cc6
--- ringbuffer = 0x00bc8000
00000000 :  00000000
00000004 :  00000000
00000008 :  00000000
...

dmesg shows something what I didn't saw until now:
...
i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
i915 0000:00:02.0: setting latency timer to 64
render error detected, EIR: 0x00000010
page table error
  PGTBL_ER: 0x00000003
[drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
render error detected, EIR: 0x00000010
page table error
  PGTBL_ER: 0x00000003
...

Should I provide something more?

Comment 31 Fryderyk Dziarmagowski 2010-02-25 10:16:39 UTC

Created attachment 33566 [details]
recorded error state

Comment 32 Fryderyk Dziarmagowski 2010-02-25 10:18:42 UTC

Created attachment 33567 [details]
kernel log

Comment 33 Chris Wilson 2010-03-04 10:29:00 UTC

Created attachment 33764 [details] [review]
Rebind fbo if unaligned.

Hmm, the bo is unaligned following a resume... Please try this patch to rebind the framebuffer with the appropriate alignment, if required.

Comment 34 Fryderyk Dziarmagowski 2010-03-10 07:10:02 UTC

2.6.33 build fails with the patch from #33:

drivers/gpu/drm/i915/intel_display.c: In function 'intel_pin_and_fence_fb_obj':
drivers/gpu/drm/i915/intel_display.c:1257: error: implicit declaration of function 'i915_gem_object_fence_offset_ok'
make[4]: *** [drivers/gpu/drm/i915/intel_display.o] Error 1

Comment 35 Fryderyk Dziarmagowski 2010-03-12 11:38:41 UTC

How to fix it?
i915_gem_object_fence_offset_ok seems to be defined in drivers/gpu/drm/i915/i915_gem_tiling.c

Comment 36 Fryderyk Dziarmagowski 2010-04-03 01:51:23 UTC

I've upgraded my setup a bit:

xserver 1.8
mesa 7.8.0
linux 2.6.33.2 (unfortunately still can't apply rebind_fbo patch)

and changed suspending method from plain echo mem > /sys/power/state to high sophisticated pm-utils (it does some magic to suspending, i915_error_state stays clean after resume).
With this shiny, new setup it is still very easy to trigger a GPU hang, but for the first time I catched i915_error_state with some content.

Comment 37 Fryderyk Dziarmagowski 2010-04-03 01:52:40 UTC

Created attachment 34640 [details]
i915_error_state running skyrocket

Comment 38 Fryderyk Dziarmagowski 2010-04-13 09:02:11 UTC

Created attachment 34968 [details]
one more error_state

Catched one more time a error_state. It looks quite different to last one.

Comment 39 Fryderyk Dziarmagowski 2010-04-23 03:42:42 UTC

ping

Comment 40 Chris Wilson 2010-06-22 05:23:09 UTC

The rebind bo is now upstream (at last!). Fryderyk, if you can reproduce the skyrocket crash on current trees, please upload a new i915_error_state. I can think of a similar bug that was a result of memory corruption through a batch buffer overrun in mesa that has since been fixed -- so I am optimistic in that this bug is now unreproducible!


commit ac0c6b5ad3b3b513e1057806d4b7627fcc0ecc27
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu May 27 13:18:18 2010 +0100

    drm/i915: Rebind bo if currently bound with incorrect alignment.
    
    Whilst pinning the buffer, check that that its current alignment
    matches the requested alignment. If it does not, rebind.
    
    This should clear up any final render errors whilst resuming,
    for reference:
    
      Bug 27070 - [i915] Page table errors with empty ringbuffer
      https://bugs.freedesktop.org/show_bug.cgi?id=27070
    
      Bug 15502 -  render error detected, EIR: 0x00000010
      https://bugzilla.kernel.org/show_bug.cgi?id=15502
    
      Bug 13844 -  i915 error: "render error detected"
      https://bugzilla.kernel.org/show_bug.cgi?id=13844
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org
    Signed-off-by: Eric Anholt <eric@anholt.net>

Comment 41 Fryderyk Dziarmagowski 2010-06-22 11:37:42 UTC

Glad to hear it! I hope, I will find some time to try something more actual then 2.6.23.x...

Comment 42 Chris Wilson 2010-07-17 07:19:09 UTC

There is a working theory that 945GM hangs when trying to perform lots of XY_SRC_COPY_BLT. A reproducible test case on a t60 (Core2/i945) is to run x11perf -copypixwin500 which hangs after one pass. The i915_error_state in these cases tend to be fairly random.

Can you check to see if your machine is also susceptible to x11perf -copypixwin500?

Comment 43 Fryderyk Dziarmagowski 2010-07-21 12:09:29 UTC

The only issue I have with this x11perf test is a massive slow down when mouse pointer stays idle... after multiple times running it, I was no able to hang the GPU

Funny thing:
mouse untouched
280 reps @   0.8198 msec (  1220.0/sec): Copy 500x500 from pixmap to window

and now some with some moves:
8000 reps @   0.9451 msec (  1060.0/sec): Copy 500x500 from pixmap to window

Comment 44 Chris Wilson 2010-07-21 12:16:43 UTC

> --- Comment #43 from Fryderyk Dziarmagowski <freetz@gmx.net> 2010-07-21 12:09:29 PDT ---
> The only issue I have with this x11perf test is a massive slow down when mouse
> pointer stays idle... after multiple times running it, I was no able to hang
> the GPU

So not suffering from the missing bit, but sounds like missing interrupts.
Are you using a compositing WM? Does it make a difference to switch to a
non-compositing WM (or vice versa)? I suspect/hope that the patches Jesse
pushed [2.6.35-rc4] to fix page-flipping on i945 are the answer here.

Comment 45 Fryderyk Dziarmagowski 2010-07-21 12:25:21 UTC

I've upgraded my kernel to 2.6.34.1 (Xorg to 1.8.2, driver to 2.12.0, Mesa 7.8.2) and I got new surprising results regarding this bug:

frozen screen was replaced with a nice X crash!

(This is done with GPU killer - skyrocket)

[ 81834.638] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[ 81834.651] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 81834.651] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 81834.651] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 81834.651] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 81834.651] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 81834.651] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 81834.651] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 81834.651] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 81837.926]
Backtrace:
[ 81837.926] 0: /usr/bin/X (xorg_backtrace+0x3b) [0x809b1c7]
[ 81837.926] 1: /usr/bin/X (0x8047000+0x5410e) [0x809b10e]
[ 81837.926] 2: (vdso) (__kernel_rt_sigreturn+0x0) [0xffffe40c]
[ 81837.926] 3: /usr/lib/xorg/modules/extensions/libdri2.so (0xb77c3000+0x3511) [0xb77c6511]
[ 81837.926] 4: /usr/bin/X (0x8047000+0x266ee) [0x806d6ee]
[ 81837.926] 5: /usr/bin/X (0x8047000+0x1f7e5) [0x80667e5]
[ 81837.926] 6: /lib/libc.so.6 (__libc_start_main+0xe6) [0x47fccb62]
[ 81837.926] 7: /usr/bin/X (0x8047000+0x1f401) [0x8066401]
[ 81837.926] Segmentation fault at address (nil)
[ 81837.926]
Fatal server error:
[ 81837.926] Caught signal 11 (Segmentation fault). Server aborting

What I'm observing now are random screen (text) corruptions in firefox and hard locks when running stellarium (this one kills my laptop, only hard reset helps...)

Comment 46 Fryderyk Dziarmagowski 2010-07-21 12:30:42 UTC

(In reply to comment #44)
> > --- Comment #43 from Fryderyk Dziarmagowski <freetz@gmx.net> 2010-07-21 12:09:29 PDT ---
> > The only issue I have with this x11perf test is a massive slow down when mouse
> > pointer stays idle... after multiple times running it, I was no able to hang
> > the GPU
> 
> So not suffering from the missing bit, but sounds like missing interrupts.
> Are you using a compositing WM? Does it make a difference to switch to a
> non-compositing WM (or vice versa)? I suspect/hope that the patches Jesse
> pushed [2.6.35-rc4] to fix page-flipping on i945 are the answer here.

No, I don't use compositing at all (due to mplayer tearing)
switching on xcompmgr helps: slowdown goes away (still without hang)

Comment 47 Chris Wilson 2010-07-21 12:33:36 UTC

> --- Comment #45 from Fryderyk Dziarmagowski <freetz@gmx.net> 2010-07-21 12:25:21 PDT ---
> I've upgraded my kernel to 2.6.34.1 (Xorg to 1.8.2, driver to 2.12.0, Mesa
> 7.8.2) and I got new surprising results regarding this bug:
> 
> frozen screen was replaced with a nice X crash!
> 
> (This is done with GPU killer - skyrocket)
> 
> [ 81834.638] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[snip]
> Fatal server error:
> [ 81837.926] Caught signal 11 (Segmentation fault). Server aborting

That crash in particular has been fixed, but we need to find the cause of
the GPU hang.

> What I'm observing now are random screen (text) corruptions in firefox and hard
> locks when running stellarium (this one kills my laptop, only hard reset
> helps...)

Can you upload some i915_error_state for these hangs? Thanks.

Comment 48 Fryderyk Dziarmagowski 2010-07-21 12:37:05 UTC

(In reply to comment #46)
> (In reply to comment #44)
> > > --- Comment #43 from Fryderyk Dziarmagowski <freetz@gmx.net> 2010-07-21 12:09:29 PDT ---
> > > The only issue I have with this x11perf test is a massive slow down when mouse
> > > pointer stays idle... after multiple times running it, I was no able to hang
> > > the GPU
> > 
> > So not suffering from the missing bit, but sounds like missing interrupts.
> > Are you using a compositing WM? Does it make a difference to switch to a
> > non-compositing WM (or vice versa)? I suspect/hope that the patches Jesse
> > pushed [2.6.35-rc4] to fix page-flipping on i945 are the answer here.
> 
> No, I don't use compositing at all (due to mplayer tearing)
> switching on xcompmgr helps: slowdown goes away (still without hang)

ok, forget what I wrote. It looks even worse now:
240 reps @   0.8121 msec (  1230.0/sec): Copy 500x500 from pixmap to window

Comment 49 Fryderyk Dziarmagowski 2010-07-21 12:39:13 UTC

(In reply to comment #47)
> > --- Comment #45 from Fryderyk Dziarmagowski <freetz@gmx.net> 2010-07-21 12:25:21 PDT ---
> > I've upgraded my kernel to 2.6.34.1 (Xorg to 1.8.2, driver to 2.12.0, Mesa
> > 7.8.2) and I got new surprising results regarding this bug:
> > 
> > frozen screen was replaced with a nice X crash!
> > 
> > (This is done with GPU killer - skyrocket)
> > 
> > [ 81834.638] (EE) intel(0): Detected a hung GPU, disabling acceleration.
> [snip]
> > Fatal server error:
> > [ 81837.926] Caught signal 11 (Segmentation fault). Server aborting
> 
> That crash in particular has been fixed, but we need to find the cause of
> the GPU hang.

Could you point me to the fix?

> > What I'm observing now are random screen (text) corruptions in firefox and hard
> > locks when running stellarium (this one kills my laptop, only hard reset
> > helps...)
> 
> Can you upload some i915_error_state for these hangs? Thanks.

give me some minutes...

Comment 50 Fryderyk Dziarmagowski 2010-07-21 12:45:36 UTC

Created attachment 37277 [details]
fresh error state

Comment 51 Fryderyk Dziarmagowski 2010-07-23 09:37:40 UTC

applying Dave's "enable low power render writes on GEN3 hardware" miracle patch does not seems to help here.

Comment 52 Chris Wilson 2010-08-08 07:01:02 UTC

Hmm, another instance of:

0x0dc07878:      0x7d000003: 3DSTATE_MAP_STATE
0x0dc0787c:      0x00000001:    mask
0x0dc07880:      0x00000000:    map 0 MS2
0x0dc07884:      0x00000000:    map 0 MS3 [width=1, height=1, tiling=none]
0x0dc07888:      0x00000000:    map 0 MS4 [pitch=4]
0x0dc0788c:      0x00000000: MI_NOOP
0x0dc07890:      0x00000000: MI_NOOP
0x0dc07894:      0x00000000: MI_NOOP
0x0dc07898:      0x00000000: MI_NOOP
0x0dc0789c:      0x00000000: MI_NOOP
0x0dc078a0:      0x00000000: MI_NOOP
0x0dc078a4:      0x00000000: MI_NOOP
0x0dc078a8:      0x00000000: MI_NOOP
0x0dc078ac:      0x00000000: MI_NOOP
0x0dc078b0:      0x00000000: MI_NOOP
0x0dc078b4:      0x00000000: MI_NOOP
0x0dc078b8:      0x00000000: MI_NOOP
0x0dc078bc:      0x00000000: MI_NOOP
0x0dc078c0:      0x00000000: MI_NOOP
0x0dc078c4:      0x00000000: MI_NOOP
0x0dc078c8:      0x00000000: MI_NOOP
0x0dc078cc:      0x00000000: MI_NOOP
0x0dc078d0:      0x00000000: MI_NOOP
0x0dc078d4:      0x00000000: MI_NOOP
0x0dc078f8:      0x00000000: MI_NOOP
0x0dc078fc:      0x00000000: MI_NOOP
0x0dc07900:      0x00000000: MI_NOOP
0x0dc07904:      0x00000000: MI_NOOP
0x0dc07908:      0x00000000: MI_NOOP
0x0dc0790c:      0x00000000: MI_NOOP
0x0dc07910:      0x00000000: MI_NOOP
0x0dc07914:      0x00000000: MI_NOOP
0x0dc07918:      0x00000000: MI_NOOP
0x0dc0791c:      0x00000000: MI_NOOP
0x0dc07920:      0x00000000: MI_NOOP
0x0dc07924:      0x00000000: MI_NOOP
0x0dc07928:      0x00000000: MI_NOOP
0x0dc0792c:      0x00000000: MI_NOOP
0x0dc07930:      0x00000000: MI_NOOP
0x0dc07934:      0x00000000: MI_NOOP
0x0dc07938:      0x00000000: MI_NOOP
0x0dc0793c:      0x00000000: MI_NOOP
0x0dc07940:      0x06060000: MI UNKNOWN
0x0dc07944:      0x7f800006: 3DPRIMITIVE sequential indirect TRILIST, 6 starting from 0
0x0dc07948:      0x00000000:               start

Comment 53 Fryderyk Dziarmagowski 2010-08-16 11:26:11 UTC

This bug is no longer present with latest kernel releases... (tested .33.7 and .35.1). Closing... :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.