Bug 45332

Summary: [snb] hangcheck reports render ring IRQ miss
Product: DRI Reporter: Sami Farin <hvtaifwkbgefbaei>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: medium CC: ben, chris, daniel, jbarnes
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
i915_error_state.mesa_eaf360e5bffc56307-intel_ed1c1a7468.txt
none
intel_reg_dumper.mesa_eaf360e5bffc56307-intel_ed1c1a7468.txt
none
3.3.0-rc5-1f033c1a6e-error_state.txt
none
3.3.0-rc5-1f033c1a6e-regs.txt none

Description Sami Farin 2012-01-28 03:55:23 UTC
sometimes when using firefox, the screen drawing stops and only mouse cursor is moving.  I can use sysrq-k to kill X and login/restart X again..
sometimes firefox hangs X in a couple of seconds, sometimes not in many minutes.  also rawtherapee triggered this  once.  chrome does not seem to trigger this.. ;)

I do not remember this happening with libxcb 1.7. This has happened with many intel driver versions since January 14.  (I installed libxcb on Jan 12.  Hmm, xorg on Jan 14, maybe I should try ugprading it ;) )

System environment: 
-- chipset: Intel(R) Sandybridge Desktop (GT2)
-- system architecture: 64-bit
-- xf86-video-intel: 2afd49a28429cdeb36583cfc31cc9b1742c1fb83
-- xserver: xorg-x11-server-Xorg-1.11.99.901-2.20120103.fc17.x86_64
-- mesa: 5ce741873954a40e48
-- libdrm: 2.4.30
-- kernel: 3.0.17
-- Linux distribution: Fedora
-- Machine or mobo model: Asus P8Z68-V PRO GEN3
-- CPU: Intel Core i5-2500K
-- Display connector: DVI

Backtrace:
0: X (xorg_backtrace+0x28) [0x4642d8]
1: X (0x400000+0x47b72) [0x447b72]
2: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7f0164eb4000+0x534c) [0x7f0164eb934c]
3: X (0x400000+0x7f6f7) [0x47f6f7]
4: X (0x400000+0xa35f8) [0x4a35f8]
5: /lib64/libpthread.so.0 (0x7f016895e000+0xef90) [0x7f016896cf90]
6: /lib64/libc.so.6 (ioctl+0x7) [0x30004e8bf7]
7: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x7f0166cc9468]
8: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f0166788000+0x4de06) [0x7f01667d5e06]
9: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f0166788000+0x52446) [0x7f01667da446]
10: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f0166788000+0x5246c) [0x7f01667da46c]
11: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f0166788000+0x77088) [0x7f01667ff088]
12: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f0166788000+0x779cf) [0x7f01667ff9cf]
13: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f0166788000+0x8a32f) [0x7f016681232f]
14: X (BlockHandler+0x4a) [0x437f9a]
15: X (WaitForSomething+0x11c) [0x4617fc]
16: X (0x400000+0x34011) [0x434011]
17: X (0x400000+0x2340a) [0x42340a]
18: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x3000421745]
19: X (0x400000+0x236b1) [0x4236b1]
[mi] EQ overflow continuing.  1000 events have been dropped.
[mi] No further overflow reports will be reported until the clog is cleared.

probably useless:
# addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so.2afd49a28429c 0x4de06 0x52446 0x5246c 0x77088 0x779cf 0x8a32f
/wrk/safari/cvs/xf86-video-intel/src/sna/kgem.c:1579
/wrk/safari/cvs/xf86-video-intel/src/sna/kgem.h:237
/wrk/safari/cvs/xf86-video-intel/src/sna/kgem.h:244
/wrk/safari/cvs/xf86-video-intel/src/sna/sna_accel.c:11478
/wrk/safari/cvs/xf86-video-intel/src/sna/sna_accel.c:11730
/wrk/safari/cvs/xf86-video-intel/src/sna/sna_driver.c:594
Comment 1 Sami Farin 2012-01-28 10:42:34 UTC
xorg-x11-server-Xorg-1.11.99.901-3.20120124.fc17.x86_64
stalled, too, but I could not login at tty1 for some reason.

<3>[180030.123002] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 9664668, at 9664667], missed IRQ?
I got these every 1.5s.

now I am running 1.11.99.902 without fedora patches, fingers crossed
Comment 2 Chris Wilson 2012-01-28 11:07:20 UTC
Daniel, another render ring IRQ miss surfaces!
Comment 3 Chris Wilson 2012-01-28 11:07:54 UTC
I suspect we have a further issue, but we can at least fix the render ring IRQ miss and see what remains...
Comment 4 Daniel Vetter 2012-01-28 11:25:52 UTC
Please test the for-nkalkhof branch available in the kernel git repo at:

http://cgit.freedesktop.org/~danvet/drm/
Comment 5 Sami Farin 2012-01-28 12:05:26 UTC
git clone git://people.freedesktop.org/~danvet/drm
Cloning into 'drm'...
remote: Counting objects: 2301783, done.
remote: Compressing objects: 100% (356562/356562), done.
remote: Total 2301783 (delta 1929730), reused 2295032 (delta 1923035)
Receiving objects: 100% (2301783/2301783), 549.20 MiB | 1.50 MiB/s, done.
Resolving deltas: 100% (1929730/1929730), done.
warning: remote HEAD refers to nonexistent ref, unable to checkout.
Comment 6 Daniel Vetter 2012-01-28 12:11:31 UTC
On Sat, Jan 28, 2012 at 21:05,  <bugzilla-daemon@freedesktop.org> wrote:
> warning: remote HEAD refers to nonexistent ref, unable to checkout.

Thats's expected, you need to checkout my branch:

git checkout origin/for-nkalkhof
Comment 7 Sami Farin 2012-01-28 12:59:19 UTC
So it was the kernel, first I thought I was cloning libdrm, no wonder it was so humongous ;)

Isn't there a spinoff of the missed-irqs patch for 3.0.x kernel?
I don't really have time to start tinkering with this nkalkhof kernel..  maybe I just wait till the fixes land into 3.2.stable..
Comment 8 Daniel Vetter 2012-01-28 13:14:58 UTC
Currently that patch will land earliest in 3.4, so trying out that git kernel is highly advised if you don't want to wait a few months for a possible fix ...
Comment 9 Sami Farin 2012-01-28 13:52:55 UTC
OK, thanks, I guess I have to start using it if I start getting serious hangs (requiring reboot) more regularly (>2 times a week).
Comment 10 Sami Farin 2012-01-30 13:25:02 UTC
ok I had to sysrq-k three times inside 24h, so I compiled -nkalkhof kernel.
it is not very stable, either.  it crashed the instant I ran mplayer -fs -vo gl some.mkv.


[ 3380.615530] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 3380.615595] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 3380.618070] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 177997 at 177995, next 178001)
[ 3386.979091] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 3386.979105] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 178004 at 178003, next 178011)
[ 3393.286694] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 3393.286711] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 178012 at 178011, next 178017)
[ 3399.646258] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 3399.646273] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 178030 at 178023, next 178031)
Comment 11 Sami Farin 2012-01-30 13:25:45 UTC
Created attachment 56339 [details]
i915_error_state.mesa_eaf360e5bffc56307-intel_ed1c1a7468.txt
Comment 12 Sami Farin 2012-01-30 13:26:04 UTC
Created attachment 56340 [details]
intel_reg_dumper.mesa_eaf360e5bffc56307-intel_ed1c1a7468.txt
Comment 13 Sami Farin 2012-01-30 13:27:14 UTC
(WW) intel(0): sna_dri_get_msc:1650 get vblank counter failed: Invalid argument
(WW) intel(0): flip queue failed: Device or resource busy
Comment 14 Chris Wilson 2012-01-30 14:57:39 UTC
The residual bug is bug 44364, so retitling as appropriate to separate the issues.
Comment 15 Sami Farin 2012-02-07 03:05:16 UTC
now with 7½ days uptime it has had only one more GPU hang, not bad.

but I am wondering how often do you merge the linus stable to for-nkalkhof..?
Comment 16 Sami Farin 2012-02-19 04:18:40 UTC
this with mesa df5963c25641a7c3a4bbfcb81cc3dc771581590e and nkalkhof kernel

<3>[599034.091545] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
<6>[599034.091620] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
<3>[599034.094118] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 35040628 at 35040627, next 35040632)
<3>[599040.403130] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
<3>[599040.403144] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 35040635 at 35040634, next 35040639)
<3>[599046.698729] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
<3>[599046.698744] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 35040641 at 35040640, next 35040646)
<3>[599053.006316] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
<3>[599053.006331] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 35040649 at 35040647, next 35040652)
<0>[599053.006346] ------------[ cut here ]------------
<2>[599053.006358] kernel BUG at drivers/gpu/drm/drm_irq.c:929!
<0>[599053.006366] invalid opcode: 0000 [#1] SMP
<4>[599053.006375] CPU 3
<4>[599053.006379] Modules linked in: nfnetlink_log nls_iso8859_1 nls_cp437 xt_NOTRACK xt_CLASSIFY ipt_ECN xt_connmark xt_length xt_connlimit xt_set xt_multiport ip_set_bitmap_port ip_set_hash_net sch_sfb nf_conntrack_netlink snd_usb_audio snd_hwdep snd_usbmidi_lib snd_rawmidi snd_seq_device arptable_filter arp_tables ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT ip6t_LOG xt_limit ipt_LOG xt_hashlimit xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state xt_tcpudp ip6table_filter ip6table_mangle ip6_tables iptable_filter iptable_mangle iptable_raw ip_tables bridge stp llc snd_hda_intel ip_set nfnetlink sch_netem sch_hfsc sch_htb sch_sfq cls_fw cls_u32 cls_route sch_ingress sch_red sch_tbf sch_teql sch_prio sch_gred cls_rsvp cls_rsvp6 cls_tcindex sch_cbq sch_dsmark at24 at25 uvcvideo videodev v4l2_compat_ioctl32 pps_ldisc pps_core pl2303 af_key xfrm4_tunnel esp4 ah4 ipcomp xfrm_ipcomp ipip tun udf vfat fat oprofile nf_conntrack x_tables dccp_diag dccp cmtp kernelcapi loop ppdev parp
ort_pc lp parport i2c_dev ppp_async ppp_generic slhc ftdi_sio usbserial rfcomm lockd sunrpc bnep w83627ehf hwmon_vid coretemp hwmon snd_hda_codec_hdmi snd_hda_codec_realtek btusb bluetooth i2c_i801 rfkill snd_hda_codec snd_pcm e1000e iTCO_wdt iTCO_vendor_support snd_timer snd soundcore snd_page_alloc mxm_wmi wmi rtc_cmos kvm binfmt_misc tcp_cubic autofs4 firewire_ohci firewire_core i915 drm_kms_helper button video [last unloaded: pcspkr]
<4>[599053.006669]
<4>[599053.006673] Pid: 20231, comm: X Not tainted 3.2.0-nkalkhof-44e435f49f5b+ #1 System manufacturer System Product Name/P8Z68-V PRO GEN3
<4>[599053.006693] RIP: 0010:[<ffffffff813c937b>]  [<ffffffff813c937b>] drm_vblank_put+0x5e/0x60
<4>[599053.006710] RSP: 0018:ffff880330917ce0  EFLAGS: 00010246
<4>[599053.006719] RAX: ffff88041d3d1200 RBX: ffff88041c8b5000 RCX: 00000000000048c0
<4>[599053.006730] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88041c8b5000
<4>[599053.006741] RBP: ffff880330917d48 R08: 0000000000000000 R09: 0000000000000000
<4>[599053.006752] R10: ffff88036a4ec9d0 R11: 0000000000000246 R12: ffff88006aa84b80
<4>[599053.006763] R13: ffff88041c8b5578 R14: ffff880417b9c000 R15: 00000000fffffff5
<4>[599053.006775] FS:  00007f5f2787e8c0(0000) GS:ffff88041f200000(0000) knlGS:0000000000000000
<4>[599053.006787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[599053.006796] CR2: 00007f4f51e8b000 CR3: 000000036eb56000 CR4: 00000000000406e0
<4>[599053.006807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[599053.006818] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[599053.006829] Process X (pid: 20231, threadinfo ffff880330916000, task ffff88036a4ec340)
<0>[599053.006840] Stack:
<4>[599053.006844]  ffffffffa003d208 ffff880330917d28 0000000000000001 ffff88004f06a400
<4>[599053.006859]  ffff88041c8b5068 ffff880417e40000 ffff88006aa84480 ffff880093a70000
<4>[599053.006873]  ffff880330917df8 ffff88041c8b5000 ffff88041c8b5720 ffff880093a70000
<0>[599053.006887] Call Trace:
<4>[599053.006902]  [<ffffffffa003d208>] ? intel_crtc_page_flip+0x20c/0x338 [i915]
<4>[599053.006915]  [<ffffffff813d509b>] drm_mode_page_flip_ioctl+0x174/0x1ec
<4>[599053.006926]  [<ffffffff813c5bcc>] drm_ioctl+0x3c0/0x483
<4>[599053.006936]  [<ffffffff813d4f27>] ? drm_mode_gamma_get_ioctl+0x106/0x106
<4>[599053.006948]  [<ffffffff81121937>] do_vfs_ioctl+0x8a/0x4f9
<4>[599053.006958]  [<ffffffff81121de0>] sys_ioctl+0x3a/0x7a
<4>[599053.006967]  [<ffffffff816963e2>] system_call_fastpath+0x16/0x1b
<0>[599053.006976] Code: 00 8d 04 80 8d 04 80 8d 14 80 c1 e2 03 be d3 4d 62 10 89 d0 f7 e6 c1 ea 06 48 8d 34 0a 48 81 c7 e0 04 00 00 e8 68 66 c8 ff 5d c3 <0f> 0b 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 41 89 f5 4c 63
<1>[599053.007049] RIP  [<ffffffff813c937b>] drm_vblank_put+0x5e/0x60
<4>[599053.007060]  RSP <ffff880330917ce0>
Comment 17 Sami Farin 2012-03-02 08:31:28 UTC
because nkalkhof seems not to  be maintained anymore, I tried torvalds-1f033c1a6ec1a6815e9c, mplayer -fs -vo gl was too much for it:

[ 1826.552898] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1826.552965] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 1826.555467] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 154707 at 154706, next 154709)
[ 1832.876460] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1832.876478] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 154711 at 154710, next 154717)
Comment 18 Sami Farin 2012-03-02 08:32:29 UTC
Created attachment 57925 [details]
3.3.0-rc5-1f033c1a6e-error_state.txt
Comment 19 Sami Farin 2012-03-02 08:32:53 UTC
Created attachment 57926 [details]
3.3.0-rc5-1f033c1a6e-regs.txt
Comment 20 Chris Wilson 2012-03-02 09:06:30 UTC
(In reply to comment #17)
> because nkalkhof seems not to  be maintained anymore, I tried
> torvalds-1f033c1a6ec1a6815e9c, mplayer -fs -vo gl was too much for it:

Which is still your *mesa* bug 44364, this bug is tracking the occurrence of the IRQ miss on SNB which does has a patch going in 3.4.
Comment 21 Florian Mickler 2012-04-05 06:49:11 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit 99ffa1629d737295e569267cf5940758139f9ddb
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jan 25 14:04:00 2012 +0100

    drm/i915: enable forcewake voodoo also for gen6
Comment 22 Daniel Vetter 2012-04-15 04:37:33 UTC
Patch has landed, closing this bug. The residual gpu hang looks like a separate mesa issue, missed IRQs seem to be gone for good. If they pop up again, please reopen.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.