Summary: | [SNB rc6/vt-d] hang unless i915_enable_rc6=0 | ||
---|---|---|---|
Product: | DRI | Reporter: | Ted Phelps <phelps> |
Component: | DRM/Intel | Assignee: | Eugeni Dodonov <eugeni> |
Status: | CLOSED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | daniel, eugeni, florian, jbarnes, xhejtman |
Version: | DRI git | Keywords: | NEEDINFO |
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Ted Phelps
2011-06-22 05:37:42 UTC
Created attachment 48282 [details]
i915_error_state
The plain text version is too large, so I've bzipped.
Right, there doesn't appear to be anything unusual in the error state. The only indication is that it is an older mesa, any you may find solace in some of the recent bug fixes. In particular, commit f6e5230b2614cc91e4c849c07781b2230878d274 Author: Eric Anholt <eric@anholt.net> Date: Fri Jun 17 18:44:26 2011 -0700 i965/gen6: Apply documented workaround for nonpipelined state packets. Fixes a 100% reproducible GPU hang in topogun-1.06-orc-84k.trace. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> sounds like it could cause quite a few random crashes. First let's rule out the known bugs and failing that, we have an unfortunate side-effect of rc6 that we need to pin down. I'm having a great deal of difficulty getting a more recent Mesa to function. When I fire up the gears demo, I see some new textual output that looks vaguely like source code and then X11 hangs for a few seconds. A window appears with no content and the machine hangs for a few seconds. I can eventually recover by killing the gears demo. I've tried recompiling X, xf86-video-intel and the gears demo but the problem still persists. So I'm afraid that this issue is going to have to go on hold until that's sorted out. -Ted Facing the same problem [ 377.856455] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 377.866509] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 13484 at 13483, next 13485) I made following observation: I am running a MSI H67MA-ED55, i7 2600k, 2.6.38. First time I had this "GPU hung"-issue was after changing the BIOS from 1.4 to 1.5. With 1.4 Ubuntu ran without any problems. Testing 2.6.39 and 3.0rc4 made no difference. I could imagine this could help a bit in focusing the fault reason if delta between 1.4 and 1.5 is not too big. Good luck, Peter Re: comment #3, I had configured mesa to use the i965 gallium driver, and that appears to have been the source of that issue. I ran with enable_rc6=0 overnight and something went wrong, but I'm not sure what. My keyboard was being ignored (even Num Lock didn't light its LED) a neither it nor my mouse could get me out of DPMS mode. Network was functional; I was able to ssh to the machine and there was no sign of trouble in dmesg or Xorg.0.log. My attempt to wake the machine from DPMS via xset just hung. strace of X itself showed an endless series of restarted system calls: --- SIGALRM (Alarm clock) @ 0 (0) --- rt_sigreturn(0xe) = -1 EINTR (Interrupted system call) ioctl(10, 0x40406469, 0x7fffb8965040) = ? ERESTARTSYS (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- rt_sigreturn(0xe) = -1 EINTR (Interrupted system call) ioctl(10, 0x40406469, 0x7fffb8965040) = ? ERESTARTSYS (To be restarted) ... I've since rebooted the machine with i915_enable_rc6=0 to see if that has any effect on this new issue. In short, I'm still having problems, but they appear to be different problems. I was unable to provoke the hangcheck timer by running 3D applications. I'll leave this bug open a few more days in case I do manage to reproduce the old symptoms. Thanks, -Ted (In reply to comment #5) > My attempt to wake the machine from DPMS via xset just hung. > strace of X itself showed an endless series of restarted system calls: > > --- SIGALRM (Alarm clock) @ 0 (0) --- > rt_sigreturn(0xe) = -1 EINTR (Interrupted system call) > ioctl(10, 0x40406469, 0x7fffb8965040) = ? ERESTARTSYS (To be restarted) Interesting. A mutex livelock perhaps? Would be useful to know the stacktrace (so that I don't have to work out which ioctl 0x40406469 is ;) and to get the kernel stacks for anything else that may still be inside i915.ko Ideally that should never have happened, as the hangcheck is supposed to kick in, force whatever is holding the lock to return, reset the device, then everything can continue merrily on as if nothing went wrong. (I did say ideally.) Ted, did you reproduce the busy-spin? I'm curious as to what the timings were. Do you still have the strace handy? Sorry for the silence. I had a hardware failure on another machine and my test machine was pressed into service in a role where I couldn't easily reboot it. I tried the test again with a newer Linux kernel (git/keithp 902daf6), Mesa snapshot (git/master 576f489) and xorg-server (1.10.3) over the weekend. The machine locked up solid on me -- nothing from netconsole -- after about 2 hours. I rebooted and it has been running without issues for 2.5 days. Both times were with i915_enable_rc6=1. So something's still wrong, but I don't have anything useful to add. -Ted Created attachment 49000 [details]
Another GPU hung
Jul 12 06:35:29 orpheus -- MARK --
Jul 12 06:48:24 orpheus kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck t
imer elapsed... GPU hung
Jul 12 06:48:24 orpheus kernel: [drm:i915_wait_request] *ERROR* i915_wait_reques
t returns -11 (awaiting 38663880 at 38663877, next 38663881)
Jul 12 06:48:24 orpheus kernel: [drm:init_ring_common] *ERROR* render ring initi
alization failed ctl 00000000 head 00000000 tail 00000000 start 00000000
Jul 12 06:48:25 orpheus kernel: [drm:init_ring_common] *ERROR* gen6 bsd ring ini
tialization failed ctl 00000000 head 00000000 tail 00000000 start 00000000
Jul 12 06:48:25 orpheus kernel: [drm:init_ring_common] *ERROR* blt ring initiali
zation failed ctl 00000000 head 00000000 tail 00000000 start 00000000
Jul 12 06:48:32 orpheus kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck t
imer elapsed... GPU hung
Jul 12 06:48:32 orpheus kernel: [drm:i915_wait_request] *ERROR* i915_wait_reques
t returns -11 (awaiting 38663887 at 38663880, next 38663888)
Jul 12 06:55:29 orpheus -- MARK --
Jul 12 07:15:29 orpheus -- MARK --
Sorry for the incoherent comment on the attachment. I seem to be especially inept with the keyboard today. I just wanted to mention that this latest hang happened whilst I was asleep, so presumably not much was happening at the time. There was another hang a few hours later when I was at work: Jul 12 09:45:58 orpheus kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Jul 12 09:45:58 orpheus kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 40661957 at 40661954, next 40661958) Same Xorg/mesa as in comment #8. -Ted The batch looks pretty innocuous. Yet the GPU is very much upset, the ring and error registers returning 0 is a very bad sign. And rc6 appears to be still the only the difference between stable/unstable systems. On a whim, I tweaked the latest drm-intel-next (git-6e96e77) so that __gen6_gt_wait_for_fifo would wait for 5000 iterations rather than 500 and to warn if the loop counter was less than 4500 when the loop exited in addition to the existing warning on the loop being less than zero. I'll attach this trivial patch in a moment. In the following day, I observed the new warning twice without the original warning -- it took between 500 and 5000 iterations get an acceptable value from GT_FIFO_FREE_ENTRIES. The third time, both warnings were encountered, indicating that over 5000 iterations passed. Immediately following that, the first warning was hit again but without the second. About 16 hours later, there was another batch of warnings indicating 10,500 iterations were required before the fifo had a sufficient number of free entries. Three hours later, 100,000 cycles of waits ensued, the hangcheck timer expired and i915_wait_request returned -EAGAIN. I'll attach the stack traces and i915_error_state for this last. No 3-D applications were running on the machine at the time. This is all with i915_enable_rc6=1, mesa git-450f486, xf86-video-intel-2.15.0. -Ted Created attachment 49193 [details] Error state corresponding to comment #12. Created attachment 49194 [details]
First warning from __gen6_gt_wait_for_fifo
Created attachment 49195 [details] last batch of kernel warnings for comment #12 Created attachment 49196 [details] [review] Patch to i915_drv.c described in comment #12 This patch doesn't fix anything; it simply highlights how long we can end up waiting for GT_FIFO_FREE_ENTRIES to reach GT_FIFO_NUM_RESERVED_ENTRIES. Created attachment 49915 [details]
Hung GPU following "Try enabling RC6 by default (again)"
Yet another i915_error_state. This one with:
- Linux version 3.0.0-00175-g07b7ddd (on keithp/drm-intel-next)
- Mesa-7.11rc4
- xf86-video-intel-2.15.0
Cheers,
-Ted
A patch referencing this bug report has been merged in Linux v3.1-rc1: commit 4e20fa65a3ea789510eed1a15deb9e8aab2b8202 Author: Keith Packard <keithp@keithp.com> Date: Wed Aug 3 10:52:24 2011 -0700 drm/i915: Try enabling RC6 by default (again) A patch referencing a commit referencing this bug report has been merged in Linux v3.1-rc1: commit 39060a07781b4930656752943cf5d66376d0533c Author: Dave Airlie <airlied@redhat.com> Date: Fri Aug 5 10:56:29 2011 +0100 Revert "drm/i915: Try enabling RC6 by default (again)" It looks like you're booting with i915.use_semaphores=1. Can you please disable semaphores and enable rc6 and see what happens? Just to check whether the problem only lies in the combination of the two. Sorry for the late reply. I've seen this behavior both with and without semaphores enabled. -Ted Can you please check whether you're using VT-d/DMAR? If so, please try disabling that in the bios. Also please attach your full dmesg after boot. Thanks, Daniel Hi, could you also post the results of dmidecode and lspci -vv for the machine where the issue happens please? Apologies again for the delay. My motherboard seems to have died; I've replaced that with a borrowed Intel DZ68DB motherboard and verified that I still see the hangcheck timer get wedged. I'm now running with virtualization disabled in the BIOS; I'll post again if I see it hang in this configuration. -Ted Created attachment 52131 [details]
dmesg output after reboot with virtualization disabled.
Created attachment 52132 [details]
lspci -vv output
Created attachment 52133 [details]
dmidecode output
Ok, I think Daniel is on to something. I've been running for 38 hours with i915_enable_rc6=1 and no GPU hangs, which is about 6 times longer than it typically takes to hang. I'll keep an eye on it for the rest of the week, but I think we have a winner. So, now that we think you're a genius, would you like to tell us why did you think VT-d/DMAR might be relevant? -Ted Ok, hopefully we'll have an angle on this now. Next thing to try is to reenable dmar in the bios and disable it on the kernel cmdline with intel_iommu=off This /should/ give the same results, but there are some slight variations possible. So please test carefully (i.e. if you can, let it run an entire week with this). Oh, and the genius thing is a bit too much - we've simply tracked down another seemingly obscure bug to bad interaction with VT-d and I thought a shot in the dark rarely hurts ;-) And the proof is still out there, I won't (yet) call this "tracked down". (In reply to comment #29) > Ok, hopefully we'll have an angle on this now. Next thing to try is to reenable > dmar in the bios and disable it on the kernel cmdline with > > intel_iommu=off > > This /should/ give the same results, but there are some slight variations > possible. So please test carefully (i.e. if you can, let it run an entire week > with this). > > Oh, and the genius thing is a bit too much - we've simply tracked down another > seemingly obscure bug to bad interaction with VT-d and I thought a shot in the > dark rarely hurts ;-) And the proof is still out there, I won't (yet) call this > "tracked down". I disabled VT-d in BIOS but I got screen corruption if rc6 enabled. Is it related? -- Lukas Hejtmanek > --- Comment #30 from Lukas Hejtmanek <xhejtman@fi.muni.cz> 2011-10-13
> I disabled VT-d in BIOS but I got screen corruption if rc6 enabled. Is it
> related?
Maybe. We have another report (#41682) implicating rc6 in render
glitches. Can you post a screenshoot/screencast of it happening?
(In reply to comment #31) > > --- Comment #30 from Lukas Hejtmanek <xhejtman@fi.muni.cz> 2011-10-13 > > I disabled VT-d in BIOS but I got screen corruption if rc6 enabled. Is it > > related? > > Maybe. We have another report (#41682) implicating rc6 in render > glitches. Can you post a screenshoot/screencast of it happening? OK, I will follow #41682 and provide a screenshot there. Btw, corruption seems to be related only to glyphs. > --- Comment #32 from Lukas Hejtmanek <xhejtman@fi.muni.cz> 2011-10-13 06:20:53 PDT ---
> OK, I will follow #41682 and provide a screenshot there. Btw, corruption seems
> to be related only to glyphs.
Please post your screenshot here on this bug. With such
hard-to-track-down issues like this it's usually better to keep
reports separate till there's proof in form of a fix that they're
indeed the same issue. Too much risk of (needless) confusion.
Created attachment 52292 [details]
Render glitch with rc6=1
(In reply to comment #33) > > --- Comment #32 from Lukas Hejtmanek <xhejtman@fi.muni.cz> 2011-10-13 06:20:53 PDT --- > > OK, I will follow #41682 and provide a screenshot there. Btw, corruption seems > > to be related only to glyphs. > > Please post your screenshot here on this bug. With such > hard-to-track-down issues like this it's usually better to keep > reports separate till there's proof in form of a fix that they're > indeed the same issue. Too much risk of (needless) confusion. Attached.. (In reply to comment #35) > (In reply to comment #33) > > > --- Comment #32 from Lukas Hejtmanek <xhejtman@fi.muni.cz> 2011-10-13 06:20:53 PDT --- > > > OK, I will follow #41682 and provide a screenshot there. Btw, corruption seems > > > to be related only to glyphs. > > > > Please post your screenshot here on this bug. With such > > hard-to-track-down issues like this it's usually better to keep > > reports separate till there's proof in form of a fix that they're > > indeed the same issue. Too much risk of (needless) confusion. > > Attached.. hmm, looks like this is not *that* issue, this screenshot looks like glyphs rendered to completely wrong surface. I try to catch another screenshot that is directly related to rc6 issue. I guess so as running forcewaked does not prevent the issue in the attachement. Created attachment 52293 [details]
True rc6 render glitch
On Thu, Oct 13, 2011 at 15:52, <bugzilla-daemon@freedesktop.org> wrote: > hmm, looks like this is not *that* issue, this screenshot looks like glyphs > rendered to completely wrong surface. I try to catch another screenshot that is > directly related to rc6 issue. > > I guess so as running forcewaked does not prevent the issue in the attachement. Can you elaborate a bit on what you exactly mean here? I.e. is the "true" rc6 glich prevented by running forcewaked, whereas the previous screenshoot is only prevented by disabling rc6 on the kernel cmdline? (In reply to comment #38) > On Thu, Oct 13, 2011 at 15:52, <bugzilla-daemon@freedesktop.org> wrote: > > hmm, looks like this is not *that* issue, this screenshot looks like glyphs > > rendered to completely wrong surface. I try to catch another screenshot that is > > directly related to rc6 issue. > > > > I guess so as running forcewaked does not prevent the issue in the attachement. > > Can you elaborate a bit on what you exactly mean here? I.e. is the > "true" rc6 glich prevented by running forcewaked, whereas the previous > screenshoot is only prevented by disabling rc6 on the kernel cmdline? It looks like: attachement id=52293 happens when rc6=1 AND forcewaked is not running. attachement id=52292 happens independently of rc6 (either on kernel cmd or forcewaked) - just checked this and I am able to reproduce it. For the render corruptions, can you please try the latest git version of xf86-video-intel, specifically commit d0184b59095d5b8fab1a65ceba075d29189130d4 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Sun Oct 9 18:43:14 2011 +0200 snb: implement PIPE_CONTROL workaround (In reply to comment #40) > For the render corruptions, can you please try the latest git version of > xf86-video-intel, specifically > > commit d0184b59095d5b8fab1a65ceba075d29189130d4 > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Sun Oct 9 18:43:14 2011 +0200 > > snb: implement PIPE_CONTROL workaround I am running: commit 823a4272c50247482428a16cb08741bf87a302ea Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Oct 11 13:51:41 2011 +0100 sna/gen3: Avoid RENDER/BLT context switch for fill boxes Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> and it is still bad. The first bad commit is: commit c5414ec992d935e10156a2b513d5ec2dded2f689 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Oct 2 12:02:41 2011 +0100 sna: Use BLT operations to avoid fallbacks in core glyph rendering Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> (In reply to comment #41) > The first bad commit is: > commit c5414ec992d935e10156a2b513d5ec2dded2f689 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Sun Oct 2 12:02:41 2011 +0100 > > sna: Use BLT operations to avoid fallbacks in core glyph rendering > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Lukas, are you SNA? If so can you file a separate bug report as the original predates SNA and in particular that bisection. Thanks. On Thu, Oct 13, 2011 at 16:42, <bugzilla-daemon@freedesktop.org> wrote: > Lukas, are you SNA? If so can you file a separate bug report as the original > predates SNA and in particular that bisection. Thanks. Also please recheck the rc6 related issues reported here with sna disabled. (In reply to comment #42) > (In reply to comment #41) > > The first bad commit is: > > commit c5414ec992d935e10156a2b513d5ec2dded2f689 > > Author: Chris Wilson <chris@chris-wilson.co.uk> > > Date: Sun Oct 2 12:02:41 2011 +0100 > > > > sna: Use BLT operations to avoid fallbacks in core glyph rendering > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Lukas, are you SNA? If so can you file a separate bug report as the original > predates SNA and in particular that bisection. Thanks. yes, I use SNA. I added comments to #41718, is it OK? (In reply to comment #43) > On Thu, Oct 13, 2011 at 16:42, <bugzilla-daemon@freedesktop.org> wrote: > > Lukas, are you SNA? If so can you file a separate bug report as the original > > predates SNA and in particular that bisection. Thanks. > > Also please recheck the rc6 related issues reported here with sna disabled. it seems that it is SNA related. I don't see the issue without SNA. I will do more testing.. (In reply to comment #45) > (In reply to comment #43) > > On Thu, Oct 13, 2011 at 16:42, <bugzilla-daemon@freedesktop.org> wrote: > > > Lukas, are you SNA? If so can you file a separate bug report as the original > > > predates SNA and in particular that bisection. Thanks. > > > > Also please recheck the rc6 related issues reported here with sna disabled. > > it seems that it is SNA related. I don't see the issue without SNA. I will do > more testing.. well, does not. see attachement. this is screenshot of corruption *without* SNA and *with* rc6=1 and *without* forcewaked running. I try to reproduce with forcewaked. Created attachment 52299 [details]
rc6 render glitch without SNA
Re: comment #29, I've re-enabled virtualization and disabled the IOMMU (intel_iommu=off) and haven't seen a GPU hang after 10 days. Please let me know if there's anything further I can test for you. -Ted A patch referencing this bug report has been merged in Linux v3.2-rc6: commit c0f372b3746d4ede07b2ace2beabd38d9c045b25 Author: Keith Packard <keithp@keithp.com> Date: Wed Nov 16 22:24:52 2011 -0800 drm/i915: By default, enable RC6 on IVB and SNB when reasonable I am pretty sure this should be resolved with 3.3 kernel where we disabled RC6p on Sandy Bridge. But if it still an issue, please, reopen so we could investigate it once again. A patch referencing this bug report has been merged in Linux v3.4-rc2: commit aa46419186992e6b8b8010319f0ca7f40a0d13f5 Author: Eugeni Dodonov <eugeni.dodonov@intel.com> Date: Fri Mar 23 11:57:19 2012 -0300 drm/i915: enable plain RC6 on Sandy Bridge by default |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.