Bug 75724

Summary: [ivb] garbage from ctx restore
Product: DRI Reporter: Kamil Bar <nevehanter>
Component: DRM/IntelAssignee: Ben Widawsky <ben>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: medium CC: dvabje0+mfvd7w, evul.troll, fox6x6x6, intel-gfx-bugs, mb_mail, patrik.plihal, pedro, richih.mailinglist, svenstaro
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
/sys/class/drm/card0/error
none
dmesg
none
Dump more of the hw context
none
Prevent context corruption none

Description Kamil Bar 2014-03-03 19:23:39 UTC
Created attachment 95053 [details]
/sys/class/drm/card0/error

I recently started to get often this error when trying to run 3D apps/games or whatever use 3D. I can get out of this video hang using Alt+F4, and when app window gets closed everything is back to normal. Also this occurs one time by ~ 1-8 launches, so it's not always repetitive and it doesn't occur when app is already running.

I'm using kernel version 3.13.5 with latest git mesa and libgl. GPU is HD4600 from i7-4770K.
Comment 1 Kamil Bar 2014-03-03 19:24:57 UTC
Created attachment 95054 [details]
dmesg
Comment 2 Chris Wilson 2014-03-03 20:48:26 UTC
Junk loaded from context. The simplest theory for this is that the context gets overwritten by another batch. Wildest speculation is that the hw reads the wrong pages.

In the past we found that caching the context in L3 improved the mtbf, but it did not completely eliminate this bug.

If you can please try a drm-intel-nightly, I believe it should contain more error state information for debugging context issues.
Comment 3 Kamil Bar 2014-03-04 08:52:12 UTC
I've compiled latest drm-intel-nightly kernel, and all the problems are gone, so I cannot post any additional errors, probably that was fixed, could I know when is expected merge window to stable branch?
Comment 4 Chris Wilson 2014-03-04 09:49:53 UTC
Ben, ideas?

Kamil, you can test drm-intel-fixes which is what will be in 3.14. drm-intel-nightly is currently targetting 3.15.

If you can indeed identify a single commit that makes everything just work, we can backport that. However, there is one massive change in drm-intel-nightly, full ppgtt, that will prevent userspace from overwritting context objects - that itself is not backportable.
Comment 5 Daniel Vetter 2014-03-04 19:23:25 UTC
ctx objects should only be bound in the global gtt with aliasing ppgtt. Well except in dinq, since full ppgtt broke this :(
Comment 6 Chris Wilson 2014-03-10 21:10:50 UTC
*** Bug 75994 has been marked as a duplicate of this bug. ***
Comment 7 Chris Wilson 2014-03-10 21:11:49 UTC
Kenneth mentioned seeing these during piglit runs as well. mesa overwriting the ctx object is as good a working theory as any.
Comment 8 Chris Wilson 2014-03-14 10:20:54 UTC
*** Bug 76133 has been marked as a duplicate of this bug. ***
Comment 9 Chris Wilson 2014-03-20 11:41:06 UTC
*** Bug 76395 has been marked as a duplicate of this bug. ***
Comment 10 Ben Widawsky 2014-03-24 23:26:38 UTC
Right, so the context is garbage with at least 2 cachelines of f's. This of course should explain IPEHR. I am surprised CCID reflects the garbage context, I would have expected that to not get loaded until after MI_SET_CONTEXT completes, but, whatever.

The fact that we have a few corrupt cachelines as opposed to be blocks of corruption makes me want to blame HW. The biggest problem of course is the LRI at the top of the context is missing, so no state is actually restored.

We could pretty easily try to detect this specific case, and then just abort the context restore if it's present. Would such a patch be interesting to anyone? It would only solve the case where the first cacheline is corrupt.

Kamil btw, does it still occur on -nightly? PPGTT is now turned off there, so the delta should be less.
Comment 11 Ben Widawsky 2014-03-24 23:36:44 UTC
Kamil, also, is this reproducible without rc6?
Comment 12 Ben Widawsky 2014-03-25 00:38:12 UTC
Created attachment 96332 [details] [review]
Dump more of the hw context

Please try this patch and attach the error state. I would like to know if the corrupted cachelines have any address pattern.
Comment 13 Chris Wilson 2014-03-25 08:01:42 UTC
Be aware that we have other bug reports where mesa is writing 0xffffffff into random locations (e.g. framebuffer, other pixmaps, ringbuffers).
Comment 14 Chris Wilson 2014-03-25 21:30:27 UTC
*** Bug 76606 has been marked as a duplicate of this bug. ***
Comment 15 Ben Widawsky 2014-03-26 01:55:32 UTC
Created attachment 96392 [details] [review]
Prevent context corruption

Another patch to try. This one should help prevent context corruption from userspace. Seems to not blow things up on my IVB. YMMV
Comment 16 Chris Wilson 2014-03-26 02:19:42 UTC
*** Bug 76608 has been marked as a duplicate of this bug. ***
Comment 17 Ben Widawsky 2014-03-27 17:56:36 UTC
Nobody wants to try the patch?
Comment 18 towo 2014-03-28 10:03:13 UTC
Against which kernel branch does this patch apply?
Im not able to get this patch apllied in 3.13.7 or 3.14-rc8.
Comment 19 Ben Widawsky 2014-03-28 16:23:02 UTC
It applies to drm-intel-nightly. AFAIK the issue still occurs there now that we disabled full PPGTT.
Comment 20 Chris Wilson 2014-04-08 18:06:08 UTC
*** Bug 77195 has been marked as a duplicate of this bug. ***
Comment 21 Chris Wilson 2014-05-16 16:26:20 UTC
I am assuming that these turn out to have been the mesa bug fixed recently.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.