Summary: | [ivb] garbage from ctx restore | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Kamil Bar <nevehanter> | ||||||||||
Component: | DRM/Intel | Assignee: | Ben Widawsky <ben> | ||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Severity: | major | ||||||||||||
Priority: | medium | CC: | dvabje0+mfvd7w, evul.troll, fox6x6x6, intel-gfx-bugs, mb_mail, patrik.plihal, pedro, richih.mailinglist, svenstaro | ||||||||||
Version: | XOrg git | ||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||
OS: | Linux (All) | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | i915 features: | ||||||||||||
Attachments: |
|
Created attachment 95054 [details]
dmesg
Junk loaded from context. The simplest theory for this is that the context gets overwritten by another batch. Wildest speculation is that the hw reads the wrong pages. In the past we found that caching the context in L3 improved the mtbf, but it did not completely eliminate this bug. If you can please try a drm-intel-nightly, I believe it should contain more error state information for debugging context issues. I've compiled latest drm-intel-nightly kernel, and all the problems are gone, so I cannot post any additional errors, probably that was fixed, could I know when is expected merge window to stable branch? Ben, ideas? Kamil, you can test drm-intel-fixes which is what will be in 3.14. drm-intel-nightly is currently targetting 3.15. If you can indeed identify a single commit that makes everything just work, we can backport that. However, there is one massive change in drm-intel-nightly, full ppgtt, that will prevent userspace from overwritting context objects - that itself is not backportable. ctx objects should only be bound in the global gtt with aliasing ppgtt. Well except in dinq, since full ppgtt broke this :( *** Bug 75994 has been marked as a duplicate of this bug. *** Kenneth mentioned seeing these during piglit runs as well. mesa overwriting the ctx object is as good a working theory as any. *** Bug 76133 has been marked as a duplicate of this bug. *** *** Bug 76395 has been marked as a duplicate of this bug. *** Right, so the context is garbage with at least 2 cachelines of f's. This of course should explain IPEHR. I am surprised CCID reflects the garbage context, I would have expected that to not get loaded until after MI_SET_CONTEXT completes, but, whatever. The fact that we have a few corrupt cachelines as opposed to be blocks of corruption makes me want to blame HW. The biggest problem of course is the LRI at the top of the context is missing, so no state is actually restored. We could pretty easily try to detect this specific case, and then just abort the context restore if it's present. Would such a patch be interesting to anyone? It would only solve the case where the first cacheline is corrupt. Kamil btw, does it still occur on -nightly? PPGTT is now turned off there, so the delta should be less. Kamil, also, is this reproducible without rc6? Created attachment 96332 [details] [review] Dump more of the hw context Please try this patch and attach the error state. I would like to know if the corrupted cachelines have any address pattern. Be aware that we have other bug reports where mesa is writing 0xffffffff into random locations (e.g. framebuffer, other pixmaps, ringbuffers). *** Bug 76606 has been marked as a duplicate of this bug. *** Created attachment 96392 [details] [review] Prevent context corruption Another patch to try. This one should help prevent context corruption from userspace. Seems to not blow things up on my IVB. YMMV *** Bug 76608 has been marked as a duplicate of this bug. *** Nobody wants to try the patch? Against which kernel branch does this patch apply? Im not able to get this patch apllied in 3.13.7 or 3.14-rc8. It applies to drm-intel-nightly. AFAIK the issue still occurs there now that we disabled full PPGTT. *** Bug 77195 has been marked as a duplicate of this bug. *** I am assuming that these turn out to have been the mesa bug fixed recently. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 95053 [details] /sys/class/drm/card0/error I recently started to get often this error when trying to run 3D apps/games or whatever use 3D. I can get out of this video hang using Alt+F4, and when app window gets closed everything is back to normal. Also this occurs one time by ~ 1-8 launches, so it's not always repetitive and it doesn't occur when app is already running. I'm using kernel version 3.13.5 with latest git mesa and libgl. GPU is HD4600 from i7-4770K.