Summary: | [BDW] GPU hang in Shogun2 | ||
---|---|---|---|
Product: | Mesa | Reporter: | Pavel Ondračka <pavel.ondracka> |
Component: | Drivers/DRI/i965 | Assignee: | Matt Turner <mattst88> |
Status: | RESOLVED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | pavel.e.popov |
Version: | git | Keywords: | bisected, regression |
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | |||
Bug Blocks: | 93185 | ||
Attachments: | Shaders output with and without hangs |
Description
Pavel Ondračka
2015-10-02 07:54:52 UTC
Thanks for this bug, Pavel! I also observed GPU hangs playing traces recorded for similar games: “Total War: Empire” and “Total War: Napoleon”. I observed this issue only on BDW with Mesa 10.6. No GPU hangs were observed on BDW with Mesa 10.4 (it doesn't have NIR). No GPU hangs were observed on HSW with Mesa 10.4 and Mesa 10.6. I tried to revert commit f5cf74d8ba8ce30b9d53b2198e5122ed72f1dcff but without success. As I said previously I see hangs on Total War: Napoleon” and “Total War: Empire” traces on BDW with Mesa 10.6. I tried to revert commit "nir: Recognize (a < c || b < c) as min(a, b) < c." but without success. So I tried to removed all optimizations in nir_opt_algebraic.py: no hangs were observed on “Total War: Napoleon” hangs were still observed on “Total War: Empire”. Somehow NIR optimizations from nir_opt_algebraic.py lead to hangs in “Total War: Napoleon”. Looks like Pavel Ondračka observed the similar issue with "Total War: Shogun2". Could reproduce hangs on BDW using the trace "Total War: Shogun2": http://pavel.ondracka.cz/Shogun2.trace Found that NO hangs were observed with disabled Intel NIR on this trace: export INTEL_USE_NIR=0 However this approach didn't work with my traces "Total War: Napoleon" and "Total War: Empire" (unfortunately, I couldn't share these not apitrace traces). Most likely it's not a NIR issue. Found that hangs on "Total War: Napoleon" and "Total War: Empire" are gone without this commit (nir/opt_algebraic: Add some constant bcsel reductions): http://cgit.freedesktop.org/mesa/mesa/commit/?id=604ae33c8b95a97ba586780324566fd21c59b695 Observed that these additional optimizations somehow lead to hangs: +# Add optimizations to handle the case where the result of a ternary is +# compared to a constant. This way we can take things like +# +# (a ? 0 : 1) > 0 +# +# and turn it into +# +# a ? (0 > 0) : (1 > 0) +# +# which constant folding will eat for lunch. The resulting ternary will +# further get cleaned up by the boolean reductions above and we will be +# left with just the original variable "a". +for op in ['flt', 'fge', 'feq', 'fne', + 'ilt', 'ige', 'ieq', 'ine', 'ult', 'uge']: + optimizations += [ + ((op, ('bcsel', 'a', '#b', '#c'), '#d'), + ('bcsel', 'a', (op, 'b', 'd'), (op, 'c', 'd'))), + ((op, '#d', ('bcsel', a, '#b', '#c')), + ('bcsel', 'a', (op, 'd', 'b'), (op, 'd', 'c'))), + ] + Also made sure that hangs on "Total War: Shogun2" are gone without this commit (nir: Recognize (a < c || b < c) as min(a, b) < c.) http://cgit.freedesktop.org/mesa/mesa/commit/?id=f5cf74d8ba8ce30b9d53b2198e5122ed72f1dcff I wasn't able to reproduce this on my Broadwell GT2. I tried both Shogun2.trace from this bug and "Empire: Total War". I used Mesa master from today, and also tried 604ae33c8b95a97ba586780324566fd21c59b695. Both worked OK for me - no hangs. Is this still happening for either of you? We still have these GPU hangs in our environment. I used some common configuration to make sure that it's not an issue on our side. I tried Shogun2.trace and could reproduce GPU hang. OS: Ubuntu 15.04 Kernel: 3.19.0-22-generic GPU: Mesa DRI Intel(R) Iris Pro P6300 (Broadwell GT3e) Mesa: 11.0.4~git20151026+11.0.ec14e6f8-0ubuntu0ricotz~vivid Kenneth could you share your configuration? I'm using kernel 4.4.0-rc2 on a BDW GT2 (Lenovo X250) with Mesa 11.2.0-devel (git-cf97544). Somehow this issue wasn't reproduced with Mesa master. But I'm not sure that this problem will not appear again. I used 41e82f4f96f87e3b5bd3e7a3dc221cf6e6b6ae0b from Mesa master and couldn't reproduce hangs on all Total Wars. Kenneth I didn't try Mesa up to 604ae33c8b95a97ba586780324566fd21c59b695 as you, I observed these hangs on Mesa 10.6 and found they are gone on my workloads if just one patch 604ae33c8b95a97ba586780324566fd21c59b695 is reverted. This is can be a reason why you didn't see hangs. Looks like some order of NIR optimizations lead to hangs on BDW. For example, I used Mesa 10.6, reverted patch 604ae33c8b95a97ba586780324566fd21c59b695 and hangs were gone on my workloads Empire and Napoleon but when I also reverted patch f5cf74d8ba8ce30b9d53b2198e5122ed72f1dcff for Shogun2 and observed that hangs appeared again on Empire. I could hide all hangs only when I removed all NIR optimizations in nir_opt_algebraic.py (comment 2 is wrong, some optimizations weren't removed during that experiment). There's a lot of information in this bug, and I want to make sure I understand it all. Pavel Ondračka can reproduce the hang in Total War: Shogun2 on Mesa 11.0.something (from commit c0722be9). Pavel Popov can reproduce the hang in Total War: Napoleon and Total War: Empire on Mesa 11.0.4 (from commit ec14e6f8). Pavel Popov can reproduce the hang in TW:N and TW:E on Mesa 10.6 (unknown commit). Pavel Popov cannot reproduce the hang in TW:N or TW:E on Mesa 10.4 (unknown commit). Ken cannot reproduce the hang in TW:E on Mesa 10.2 (commit 604ae33). Ken cannot reproduce the hang in TW:E on Mesa master (unknown commit from 3-December-2015). Pavel Popov cannot reproduce the hang in TW:N or TW:E on Mesa master (from commit 41e82f4f). Reverting some patches (604ae33c8 or f5cf74d8, but *not both*) that add algebraic optimizations or disabling nir_opt_algebraic.py eliminates the hangs. Assuming that the information in comment #10 is at least mostly correct, I'd like to see several pieces of additional information: 1. Do the hangs occur on Mesa 11.0.6 or the current tip of the 11.0 stable branch? I suspect that they will, but I want to be thorough. 2. If the hangs occur on Mesa 11.0.4 and do not occur on master, can someone bisect to see when this was fixed? There may be some backend patch that we want to cherry pick back to 11.0. 3. Can someone attach the GEN assembly of the shaders that trigger the hang? It sounds like running the Shogun2.trace trace with the environment variable INTEL_DEBUG=vs,gs,fs should do the trick. I've managed to bisect the patch which fixes hangs for all Total Wars cases (TW:E, TW:N, TW:S): i965: always run the post-RA scheduler http://cgit.freedesktop.org/mesa/mesa/commit/?id=486268bdb03a36faf09d84e0458ff49dd1325c40 Also I applied this patch to Mesa 11.0.6 and made sure that hangs are gone. Please double check me, I have a very specific environment which can affect results. I think that apitrace for Shogun2 is enough to do this. Created attachment 120414 [details]
Shaders output with and without hangs
Used INTEL_DEBUG=vs,gs,fs for Shogun2 trace and obtained 2 logs: with (good one) and without (bad one) patch "i965: always run the post-RA scheduler".
I believe I came across a bug that is the cause of these problems. I sent a four patch series i965/fs: Rename opt_copy_propagate -> opt_copy_propagation. i965/fs: Add unit tests for copy propagation pass. i965/fs: Reject copy propagation into SEL if not min/max. nir: Move fsat outside of fmin/fmax if second arg is 0 to 1. where the 3rd is the bug fix. My theory is that enabling NIR caused the code we optimize in the backend to hit this bug. I've committed those four patches. commit 7bed52bb5fb4cfd5f91c902a654b3452f921da17 Author: Matt Turner <mattst88@gmail.com> Date: Mon Nov 28 15:21:51 2016 -0800 i965/fs: Reject copy propagation into SEL if not min/max. and the previous three should fix a codegen bug seen in a number of games, including Shogun 2. I would love it if we were able to confirm that this was the culprit, but it may not be possible. Please do not hesitate to reopen if you can reproduce. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.