Summary: | Processing of SPIR-V shader causes device hang, sometimes leading to system reboot | ||
---|---|---|---|
Product: | Mesa | Reporter: | Alastair Donaldson <afdx> |
Component: | Drivers/Vulkan/intel | Assignee: | Ian Romanick <idr> |
Status: | RESOLVED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | danylo.piliaiev, jason |
Version: | git | Keywords: | bisected, regression |
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Amber test exposing the problem
output_after_test_running |
Description
Alastair Donaldson
2019-07-05 10:52:52 UTC
Created attachment 144712 [details] output_after_test_running hey, looks like you opened pandora box with a lot of issues. Adding one more bisect result below. (and attaching test output during the hang) upd - interesting that mesa-18.0.0 - test crashes mesa-18.3.0 - test passes mesa-git - test hangs the system 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99 is the first bad commit commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99 Author: Ian Romanick <ian.d.romanick@intel.com> Date: Tue May 22 18:56:41 2018 -0700 intel/compiler: More peephole select Shader-db results: The one shader hurt for instructions is a compute shader that had both spills and fills hurt. I think I'm close to finding the issue which results in shader using uninitialized memory which in turn leads to hang. I'm unable to come up with any seemingly good solution at the moment so here what I have found: The hang happens due to the usage of uninitialized memory in calculations of loop terminator. Bisected commit (8fb8ebfbb05d3323481c8ba6d320b3a3580bad99) just makes other optimization go wrong. The usage of uninitialized memory starts with nir_lower_regs_to_ssa_impl which happens after opt_if_cf_list. During the nir_lower_regs_to_ssa_impl have (pay attention to r23): > decl_reg vec1 32 r23 > ... > /* succs: block_1 */ > loop { > block block_1: > /* preds: block_0 block_12 */ > ... > /* succs: block_5 */ > loop { > block block_5: > /* preds: block_4 block_9 block_11 */ > vec1 32 ssa_32 = phi block_4: ssa_5, block_9: r23, block_11: ssa_168 > r23 = phi block_4: ssa_5, block_9: r27, block_11: ssa_170 > ... > /* succs: block_6 block_7 */ > if r25 { > block block_6: > /* preds: block_5 */ > ... > break > /* succs: block_12 */ > } else { > block block_7: > /* preds: block_5 */ > /* succs: block_8 */ > } > block block_8: > /* preds: block_7 */ > vec1 32 ssa_47 = i2f32 r23 > /* succs: block_9 block_10 */ > if r26 { > block block_9: > /* preds: block_8 */ > r27 = iadd r23, ssa_5 > continue > /* succs: block_5 */ > } else { > block block_10: > /* preds: block_8 */ > break > /* succs: block_12 */ > } > block block_11: > /* preds: */ > /* succs: block_5 */ > } > block block_12: > /* preds: block_6 block_10 */ > ... > /* succs: block_1 */ > } r23 is used in calculations of ssa_32 thus when we are calling nir_phi_builder_value_get_block_def with block_9 and r23 in rewrite_src we are trying to find the closest ssa_def by traversing the dominance tree. ssa_def is found in block_5 - the only place where r23 is assigned, however it is NEEDS_PHI since we haven't processed the r23 assignment yet. So empty phi is created, placed in the list and then added to all blocks between block_9 and block_5 in dominance tree. Next, the r23 assignment is processed, nir_phi_builder_value_set_block_def is called with new ssa for this block however the later blocks e.g. block_8 still has empty phi for r23. So when assignment to ssa_47 is processed it fetches the empty phi which corresponds to this block for r23, now ssa_47 uses incorrect source: > block block_5: > /* preds: block_4 block_9 block_11 */ > vec1 32 ssa_32 = phi block_4: ssa_5, block_9: ssa_4294967295, block_11: ssa_168 > vec1 32 ssa_223 = phi block_4: ssa_5, block_9: ssa_4294967295, block_11: ssa_170 > /* succs: block_6 block_7 */ > if ssa_225 { > block block_6: > /* preds: block_5 */ > break > /* succs: block_12 */ > } else { > block block_7: > /* preds: block_5 */ > /* succs: block_8 */ > } > block block_8: > /* preds: block_7 */ > vec1 32 ssa_47 = i2f32 ssa_4294967295 Later in nir_phi_builder_finish "virtual" phi blocks are materialized not taking into account the fact that r23 was assigned. So we get: >block block_0: >vec1 32 ssa_248 = undefined >loop { > block block_1: > vec1 32 ssa_249 = phi block_0: ssa_248, block_12: ssa_223 > > loop { > block block_5: > /* preds: block_4 block_9 block_11 */ > vec1 32 ssa_247 = phi block_4: ssa_249, block_9: ssa_247, block_11: ssa_246 > vec1 32 ssa_32 = phi block_4: ssa_5, block_9: ssa_247, block_11: ssa_168 > vec1 32 ssa_223 = phi block_4: ssa_5, block_9: ssa_255, block_11: ssa_170 > /* succs: block_6 block_7 */ > if ssa_225 { > block block_6: > /* preds: block_5 */ > break > /* succs: block_12 */ > } else { > block block_7: > /* preds: block_5 */ > /* succs: block_8 */ > } > block block_8: > /* preds: block_7 */ > vec1 32 ssa_47 = i2f32 ssa_247 In this last NIR ssa_47 uses the value which is essentially ssa_248 during the first loop iteration and is undefined. I thought that the initial NIR arrangement violates some unwritten rule but commenting out last block in nir_phi_builder_value_get_block_def resulted in a correct shader and no other code has issues with such NIR. I'm unsure if it should take into account the situation described above in some way or if such NIR arrangement shouldn't happen. Good triage work! The real problem here isn't with visiting phi destinations but visiting phi sources. In particular, they don't exist in the block with the phi but at the end of the predecessor block. https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1331 Fixed by the following commit in master: commit 6fb685fe4b762c8030f86895707516e2481e9ece (HEAD -> master, origin/master, origin/HEAD) Author: Jason Ekstrand <jason@jlekstrand.net> Date: Fri Jul 12 11:01:40 2019 -0500 nir/regs_to_ssa: Handle regs in phi sources properly Sources of phi instructions act as if they occur at the very end of the predecessor block not the block in which the phi lives. In order to handle them correctly, we have to skip phi sources on the normal instruction walk and handle them as a separate walk over the successor phis. While registers in phi instructions is a bit of an oddity it can happen when we temporarily go out-of-SSA for control-flow manipulations. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111075 Cc: mesa-stable@lists.freedesktop.org Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.