Bug 111075

Summary: Processing of SPIR-V shader causes device hang, sometimes leading to system reboot
Product: Mesa Reporter: Alastair Donaldson <afdx>
Component: Drivers/Vulkan/intelAssignee: Ian Romanick <idr>
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: danylo.piliaiev, jason
Version: gitKeywords: bisected, regression
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Amber test exposing the problem
output_after_test_running

Description Alastair Donaldson 2019-07-05 10:52:52 UTC
Created attachment 144711 [details]
Amber test exposing the problem

Running the attached test using Amber (https://github.com/google/amber):

amber return-in-loop-in-function.amber

should lead to the test passing.

Instead, a GPU hang is reported, and this sometimes leads to a system reboot.

Build: Mesa 19.2.0-devel (git-243db4980c) (Debug)
Device: Intel(R) HD Graphics 630 (Kaby Lake GT2)

Interestingly, using Mesa 18.0.3, the GPU hang does not occur; instead there is a segmentation fault in libvulkan_intel.so.
Comment 1 Denis 2019-07-05 13:26:36 UTC
Created attachment 144712 [details]
output_after_test_running

hey, looks like you opened pandora box with a lot of issues.
Adding one more bisect result below. (and attaching test output during the hang)
upd - interesting that

mesa-18.0.0 - test crashes
mesa-18.3.0 - test passes
mesa-git - test hangs the system


8fb8ebfbb05d3323481c8ba6d320b3a3580bad99 is the first bad commit
commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
Author: Ian Romanick <ian.d.romanick@intel.com>
Date:   Tue May 22 18:56:41 2018 -0700

    intel/compiler: More peephole select
    
    Shader-db results:
    
    The one shader hurt for instructions is a compute shader that had both
    spills and fills hurt.
Comment 2 Danylo 2019-07-11 10:20:41 UTC
I think I'm close to finding the issue which results in shader using uninitialized memory which in turn leads to hang.
Comment 3 Danylo 2019-07-12 15:05:45 UTC
I'm unable to come up with any seemingly good solution at the moment so here what I have found:

The hang happens due to the usage of uninitialized memory in calculations of loop terminator.
Bisected commit (8fb8ebfbb05d3323481c8ba6d320b3a3580bad99) just makes other optimization go wrong.

The usage of uninitialized memory starts with nir_lower_regs_to_ssa_impl which happens after opt_if_cf_list. During the nir_lower_regs_to_ssa_impl have (pay attention to r23):

> decl_reg vec1 32 r23
> ...
> /* succs: block_1 */
> loop {
>     block block_1:
>     /* preds: block_0 block_12 */
>     ...
>     /* succs: block_5 */
>     loop {
>         block block_5:
>         /* preds: block_4 block_9 block_11 */
>         vec1 32 ssa_32 = phi block_4: ssa_5, block_9: r23, block_11: ssa_168
>         r23 = phi block_4: ssa_5, block_9: r27, block_11: ssa_170
>         ...
>         /* succs: block_6 block_7 */
>         if r25 {
>             block block_6:
>             /* preds: block_5 */
>             ...
>             break
>             /* succs: block_12 */
>         } else {
>             block block_7:
>             /* preds: block_5 */
>             /* succs: block_8 */
>         }
>         block block_8:
>         /* preds: block_7 */
>         vec1 32 ssa_47 = i2f32 r23
>         /* succs: block_9 block_10 */
>         if r26 {
>             block block_9:
>             /* preds: block_8 */
>             r27 = iadd r23, ssa_5
>             continue
>             /* succs: block_5 */
>         } else {
>             block block_10:
>             /* preds: block_8 */
>             break
>             /* succs: block_12 */
>         }
>         block block_11:
>         /* preds: */
>         /* succs: block_5 */
>     }
>     block block_12:
>     /* preds: block_6 block_10 */
>     ...
>     /* succs: block_1 */
> }

r23 is used in calculations of ssa_32 thus when we are calling nir_phi_builder_value_get_block_def with block_9 and r23 in rewrite_src we are trying to find the closest ssa_def by traversing the dominance tree.

ssa_def is found in block_5 - the only place where r23 is assigned, however
it is NEEDS_PHI since we haven't processed the r23 assignment yet. So empty phi
is created, placed in the list and then added to all blocks between block_9 and block_5 in dominance tree.

Next, the r23 assignment is processed, nir_phi_builder_value_set_block_def is called with new ssa for this block however the later blocks e.g. block_8 still has empty phi for r23. So when assignment to ssa_47 is processed it fetches the empty phi which corresponds to this block for r23, now ssa_47 uses incorrect source:

> block block_5:
> /* preds: block_4 block_9 block_11 */
> vec1 32 ssa_32 = phi block_4: ssa_5, block_9: ssa_4294967295, block_11: ssa_168
> vec1 32 ssa_223 = phi block_4: ssa_5, block_9: ssa_4294967295, block_11: ssa_170
> /* succs: block_6 block_7 */
> if ssa_225 {
>     block block_6:
>     /* preds: block_5 */
>     break
>     /* succs: block_12 */
> } else {
>     block block_7:
>     /* preds: block_5 */
>     /* succs: block_8 */
> }
> block block_8:
> /* preds: block_7 */
> vec1 32 ssa_47 = i2f32 ssa_4294967295

Later in nir_phi_builder_finish "virtual" phi blocks are materialized not taking into account the fact that r23 was assigned. So we get:

>block block_0:
>vec1 32 ssa_248 = undefined
>loop {
>	block block_1:
>    vec1 32 ssa_249 = phi block_0: ssa_248, block_12: ssa_223
>
>    loop {
>        block block_5:
>        /* preds: block_4 block_9 block_11 */
>        vec1 32 ssa_247 = phi block_4: ssa_249, block_9: ssa_247, block_11: ssa_246
>        vec1 32 ssa_32 = phi block_4: ssa_5, block_9: ssa_247, block_11: ssa_168
>        vec1 32 ssa_223 = phi block_4: ssa_5, block_9: ssa_255, block_11: ssa_170
>        /* succs: block_6 block_7 */
>        if ssa_225 {
>            block block_6:
>            /* preds: block_5 */
>            break
>            /* succs: block_12 */
>        } else {
>            block block_7:
>            /* preds: block_5 */
>            /* succs: block_8 */
>        }
>        block block_8:
>        /* preds: block_7 */
>        vec1 32 ssa_47 = i2f32 ssa_247

In this last NIR ssa_47 uses the value which is essentially ssa_248 during the first loop iteration and is undefined.

I thought that the initial NIR arrangement violates some unwritten rule but
commenting out last block in nir_phi_builder_value_get_block_def resulted in a correct shader and no other code has issues with such NIR.

I'm unsure if it should take into account the situation described above in some way or if such NIR arrangement shouldn't happen.
Comment 4 Jason Ekstrand 2019-07-12 16:08:10 UTC
Good triage work!  The real problem here isn't with visiting phi destinations but visiting phi sources.  In particular, they don't exist in the block with the phi but at the end of the predecessor block.

https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1331
Comment 5 Jason Ekstrand 2019-07-16 23:28:58 UTC
Fixed by the following commit in master:

commit 6fb685fe4b762c8030f86895707516e2481e9ece (HEAD -> master, origin/master, origin/HEAD)
Author: Jason Ekstrand <jason@jlekstrand.net>
Date:   Fri Jul 12 11:01:40 2019 -0500

    nir/regs_to_ssa: Handle regs in phi sources properly
    
    Sources of phi instructions act as if they occur at the very end of the
    predecessor block not the block in which the phi lives.  In order to
    handle them correctly, we have to skip phi sources on the normal
    instruction walk and handle them as a separate walk over the successor
    phis.  While registers in phi instructions is a bit of an oddity it can
    happen when we temporarily go out-of-SSA for control-flow manipulations.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111075
    Cc: mesa-stable@lists.freedesktop.org
    Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.