Bug 110796

Summary: [REGRESSION] [BISECTED] [GLESCTS] race between destruction of types and shader compilation (?)
Product: Mesa Reporter: Andrés Gómez García <agomez>
Component: Drivers/DRI/i965Assignee: Tapani Pälli <lemody>
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: agoldmints, agomez, gpoo+bfdo, jasuarez, lemody, mark.a.janes, t_arceri
Version: gitKeywords: bisected, regression
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 102590    
Attachments: BT for the core running KHR-GLES31.core.gpu_shader5.texture_gather_offsets_color
BT for the core running dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat
BT for the core running dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat without mesa_glthread
move glsl_type dtor back to atexit
fix

Description Andrés Gómez García 2019-05-29 23:15:21 UTC
Created attachment 144376 [details]
BT for the core running KHR-GLES31.core.gpu_shader5.texture_gather_offsets_color

This is a hard one to explain. Let's try ...

After:

--

commit 624789e3708c87ea2a4c8d2266266b489b421cba (gitlabcom/master)
Author: Tapani Pälli <tapani.palli@intel.com>
Date:   Fri Mar 15 09:47:49 2019 +0200

    compiler/glsl: handle case where we have multiple users for types
    
    Both Vulkan and OpenGL might be using glsl_types simultaneously or we
    can also have multiple concurrent Vulkan instances using glsl_types.
    Patch adds a one time init to track number of users and will release
    types only when last user calls _glsl_type_singleton_decref().
    
    This change fixes glsl_type memory leaks we have with anv driver.
    
    v2: reuse hash_mutex, cleanup, apply fix also to radv driver and
        rename helper functions (Jason)
    
    v3: move init, destroy to happen on GL context init and destroy
    
    Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
    Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

--

We can hit a situation in which we try to get a glsl_type that has been already freed, leading to a SIGSEV.

Timothy seems to have solved this very same problem for radeonsi at:

--

commit a6b7068ff5fbf4694a45a6e07adac5047e574514
Author: Timothy Arceri <tarceri@itsqueeze.com>
Date:   Tue Apr 23 12:54:38 2019 +1000

    st/mesa/radeonsi: fix race between destruction of types and shader compilation
    
    Commit 624789e3708c moved the destruction of types out of atexit() and
    made use of a ref count instead. This is useful for avoiding a crash
    where drivers such as radeonsi are still compiling in a thread when the app
    exits and has not called MakeCurrent to change from the current context.
    
    While the above scenario is technically an app bug we shouldn't crash.
    However that change caused another race condition between the shader
    compilation tread in radeonsi and context teardown functions.
    
    This patch makes two changes to fix this new problem:
    
    First we explicitly call _mesa_destroy_shader_compiler_types() when destroying
    the st context rather than calling it indirectly via _mesa_free_context_data().
    We do this as we must call it after st_destroy_context_priv() so that we don't
    destory the glsl types before the compilation threads finish.
    
    Next wait for the shader threads to finish in si_destroy_context() this
    also means we need to call context destroy before destroying the queues
    in si_destroy_screen().
    
    Fixes: 624789e3708c ("compiler/glsl: handle case where we have multiple users for types")
    
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>

--

Potentially, this problem is not only also present in i965 but in others (?)

--

The conditions in which I've detected this problem are a bit tricky.

While playing with bug 110357, I realized that, after the inclusion of the offending commit above, reverting dacb11a585 was causing systematic SIGSEVs in KBL, SKL, BDW and HSW while running cts-runner for es32 in the working branch.

**Notice that this cannot be reproduce by testing just the problematic test or the CTS configuration in which the test exists. I've only been able to reproduce through running with cts-runner for es32**

The "problematic" tests in which the execution could SIGSEV are:

dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat
KHR-GLES31.core.gpu_shader5.texture_gather_offsets_color

The SIGSEV happens while trying to invoke:

textureGatherOffsets(isampler2D, vec2, ivec2[4], int)

Checking the obtained cores, my guess is that the invokation fails because, while checking the funtion signature to identify the proper function pointer, it fails as it cannot find the "ivec2[4]" array type in the array types hash table.

It cannot find it, because the hast table doesn't exist any more (?) because it has been freed (?).

However, the signature is correct, but mesa tries to print the possible function signature candidates. While doing so, it reaches once again to the array types hash table for the name of the "ivec2[4]" type of the variable defining the signature. As the hash table is bogus already, when trying to print the name of the type, strlen leads us to a SIGSVEV.

So, it really seems like a similar race condition situation as the one Timothy fixed for radeonsi.
Comment 1 Andrés Gómez García 2019-05-29 23:16:57 UTC
Created attachment 144377 [details]
BT for the core running dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat
Comment 2 Andrés Gómez García 2019-05-29 23:17:41 UTC
If needed, I can provide a docker image with the needed mesa and VK-GL-CTS versions with which to reproduce this SIGSEV.
Comment 3 Tapani Pälli 2019-05-30 11:07:20 UTC
*sigh* will try to reproduce this
Comment 4 Andrés Gómez García 2019-05-30 13:54:22 UTC
(In reply to Andrés Gómez García from comment #0)

> **Notice that this cannot be reproduce by testing just the problematic test
> or the CTS configuration in which the test exists. I've only been able to
> reproduce through running with cts-runner for es32**

With VK-GL-CTS compiled against Wayland.
Comment 5 Tapani Pälli 2019-06-07 06:37:34 UTC
Andrés, could you try if reverting following commit has any effect on this:

https://gitlab.freedesktop.org/mesa/mesa/commit/dca36d5516d0fdaf012b4476975c5d585c2d1a09

(or maybe just disabling gl thread option)
Comment 6 Andrés Gómez García 2019-06-28 19:10:50 UTC
(In reply to Tapani Pälli from comment #5)
> Andrés, could you try if reverting following commit has any effect on this:
> 
> https://gitlab.freedesktop.org/mesa/mesa/commit/
> dca36d5516d0fdaf012b4476975c5d585c2d1a09
> 
> (or maybe just disabling gl thread option)

mesa_glthread is not activated by default.

Still, I've also reverted that commit and tried again and I can confirm that the problem still is reproducible.
Comment 7 Andrés Gómez García 2019-06-28 19:12:32 UTC
Created attachment 144677 [details]
BT for the core running dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat without mesa_glthread
Comment 8 Andrés Gómez García 2019-06-30 16:12:03 UTC
(In reply to Andrés Gómez García from comment #0)
> The "problematic" tests in which the execution could SIGSEV are:
> 
> dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.
> size_pot.clamp_to_edge_repeat

I can also replicate this problem with Mesa master, and applying Ken's two-configs fix attempt for bug 110357:
https://gitlab.freedesktop.org/kwg/mesa/tree/pb565-two-configs

> KHR-GLES31.core.gpu_shader5.texture_gather_offsets_color
Comment 9 Andrés Gómez García 2019-07-12 13:02:16 UTC
*** Bug 111114 has been marked as a duplicate of this bug. ***
Comment 10 Mark Janes 2019-07-15 18:26:47 UTC
In my opinion, this should have blocked the 19.1 release.

I'm altering the title to make it clear that this prevents i965 from being compliant with the gles cts.

Intel CI is an effective sieve for unfixed bisected regressions, and prevents them from going out in a release.  However bugs (like this one) which are not seen in CI have no mechanism that ensures they block the next release.

The bug originator is unlikely to remember that the bug is unfixed and add it to the release tracker.  The release manager may search in bugzilla for unfixed bugs affecting the release, but will find a lot of noise in the results.

Juan, do you have any suggestions for how we can catch these bugs in the release process?  I would suggest relying on the "bisected, regression" keywords, but this list seems too long for me:

https://bugs.freedesktop.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=bisected%2C%20regression%2C%20&keywords_type=allwords&list_id=676302&order=bug_id%20DESC&product=Mesa&query_format=advanced

We should be marking some of those bugs as WONTFIX etc.
Comment 11 Juan A. Suarez 2019-07-17 11:41:57 UTC
(In reply to Mark Janes from comment #10)
> 
> 
> Juan, do you have any suggestions for how we can catch these bugs in the
> release process?  I would suggest relying on the "bisected, regression"
> keywords, but this list seems too long for me:
> 
> https://bugs.freedesktop.org/buglist.
> cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=bisected%
> 2C%20regression%2C%20&keywords_type=allwords&list_id=676302&order=bug_id%20DE
> SC&product=Mesa&query_format=advanced
> 


Maybe we can short the list if we search for all the bugs created since the latest branch point (for 19.1, they would be all the tests created after 19.0 branchpoint). The rationale is that all the previous bugs didn't block 19.0, so they shouldn't block 19.1.
Comment 12 Mark Janes 2019-07-17 15:40:05 UTC
(In reply to Juan A. Suarez from comment #11)
> (In reply to Mark Janes from comment #10)
> > https://bugs.freedesktop.org/buglist.
> > cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=bisected%
> > 2C%20regression%2C%20&keywords_type=allwords&list_id=676302&order=bug_id%20DE
> > SC&product=Mesa&query_format=advanced
> 
> Maybe we can short the list if we search for all the bugs created since the
> latest branch point (for 19.1, they would be all the tests created after
> 19.0 branchpoint). The rationale is that all the previous bugs didn't block
> 19.0, so they shouldn't block 19.1.

Makes sense to me.  Auditing new bugs before release will improve the quality of our releases.
Comment 13 Andrés Gómez García 2019-07-18 11:27:08 UTC
(In reply to Mark Janes from comment #10)
> In my opinion, this should have blocked the 19.1 release.

I think you are right.

> I'm altering the title to make it clear that this prevents i965 from being
> compliant with the gles cts.

Thanks!
 
> The bug originator is unlikely to remember that the bug is unfixed and add
> it to the release tracker.  The release manager may search in bugzilla for
> unfixed bugs affecting the release, but will find a lot of noise in the
> results.

Unfortunately, at the time of reporting it was unclear to me whether this was a regression of just a problem happening in a very specific condition. Notice that I was not using master directly but a patched master :(
Comment 14 Tapani Pälli 2019-07-29 11:16:31 UTC
Andrés, have you been able to reproduce this using 'deqp-gles31' directly instead of 'cts-runner'? I'm asking because I'm experiencing weird issues with 'cts-runner', for some reason Mesa loader goes nuts and refuses to load or find driver, this seems to happen when 'cts-runner' is going through EGL module tests.
Comment 15 Tapani Pälli 2019-07-29 12:01:16 UTC
(In reply to Tapani Pälli from comment #14)
> Andrés, have you been able to reproduce this using 'deqp-gles31' directly
> instead of 'cts-runner'? I'm asking because I'm experiencing weird issues
> with 'cts-runner', for some reason Mesa loader goes nuts and refuses to load
> or find driver, this seems to happen when 'cts-runner' is going through EGL
> module tests.

OK, got that solved. Now using 'cts-runner' and trying to reproduce.
Comment 16 Tapani Pälli 2019-07-30 06:34:16 UTC
(In reply to Andrés Gómez García from comment #2)
> If needed, I can provide a docker image with the needed mesa and VK-GL-CTS
> versions with which to reproduce this SIGSEV.

Which versions should be used to reproduce? I cannot seem to reproduce this with:

vk-gl-cts: 7f6762b95232c0959f6f96f88fe91391c773d960
mesa: 6fc7384fd44f0b42c6decac4468bba06b28a8186

I've configured vk-gl-cts for wayland and running this under Weston. I've tried running with vblank_mode=0 and without. For the run I've edited txt files so that the run starts near texture.gather tests for quicker reproduce.
Comment 17 Mark Janes 2019-07-30 09:22:40 UTC
Tapani: build the gles 3.2.5 branch of the cts:

afb0b4b17f0969c7e9353b622a02658877cb0e7e
Suppress Clang 7 self-assignment warnings
Comment 18 Tapani Pälli 2019-07-30 10:18:22 UTC
Created attachment 144911 [details] [review]
move glsl_type dtor back to atexit

Here's something that could be the solution. Having said that, I'm not still clear what the actual issue is :/ But will attempt to dig and debug more.
Comment 19 Lionel Landwerlin 2019-07-30 11:21:47 UTC
(In reply to Tapani Pälli from comment #16)
> (In reply to Andrés Gómez García from comment #2)
> > If needed, I can provide a docker image with the needed mesa and VK-GL-CTS
> > versions with which to reproduce this SIGSEV.
> 
> Which versions should be used to reproduce? I cannot seem to reproduce this
> with:
> 
> vk-gl-cts: 7f6762b95232c0959f6f96f88fe91391c773d960
> mesa: 6fc7384fd44f0b42c6decac4468bba06b28a8186
> 
> I've configured vk-gl-cts for wayland and running this under Weston. I've
> tried running with vblank_mode=0 and without. For the run I've edited txt
> files so that the run starts near texture.gather tests for quicker reproduce.

Tapani & I are failing to reproduce. What distribution are you running this on?
Comment 20 Lionel Landwerlin 2019-07-30 17:19:30 UTC
Can get some hints in valgrind now :

==26855== Invalid read of size 8
==26855==    at 0x6192980: prototype_string(glsl_type const*, char const*, exec_list*) (ast_function.cpp:92)
==26855==    by 0x619440C: print_function_prototypes(_mesa_glsl_parse_state*, YYLTYPE*, ir_function*) (ast_function.cpp:777)
==26855==    by 0x61945CE: no_matching_function_error(char const*, YYLTYPE*, exec_list*, _mesa_glsl_parse_state*) (ast_function.cpp:812)
==26855==    by 0x6198F77: ast_function_expression::hir(exec_list*, _mesa_glsl_parse_state*) (ast_function.cpp:2386)
==26855==    by 0x619C056: ast_expression::do_hir(exec_list*, _mesa_glsl_parse_state*, bool) (ast_to_hir.cpp:1408)
==26855==    by 0x619BE9A: ast_expression::hir_no_rvalue(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:1310)
==26855==    by 0x619F18F: ast_expression_statement::hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:2236)
==26855==    by 0x619F202: ast_compound_statement::hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:2252)
==26855==    by 0x61A6FFD: ast_function_definition::hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:6223)
==26855==    by 0x6199A07: _mesa_ast_to_hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:159)
==26855==    by 0x60E0BE9: _mesa_glsl_compile_shader (glsl_parser_extras.cpp:2157)
==26855==    by 0x5DBFCE0: _mesa_compile_shader (shaderapi.c:1200)
==26855==  Address 0x13a78550 is 16 bytes inside a block of size 224 free'd
==26855==    at 0x48369AB: free (vg_replace_malloc.c:540)
==26855==    by 0x5D00BC8: unsafe_free (ralloc.c:299)
==26855==    by 0x5D00B8D: unsafe_free (ralloc.c:292)
==26855==    by 0x5D00B8D: unsafe_free (ralloc.c:292)
==26855==    by 0x5D00AB2: ralloc_free (ralloc.c:262)
==26855==    by 0x60E163B: glsl_symbol_table::operator delete(void*) (glsl_symbol_table.h:43)
==26855==    by 0x60E0DAF: _mesa_glsl_compile_shader (glsl_parser_extras.cpp:2191)
==26855==    by 0x5DBFCE0: _mesa_compile_shader (shaderapi.c:1200)
==26855==    by 0x5DC0D55: _mesa_CompileShader (shaderapi.c:1569)
==26855==    by 0x12DDA16: glcts::TestCaseBase::buildProgramVA(unsigned int, bool*, unsigned int, ...) (esextcTestCaseBase.cpp:424)
==26855==    by 0x12DE1C9: glcts::TestCaseBase::buildProgram(unsigned int, unsigned int, unsigned int, char const* const*, unsigned int, unsigned int, char const* const*, unsigned int, unsigned int, char const* const*, bool*) (esextcTestCaseBase.cpp:581)
==26855==    by 0x1393DAE: glcts::GeometryShaderFlatInterpolationTest::initProgram() (esextcGeometryShaderQualifiers.cpp:181)
==26855==  Block was alloc'd at
==26855==    at 0x483577F: malloc (vg_replace_malloc.c:309)
==26855==    by 0x5D00665: ralloc_size (ralloc.c:119)
==26855==    by 0x5D00721: rzalloc_size (ralloc.c:151)
==26855==    by 0x5FBF655: exec_node::operator new(unsigned long, void*) (list.h:58)
==26855==    by 0x60C806C: (anonymous namespace)::builtin_variable_generator::add_const(char const*, int, int) (builtin_variables.cpp:655)
==26855==    by 0x60C7968: (anonymous namespace)::builtin_variable_generator::add_const(char const*, int) (builtin_variables.cpp:454)
==26855==    by 0x60C8813: (anonymous namespace)::builtin_variable_generator::generate_constants() (builtin_variables.cpp:817)
==26855==    by 0x60CB436: _mesa_glsl_initialize_variables(exec_list*, _mesa_glsl_parse_state*) (builtin_variables.cpp:1543)
==26855==    by 0x619993E: _mesa_ast_to_hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:131)
==26855==    by 0x60E0BE9: _mesa_glsl_compile_shader (glsl_parser_extras.cpp:2157)
==26855==    by 0x5DBFCE0: _mesa_compile_shader (shaderapi.c:1200)
==26855==    by 0x5DC0D55: _mesa_CompileShader (shaderapi.c:1569)
==26855==
Comment 21 Mark Janes 2019-07-30 17:57:34 UTC
(In reply to Tapani Pälli from comment #18)
> Created attachment 144911 [details] [review] [review]
> move glsl_type dtor back to atexit
> 
> Here's something that could be the solution. Having said that, I'm not still
> clear what the actual issue is :/ But will attempt to dig and debug more.

Tapani: with this patch, I can't reproduce the crash.  In my environment, the crash reproduces immediately without your patch.

Andrés: can you try the attachment to see if it resolves the bug in your environment?
Comment 22 Anuj Phogat 2019-07-30 22:26:30 UTC
(In reply to Tapani Pälli from comment #18)
> Created attachment 144911 [details] [review] [review]
> move glsl_type dtor back to atexit
> 
> Here's something that could be the solution. Having said that, I'm not still
> clear what the actual issue is :/ But will attempt to dig and debug more.

I can't reproduce anymore after applying this patch. I'm using X11 in Fedora.
Comment 24 Tapani Pälli 2019-07-31 04:39:17 UTC
(In reply to Lionel Landwerlin from comment #23)
> https://gitlab.freedesktop.org/mesa/piglit/merge_requests/103

Reproduced! \o/ I guess ideally we would solve this without using atexit(), I'll try to see if there would be another way.
Comment 25 Tapani Pälli 2019-07-31 06:02:45 UTC
I have another alternative fix, will do some proper testing and CI
Comment 26 Tapani Pälli 2019-07-31 07:43:38 UTC
Created attachment 144916 [details] [review]
fix

Here's the fix, needs still some careful testing.
Comment 27 Anuj Phogat 2019-07-31 16:48:27 UTC
(In reply to Tapani Pälli from comment #26)
> Created attachment 144916 [details] [review] [review]
> fix
> 
> Here's the fix, needs still some careful testing.

Tested the fix. Don't get the segfault in cts with patch.
Comment 28 Tapani Pälli 2019-08-07 10:57:50 UTC
This is now resolved in staging/19.1 branch (meaning next 19.1 release) with following commit. There will be a different fix for master branch.

------- 8< ------
Author: Tapani Pälli <tapani.palli@intel.com>
Date:   Tue Aug 6 13:10:10 2019 +0300

    mesa: add glsl_type ref to one_time_init and decref to atexit
    
    This fixes problems spotted within vk-gl-cts. Problem is that the builtin
    functions refer to types and we should not release types before builtins
    are released.
    
    Fixes: 624789e3708c ("compiler/glsl: handle case where we have multiple users for types")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110796
    Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
    Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Comment 29 Andrés Gómez García 2019-08-08 11:26:16 UTC
(In reply to Mark Janes from comment #21)
> (In reply to Tapani Pälli from comment #18)
> > Created attachment 144911 [details] [review] [review] [review]
> > move glsl_type dtor back to atexit
> > 
> > Here's something that could be the solution. Having said that, I'm not still
> > clear what the actual issue is :/ But will attempt to dig and debug more.
> 
> Tapani: with this patch, I can't reproduce the crash.  In my environment,
> the crash reproduces immediately without your patch.
> 
> Andrés: can you try the attachment to see if it resolves the bug in your
> environment?

Tested the patch with my original environment and I can confirm it is fixing the SIGSEV.

I also tested against the newly added piglit test at https://gitlab.freedesktop.org/mesa/piglit/merge_requests/103 and I can also confirm that it solves the issue.
Comment 30 Tapani Pälli 2019-08-26 06:25:03 UTC
Fixed in master by:

--- 8< ---
commit e4da8b9c331cc3ae1b86b3a7860231e202463db0
Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Date:   Wed Jul 31 12:12:10 2019 +0300

    mesa/compiler: rework tear down of builtin/types
    
    The issue we're running into when running CTS is that glsl types are
    deleted while builtins depending on them are not.
    
    This happens because on one hand we have glsl types ref counted, but
    builtins are not. Instead builtins are destroyed when unloading libGL
    or explicitly calling glReleaseShaderCompiler().
    
    This change removes almost entirely any dealing with glsl types
    ref/unref by letting the builtins deal with it instead. In turn we
    introduce a builtin ref count mechanism. Each GL context takes a
    reference on the builtins when compiling a shader for the first time.
    It releases the reference when the context is destroyed. It can also
    explicitly release those when glReleaseShaderCompiler() is called.
    
    Finally we also take a reference on the glsl types when loading libGL
    to avoid recreating glsl types too often.
    
    v2: Ensure we take a reference if we don't have one in link step (Lionel)
    
    Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110796
    Reviewed-by: Eric Anholt <eric@anholt.net>
    Reviewed-by: Tapani Pälli <tapani.palli@intel.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.