Summary: | Regression: GLB 2.7 & Glmark-2 GLES versions segfault due to linker precision error (259fc505) on dead variable | ||
---|---|---|---|
Product: | Mesa | Reporter: | Eero Tamminen <eero.t.tamminen> |
Component: | glsl-compiler | Assignee: | Ian Romanick <idr> |
Status: | VERIFIED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | tfiga |
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
See Also: |
https://github.com/glmark2/glmark2/issues/25 https://bugs.freedesktop.org/show_bug.cgi?id=97804 |
||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | |||
Bug Blocks: | 103491 | ||
Attachments: |
patch
Patch for environment variable that turns error into warning A patch that implements a drirc option |
Description
Eero Tamminen
2016-08-29 10:46:07 UTC
My GLBenchmark 2.7 egypt/trex automation continues to provide results. Based on this bug report, I would expect it to crash. Can you provide the invocation of the crashing benchmark? I'd like to reproduce. For example: ./build_x86_64/binaries/GLBenchmark -data data -w 1366 -h 768 -t GLB27_TRex_C24Z16_FixedTimeStep But basically anything should crash right at start when GLB 2.7 compiles all of its shaders. Maybe you have somehow different version of GLB 2.7. If you use MESA_SHADER_DUMP_PATH to get shaders out of GLB, do you see the indicated variable in shaders: $ grep fragmentColorVP * FS_d049e6b822a18cb092e02530736ad6229104e9bb.glsl:uniform vec3 fragmentColorVP; FS_d049e6b822a18cb092e02530736ad6229104e9bb.glsl: gl_FragColor = tex * vec4(fragmentColorVP, 1.0); VS_51d5e7cdfb8adc088cc5592d75b01fea7bf0ee69.glsl:uniform vec3 fragmentColorVP; ? My copy of GLB 2.7 continues to work fine as well, both in GL and ES builds. The shaders in my copy do not contain that code. (In reply to Eero Tamminen from comment #0) > Question: Is this linker error relevant/appropriate for GLES v2.x, or should > it be generated only for v3.x? GLSL ES 1.0.17 states that precision qualifiers for uniforms must match (in section 10 Issues) so looks appropriate. (In reply to Tapani Pälli from comment #4) > (In reply to Eero Tamminen from comment #0) > > Question: Is this linker error relevant/appropriate for GLES v2.x, or should > > it be generated only for v3.x? > > GLSL ES 1.0.17 states that precision qualifiers for uniforms must match (in > section 10 Issues) so looks appropriate. having said that, when linking we should know if uniform was used at all by the earlier shader stage and could utilize this information to print only warning in this case .. should first verify if this would be ok by the spec and check other vendor implementations how they behave in such situation (In reply to Kenneth Graunke from comment #3) > My copy of GLB 2.7 continues to work fine as well, both in GL and ES builds. > The shaders in my copy do not contain that code. Because GfxBench T-Rex version doesn't contain that either, I guess your GLB 2.7 version is newer than the GLB 2.7 we're using. GLB test-cases shaders contain huge amounts of dead code, so it's not surprise if some of that is removed in minor version updates (some other driver may have complained about this issue also). If your GLB 2.7 is also an official version, we can update GLB 2.7. I think the error is still a bug though. (I was wondering why the dead uniform hadn't been removed before linking, but Tapani said any shader stage can define uniform location used in the shader stages. I.e. I guess info for all declared uniforms is retained to linking stage in case other stages were missing location info.) Created attachment 126156 [details] [review] patch Here's somewhat experimental patch to make it warning instead of error if existing uniform was not read/written in the validate globals check. You have to enforce language rules regardless of whether or not code is used. Nobody would suggest that a C compiler should compile: void main() { int x = "lol, who needs rules?"; return 0; } just because 'x' isn't used. That's crazy. (In reply to Kenneth Graunke from comment #8) > You have to enforce language rules regardless of whether or not code is used. > > Nobody would suggest that a C compiler should compile: > > void main() > { > int x = "lol, who needs rules?"; > return 0; > } > > just because 'x' isn't used. That's crazy. Except that's legal C. :) Either way... I added that check specifically for a CTS test. If, as Tapani says, the same text exists in earlier versions of GLSL ES, then we should keep the err. (In reply to Kenneth Graunke from comment #8) > You have to enforce language rules regardless of whether or not code is used. > > Nobody would suggest that a C compiler should compile: > > void main() > { > int x = "lol, who needs rules?"; > return 0; > } > > just because 'x' isn't used. That's crazy. agreed, I sent the patch only because spec language is a bit vague (surprise!), it says: "If uniforms are used in both the vertex and fragment shaders, developers should be warned if the precisions are different. Conversion of precision should never be implicit. ... blah blah .. RESOLUTION: Yes, precision qualifiers for uniforms must match." so ... conversion of precision should never happen implicitly, and precision qualifiers _must_ match, however developers should be 'warned' (In reply to Kenneth Graunke from comment #8) > You have to enforce language rules regardless of whether or not code is used. > > Nobody would suggest that a C compiler should compile: > > void main() > { > int x = "lol, who needs rules?"; > return 0; > } > > just because 'x' isn't used. That's crazy. C-code variant of this would be: ----------- foo.c ----------- #include <stdint.h> extern int32_t x; ----------- bar.c ----------- #include <stdint.h> extern int64_t x; void do_something_with_x(...) { ... } ----------------------------- And question would be whether C-linker should warn that global variable 'x' has been declared with different type/size in different objects, because program where this library would be linked can only be providing one type. Whether things work in above example, depends only on whether the int64_t type matches the (at this moment unknown) "external" type, because that's the only type that's actually used by the code. Usage here (in above C-code, and Egypt shaders) isn't dependent on whether the code is optimized, because the other declared type is NOT used at all, it's just declared. Spec talks only about mismatch in use, not definition. One can say that linker doesn't know whether the reason why uniform isn't used in the linked code is result of shader stage optimization or not, i.e. is it in use in unoptimized code. That's why it must definitely be warned about. However, only the case where different types are actually in use, is the only case where it's certain that it's not going to work -> clearly an error where program should be stopped. The reason why I don't think there should be an error about the first case is that it can easily be a regression on fully functional programs (like here), and those regressions will go unnoticed because Mesa's regular testing is mainly for GL, there's not much testing for GLES (WebGL etc). (In reply to Eero Tamminen from comment #11) > The reason why I don't think there should be an error about the first case > is that it can easily be a regression on fully functional programs (like > here), and those regressions will go unnoticed because Mesa's regular > testing is mainly for GL, there's not much testing for GLES (WebGL etc). That's not true at all...we run dEQP and the ES conformance suite on every commit. That's something like 90,000 GLES-only tests, which is definitely tests than we run for GL. You have a broken application. If the latest release works, you should upgrade. If it doesn't work, we should contact Kishonti and ask them to fix it. It's a simple fix and I believe they would be responsive. Eero is right, the GLES version of GLBenchmark breaks. The GL version works fine. I double checked the rules and Ian's patch is absolutely correct - the ES 3.2 spec says: "Uniforms in shaders all share a single global name space when linked into a program or separable program. Hence, the types, precisions and any location specifiers of all declared uniform variables with the same name must match across shaders that are linked into a single program." In addition to the uniform not actually being used in the vertex shader...neither declaration is actually qualified with a specific precision. It's just using the default precision in both cases. One thought would be to enforce the "must match" rule when explicitly qualified, and take the maximum if all declarations are inheriting a default precision. (I'm trying to think if there's some weasily way this could actually be legal - it seems surprising that this app has been out for 3 years and no other ES vendor has hit this bug...) Or perhaps we just need to add a driconf option to either resort to "take the max" or just force everything to highp (like in desktop GL). Ian, thoughts? (In reply to Kenneth Graunke from comment #12) > That's not true at all...we run dEQP and the ES conformance suite on every > commit. That's something like 90,000 GLES-only tests, which is definitely > tests than we run for GL. Sorry, I was unclear. I wasn't thinking of API & unit test-cases. Such tests are naturally maintained & fixed (Mesa commit breaking GLB was done to fix conformance test). By "use-cases" I meant real world GLES apps, games and benchmarks, things that users may be using years after buying them and their maintenance having stopped. E.g. stuff in Android & ChromeOS app stores that's still sold but not anymore maintained, or something like GLB 2.7. (In reply to Kenneth Graunke from comment #13) > Eero is right, the GLES version of GLBenchmark breaks. The GL version works > fine. > > I double checked the rules and Ian's patch is absolutely correct - the ES > 3.2 spec says: > > "Uniforms in shaders all share a single global name space when > linked into a program or separable program. Hence, the types, > precisions and any location specifiers of all declared uniform > variables with the same name must match across shaders that > are linked into a single program." 3.1 & 3.2 specs are explicit that declarations must match. 3.0 was a bit more vague whether this applies to non-active uniform declarations. 2.0 spec doesn't say anything about matching, just a bit about whether uniforms are active or not. [...] > - it seems surprising that this app has been out for 3 years and no > other ES vendor has hit this bug...) GLB is GLES 2.x, and the matching requirement is only in GLES 3.x spec. > Or perhaps we just need to add a driconf option to either resort to "take > the max" or just force everything to highp (like in desktop GL). Or give warnings for apps requesting GLES 2.x context, when both of the compared stage's uniforms aren't used, and error only when both are in use, or context is GLES 3.x+? Tricky thing here is that AFAIK Mesa gives GLES 3.x context also for apps requesting 2.x one (which also causes them to be much slower when they use 1x1 cube maps, due to GLES 3.x sampling rule about cube map corners). Glmark2 GLES version in latest Ubuntu (16.04) also segfaults to this in following subtests: - ideas (modelview uniform declared also in FS but not used) - jellyfish (uCurrentTime uniform used both in VS & FS) - terrain (viewMatrix uniform declared also in FS but not used) For example: ------------------------- $ apt-get install glmark2-es2 $ glmark2-es2 -b terrain ... Error: Failed to link program created from files None and None: error: declarations for uniform `viewMatrix` have mismatching precision qualifiers Segmentation fault (core dumped) ------------------------- Like GLB, it's GLES v2 program. Ideas & terrain can be run by fixing Tapani's patch to check whether uniform is unused in either of the stages (not just VS), but jellyfish test is using the uniform in both stages with different precision, so it has a real bug (which might show on backends that implement FP16 support). Glmark2 is open source so it can be (eventually) fixed: https://github.com/glmark2/glmark2 But I don't think this and GLB are the only GLES 2.x programs with the problem. Filed bug against glmark2 and checked the conformance test suite. CTS tests check that also unused uniforms precision check fails and has check also for GLSL_VERSION_100_ES. I.e. giving just warning on unused declarations, or with GLES <v3 won't fly. I guess this needs another drirc/env setting, which is specified in drirc e.g. for "GLBenchmark" and "glmark2-es2" binaries. My proposal for setting name would be "linker_precision_check", which defaults to true. Upstream fixed glmark2-es2 test-suite shader issues. Jellyfish one turned out to be bug in Mesa precision handling (see bug 97804). Created attachment 126582 [details]
Patch for environment variable that turns error into warning
I experimented with adding drirc option for this, but had no good solution for that [1]. So I opted for just adding an environment variable. Does that seem OK?
[1] Compiler part of Mesa didn't have any drirc stuff yet, so it would need drirc parsing somewhere higher, then either adding member to shader struct for the option, or propagating options stuff along function calls. Or making some drirc options global available.
I have working patch also of last "solution", but it's really awful (couldn't decide where options are initially read/cached, so I stuck it into library constructor, but it seems that xmlconfig.c gets included into a lot libs generated by Mesa...). I can provide it for entertainment value if requested.
With Mesa bug 97804 and glmark2 upstream code fixed, GLB 2.7 is the only currently known test-suite/program this affects on *Linux desktop* [1]. Kenneth marked this early on as "NOTOURBUG" which is valid, but not necessarily best option. GLB 2.7 is still used (despite its problems, it's probably most widely used GLES 2.x benchmark) and Mesa now enforcing GLES 3.x error for harmless & earlier accepted inconsistency in GLES 2.0 programs, is a functionality regression (although Intel backend doesn't yet even support FP16). Kenneth, I'm opening this to get (your or somebody else) opinion on adding user workaround for this, e.g. the attached trivial patch. [1] Linux desktop has only handful of GLES programs (and most of them compositors i.e. not really using complex shaders). There could be other regressing cases on ChromeOS (WebGL www-pages) or Android. Eero, we talked about this bug in our weekly meeting today, and agreed that it should be fixed as soon as possible, and put into the 13.0.1 release. Everyone agreed that your fix was the right thing to do, but that it should be put behind a drirc option, so regular users benefit without knowing about an environment variable. If you can put your patch behind a drirc option, then it will be reviewed and merged. If you don't want to do that, we can assign it to someone else. You've put a lot of work into this bug, and I'd like for you to get the credit for it in the git history. (In reply to Mark Janes from comment #20) > Eero, we talked about this bug in our weekly meeting today, and agreed that > it should be fixed as soon as possible, and put into the 13.0.1 release. > Everyone agreed that your fix was the right thing to do, but that it should > be put behind a drirc option, so regular users benefit without knowing about > an environment variable. > > If you can put your patch behind a drirc option, then it will be reviewed > and merged. If you don't want to do that, we can assign it to someone else. I think that should be done by somebody who has better opinion on what's best place in the compiler side to initialize drirc stuff, and what's the best place to pass the info from there to the linker, than me. Even better would be refactoring all the separate places in Mesa where drirc is initialized sp that it's done only in one place, but that would be a *lot* more work (Martin had some initial patches for that from long ago, but he gave up on it)... Just discovered another reason I didn't notice this bug: this shader is only used for the loading screen. If you run it with -skip_load_frames, it works fine. Created attachment 127853 [details] [review] A patch that implements a drirc option Here's a patch that implements a drirc option. We implement the specification. In cases where the specification does not match the reality of shipping applications and shipping implementations, we work to change the specification. The only application that does not work is one that has been abandoned by the developer because it was replaced by a newer version that does work. Moreover, there is a workaround to allow the older version to work. (In reply to Eero Tamminen from comment #19) > [1] Linux desktop has only handful of GLES programs (and most of them > compositors i.e. not really using complex shaders). There could be other > regressing cases on ChromeOS (WebGL www-pages) or Android. I suspect this will not be a problem for WebGL. WebGL conformance is even more picky about such restrictions that the native API CTS. There have been no reports of Android applications failing due to this change. Should there be any problems with Android applications, they will be handled on a case-by-case basis. (In reply to Kenneth Graunke from comment #22) > Just discovered another reason I didn't notice this bug: this shader is only > used for the loading screen. If you run it with -skip_load_frames, it works > fine. Thanks, I didn't know about that, it indeed works fine. (In reply to Kenneth Graunke from comment #23) > Created attachment 127853 [details] [review] [review] > A patch that implements a drirc option > > Here's a patch that implements a drirc option. Thanks, this works fine too, and the patch is much smaller than I expected! Comment added at the end of patch would need to be updated before commiting. This same override would be needed also for currently available glmark2-es2 releases (fixes for the bugs are only in glmark2 git repo). (In reply to Ian Romanick from comment #24) > (In reply to Eero Tamminen from comment #19) > > [1] Linux desktop has only handful of GLES programs (and most of them > > compositors i.e. not really using complex shaders). There could be other > > regressing cases on ChromeOS (WebGL www-pages) or Android. > > I suspect this will not be a problem for WebGL. WebGL conformance is even > more picky about such restrictions that the native API CTS. > > There have been no reports of Android applications failing due to this > change. Should there be any problems with Android applications, they will > be handled on a case-by-case basis. With the GLB workaround option available, this is fine for me. If there are other buggy proprietary apps, Kenneth's patch can be applied. -> Verified --- We just don't know whether there are other problematic apps: * Different WebGL scripts on random www-pages aren't conformance tested. They can have this bug if they're developed against Browsers & 3D driver versions that haven't (had) this check * AFAIK there aren't Android devices shipping with Mesa 13.0 (which is the first version with this issue), so there cannot be any reports of issues yet * I didn't find (by googling) any reports of GLB 2.x failing with any other driver (and it works fine with GLES version of Intel Windows driver) * Glmark2 had the same issue and nobody had noticed anything on Linux or Android with other drivers (at least I was the first one to file a bug about it) this issue is hit by 'Gears4Android', if we will have some more interesting apps hitting it then we will likely implement workaround like in comment #23 Hi, several Android apps are affected besides Gears, as examples: Rajawali and Dragonball Z Dokkan Battle featuring black screens like Gears, because of linker_error() As a Proof of Concept linker_error() was replaced by linker_warning() and Rajawali and Dragonball Z Dokkan Battle are working again. For Android drirc option is unpractical Mauro (In reply to Mauro Rossi from comment #27) > several Android apps are affected besides Gears, > as examples: Rajawali and Dragonball Z Dokkan Battle > featuring black screens like Gears, because of linker_error() > > As a Proof of Concept linker_error() was replaced by linker_warning() > and Rajawali and Dragonball Z Dokkan Battle are working again. Was it enough to replace error with warning only in the case where the mismatching variable declarations were unused, or did you need to disable error also for variables that were actually used? This is also affecting Forge of Empires on Android. With Kenneth's patch and the drirc option enabled for <application name="Default"> the game starts working again. (In reply to Tomasz Figa from comment #29) > This is also affecting Forge of Empires on Android. With Kenneth's patch and > the drirc option enabled for <application name="Default"> the game starts > working again. IMHO it would be good the check whether the problem is just with a dead variable, or is there a real functional issue with the shader. (If other drivers have accepted latter, they're pretty broken too...) If env vars work with Android, you can do that either by using my original patch: https://bugs.freedesktop.org/attachment.cgi?id=126582 Otherwise you could add the check from it to Kenneth's: if ((existing->data.used && var->data.used) -> still error else -> warning Or dump & inspect the shader code directly. (In reply to Eero Tamminen from comment #30) > (In reply to Tomasz Figa from comment #29) > > This is also affecting Forge of Empires on Android. With Kenneth's patch and > > the drirc option enabled for <application name="Default"> the game starts > > working again. > > IMHO it would be good the check whether the problem is just with a dead > variable, or is there a real functional issue with the shader. > [...] > > Or dump & inspect the shader code directly. I have a full trace captured with GAPID, which includes shaders code. Can't share here due to potential legal issues, but I can share my observations: - vertex shader starts with "precision highp float", - fragment shader starts with "precision mediump float", - both shaders define "uniform mat4 M1" after that, - both shaders define "uniform mat4 M2" after that, - "M1" is not used in either of the shaders, - "M2" is used only in vertex shader, - the link fails with "error: declarations for uniform `M1` have mismatching precision qualifiers" (In reply to Tomasz Figa from comment #31) > (In reply to Eero Tamminen from comment #30) > > (In reply to Tomasz Figa from comment #29) > > > This is also affecting Forge of Empires on Android. With Kenneth's patch and > > > the drirc option enabled for <application name="Default"> the game starts > > > working again. > > > > IMHO it would be good the check whether the problem is just with a dead > > variable, or is there a real functional issue with the shader. > > > [...] > > > > Or dump & inspect the shader code directly. > > I have a full trace captured with GAPID, which includes shaders code. Can't > share here due to potential legal issues, but I can share my observations: > > - vertex shader starts with "precision highp float", > - fragment shader starts with "precision mediump float", > - both shaders define "uniform mat4 M1" after that, > - both shaders define "uniform mat4 M2" after that, > - "M1" is not used in either of the shaders, > - "M2" is used only in vertex shader, > - the link fails with "error: declarations for uniform `M1` have mismatching > precision qualifiers" Replying myself, looks like the application is buggy or we are misunderstanding the spec (my understanding is exactly as found earlier in thhis bug). Ok, so it's the same case as with GLB (and git version of fixed Glmark), the shaders are violating the spec, but it's not a functionality problem because variables in question are just declared with conflicting precision, but not used with conflicting precision. It would be interesting to know whether the other drivers that allow this behavior, would fail when variables with mismatched precision are actually used across shader stages. And does that behavior change depending on whether application is using GLES 2.x or GLES 3.x context (GLES 2.x spec didn't require this check, it came afterwards, but Mesa promotes all GLES 2.x contexts to 3.x). The case seems to be developing further and it turns out some versions of GLES implementation for ARM Mali also prohibits such behavior: https://github.com/cocos2d/cocos2d-x/issues/17256 It's likely that the engine gets fixed, but I wonder if app developers upgrade their apps, though. By the way, it is not true that GLES 2.x didn't require this check, at least according to following thread: https://github.com/AnalyticalGraphicsInc/cesium/issues/817 It refers to GLSL ES 1.00 specification (http://www.khronos.org/registry/OpenGL/specs/es/2.0/GLSL_ES_Specification_1.00.pdf) and quotes the following part: > Do precision qualifiers for uniforms need to match? > Option 1: Yes. > Uniforms are defined to behave as if they are using the same storage in the > vertex and fragment processors and may be implemented this way. > If uniforms are used in both the vertex and fragment shaders, developers > should be warned if the precisions are different. Conversion of precision > should never be implicit. > Option 2: No. > Uniforms may be used by both shaders but the same precision may not be > available in both so there is a justification for allowing them to be > different. > Using the same uniform in the vertex and fragment shaders will always require > the precision to be specified in the vertex shader (since the default > precision is highp). This is an unnecessary burden on developers. > RESOLUTION: Yes, precision qualifiers for uniforms must match. (In reply to Tomasz Figa from comment #35) > By the way, it is not true that GLES 2.x didn't require this check, at least > according to following thread: > > https://github.com/AnalyticalGraphicsInc/cesium/issues/817 > > It refers to GLSL ES 1.00 specification > (http://www.khronos.org/registry/OpenGL/specs/es/2.0/GLSL_ES_Specification_1. > 00.pdf) and quotes the following part: [...] If you read it more carefully, it talks only about *used* uniforms. GLES 3.x is more explicit and requires that *declared* uniforms have to match. Different drivers could interpret *used* in different ways, as the match requirement being for uniforms that are: * declared (same as GLES 3.x) * referred after post-processing, or * linked after all optimizations are done (Last one is the only state that can cause real errors, but checking also declarations makes things easier for developers, things behave more consistently between drivers and different compiler versions.) That's why I asked whether the problematic Android games actually use the uniforms or not. Good point. I was an obvious example of interpreting "used" as "declared". (In reply to Eero Tamminen from comment #36) > (In reply to Tomasz Figa from comment #35) > > By the way, it is not true that GLES 2.x didn't require this check, at least > > according to following thread: > > > > https://github.com/AnalyticalGraphicsInc/cesium/issues/817 > > > > It refers to GLSL ES 1.00 specification > > (http://www.khronos.org/registry/OpenGL/specs/es/2.0/GLSL_ES_Specification_1. > > 00.pdf) and quotes the following part: > [...] > > If you read it more carefully, it talks only about *used* uniforms. > > GLES 3.x is more explicit and requires that *declared* uniforms have to > match. > > > Different drivers could interpret *used* in different ways, as the match > requirement being for uniforms that are: > * declared (same as GLES 3.x) > * referred after post-processing, or > * linked after all optimizations are done For the most part, the GLSL solves this problem by requiring things that are statically referenced to match. This means rules are applied to some things that may be unnecessary, but it avoids the problem of depending on how hard a particular compiler will optimize things. > (Last one is the only state that can cause real errors, but checking also > declarations makes things easier for developers, things behave more > consistently between drivers and different compiler versions.) > > That's why I asked whether the problematic Android games actually use the > uniforms or not. It sounds like we're going to take Tomasz's patch to relax this in GLSL ES 1.00. We're not certain what the right behavior is for ES 3.00 / 3.10 / 3.20. Ian is looking into this. I could not tell if any fix for this bug has been made to master, let alone the stable branches. Looks like Kenneth just pushed the fix: https://cgit.freedesktop.org/mesa/mesa/commit/src/compiler/glsl?id=0886be093fb871b0b6169718277e0f4d18df3ea7 All my use-cases work nowadays without this fix. -> Could somebody from the Android side (where the fix is still needed) verify this instead? Verified that Android games that used to suffer from this problem are fine on 17.3 branch. Thanks, marking the bug as verified. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.