Bug 104335

Summary: [OpenGL CTS][SKL,KBL] KHR-GL45.vertex_attrib_64bit.limits_test occasionally fails
Product: Mesa Reporter: Iago Toral <itoral>
Component: Drivers/DRI/i965Assignee: Intel 3D Bugs Mailing List <intel-3d-bugs>
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: itoral
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 102590    

Description Iago Toral 2017-12-19 11:22:52 UTC
In my tests I have found this to have a low fail between 1% and 2%. If the test is executed in a loop, then chances of it failing can be significantly increased if we scroll the terminal fast for periods of 10+ seconds.

The problem seems somehow related to gl_InstanceID since every time I got this to fail, the failure happens in subtests that use gl_InstanceID in the shaders (with and without actual instancing draw calls being used).

Interestingly, replacing gl_InstanceID with 0 in the shader code seems to make the problem go away entirely, which is quite surprising seeing that this test has otherwise a very high pass rate and does actual instanced draw calls that woudl rely on gl_InstanceID taking appropriate values.

FWIW, we have tested these in 4 different machines (SKL and KBL), and found that on two of these which had 4.9 kernels we were unable to reproduce the problem, where the other two where the problem existed the kernel was 4.10.
Comment 1 Iago Toral 2017-12-19 12:56:26 UTC
I have confirmed that when this test fails it does because we read completely bogus values of gl_instanceID in the shader.
Comment 2 Iago Toral 2017-12-19 13:00:22 UTC
(In reply to Iago Toral from comment #0)

> Interestingly, replacing gl_InstanceID with 0 in the shader code seems to
> make the problem go away entirely, which is quite surprising seeing that
> this test has otherwise a very high pass rate and does actual instanced draw
> calls that woudl rely on gl_InstanceID taking appropriate values.

I have tracked this part where we can make the test pass consistently  when we replace gl_InstanceID by 0 down to a bug in CTS that was only validating the first instance worth of results (See gerrit CL#2037 for a patch).
Comment 3 Iago Toral 2017-12-20 11:58:31 UTC
Another thing I noticed is that if I modify the vertex shader in these tests to also use gl_VertexID besides gl_InstanceID, for example replacing gl_InstanceID in gl4cVertexAttrib64bitTest.cpp:1826 with this expression:

(gl_InstanceID + gl_VertexID * 2 - gl_VertexID - gl_VertexID)

so the end result is the same but it makes the compiler read gl_VertexID, then I can no longer reproduce  the problem, even if I do the scrolling thing I mention in the first comment which most definitely makes the issue reproducible much more easily.

Based on this, I thought there may be some piece of state that is only set when we resad  gl_VertexID that could be affecting the test but setting prog_data->uses_vertexid when we set prog_data->uses_instanceid doesn't seem to fix the problem and I did not identify any other place in the code where reading gl_VertexID affects state setup...
Comment 4 Iago Toral 2018-01-02 13:12:44 UTC
It seems that programming the hardware to include a dummy SGVS vertex element fixes the problem.

In theory, this is not required since gen8 because 3DSTATE_VF_SGVS allows us to store VertexID and InstanceID in an element beyond the last vertex element programmed, but just doing this seems to get rid of the spurious fails completely, at least on my KBL laptop.

I'll do some more testing and send an RFC patch for review if I don't find any issues. It could be that this is just papering over something else though.
Comment 5 Iago Toral 2018-02-07 10:12:52 UTC
Fixed with:

commit f474b19875a7c51ced6bb986e5733379b2780dcf
Author: Iago Toral Quiroga <itoral@igalia.com>
Date:   Thu Jan 4 03:55:13 2018 +0100

    i965: allocate a SGVS element when VertexID or InstanceID are read
    
    Although on gen8+ platforms we can in theory use 3DSTATE_VF_SGVS
    to put these beyond the last vertex element it seems that we still
    need to allocate the SVGS element, otherwise we have observed cases
    where we end up reading garbage. Specifically, the CTS test mentioned
    below was flaky with a fail rate of ~1% on some gen9+ platforms caused
    by reading garbage for the gl_InstanceID value. The flakyness goes
    away as soon as we start allocating the SVGS element.
    
    v2:
      - Do this for gen8+, not just gen9+, and pull the boolean
        outside the #if block (Jason)
    
    Fixes flaky test:
    KHR-GL45.vertex_attrib_64bit.limits_test
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104335
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.