Summary: | The big SKQP bug | ||
---|---|---|---|
Product: | Mesa | Reporter: | Dylan Baker <baker.dylan.c> |
Component: | Drivers/DRI/i965 | Assignee: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Status: | RESOLVED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | baker.dylan.c, chadversary, chenglei.ren, danylo.piliaiev, dongseong.hwang, jmcasanova, kenneth |
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
GPU hang error state & dmesg for mixedtextblobs
GPU hang dmesg for mixedtextblobs dmesg with 0x3F enabled i915 error state of the GPU hang apitrace Patch to detect bug in mixedtextblobs test Random vk failure on APL platform error state decoded by latest decoder apitrace of mixedtextblobs test |
Description
Dylan Baker
2018-02-28 23:27:45 UTC
Android P CTS CtsSkQPTestCases "#unitTest_EGLImageTest" fails as well. I'll paste here my initial thoughts from internal mail thread: "First test creates a regular texture (GL_TEXTURE_2D) and then EGLImage resource out of that. After this, test creates another texture ID (that has type GL_TEXTURE_EXTERNAL) and attempts to bind/source this one from the original EGLImage (using glEGLImageTargetTexture2D). Since the original EGLImage was not created from dmabuf (and is not considered external resource) we hit this check. TBH I'm not 100% certain if the 2nd step is considered 'legal' according to the spec." After going through the specs for OES_EGL_image_external and external textures in OpenGL spec, my understanding is that there are no limitations/restriction or special conditions to binding existing texture object to External texture. Only limitation from the spec is that "If the new texture object is bound to TEXTURE_2D, TEXTURE_CUBE_MAP, or TEXTURE_EXTERNAL_OES it is and remains a 2-D, cube map or external texture respectively until it is deleted". Another point I would like to add is - In the specs it is not mentioned that EGLImages should only be created with EGL_EXT_image_dma_buf_import. I think we(Intel) are imposing the restriction that EGL image will only be supported through Linux dma buf by using extension EGL_LINUX_DMA_BUF_EXT. Also, this test passes on ARM(Pixel 2 phones) and seems to be only affecting Intel platforms. I understand that 2D textures are different from external textures. From the specs it seems to be wrt usage of following operations TexImage2D, TexSubImage2D,CompressedTexImage2D, CompressedTexSubImage2D, CopyTexImage2D, or CopyTexSubImage2D not permitted for external textures(which should be an INVALID operation for external textures in any case and since the texture object is bound to TEXTURE_EXTERNAL_OES, it shouldn't matter if EGLImage resource was created with a regular texture. It would be considered an external texture till it is deleted). Tapani, Aditya, I think SKQP has an unnecessary restriction. "The 2nd step" is how chromium works with media driver in following code. https://cs.chromium.org/chromium/src/media/gpu/vaapi/vaapi_picture_native_pixmap_egl.cc?type=cs&q=VaapiPictureNativePixmapEgl::Allocate&l=72 * chromium creates regular texture * chromium exports eglImage from the regular texture using eglCreateImageKHR with EGL_GL_TEXTURE_2D_KHR as EGL_KHR_gl_image spec allows it. * chromium export dma_buf from eglImage via EGL_MESA_image_dma_buf_export * chromium gives the dma_buf to libva as external buffer. The dma_buf can be bound in "another" egl context, so TEXTURE_EXTERNAL_OES with the eglImage is legit in my opinion. Correction: in Chromium case above, it uses GL_TEXTURE_2D. The source of all confusion is EGL_KHR_gl_image spec, which allows to create eglImage from regular texture. https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_gl_image.txt Unfortunately, the spec doesn't define which texture <target> parameter we must use for EGLImageTargetTexture2DOES API. ARM driver allows TEXTURE_BINDING_EXTERNAL_OES but Mesa allows only TEXTURE_2D in this case. in EGL_KHR_gl_image case, we create eglImage by eglCreateImageKHR with EGL_GL_TEXTURE_2D_KHR <target> parameter, but the spec doesn't explicitly say which TEXTURE_BINDING_EXTERNAL_OES or TEXTURE_2D must be used for EGLImageTargetTexture2DOES. Worth to note, EGL_EXT_image_dma_buf_import spec also doesn't explicitly say which TEXTURE_BINDING_EXTERNAL_OES or TEXTURE_2D must be used for EGLImageTargetTexture2DOES. in this case, we create eglImage by eglCreateImageKHR with EGL_LINUX_DMA_BUF_EXT <target> parameter, and use TEXTURE_BINDING_EXTERNAL_OES for EGLImageTargetTexture2DOES without any spec support. It's really gray area. I'm not sure which skia or mesa we should change. Thanks all for the debugging. On Android IVI, we also encounter this issue. We have collect the info in this bug, and will create one bug to Google. If there is any update, will share here. (In reply to Aditya Swarup from comment #2) > Another point I would like to add is - In the specs it is not mentioned that > EGLImages should only be created with EGL_EXT_image_dma_buf_import. I think > we(Intel) are imposing the restriction that EGL image will only be supported > through Linux dma buf by using extension EGL_LINUX_DMA_BUF_EXT. There is no such restriction. We do support creating EGLImages from regular 2D textures, native platform pixmaps and also renderbuffers and pass the tests written for this exact purpose (Piglit, deqp). However we don't support "mixmatching" of resources in the way this test does. > Also, this test passes on ARM(Pixel 2 phones) and seems to be only affecting > Intel platforms. > > I understand that 2D textures are different from external textures. > > From the specs it seems to be wrt usage of following operations TexImage2D, > TexSubImage2D,CompressedTexImage2D, CompressedTexSubImage2D, CopyTexImage2D, > or > CopyTexSubImage2D not permitted for external textures(which should be an > INVALID operation for external textures in any case and since the texture > object is bound to TEXTURE_EXTERNAL_OES, it shouldn't matter if EGLImage > resource was created with a regular texture. It would be considered an > external texture till it is deleted). But in practice from implementation point of view it can make a big difference how do we treat the resource. Note that at same time, we would need to make sure the resource is good for external use-cases but also all of those APIs would need to work when using the regular handle. It seems removing the restriction causes problems on Android so I believe there are some corner cases that do not work correctly without this restriction in place. I don't know what the regression on Android side is. The only way I see us solving this problem with minimum fuss is to use Linux DMA BUF IMPORT EXT to create EGLImage(CreateEGLImage) instead of using regular texture in our case.(Don't know if this implementation will sit well with Android folks) The change could be something like: if(support LINUX DMA BUF IMPORT EXT extension) CreateEGLImage using linux dma buf extension else CreateEGLImage(using regular texture) <----- For other platforms Created attachment 142256 [details]
GPU hang error state & dmesg for mixedtextblobs
For another issue gles_mixedtextblobs, we caught GPU hang, please refer to the attachment for the error state and dmesg log.
Created attachment 142257 [details]
GPU hang dmesg for mixedtextblobs
Hi, Here are more details to the CTS issue we are trying to solve. We are able to repro 100% in the Android container under Chrome OS running Linux kernel 4.4 and Mesa 18. We can also repro under Celadon/NUC/drm tip running kernel 4.19 and Mesa 18. The code from the upper layer is here : http://androidxref.com/9.0.0_r3/xref/external/skia/gm/mixedtextblobs.cpp#20 There are 3 main draw calls() 26 SkPaint paint(skPaint); 28 canvas->drawRect(clipRect, clipHairline); 30 canvas->drawTextBlob(blob, 0, 0, paint); 33 canvas->drawTextBlob(blob, 0, 0, paint); We are still trying to see what these functions are in GL apis (more soon). The IPHER, which points to the command that hung is at IPEHR: 0x78260000, and that address point to 3DSTATE_BINDING_TABLE_POINTERS_VS 0xe60003ac: 0x784f0000: 3D UNKNOWN: 3d_965 opcode = 0x784f 0xe60003b0: 0x90000100: UNKNOWN 0xe60003b4: 0x78230000: 3D UNKNOWN: 3d_965 opcode = 0x7823 0xe60003b8: 0x00000140: MI_NOOP 0xe60003bc: 0x78260000: 3DSTATE_BINDING_TABLE_POINTERS_VS 0xe60003c0: 0x00000000: dword 1 0xe60003c4: 0x78270000: 3DSTATE_BINDING_TABLE_POINTERS_HS 0xe60003c8: 0x00000000: dword 1 0xe60003cc: 0x78280000: 3DSTATE_BINDING_TABLE_POINTERS_DS 0xe60003d0: 0x00000000: dword 1 0xe60003d4: 0x78290000: 3DSTATE_BINDING_TABLE_POINTERS_GS 0xe60003d8: 0x00000000: dword 1 0xe60003dc: 0x782a0000: 3DSTATE_BINDING_TABLE_POINTERS_PS 0xe60003e0: 0x00000160: dword 1 The documentation says this: The 3DSTATE_BINDING_TABLE_POINTERS_VS command is used to define the location of fixed functions' BINDING_TABLE_STATE. Only some of the fixed functions utilize binding tables. Created attachment 142263 [details]
dmesg with 0x3F enabled
Created attachment 142264 [details]
i915 error state of the GPU hang
I am also attaching more logs Created attachment 142263 [details] dmesg with 0x3F enabled Created attachment 142264 [details] i915 error state of the GPU hang I'm testing a fix for SRGBMipMaps now. It looks like SkiaGM_gles.mixedtextblobs may be hitting an issue with dual source blending. Looking into that. Created attachment 142267 [details]
apitrace
let me add apitrace when this test run on ubuntu with egl backend. Even though it was captured on Ubuntu, the OpenGLES2 usage should be similar.
i915 error state complains about 3DSTATE_BINDING_TABLE_POINTERS_VS and there are some unique pattern of vertex shader attrib binding.
1902 @0 glVertexAttribPointer(index = 0, size = 4, type = GL_FLOAT, normalized = GL_FALSE, stride = 20, pointer = NULL)
1904 @0 glVertexAttribDivisor(index = 0, divisor = 1)
1906 @0 glVertexAttribPointer(index = 1, size = 1, type = GL_FLOAT, normalized = GL_FALSE, stride = 20, pointer = 0x10)
1908 @0 glVertexAttribDivisor(index = 1, divisor = 1)
1910 @0 glDrawArraysInstanced(mode = GL_TRIANGLE_STRIP, first = 0, count = 4, instancecount = 1)
(In reply to Kenneth Graunke from comment #14) > I'm testing a fix for SRGBMipMaps now. Patch for SRGBMipMaps test: https://patchwork.freedesktop.org/patch/259089/ (In reply to Dongseong Hwang from comment #15) > Created attachment 142267 [details] > apitrace > > let me add apitrace when this test run on ubuntu with egl backend. Even > though it was captured on Ubuntu, the OpenGLES2 usage should be similar. > > i915 error state complains about 3DSTATE_BINDING_TABLE_POINTERS_VS and there > are some unique pattern of vertex shader attrib binding. Pretty sure this is a complete red herring. The illegal GPU programming I'm seeing is that we're submitting a shader with regular render target messages, instead of dual source blending render target messages. But, BLEND_STATE has SRC1 blend factors. That could easily cause a hang. About mixedtextblobs... https://gitlab.freedesktop.org/kwg/mesa/commits/skia has a patch, "i965: Always use dual source RT writes if src1 blend factors are enabled", which makes the mixedtextblobs tests pass. Note that the SKQP test is broken. It's enabled dual source blend factors (SRC1_COLOR etc.) but provided a shader that only writes one color output, when it ought to write two colors. For reference, the one that is hanging is the fragment shader with the 'discard', which only outputs a single color. In fact, none of the fragment shaders appear to have a layout(index = 1) qualifier indicating which is the second color. Nor does SQKP call glBindFragDataLocationIndexed to set one (which is the other mechanism for doing this). So, the results are undefined. Search https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_blend_func_extended.txt for "not defined" to find the spec citation. That said, we should not hang in this case - just render some undefined garbage. Doing so appears to suit SKQP just fine (as does rendering nothing at all, amusingly). If you guys could pass that on to the SKQP authors, it would be great to get the bug in the test fixed. (In reply to Kenneth Graunke from comment #18) > About mixedtextblobs... > > https://gitlab.freedesktop.org/kwg/mesa/commits/skia has a patch, "i965: > Always use dual source RT writes if src1 blend factors are enabled", which > makes the mixedtextblobs tests pass. Sorry, forgot to mention. This patch is pretty broken. (It works with shader_precompile=false set, but crashes otherwise.) I don't think we want to go with this patch in the long term. Will need to talk with a few people about what the best solution is. Thanks, Kenneth. Let me try to fix skqp based on your explanation tmr. beside, I printf all gl calls in skia and glReadPixels causes gpu hangs. Does it make sense with your theory? ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(137) : GL: CreateProgram() ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(124) : GL: CreateShader(type) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(128) : GL: ShaderSource(shaderId, 1, &glsl, &glslLength) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(131) : GL: CompileShader(shaderId) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(140) : GL: GetShaderiv(shaderId, 0x8B81, &compiled) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(174) : GL: AttachShader(programId, shaderId) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(124) : GL: CreateShader(type) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(128) : GL: ShaderSource(shaderId, 1, &glsl, &glslLength) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(131) : GL: CompileShader(shaderId) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(140) : GL: GetShaderiv(shaderId, 0x8B81, &compiled) ../../src/gpu/gl/builders/GrGLShaderStringBuilder.cpp(174) : GL: AttachShader(programId, shaderId) ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(227) : GL: BindAttribLocation(programID, i, primProc.getAttrib(i).fName) ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(227) : GL: BindAttribLocation(programID, i, primProc.getAttrib(i).fName) ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(227) : GL: BindAttribLocation(programID, i, primProc.getAttrib(i).fName) ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(311) : GL: BindFragDataLocation(programID, 0, GrGLSLFragmentShaderBuilder::DeclaredColorOutputName()) ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(250) : GL: LinkProgram(programID) ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(333) : GL: GetProgramiv(programID, 0x8B82, &linked) ../../src/gpu/gl/GrGLUniformHandler.cpp(154) : GL: GetUniformLocation(programID, fUniforms[i].fVariable.c_str()) ../../src/gpu/gl/GrGLUniformHandler.cpp(154) : GL: GetUniformLocation(programID, fUniforms[i].fVariable.c_str()) ../../src/gpu/gl/GrGLUniformHandler.cpp(159) : GL: GetUniformLocation(programID, fSamplers[i].fVariable.c_str()) ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(381) : GL: DeleteShader(shaderIDs[i]) ../../src/gpu/gl/builders/GrGLProgramBuilder.cpp(381) : GL: DeleteShader(shaderIDs[i]) ../../src/gpu/gl/GrGLProgram.cpp(50) : GL: UseProgram(fProgramID) ../../src/gpu/gl/GrGLProgramDataManager.cpp(59) : GL: Uniform1i(sampler.fLocation, i + startUnit) ../../src/gpu/gl/GrGLGpu.cpp(1828) : GL: UseProgram(programID) ../../src/gpu/gl/GrGLGpu.cpp(2885) : GL: Disable(0x0BE2) ../../src/gpu/gl/GrGLProgramDataManager.cpp(173) : GL: Uniform4fv(uni.fLocation, arrayCount, v) ../../src/gpu/gl/GrGLProgramDataManager.cpp(224) : GL: UniformMatrix3fv(loc, cnt, false, m) ../../src/gpu/gl/GrGLGpu.cpp(3267) : GL: ActiveTexture(0x84C0 + unit) ../../src/gpu/gl/GrGLGpu.cpp(3032) : GL: BindTexture(target, texture->textureID()) ../../src/gpu/gl/GrGLGpu.cpp(3238) : GL: TexParameteri(target, 0x8E42, swizzle[0]) ../../src/gpu/gl/GrGLGpu.cpp(3239) : GL: TexParameteri(target, 0x8E43, swizzle[1]) ../../src/gpu/gl/GrGLGpu.cpp(3240) : GL: TexParameteri(target, 0x8E44, swizzle[2]) ../../src/gpu/gl/GrGLGpu.cpp(3241) : GL: TexParameteri(target, 0x8E45, swizzle[3]) ../../src/gpu/gl/GrGLGpu.cpp(2488) : GL: BindFramebuffer(0x8D40, target->renderFBOID()) ../../src/gpu/gl/GrGLGpu.cpp(2496) : GL: CheckFramebufferStatus(0x8D40) ../../src/gpu/gl/GrGLIRect.h(41) : GL: Viewport(fLeft, fBottom, fWidth, fHeight) ../../src/gpu/gl/GrGLVertexArray.cpp(130) : GL: VertexAttribPointer(index, layout.fCount, layout.fType, layout.fNormalized, stride, offsetAsPtr) ../../src/gpu/gl/GrGLVertexArray.cpp(130) : GL: VertexAttribPointer(index, layout.fCount, layout.fType, layout.fNormalized, stride, offsetAsPtr) ../../src/gpu/gl/GrGLVertexArray.cpp(130) : GL: VertexAttribPointer(index, layout.fCount, layout.fType, layout.fNormalized, stride, offsetAsPtr) ../../src/gpu/gl/GrGLGpu.cpp(2651) : GL: DrawRangeElements(glPrimType, minIndexValue, maxIndexValue, indexCount, 0x1403, indices) ../../src/gpu/gl/GrGLGpu.cpp(2400) : GL: PixelStorei(0x0D05, config_alignment(dstAsConfig)) <-------------- GPU HANG ../../src/gpu/gl/GrGLGpu.cpp(2404) : GL: ReadPixels(readRect.fLeft, readRect.fBottom, readRect.fWidth, readRect.fHeight, externalFormat, externalType, readDst) GrContextPriv::readSurfacePixels 3 ../../src/gpu/gl/GrGLStencilAttachment.cpp(25) : GL: DeleteRenderbuffers(1, &fRenderbufferID) Created attachment 142270 [details] [review] Patch to detect bug in mixedtextblobs test (In reply to Dongseong Hwang from comment #20) > Thanks, Kenneth. Let me try to fix skqp based on your explanation tmr. > > beside, I printf all gl calls in skia and glReadPixels causes gpu hangs. > Does it make sense with your theory? No, glReadPixels would not hit this case. GL calls batch up GPU commands in a buffer, which isn't submitted immediately. glReadPixels will demand the results of the earlier rendering, so it causes the earlier commands to be submitted, so they actually execute. (The batch buffer may also fill up at a random point, causing us to submit it to the GPU.) So, it appears to be at fault, but it's actually innocent. If you want to do that kind of debugging, you should set always_flush_batch=true and INTEL_DEBUG=sync to avoid batching and wait for rendering to finish (although this can hide many bugs). I have attached a Mesa patch you can apply to make it assert fail in the undefined case. Then, the backtrace should point to the actual draw that's going wrong. Hello Kenneth, I feel I have something to add: > Note that the SKQP test is broken. It's enabled dual source blend factors > (SRC1_COLOR etc.) but provided a shader that only writes one color output, when > it ought to write two colors. For reference, the one that is hanging is the > fragment shader with the 'discard', which only outputs a single color I have reported this hang some time ago in GL and in Vulkan: https://bugs.freedesktop.org/show_bug.cgi?id=107088 https://bugs.freedesktop.org/show_bug.cgi?id=107306 (It looks like the case you are describing) And made two patches respectively: https://patchwork.freedesktop.org/series/45779/ https://patchwork.freedesktop.org/series/46931/ Hope this would help Created attachment 142277 [details]
Random vk failure on APL platform
Thanks all for the help! On Android, we also have target for APL platform(You could install Celadon on APL NUC). On APL platform, there will be random vk failure. Similar to mixedtextblobs, there will be no image render out. But no GPU hang found. I also tried Danylo's vulkan patch, but it doesn't work.
This issue could be reproduced with following:
1. adb install -r CtsSkQPTestCases.apk
2. adb shell am instrument -e class 'org.skia.skqp.SkQPRunner#vk_modecolorfilters' -w org.skia.skqp
vk_modecolorfilters is very easy to reproduce. APL platform is very important for Android. Could you help take a look of this issue?
(In reply to Carlos Santa from comment #10) > Hi, > > Here are more details to the CTS issue we are trying to solve. > > We are able to repro 100% in the Android container under Chrome OS running > Linux kernel 4.4 and Mesa 18. We can also repro under Celadon/NUC/drm tip > running kernel 4.19 and Mesa 18. > > The code from the upper layer is here : > http://androidxref.com/9.0.0_r3/xref/external/skia/gm/mixedtextblobs.cpp#20 > > There are 3 main draw calls() > > 26 SkPaint paint(skPaint); > 28 canvas->drawRect(clipRect, clipHairline); > 30 canvas->drawTextBlob(blob, 0, 0, paint); > 33 canvas->drawTextBlob(blob, 0, 0, paint); > > We are still trying to see what these functions are in GL apis (more soon). > > The IPHER, which points to the command that hung is at IPEHR: 0x78260000, > and that address point to 3DSTATE_BINDING_TABLE_POINTERS_VS > > 0xe60003ac: 0x784f0000: 3D UNKNOWN: 3d_965 opcode = 0x784f > 0xe60003b0: 0x90000100: UNKNOWN > 0xe60003b4: 0x78230000: 3D UNKNOWN: 3d_965 opcode = 0x7823 > 0xe60003b8: 0x00000140: MI_NOOP > 0xe60003bc: 0x78260000: 3DSTATE_BINDING_TABLE_POINTERS_VS > 0xe60003c0: 0x00000000: dword 1 > 0xe60003c4: 0x78270000: 3DSTATE_BINDING_TABLE_POINTERS_HS > 0xe60003c8: 0x00000000: dword 1 > 0xe60003cc: 0x78280000: 3DSTATE_BINDING_TABLE_POINTERS_DS > 0xe60003d0: 0x00000000: dword 1 > 0xe60003d4: 0x78290000: 3DSTATE_BINDING_TABLE_POINTERS_GS > 0xe60003d8: 0x00000000: dword 1 > 0xe60003dc: 0x782a0000: 3DSTATE_BINDING_TABLE_POINTERS_PS > 0xe60003e0: 0x00000160: dword 1 > > The documentation says this: The 3DSTATE_BINDING_TABLE_POINTERS_VS command > is used to define the location of fixed functions' > BINDING_TABLE_STATE. Only some of the fixed functions utilize binding tables. The progress could pass this stage but just want to keep the record straight. I believe this is from the crash dump provided in the following comment #12. I think the correct address is another location in the batch buffer: 0xe60034bc: 0x00000000: MI_NOOP 0xe60034c0: 0x00000000: MI_NOOP 0xe60034c4: 0xd00103c0: UNKNOWN 0xe60034c8: 0xffffffff: UNKNOWN 0xe60034cc: 0x78260000: 3DSTATE_BINDING_TABLE_POINTERS_VS 0xe60034d0: 0x00000000: dword 1 0xe60034d4: 0x782a0000: 3DSTATE_BINDING_TABLE_POINTERS_PS Correct me if I am wrong. For one, people are using bad decoders - probably the ancient one in public libdrm that hasn't been updated in many generations. Use aubinator_error_decode from Mesa to read error states. You'll get far fewer UNKNOWNs and MI_NOOPs and other garbage. IPEHR is the command DWord that triggered an error. ACTHD is the address of the command. You can look at ACTHD then jump there to get to the batch. Kenneth, actually skqp assigns SRC1 by glBindFragDataLocationIndexed. However, skia bindw two varying out variables to (number=0, index=0) and (number=0, index=1). Is it legal? I cannot find the spec preventing this usage. https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_blend_func_extended.txt In the spec, LinkProgram to fail "if more than one varying out variable is bound to the same number and index". However, it doesn't explicitly prohibit that more than one varying out variable is bound to the same number but different index.. Here's how skia work * FBO with only one color attachment 3104 @0 glBindFramebuffer(target = GL_FRAMEBUFFER, framebuffer = 1) 3106 @0 glFramebufferTexture2D(target = GL_FRAMEBUFFER, attachment = GL_COLOR_ATTACHMENT0, textarget = GL_TEXTURE_2D, texture = 1, level = 0) * Compile fs and bind two varying out variables to (number=0, index=0) and (number=0, index=1). 3418 @0 glShaderSource(shader = 9, count = 1, string = ["#version 320 es precision mediump float; uniform highp vec4 sk_RTAdjust; uniform highp vec2 uAtlasSizeInv_Stage0; in highp vec2 inPosition; in mediump vec4 inColor; in mediump uvec2 inTextureCoords; out highp vec2 vTextureCoords_Stage0; flat out highp int vTexIndex_Stage0; out mediump vec4 vinColor_Stage0; void main() { highp ivec2 signedCoords = ivec2(int(inTextureCoords.x), int(inTextureCoords.y)); highp int texIdx = 2 * (signedCoords.x & 1) + (signedCoords.y & 1); highp vec2 unormTexCoords = vec2(float(signedCoords.x / 2), float(signedCoords.y / 2)); vTextureCoords_Stage0 = unormTexCoords * uAtlasSizeInv_Stage0; vTexIndex_Stage0 = texIdx; vinColor_Stage0 = inColor; gl_Position = vec4(inPosition.x, inPosition.y, 0.0, 1.0); gl_Position = vec4(gl_Position.xy * sk_RTAdjust.xz + gl_Position.ww * sk_RTAdjust.yw, 0.0, gl_Position.w); } "], length = [877]) 3432 @0 glBindFragDataLocationEXT(program = 7, color = 0, name = sk_FragColor) 3434 @0 glBindFragDataLocationIndexedEXT(program = 7, colorNumber = 0, index = 1, name = fsSecondaryColorOut) 3439 @0 glLinkProgram(program = 7) * set BlendFunc with SRC1 3459 @0 glBlendFunc(sfactor = GL_ONE, dfactor = GL_ONE_MINUS_SRC1_COLOR) * draw and gpu hang 3489 @0 glVertexAttribPointer(index = 0, size = 2, type = GL_FLOAT, normalized = GL_FALSE, stride = 16, pointer = 0x30) 3491 @0 glVertexAttribPointer(index = 1, size = 4, type = GL_UNSIGNED_BYTE, normalized = GL_TRUE, stride = 16, pointer = 0x38) 3493 @0 glVertexAttribDivisor(index = 1, divisor = 0) 3495 @0 glVertexAttribIPointer(index = 2, size = 2, type = GL_UNSIGNED_SHORT, stride = 16, pointer = 0x3c) 3497 @0 glVertexAttribDivisor(index = 2, divisor = 0) 3499 @0 glDrawRangeElements(mode = GL_TRIANGLES, start = 0, end = 31, count = 48, type = GL_UNSIGNED_SHORT, indices = NULL) #26 - oops I paste wrong fs. The problematic shader is as follow. #version 320 es #extension GL_EXT_blend_func_extended : require precision mediump float; out mediump vec4 sk_FragColor; uniform lowp sampler2D uTextureSampler_0_Stage0; in highp vec2 vTextureCoords_Stage0; flat in highp int vTexIndex_Stage0; in mediump vec4 vinColor_Stage0; out mediump vec4 fsSecondaryColorOut; void main() { mediump vec4 outputColor_Stage0; mediump vec4 outputCoverage_Stage0; { outputColor_Stage0 = vinColor_Stage0; mediump vec4 texColor; { texColor = texture(uTextureSampler_0_Stage0, vTextureCoords_Stage0); } outputCoverage_Stage0 = texColor; } { outputCoverage_Stage0.w = max(max(outputCoverage_Stage0.x, outputCoverage_Stage0.y), outputCoverage_Stage0.z); fsSecondaryColorOut = outputColor_Stage0.w * outputCoverage_Stage0; sk_FragColor = outputColor_Stage0 * outputCoverage_Stage0; } } Created attachment 142282 [details]
error state decoded by latest decoder
btw, let me upload the decoded error state decoded by latest decoder, which belongs to chromeos itself.
I got this error state in kernel version v4.14 chromeos and decode it in the device.
Kenneth, Zhang, does it look sane?
(In reply to Dongseong Hwang from comment #26) > Kenneth, actually skqp assigns SRC1 by glBindFragDataLocationIndexed. > However, skia bindw two varying out variables to (number=0, index=0) and > (number=0, index=1). Is it legal? I cannot find the spec preventing this > usage. Yes, that's exactly what you're supposed to do. Location should be the same for both (render target 0), index should select color 0 vs. color 1. That all seems fine. The GPU hang is definitely on a later draw with a shader that has 'discard' in it. I believe the GL_ONE_MINUS_SRC1_COLOR blend factor is still set, but the 'discard' shader doesn't write two color outputs. Like, they meant to reset the blend factor and forgot. (In reply to Dongseong Hwang from comment #28) > Created attachment 142282 [details] > error state decoded by latest decoder > > btw, let me upload the decoded error state decoded by latest decoder, which > belongs to chromeos itself. > I got this error state in kernel version v4.14 chromeos and decode it in the > device. > > Kenneth, Zhang, does it look sane? That's still a garbage decoder. We're also not going to get anywhere looking at the error state at this point. (In reply to Kenneth Graunke from comment #30) > (In reply to Dongseong Hwang from comment #28) > > Created attachment 142282 [details] > > error state decoded by latest decoder > > > > btw, let me upload the decoded error state decoded by latest decoder, which > > belongs to chromeos itself. > > I got this error state in kernel version v4.14 chromeos and decode it in the > > device. > > > > Kenneth, Zhang, does it look sane? > > That's still a garbage decoder. We're also not going to get anywhere > looking at the error state at this point. By the way, you can tell because of all the "Bad length X in (null)" and "3D UNKNOWN: 3d_965 opcode = 0x7824" messages. It's clearly struggling to decode. That's because intel_error_decode uses a libdrm decoder that hasn't been updated in roughly 6 years. aubinator_error_decode works. Most people just post the raw decodes, so people can decode them using whatever tools, though :) I tested patch in #22 https://patchwork.freedesktop.org/patch/235939/ and it fixes the issue for gles_mixedtextblobs Can this be accepted as a fix upstream? Rather than application program screwing up dual src blending, we are taking care of the corner case by disabling Dual source blending when shader doesn't have second color output. Seems logical to me. (In reply to Aditya Swarup from comment #32) > I tested patch in #22 > https://patchwork.freedesktop.org/patch/235939/ > and it fixes the issue for gles_mixedtextblobs > > Can this be accepted as a fix upstream? > > Rather than application program screwing up dual src blending, we are taking > care of the corner case by disabling Dual source blending when shader > doesn't have second color output. Seems logical to me. Yes, that's the plan, but the test is still hitting undefined behavior, so presumably they'd like to fix that. :) Also, tested patch in #16 for SRGBMipMaps unit test. It passes. :) @Ken Can you please pitch in for the debate regarding EGLImage test failure where the skia test is failing as we are creating EGLImage out of a 2D texture and then converting it to an external texture. It hits the following check in MESA in file intel_tex_image.c: In intel_image_target_texture_2d(): /* We support external textures only for EGLImages created with * EGL_EXT_image_dma_buf_import. We may lift that restriction in the future. */ if (target == GL_TEXTURE_EXTERNAL_OES && !image->dma_buf_imported) { _mesa_error(ctx, GL_INVALID_OPERATION, "glEGLImageTargetTexture2DOES(external target is enabled only " "for images created with EGL_EXT_image_dma_buf_import"); return; } Removing this check and the flag dma_buf_imported solves the problem. I have not submitted the patch upstream but uploaded it for chromium: https://chromium-review.googlesource.com/c/chromiumos/third_party/mesa/+/1285296 I don't know the logic behind restricting external textures to be only created using dmabuf. Do you feel this can be accepted upstream or Skqp is doing illegal operation wrt external texture? Okay, fixes for SRGBMipMaps and mixedtextblobs are now in master. Thanks so much for fixing the latter, Danylo! :D (In reply to Aditya Swarup from comment #34) > Also, tested patch in #16 for SRGBMipMaps unit test. It passes. :) > > @Ken > Can you please pitch in for the debate regarding EGLImage test failure where > the skia test is failing as we are creating EGLImage out of a 2D texture and > then converting it to an external texture. I'm not the best person for that but I can try and fetch some people who are familiar with the code and have strong opinions... Thank you for all quick response and fixing this issue. Kenneth, I carefully checked skqp. skia never do something wrong. When it uses double blend src, skia explicitly uses the program and sets the blend func. glUseProgram(program = 7) glBlendFunc(sfactor = GL_ONE, dfactor = GL_ONE_MINUS_SRC1_COLOR) Whenever it uses single blend src, skia explicitly uses the program and sets the blend func again. glUseProgram(program = 16) glBlendFunc(sfactor = GL_ONE, dfactor = GL_ONE_MINUS_SRC_ALPHA) I think it's the hidden bug of mesa and the following patch accidentally fix it. https://patchwork.freedesktop.org/patch/235939/ I guess |wm_prog_data->dual_src_blend| is stale at the point, although skia calls glBlendFunc(sfactor = GL_ONE, dfactor = GL_ONE_MINUS_SRC_ALPHA). What do you think? Created attachment 142291 [details] apitrace of mixedtextblobs test (In reply to Dongseong Hwang from comment #36) > Thank you for all quick response and fixing this issue. > > Kenneth, I carefully checked skqp. skia never do something wrong. I have attached an apitrace of mixedtextblobs. unxz then apitrace dump mixedtextblobs.trace to see the GL calls it's making. 3794 glBlendFunc(sfactor = GL_ONE, dfactor = GL_ONE_MINUS_SRC1_COLOR) ... 3804 glDrawRangeElements(mode = GL_TRIANGLES, start = 0, end = 47, count = 72, type = GL_UNSIGNED_SHORT, indices = NULL) ... 3808 glUseProgram(program = 4) ... 3836 glDrawRangeElements(mode = GL_TRIANGLES, start = 0, end = 33, count = 90, type = GL_UNSIGNED_SHORT, indices = 0xad4) 3838 glEnable(cap = GL_SAMPLE_SHADING) 3841 glUseProgram(program = 7) ... 3853 glDrawRangeElements(mode = GL_TRIANGLES, start = 0, end = 95, count = 96, type = GL_UNSIGNED_SHORT, indices = 0xba0) Call #3853 does a draw with GL_ONE_MINUS_SRC1_COLOR and program 7, which is the "discard" program, which only writes one color, triggering undefined behavior. I'm not making this up. :( Thank you all for the quick response and fixing. Hi Kenneth & Dongseong, there are also random failure for vk backend on APL platform. For example, vk_modecolorfilters has a very low pass rate. Most of time, there is nothing render out. Could you help check if this is one HW limitation or vk driver issue? Thanks a lot! This issue could be reproduced with following: 1. adb install -r CtsSkQPTestCases.apk 2. adb shell am instrument -e class 'org.skia.skqp.SkQPRunner#vk_modecolorfilters' -w org.skia.skqp Kenneth, thank you for sharing apitrace. The call flows is quite different compared to egl backend in Ubuntu. It's why this issue happened in only android. With this apitrace, it is quite easy to fix this issue in skia. Appreciate it! I submitted skia fix https://skia-review.googlesource.com/c/skia/+/167220 Kenneth, Thank you! Sent a patch to mesa mailing list for solving EGLImage test failure: https://patchwork.freedesktop.org/patch/259601/ Tested this patch by running deqp and we didn't see any failures. (In reply to Dongseong Hwang from comment #40) > I submitted skia fix https://skia-review.googlesource.com/c/skia/+/167220 > Kenneth, Thank you! Great, thank you! I suppose undefined blending results doesn't matter if you aren't writing anything anyway. Still, probably best to avoid it nonetheless - thanks for submitting that patch. Hi Ken, Could you please help in reviewing https://patchwork.freedesktop.org/patch/259601/ or help me getting in touch with relevant people who will be able to review this patch? I submitted a skia change which makes it use GL_TEXTURE_2D, instead of GL_TEXTURE_EXTERNAL target for EGL_KHR_gl_image. https://skia-review.googlesource.com/c/skia/+/167456 Tapani, Kenneth, what do you think? Is it better than relieving mesa's restriction? Kenneth, I changed my mind. Now I think the EGLImage based on a regular texture by EGL_KHR_gl_image can be bound by EGLImageTargetTexture2DOES with TEXTURE_EXTERNAL_OES. I thought some EGLImage (i.e. EGL_KHR_gl_image) should be considered as regular EGLImage and other EGLImage (i.e. dma_buf_egl_image) should be considered as external EGLImage. However I think I was wrong. It's up to the <target> parameter to EGLImageTargetTexture2DOES(). If it's TEXTURE_2D, TexImage2D and TexSubImage2D are available. If it's TEXTURE_EXTERNAL_OES, TexImage2D and TexSubImage2D cannot be used. EGL_KHR_gl_image can be used with both TEXTURE_2D and TEXTURE_EXTERNAL_OES, although dma_buf_egl_image works with only TEXTURE_EXTERNAL_OES. OES_EGL_image_external spec explicitly says TEXTURE_EXTERNAL_OES can be used to EGLImageTargetTexture2DOES() on any EGLImage. https://www.khronos.org/registry/OpenGL/extensions/OES/OES_EGL_image_external.txt Dependencies on GL_OES_EGL_image If GL_OES_EGL_image is supported then change the text in both extensions to allow either TEXTURE_2D or TEXTURE_EXTERNAL_OES to be passed as the <target> parameter to EGLImageTargetTexture2DOES(). When <target> is TEXTURE_2D, behavior of EGLImageTargetTexture2DOES() is as described in the GL_OES_EGL_image spec. When <target> is TEXTURE_EXTERNAL_OES, behavior of EGLImageTargetTexture2DOES() is as described in this spec. On the other hands, from implementation side, I think Aditya's change makes sense and this change would be safe because 1. it passes Piglit and deqp 2. it restricts available functionality. About 2, all available functionalities of OES_EGL_image_external is subset of all available functionalities of EGL_KHR_gl_image. Simply speaking, TEXTURE_EXTERNAL_OES just prevents all write operations (e.g. TexImage2D and TexSubImage2D), which EGL_KHR_gl_image already allows. Using TEXTURE_EXTERNAL_OES for EGL_KHR_gl_image EGLImage doesn't add any functionalities but prohibits some functionalities. So I think this is the safe change. Kenneth, what do you think? (In reply to Dongseong Hwang from comment #45) > On the other hands, from implementation side, I think Aditya's change makes > sense and this change would be safe because > 1. it passes Piglit and deqp > 2. it restricts available functionality. How does removing a restriction "restrict available functionality"? (In reply to Kenneth Graunke from comment #46) > (In reply to Dongseong Hwang from comment #45) > > On the other hands, from implementation side, I think Aditya's change makes > > sense and this change would be safe because > > 1. it passes Piglit and deqp > > 2. it restricts available functionality. > > How does removing a restriction "restrict available functionality"? I guess what he means is that when using TEXTURE_EXTERNAL_OES handle, there is less functionality exposed. Of course at that at same time client can still use the TEXTURE_2D handle with all functionality. So we are not really restricting any functionality here. I'm not sure if we have fast clear and compression state stored so that one could do resolve then switch them off (and never turn on again). If we allow this then this is something we should probably do in this use case? From client perspective this will result in un-optimal performance I'm not sure if there are sensible options here :/ Hi Tapani, I don't know the complexity of the task but definitely the spec doesn't say to restrict external texture creation using dmabuf. This is a policy we are introducing in our driver and other platforms don't exhibit this behavior. If you think what we are doing in the driver is confirming to specs, then can you or somebody from the MESA team take up this discussion with Google to resolve this Skqp failure by explaining to them what they are doing in the test code is wrong. The problem over here is we are juggling the ball from one end to the other. Google says there is no problem in their test code and if the driver is written according to spec, it should pass. But you/Intel-MESA team have a different opinion on this. As a result we are going nowhere wrt resolving this issue :( Can you/Ken please post a reply to this thread with your suggestion: https://partnerissuetracker.corp.google.com/issues/80123702 Also, this comment in the code doesn't make sense to me: /* We support external textures only for EGLImages created with * EGL_EXT_image_dma_buf_import. We may lift that restriction in the future. */ Why are we saying that "We may lift that restriction in the future" when we are not going to do it? (In reply to Tapani Pälli from comment #47) > (In reply to Kenneth Graunke from comment #46) > > (In reply to Dongseong Hwang from comment #45) > > > On the other hands, from implementation side, I think Aditya's change makes > > > sense and this change would be safe because > > > 1. it passes Piglit and deqp > > > 2. it restricts available functionality. > > > > How does removing a restriction "restrict available functionality"? > > I guess what he means is that when using TEXTURE_EXTERNAL_OES handle, there > is less functionality exposed. Of course at that at same time client can > still use the TEXTURE_2D handle with all functionality. So we are not really > restricting any functionality here. Yes, correct. EGL image based on EGL_KHR_gl_image has the full potential of functionality. The only read functionality is exposed with TEXTURE_EXTERNAL_OES target. > I'm not sure if we have fast clear and compression state stored so that one > could do resolve then switch them off (and never turn on again). If we allow > this then this is something we should probably do in this use case? From > client perspective this will result in un-optimal performance I'm not sure > if there are sensible options here :/ If unfortunately it's un-optimal performance from mesa driver perspective (we don't know it's true tho), the driver should respect the spec and the spec explicitly says TEXTURE_EXTERNAL_OES target can be used "ANY EGL image" described by GL_OES_EGL_image. https://www.khronos.org/registry/OpenGL/extensions/OES/OES_EGL_image_external.txt Dependencies on GL_OES_EGL_image If GL_OES_EGL_image is supported then change the text in both extensions to allow either TEXTURE_2D or TEXTURE_EXTERNAL_OES to be passed as the <target> parameter to EGLImageTargetTexture2DOES(). When <target> is TEXTURE_2D, behavior of EGLImageTargetTexture2DOES() is as described in the GL_OES_EGL_image spec. When <target> is TEXTURE_EXTERNAL_OES, behavior of EGLImageTargetTexture2DOES() is as described in this spec. In my opinion, Intel has no excuse to Google about this conformance test failure, and a major product has been broken. (In reply to Aditya Swarup from comment #49) > Why are we saying that "We may lift that restriction in the future" when we > are not going to do it? No one has said that we're not going to do it. People need to think through the ramifications and make sure that it's safe to do so - or if there's additional work required for correctness (beyond fixing one unit test). (In reply to Dongseong Hwang from comment #50) > In my opinion, Intel has no excuse to Google about this conformance test > failure, and a major product has been broken. Thank you for the analysis. I believe you're correct that we should allow this. The specs were fairly new when we first implemented this, and there were very few users, so it makes sense that we'd need to adjust things now that it's in more widespread use. Please avoid grandstanding about how we're "breaking major products", "have no excuse", and other implications that we're doing a terrible job. In the open source world, it only makes people less likely to care about your problem and listen to you. Thank you. (In reply to Kenneth Graunke from comment #51) > (In reply to Dongseong Hwang from comment #50) > > In my opinion, Intel has no excuse to Google about this conformance test > > failure, and a major product has been broken. > > Thank you for the analysis. I believe you're correct that we should allow > this. The specs were fairly new when we first implemented this, and there > were very few users, so it makes sense that we'd need to adjust things now > that it's in more widespread use. Thank you. > Please avoid grandstanding about how we're "breaking major products", "have > no excuse", and other implications that we're doing a terrible job. In the > open source world, it only makes people less likely to care about your > problem and listen to you. Thank you. Sorry for bad sentence. As the second English speaker, subtle nuance is always difficult for me. I just translated my language to English. I'll be more careful. (In reply to Aditya Swarup from comment #48) > Hi Tapani, > > I don't know the complexity of the task but definitely the spec doesn't say > to restrict external texture creation using dmabuf. This is a policy we are > introducing in our driver and other platforms don't exhibit this behavior. > > If you think what we are doing in the driver is confirming to specs, then > can you or somebody from the MESA team take up this discussion with Google > to resolve this Skqp failure by explaining to them what they are doing in > the test code is wrong. > > The problem over here is we are juggling the ball from one end to the other. > Google says there is no problem in their test code and if the driver is > written according to spec, it should pass. But you/Intel-MESA team have a > different opinion on this. As a result we are going nowhere wrt resolving > this issue :( My suggestion (in internal mail thread) was to use the patch for some time in our product trees and see if there's any smoke coming out in full validation or general usage with apps. This was also meant to unblock you guys, I'm sorry if this approach did not work out. Google here :-) Git-blame claims that I wrote the pivotal comment /* We support external textures only for EGLImages created with * EGL_EXT_image_dma_buf_import. We may lift that restriction in the future. */ in April 2015. Even if git-blame is wrong (due to refactors), I'm the person who insisted that we enforce that restriction in the initial rollout of GL_OES_EGL_image_external. It's finally time to lift the restriction :) Copying what I said on mesa-dev today: ==== The original reason was that the driver was not yet robust enough to handle all the corner cases of external images, auxiliary surfaces, etc. But, I believe the driver became sufficiently robust enough near September last year. To clarify... I believe the driver is robust regarding auxiliary surfaces and EGLImages (solution: The driver disables auxiliary surfaces on EGLImages created via this codepath), but not necessarily fully correct around all the non-compression corner cases of the EGLImage api. But, as Ken said, we probably still have problems lurking around the orphaning logic, and we can fix those corner cases as we find them. ==== The EGLImage issue should be fixed by this commit in master: commit a5c39ed974402c6a40d51c6189547d1f29581fbe Author: Aditya Swarup <aditya.swarup@intel.com> AuthorDate: Wed Oct 31 17:12:40 2018 -0700 Commit: Chad Versace <chadversary@chromium.org> CommitDate: Thu Nov 8 12:33:06 2018 -0800 i965: Lift restriction in external textures for EGLImage support Fixes Skqp's unitTest_EGLImageTest test. For Intel platforms, we support external textures only for EGLImages created with EGL_EXT_image_dma_buf_import. This restriction seems to be Intel specific and not present for other platforms. While running SKQP test - unitTest_EGLImageTest, GL_INVALID is sent to the test because of this restriction. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105301 Signed-off-by: Aditya Swarup <aditya.swarup@intel.com> Reviewed-by: Tapani Pälli <tapani.palli@intel.com> Reviewed-by: Chad Versace <chadversary@chromium.org> There's bug is quite noisy, I'm resolving it as fixed. Please file separate new bugs against each new individual failure. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.