Setup: * SKL GT2 / GT3e * Ubuntu 18.04 * *drm-tip* v4.19 kernel * Mesa & X git head Test-case: * Run a test-case using compute shaders Expected output: * No GPU hangs (like with earlier Mesa commits) Actual output: * Recoverable GPU hangs in compute shader using test-cases: - GfxBench Aztec Ruins, CarChase and Manhattan 3.1 - Sacha Willems' Vulkan compute demos - SynMark CSDof / CSCloth * Vulkan compute demos fail to run (other tests run successfully despite hangs) This seems to be SKL specific, it's not visible on other HW. This regression happened between following Mesa commits: * dca35c598d: 2018-11-19 15:57:41: intel/fs,vec4: Fix a compiler warning * a999798daa: 2018-11-20 17:09:22: meson: Add tests to suites It also seems to be specific to *drm-tip* v4.19.0 kernel as I don't see it with latest drm-tip v4.20.0-rc3 kernel. So it's also possible that it's a bug in i915, that just gets triggered by Mesa change, and which got fixed later. Sacha Willems' Vulkan Raytracing demo outputs following on first run: --------------------------------- SPIR-V WARNING: In file src/compiler/spirv/vtn_variables.c:1897 Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212 14920 bytes into the SPIR-V binary SPIR-V WARNING: In file src/compiler/spirv/vtn_variables.c:1897 Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212 10300 bytes into the SPIR-V binary SPIR-V WARNING: In file src/compiler/spirv/vtn_variables.c:1897 Source and destination types of SpvOpStore do not have the same ID (but are compatible): 269 vs 256 10944 bytes into the SPIR-V binary SPIR-V WARNING: In file src/compiler/spirv/vtn_variables.c:1897 Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212 11920 bytes into the SPIR-V binary INTEL-MESA: error: src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST) vulkan_raytracing: base/vulkanexamplebase.cpp:651: void VulkanExampleBase::submitFrame(): Assertion `res == VK_SUCCESS' failed. ----------------------------- (Other runs show just the error and assert.)
Since this bug is limited to a drm-tip kernel, it seems likely that the problem is in the kernel, not in mesa. Can you reproduce it on any released kernel?
(In reply to Eero Tamminen from comment #0) > It also seems to be specific to *drm-tip* v4.19.0 kernel as I don't see it > with latest drm-tip v4.20.0-rc3 kernel. So it's also possible that it's a > bug in i915, that just gets triggered by Mesa change, and which got fixed > later. I've now seen hangs also with drm-tip v4.20.0-rc3 kernel. However, these GPU hangs don't happen anymore with this or later Mesa commit (regardless of whether they're with v1.19 or v4.20-rc4 drm-tip kernels): 3c96a1e3a97ba 2018-11-26 08-29-39: radv: Fix opaque metadata descriptor last layer -> FIXED? (I'm lacking data for several previous days, so I can't give an exact time when those hangs stopped.) Raytracing demo SPIR-V warnings happen still, although I updated Sacha Willem's demos to latest Git version.
Sorry, all the hangs have happened with drm-tip v4.20-rc versions, not v4.19. Last night there were again recoverable hangs on SKL, with drm-tip v4.20-rc4: * GfxBench v5-GOLD2 Aztec Ruins GL & Vulkan ("normal") versions * Ungine Heaven v4.0 * SynMark v7 CSCloth Heaven doesn't use compute shaders, so maybe the issue isn't compute related after all.
It also affects me on Skylake i7-6500U, Mesa 18.3.1 and kernel 4.19.12. I am able to reproduce it in my own vulkan app. If I don't dispatch any compute work, everything works fine. As soon as I submit a CB with dispatches, I get that same error: INTEL-MESA: error: ../mesa-18.3.1/src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST)
I should clarify: when I said "as soon as I submit", I mean the vkQueueSubmit call exits with device lost error. Before it returns, my desktop freezes for a couple seconds (maybe I can move my mouse, but it doesn't render the new position while hung).
Could you attach the /sys/class/drm/card0/error file after you notice a hang? Thanks!
Created attachment 142952 [details] CarChase GPU hang I'm now pretty sure it's drm-tip kernel issue. It went away after v4.20-rc4 kernel version, at end of November, but still happens when using our last v4.19 drm-tip build (currently used in our Mesa tracking). It seems to happen more frequently with SKL GT3e than GT2. Attached is error state for GfxBench CarChase hang with Mesa (8c93ef5de98a9) from couple of days ago. I've now updated our Mesa tracking to use v4.20 drm-tip build, I'll tell next week whether that helped (as expected).
I know this is going to be painful, but it would be really good to have a bisect on what commit broke this... Skimming through the logs, I couldn't find anything between drm-tip/4.18-rc7 and drm-tip/4.20-rc4 that indicates a hang of this kind on gen9. A bit later (4th of December) this fix appeared that could impact : commit 4a15c75c42460252a63d30f03b4766a52945fb47 Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Date: Mon Dec 3 13:33:41 2018 +0000 drm/i915: Introduce per-engine workarounds We stopped re-applying the GT workarounds after engine reset since commit 59b449d5c82a ("drm/i915: Split out functions for different kinds of workarounds"). Issue with this is that some of the GT workarounds live in the MMIO space which gets lost during engine resets. So far the registers in 0x2xxx and 0xbxxx address range have been identified to be affected. This losing of applied workarounds has obvious negative effects and can even lead to hard system hangs (see the linked Bugzilla). Rather than just restoring this re-application, because we have also observed that it is not safe to just re-write all GT workarounds after engine resets (GPU might be live and weird hardware states can happen), we introduce a new class of per-engine workarounds and move only the affected GT workarounds over. Using the framework introduced in the previous patch, we therefore after engine reset, re-apply only the workarounds living in the affected MMIO address ranges. v2: * Move Wa_1406609255:icl to engine workarounds as well. * Rename API. (Chris Wilson) * Drop redundant IS_KABYLAKE. (Chris Wilson) * Re-order engine wa/ init so latest platforms are first. (Rodrigo Vivi) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Bugzilla: https://bugzilla.freedesktop.org/show_bug.cgi?id=107945 Fixes: 59b449d5c82a ("drm/i915: Split out functions for different kinds of workarounds")
Still seeing the hangs with latest Mesa and drm-tip 4.20 kernel, on SKL GT3e & GT4e. It happens approximately on 1 out of 3 runs. Seems to happen only with: * Aztec Ruins (normal, FullHD resolution) * Carchase, but only when it's run in 4K resolution (not in FullHD): testfw_app --gfx glfw --gl_api desktop_core --width 3840 --height 2160 --fullscreen 0 --test_id gl_4
(In reply to Jakub Okoński from comment #4) > It also affects me on Skylake i7-6500U, Mesa 18.3.1 and kernel 4.19.12. I am > able to reproduce it in my own vulkan app. If I don't dispatch any compute > work, everything works fine. As soon as I submit a CB with dispatches, I get > that same error: > > INTEL-MESA: error: ../mesa-18.3.1/src/intel/vulkan/anv_device.c:2091: GPU > hung on one of our command buffers (VK_ERROR_DEVICE_LOST) Is this fully reproducible? If yes, could you either attach your test-case, or (preferably :)) try bisecting it from Mesa? As can be seen from above comments, this isn't reproducible enough in my test-cases that I could reliably bisect it (even whether it's Mesa or kernel issue).
I don't know about these other applications, I've been experiencing this issue in my own vulkan app. I had time today to mess around a bit more. First I removed my dispatches and it worked fine, then I brought them back and started simplifying my compute shader. So far, I've been able to isolate the issue to a single `barrier()` GLSL call near the end of my shader. I have another barrier earlier - `memoryBarrierShared()` and it doesn't cause any issues. Perhaps this is isolated to control flow barriers in compute shaders? I am preparing my code to serve as a repro case, I should have it soon, but I use a Rust toolchain so it might not be the easiest.
OK, I think I found the precise issue. It occurs when using a control flow barrier in the shader with more than 64 items in the workgroup. To put in concrete terms: Shader A: ... layout (local_size_x = 64) in; void main() { // code barrier(); } ---- Works fine. Shader B: ... layout (local_size_x = 65) in; void main() { // code barrier(); } ---- Hangs with INTEL-MESA: error: src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST). Shader C: ... layout (local_size_x = 65) in; void main() { // code // barrier(); without any control flow barriers inside } ---- Works fine as well. This should be enough to zoom into the issue, but if you need code you can execute and repro locally, let me know and I can deliver it.
Vulkan spec defines a minimum of 128 items in the first dimension of a workgroup, the driver reports maxComputeWorkGroupSize[0] = 896 so I think my application is well behaved in this case and should not hang because of limits.
(In reply to Jakub Okoński from comment #12) > OK, I think I found the precise issue. It occurs when using a control flow > barrier in the shader with more than 64 items in the workgroup. Great, thanks! (In reply to Jakub Okoński from comment #13) > Vulkan spec defines a minimum of 128 items in the first dimension of a > workgroup, the driver reports maxComputeWorkGroupSize[0] = 896 so I think my > application is well behaved in this case and should not hang because of > limits. At certain workgroup size thresholds, at least the used SIMD mode can increase (threshold differs between platforms). You can check whether there's a change in that between working and non-working cases with something like this: INTEL_DEBUG=cs <your test-case> 2> shader-ir-asm.txt grep ^SIMD shader-ir-asm.txt If SIMD mode doesn't change, what's the diff between the shader IR/ASM output of the two versions?
I couldn't get the output of `INTEL_DEBUG=cs`, it returned an output once but I lost due to terminal scrollback. No matter how many times I ran it again, it never dumped the actual CS shader info. I was successful when using `INTEL_DEBUG=cs,do32`, the combination of options prints the expected output every time. It doesn't change when my program works or crashes, so I hope that's OK. So with the forced SIMD32 mode, codegen is still different and the issue remains. I'm attaching both outputs below (do32-failing.txt and do32-working.txt), here's the diff for generated native code: Native code for unnamed compute shader (null) -SIMD32 shader: 496 instructions. 0 loops. 19586 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 7936 to 6560 bytes (17%) +SIMD32 shader: 498 instructions. 0 loops. 19586 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 7968 to 6576 bytes (17%) START B0 (162 cycles) mov(8) g4<1>UW 0x76543210V { align1 WE_all 1Q }; mov(16) g60<1>UD g0.1<0,1,0>UD { align1 1H compacted }; @@ -4354,16 +4354,18 @@ add(16) g63<1>D g8<8,8,1>D g1.5<0,1,0>D { align1 2H }; add(16) g3<1>UW g4<16,16,1>UW 0x0010UW { align1 WE_all 1H }; mov(16) g58<1>D g4<8,8,1>UW { align1 1H }; -shl(16) g68<1>D g55<8,8,1>D 0x00000006UD { align1 1H }; -shl(16) g46<1>D g63<8,8,1>D 0x00000006UD { align1 2H }; +mul(16) g68<1>D g55<8,8,1>D 65D { align1 1H compacted }; +mul(16) g46<1>D g63<8,8,1>D 65D { align1 2H }; shl(16) g56<1>D g2<0,1,0>D 0x00000005UD { align1 1H }; shl(16) g64<1>D g2<0,1,0>D 0x00000005UD { align1 2H }; mov(16) g66<1>D g3<8,8,1>UW { align1 2H }; add(16) g60<1>D g58<8,8,1>D g56<8,8,1>D { align1 1H compacted }; -and(16) g62<1>UD g60<8,8,1>UD 0x0000003fUD { align1 1H compacted }; +math intmod(8) g62<1>UD g60<8,8,1>UD 0x00000041UD { align1 1Q compacted }; +math intmod(8) g63<1>UD g61<8,8,1>UD 0x00000041UD { align1 2Q compacted }; add.z.f0(16) g76<1>D g68<8,8,1>D g62<8,8,1>D { align1 1H compacted }; add(16) g68<1>D g66<8,8,1>D g64<8,8,1>D { align1 2H }; -and(16) g70<1>UD g68<8,8,1>UD 0x0000003fUD { align1 2H }; +math intmod(8) g70<1>UD g68<8,8,1>UD 0x00000041UD { align1 3Q }; +math intmod(8) g71<1>UD g69<8,8,1>UD 0x00000041UD { align1 4Q }; add.z.f0(16) g50<1>D g46<8,8,1>D g70<8,8,1>D { align1 2H }; (+f0) if(32) JIP: 416 UIP: 416 { align1 }; END B0 ->B1 ->B2
Created attachment 143046 [details] INTEL_DEBUG=cs,do32 output for the 65 local workgroup size case (GPU hang)
Created attachment 143047 [details] INTEL_DEBUG=cs,do32 output for the 64 local workgroup size case (working)
(In reply to Jakub Okoński from comment #15) > I couldn't get the output of `INTEL_DEBUG=cs`, it returned an output once > but I lost due to terminal scrollback. No matter how many times I ran it > again, it never dumped the actual CS shader info. That's weird. If shaders come from cache, they seem to be missing "SIMD" line, and some other info, but the actual assembly instructions should be there. Do you get shader info, if you also disable cache: MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs <use-case> ?
It helps and I get reliable output, although I don't understand how the GLSL caching option is relevant. I use vulkan without VkPipelineCache, and yet it is being cached anyway? I'm attaching new debug outputs, the diffs look pretty similar, but they now use SIMD16 in both cases, with one more instruction in the failing shader.
Created attachment 143060 [details] MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 65 items (failing)
Created attachment 143061 [details] MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 64 items (working)
(In reply to Jakub Okoński from comment #19) > It helps and I get reliable output, although I don't understand how the GLSL > caching option is relevant. Shader compiler is shared between Vulkan and GL drivers. > I use vulkan without VkPipelineCache, and yet it > is being cached anyway? Aren't pipeline objects higher level concept than shaders? Shader caching underneath should be invisible to the upper layers (except for performance and debug output), and not affect the resulting shader binaries (unless it's buggy). Lionel, do you think the shader assembly changes between working and non-working work group sizes has anything to do with the hangs? There's no code-flow difference, few instruction changes (shift -> mul, and -> 2x intmod) look innocent to me. -> seems that minimal reproduction code would help. :-)
I'm trying to reduce the shader to bare-minimum that reproduces the hang, here it is: ``` #version 450 layout (local_size_x = 65) in; void main() { if (gl_GlobalInvocationID.x >= 20) { return; } barrier(); } ``` New piece of information is the early-return that is required to trigger the hang. From this baseline shader, if I decrease local_size_x to 64, it works. Or I can remove the if statement and it will also work. Or I can remove the barrier() and it will also start to work. It seems to be a combination of these that causes the hang. If I take away any piece from it, the problem goes away. I'm attaching the debug info on this minimal shader.
Created attachment 143062 [details] MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs for the minimal repro shader
I spoke a bit too soon. For this minimal shader, the local_size_x seems irrelevant and decreasing it doesn't prevent the hang. I wonder if this is the same issue or if I uncovered some other issue by reducing the test case.
Lionel, is the info from Jakub enough to reproduce the issue? FYI: Last night git builds of drm-tip kernel, X and Mesa didn't have hangs on SKL, but all BXT devices had hard hang in Aztec Ruins Vulkan test-case. (Just git version of Mesa + v4.20 drm-tip kernel wasn't enough to trigger it.)
(In reply to Eero Tamminen from comment #26) > FYI: Last night git builds of drm-tip kernel, X and Mesa didn't have hangs > on SKL, but all BXT devices had hard hang in Aztec Ruins Vulkan test-case. This was unrelated (kernel) bug which seems to have been introduced and fixed (day later) by Chris.
Jakub, are still seeing the hangs with your test-cases? I haven't see them for a while with my test-cases when using latest Mesa (and drm-tip kernel 2.20 or 5.0-rc). There have been a couple of CS related Mesa fixes since I filed this bug: ---------------------------------------------------------------- commit fea5b8e5ad5042725cb52d6d37256b9185115502 Author: Oscar Blumberg <carnaval@12-10e.me> AuthorDate: Sat Jan 26 16:47:42 2019 +0100 Commit: Kenneth Graunke <kenneth@whitecape.org> CommitDate: Fri Feb 1 10:53:33 2019 -0800 intel/fs: Fix memory corruption when compiling a CS Missing check for shader stage in the fs_visitor would corrupt the cs_prog_data.push information and trigger crashes / corruption later when uploading the CS state. ... commit 31e4c9ce400341df9b0136419b3b3c73b8c9eb7e Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com> AuthorDate: Thu Jan 3 16:18:48 2019 +0000 Commit: Lionel Landwerlin <lionel.g.landwerlin@intel.com> CommitDate: Fri Jan 4 11:18:54 2019 +0000 i965: add CS stall on VF invalidation workaround Even with the previous commit, hangs are still happening. The problem there is that the VF cache invalidate do happen immediately without waiting for previous rendering to complete. What happens is that we invalidate the cache the moment the PIPE_CONTROL is parsed but we still have old rendering in the pipe which continues to pull data into the cache with the old high address bits. The later rendering with the new high address bits then doesn't have the clean cache that it expects/needs. ---------------------------------------------------------------- If you're still seeing hangs and they're 100% reproducible, I think it would be better to file a separate bug about it, and get it bisected.
Still hangs on 5.0-rc5, I will try compiling latest mesa from git to see if that helps.
I tried on mesa 19.1.0 git revision 64d3b148fe7 and it also hangs, do you want me to create another issue?
Created attachment 143303 [details] SKL GT3e CarChase GPU hang Gah. Every time I comment that this seems to have gone, the very next day I get a new (recoverable) hang. I.e. this happens nowadays *very* rarely. This time there was one recoverable hang in GfxBench CarChase on SKL GT3e, no hangs on the other machines. Like earlier, it doesn't have significant impact on performance. (In reply to Jakub Okoński from comment #30) > I tried on mesa 19.1.0 git revision 64d3b148fe7 and it also hangs, do you > want me to create another issue? Lionel, any comments? Jakub, you could attach i915 error state from: /sys/class/drm/card0/error So that Mesa developers can check whether your hangs happen in the same place as mine.
Created attachment 143305 [details] /sys/class/drm/card0/error Jakub Here you go.
(In reply to Jakub Okoński from comment #25) > I spoke a bit too soon. For this minimal shader, the local_size_x seems > irrelevant and decreasing it doesn't prevent the hang. I wonder if this is > the same issue or if I uncovered some other issue by reducing the test case. (In reply to Jakub Okoński from comment #32) > Created attachment 143305 [details] > /sys/class/drm/card0/error Jakub > > Here you go. If there are still two different ways of reliably triggering the hang, could you attach error output also for the other and name them so that they can be differentiated? (E.g. "minimal conditional return + barrier case hang" and "large / local_size_x case hang".)
I hope I'm not going crazy, but on 5.0-rc5 with mesa 19.0-rc2, the goal post has moved to 32 shader invocations in a local group. So comment #12 is outdated when it comes to the number. Otherwise, the behavior is the same, it's the combination of > 32 items AND conditional early return statements that cause the hang. So in the end I only have one repro case I think, it's this: #version 450 layout (local_size_x = 33) in; void main() { if (gl_GlobalInvocationID.x >= 20) { return; } barrier(); } From here, I can do any of: 1) comment out barrier() call 2) comment out the return statement (the if can stay) 3) decrease local_size_x to 32 And it will prevent the crash from happening. The drm/card0 error that I uploaded on February 5th is the only crash I can provide.
I meant to say: it's the combination of > 32 items AND conditional early return statements AND a barrier that cause the hang. I also checked replacing the barrier() call with memory barriers, and it prevents the crash. So only execution barriers are a component/contributor to this issue.
Should this be blocking the Mesa 19.0 release? Why wouldn't we suspect a bug in drm-tip instead of Mesa?
(In reply to Mark Janes from comment #36) > Should this be blocking the Mesa 19.0 release? At least it started happening for us after the previous release. Because of bug 108787 caused by Meson, I started to wonder whether Meson is another possible cause for this (I started to see the random hangs sometime after switching to Meson). Jakub, are you building Mesa with Meson (like me) or Autotools? > Why wouldn't we suspect a bug in drm-tip instead of Mesa? Somebody needs to: * bisect Jakub's 2 fully reproducible compute shader test-case to find out whether they're same issue or not, and whether they are Mesa or kernel issues * look into attached i915 error files to check whether Jakub's fully reproducible compute hangs, and my very rarely happening CarChase hangs have the same cause. I.e. should there be separate bugs for these * Ask for new bugs to be filed where applicable and move them to drm-tip, if they're kernel issues (If those rare CarChase hangs match neither of Jakub's reproducible compute hang cases, and i915 error file isn't enough to locate the problem, that particular issue can't be located / debugged and probably needs to be wontfix, it's nowadays so rare.)
I was using meson to build the release candidates of mesa 19.0. Using this script to be exact: https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa#n39 I don't have much time available, but I can try bisecting. Need to come up with some scripts to build these packages on my workstation and not the mobile dual core SKU. Should I be just bisecting the kernel with latest rc of mesa 19, just mesa, or both at the same time somehow?
(In reply to Jakub Okoński from comment #38) > I was using meson to build the release candidates of mesa 19.0. Found out that Meson isn't related (it was just enabling asserts in bug 108787). > I don't have much time available, but I can try bisecting. Need to come up > with some scripts to build these packages on my workstation and not the > mobile dual core SKU. > > Should I be just bisecting the kernel with latest rc of mesa 19, just mesa, > or both at the same time somehow? First find out which one is the culprit. For this it should be enough to check some release versions of both, that go far enough back, whatever you can test most easily (e.g. readily available distro packages). Test new Mesa with old kernel, new kernel with old Mesa, to verify that issue hasn't just moved. Only after you've found a version(s) that don't have the problem, you can start bisecting. You might first try bisecting things closer using release versions & pre-built packages if such are available, to minimize building needed for real git bisect.
I have a local cache of old packages I used to have installed years ago. I tried a couple kernels down to 4.8.7 from November 2016 with mesa 19.0-rc2 and they all have the same problem as 5.0. I also have old packages of mesa, down to 11.0.7, but I'm using a rolling release distro and it would take a lot of effort (and probably breaking the system) to downgrade this far back. I was unable to build even 18.x versions locally due to some incompatibilies with llvm. Maybe I'll try historical livecd versions of Ubuntu to check other Mesa versions. Is that a bad approach?
(In reply to Jakub Okoński from comment #40) > I have a local cache of old packages I used to have installed years ago. I > tried a couple kernels down to 4.8.7 from November 2016 with mesa 19.0-rc2 > and they all have the same problem as 5.0. > > I also have old packages of mesa, down to 11.0.7, but I'm using a rolling > release distro and it would take a lot of effort (and probably breaking the > system) to downgrade this far back. I was unable to build even 18.x versions > locally due to some incompatibilies with llvm. i965 doesn't need/use LLVM. Just disable gallium & RADV and everything LLVM related from your Mesa build. Using autotools: --with-dri-drivers=i965 --with-vulkan-drivers= --with-gallium-drivers= --disable-llvm > Maybe I'll try historical livecd versions of Ubuntu to check other Mesa > versions. Is that a bad approach?
(In reply to Jakub Okoński from comment #40) > I have a local cache of old packages I used to have installed years ago. I > tried a couple kernels down to 4.8.7 from November 2016 with mesa 19.0-rc2 > and they all have the same problem as 5.0. Ok, so with a reproducible test-case it didn't even require new kernel => updated summary.
Compute hangs aren't anymore reproducible with my test-cases, but recently I've seen very rarely (system) hang on BXT in GfxBench Manhattan 3.1, which uses compute. These happen only with Wayland version under Weston, not with X version (under X, or Weston), so they're unlikely to be compute related though. => Jakub's fully reproducible test-cases are best for checking this. Jakub, were you able to narrow down in which Mesa version your hangs happened / is it a 19.x regression?
Not yet, I need to find more time to do these rebuilds and bisect. I think I need to create a standalone, vulkan 1.0 test case for this, it's hard to do it in a biffer app. Can I use the Conformance Test Suite to do this easily? I don't mean contributing to upstream CTS, just spinning off a test case with my problem.
Created attachment 143739 [details] SKL GT3e CarChase GPU hang (mesa: b3aa37046b) (In reply to Eero Tamminen from comment #43) > Compute hangs aren't anymore reproducible with my test-cases Added GPU hang tracking so that I can catch these. Attached one with yesterday's Mesa Git.
(In reply to Jakub Okoński from comment #44) > Not yet, I need to find more time to do these rebuilds and bisect. I think I > need to create a standalone, vulkan 1.0 test case for this, it's hard to do > it in a biffer app. > > Can I use the Conformance Test Suite to do this easily? You might try also Piglit as it seems nowadays to have some support for Vulkan. On quick browse I didn't see any test for compute with Vulkan though. > I don't mean contributing to upstream CTS, just spinning off a test case with my problem. AFAIK Mesa CI runs both CTS and piglit, so getting the resulting test to upstream version of either piglit or CTS would be good. (In reply to Eero Tamminen from comment #41) > i965 doesn't need/use LLVM. Just disable gallium & RADV and everything LLVM > related from your Mesa build. Using autotools: > --with-dri-drivers=i965 --with-vulkan-drivers= --with-gallium-drivers= > --disable-llvm Sorry, I of course meant: "--with-vulkan-drivers=intel". (In reply to Eero Tamminen from comment #45) > Added GPU hang tracking so that I can catch these. Every few days there's recoverable GPU hang on some SKL or BXT device in GfxBench Manhattan 3.1, CarChase or AztecRuins.
Finally made some progress here. I have created piglit test cases to demonstrate the problem. I still haven't done any bisecting, so I don't know if it's a regression. Test #1: passes on my RADV desktop machine, fails on my Gen 9 6500U laptop and freezes the graphics for a couple seconds: [require] [compute shader] #version 450 layout(binding = 0) buffer block { uint value[]; }; layout (local_size_x = 33) in; void main() { if (gl_GlobalInvocationID.x >= 20) { return; } barrier(); value[gl_GlobalInvocationID.x] = gl_GlobalInvocationID.x; } [test] # 60 elements ssbo 0 240 compute 5 1 1 probe ssbo uint 0 0 == 0 probe ssbo uint 0 16 == 4 probe ssbo uint 0 76 == 19 probe ssbo uint 0 128 == 0 probe ssbo uint 0 132 == 0 I have more variations of this, I could send a patch to piglit if you think it's valuable. Can you try to reproduce this exact case on your hardware?
I should have mentioned, the == 4 and == 19 assertions are failing for me, it acts like none of the SIMD lanes executed anything AFAICT.
I double checked the SPIR-V specification, and I think this shader is invalid. > 3.32.20 OpControlBarrier > This instruction is only guaranteed to work correctly if placed strictly > within uniform control flow within Execution. This ensures that if any > invocation executes it, all invocations will execute it. > If placed elsewhere, an invocation may stall indefinitely. I guess RADV and/or AMD hardware can handle this case? Or maybe it's compiled differently?
(In reply to Eero Tamminen from comment #46) > (In reply to Eero Tamminen from comment #45) > > Added GPU hang tracking so that I can catch these. > > Every few days there's recoverable GPU hang on some SKL or BXT device in > GfxBench Manhattan 3.1, CarChase or AztecRuins. Last recoverable GfxBench i965 hangs were months ago, with older (5.0 or realier) kernels. I've also seen twice Heaven hangs on SKL in June, but not since then. -> Marking this as WORKSFORME (as I don't know what was fixed). (In reply to Jakub Okoński from comment #49) > I double checked the SPIR-V specification, and I think this shader is > invalid. If you think there's a valid issue after all with your compute shaders, could you file a separate issue about that? > > 3.32.20 OpControlBarrier > > This instruction is only guaranteed to work correctly if placed strictly > > within uniform control flow within Execution. This ensures that if any > > invocation executes it, all invocations will execute it. > > If placed elsewhere, an invocation may stall indefinitely. > > I guess RADV and/or AMD hardware can handle this case? Or maybe it's > compiled differently?
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.