Created attachment 127134 [details] i915 error state of GPU hang Execute below dEQP case may cause GPU hang on SKL $ ./deqp-gles3 --deqp-case=dEQP-GLES3.functional.ubo.single_nested_struct_array.single_buffer.std140_instance_array_fragment ERROR: <Text>Image comparison failed, got 16384 non-white pixels</Text> [782161.763223] [drm] stuck on render ring [782161.763416] [drm] GPU HANG: ecode 9:0:0x85dffffb, in deqp-gles3 [21566], reason: Engine(s) hung, action: reset [782161.765222] drm/i915: Resetting chip after gpu hang [782163.763513] [drm] RC6 on See attached log for i915 hang state
The same test case can pass on HSW and BDW w/ the same driver, the GPU hang is SKL specific now.
Does it hang all the time or just occasionally?
(In reply to Kenneth Graunke from comment #2) > Does it hang all the time or just occasionally? It can be reproduced consistently, not an occasional issue. Suspect it's due to memory resource mis-alignment
This test passes reliably in the Mesa CI on sklgt2
I have been enabling sklgt4e in the Mesa CI and see similar gpu hangs on that platform. Randy, please specify which sku of skl you are testing.
(In reply to Mark Janes from comment #5) > I have been enabling sklgt4e in the Mesa CI and see similar gpu hangs on > that platform. Randy, please specify which sku of skl you are testing. Hi, Mark Yes, I am using GT4E, it's Intel NUC6i7KYK. And it can also be reproduced on the Kernel 4.7 More Infos: - mesa git top commit 1d466b9b04662d41a403ea8fd617a5365750b1de Author: Steven Toth <stoth@kernellabs.com> Date: Thu Sep 29 08:11:00 2016 -0600 gallium/hud: Add power sensor support - libdrm git top commit b382b22fd4aa6faa954396c94330f2c7d8428aba Author: Sean Paul <seanpaul@chromium.org> Date: Tue Jul 14 15:43:20 2015 -0400 libdrm: Add rotation property fields - latest publicly released kernel (4.7) and i915 top commit is commit ad778f8967ea2f0bfda02701f918bcfcd495b721 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Aug 4 16:32:42 2016 +0100 drm/i915: Export our request as a dma-buf fence on the reservation object - version of the test suite dEQP git: https://android.googlesource.com/platform/external/deqp top commit is ca988480be945772473f9256b6ae91fa6aa62bd1 Thanks, Randy
I did some experimental debugging for this .. fragment shader has a huge number of comparisons made (114), if I comment out all the comparisons after 58 which looks like this: result *= compare_ivec2(block[1].t[0].b[2].b[3], ivec2(1, -7)); then hang disappears and test passes. Not sure if this helps but for me it seems that the hang is related to the code generated by these comparisons (?)
(In reply to Tapani Pälli from comment #7) > I did some experimental debugging for this .. fragment shader has a huge > number of comparisons made (114), if I comment out all the comparisons after > 58 which looks like this: > > result *= compare_ivec2(block[1].t[0].b[2].b[3], ivec2(1, -7)); > > then hang disappears and test passes. Not sure if this helps but for me it > seems that the hang is related to the code generated by these comparisons (?) so just to speculate a bit more, this test generates a huge number of ubo loads (there are total 473 ubo_load_tmp variables), possibly maybe related to this.
Ben Widawsky asked me to provide card error state from a SKL GT4e to Mika Kuoppala to investigate this failure. In addition to the state attached by Randy, I captured another crash at: http://otc-mesa-ci.jf.intel.com/userContent/gt4_error/*view*/
Another two dEQP cases can reproduce the GPU hang issue on GT4e, the signature is similar, i.e HEAD 440, TAIL 460 render command stream: START: 0x017a2000 HEAD: 0x00000440 TAIL: 0x00000460 CTL: 0x00003001 HWS: 0x007ec000 ACTHD: 0x00000000 00000440 #deqp-gles3 --deqp-case=dEQP-GLES3.functional.ubo.single_nested_struct_array.per_block_buffer.shared_instance_array_fragment #deqp-gles3 --deqp-case=dEQP-GLES3.functional.ubo.random.all_per_block_buffers.33
27 cases failed on SKL GT4e due to gpu reset, they are dEQP-GLES3.functional.ubo.multi_nested_struct.per_block_buffer.packed_instance_array_fragment dEQP-GLES3.functional.ubo.multi_nested_struct.per_block_buffer.shared_instance_array_fragment dEQP-GLES3.functional.ubo.multi_nested_struct.per_block_buffer.std140_instance_array_fragment dEQP-GLES3.functional.ubo.multi_nested_struct.single_buffer.packed_instance_array_fragment dEQP-GLES3.functional.ubo.multi_nested_struct.single_buffer.shared_instance_array_fragment dEQP-GLES3.functional.ubo.multi_nested_struct.single_buffer.std140_instance_array_fragment dEQP-GLES3.functional.ubo.random.all_per_block_buffers.33 dEQP-GLES3.functional.ubo.random.all_shared_buffer.23 dEQP-GLES3.functional.ubo.random.nested_structs_arrays_instance_arrays.24 dEQP-GLES3.functional.ubo.single_nested_struct_array.per_block_buffer.packed_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct_array.per_block_buffer.shared_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct_array.per_block_buffer.std140_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct_array.single_buffer.packed_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct_array.single_buffer.shared_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct_array.single_buffer.std140_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct.per_block_buffer.packed_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct.per_block_buffer.shared_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct.per_block_buffer.std140_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct.single_buffer.packed_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct.single_buffer.shared_instance_array_fragment dEQP-GLES3.functional.ubo.single_nested_struct.single_buffer.std140_instance_array_fragment dEQP-GLES3.functional.ubo.single_struct_array.per_block_buffer.packed_instance_array_fragment dEQP-GLES3.functional.ubo.single_struct_array.per_block_buffer.shared_instance_array_fragment dEQP-GLES3.functional.ubo.single_struct_array.per_block_buffer.std140_instance_array_fragment dEQP-GLES3.functional.ubo.single_struct_array.single_buffer.packed_instance_array_fragment dEQP-GLES3.functional.ubo.single_struct_array.single_buffer.shared_instance_array_fragment dEQP-GLES3.functional.ubo.single_struct_array.single_buffer.std140_instance_array_fragment
I believe this is a scratch space allocation problem. Increasing max_wm_threads from 64 * 9 to 72 * 9 in src/intel/common/gen_device_info.c seems to fix the problem.
(In reply to Kenneth Graunke from comment #12) > I believe this is a scratch space allocation problem. Increasing > max_wm_threads from 64 * 9 to 72 * 9 in src/intel/common/gen_device_info.c > seems to fix the problem. I probably spoke too soon - increasing the size of the buffer can also just move things around in the GTT so it happens to work. Ben and I think the old calculation is correct, but I'll look at this more carefully.
Mika, can you reproduce this gpu hang?
Mika, don't bother reproducing this. Ken Graunke has found a bug and has a patch to address SKLGT4e instabilities.
It turns out this was our fault: https://lists.freedesktop.org/archives/mesa-dev/2016-November/134606.html Once again...documented...but in an obscure place. Nobody thinks to read the description of "scratch space base pointer", as that pointer has meant the same thing for 10 years...
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.