https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4960/fi-icl-u2/igt@drv_selftest@live_hangcheck.html <3> [584.387917] process_csb:977 GEM_BUG_ON(!i915_request_completed(rq)) <4> [584.388022] ------------[ cut here ]------------ <2> [584.388024] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:977! <4> [584.388031] invalid opcode: 0000 [#1] PREEMPT SMP PTI <4> [584.388035] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G U W 4.19.0-rc7-CI-CI_DRM_4960+ #1 <4> [584.388037] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.2352.A01.1808281852 08/28/2018 <4> [584.388096] RIP: 0010:process_csb+0x24b/0x790 [i915] <4> [584.388100] Code: 74 69 e1 e0 48 8b 35 64 db 19 00 49 c7 c0 8e ab 3c a0 b9 d1 03 00 00 48 c7 c2 f0 0b 3b a0 48 c7 c7 c6 36 2e a0 e8 c5 f8 e7 e0 <0f> 0b 49 8b 96 a8 00 00 00 48 8b 92 c0 02 00 00 8b 92 c0 00 00 00 <4> [584.388103] RSP: 0018:ffff8804b0703e20 EFLAGS: 00010082 <4> [584.388108] RAX: 000000000000000b RBX: ffff8804805a0008 RCX: 0000000000000000 <4> [584.388111] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff8804aea40aa8 <4> [584.388114] RBP: ffff8804b0703e88 R08: 00000000006ae0d9 R09: ffff8804aeb98000 <4> [584.388117] R10: 0000000000000000 R11: ffff8804aea40aa8 R12: ffff8804a51b104c <4> [584.388120] R13: 0000000000000001 R14: ffff8804229d70c0 R15: ffff8804a51b1040 <4> [584.388124] FS: 0000000000000000(0000) GS:ffff8804b0700000(0000) knlGS:0000000000000000 <4> [584.388127] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [584.388130] CR2: 0000561022285a30 CR3: 0000000005210002 CR4: 0000000000760ee0 <4> [584.388133] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4> [584.388136] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4> [584.388139] PKRU: 55555554 <4> [584.388141] Call Trace: <4> [584.388144] <IRQ> <4> [584.388198] __execlists_submission_tasklet+0x2c/0xc20 [i915] <4> [584.388253] execlists_submission_tasklet+0x46/0x60 [i915] <4> [584.388260] tasklet_action_common.isra.5+0x47/0xb0 <4> [584.388264] __do_softirq+0xd8/0x483 <4> [584.388268] ? _raw_spin_unlock+0x29/0x40 <4> [584.388271] irq_exit+0xa9/0xc0 <4> [584.388274] do_IRQ+0x9a/0x120 <4> [584.388278] common_interrupt+0xf/0xf <4> [584.388280] </IRQ> <4> [584.388284] RIP: 0010:cpuidle_enter_state+0xab/0x340 <4> [584.388287] Code: 44 00 00 31 ff e8 45 88 94 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 70 02 00 00 31 ff e8 1e 2b 9b ff e8 c9 fc 9e ff fb 4c 29 fb <48> ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff <4> [584.388290] RSP: 0018:ffffc9000010fe90 EFLAGS: 00000216 ORIG_RAX: ffffffffffffffda <4> [584.388295] RAX: ffff8804aeb0c040 RBX: 000000000005db6a RCX: 0000000000000000 <4> [584.388297] RDX: 0000000000000046 RSI: ffffffff82124e2a RDI: ffffffff820d386f <4> [584.388300] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000 <4> [584.388302] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8804ab720008 <4> [584.388303] R13: ffffffff8229bc18 R14: 0000000000000000 R15: 000000881035ced7
Note the bogons started in live_contexts. The gpu never seemed to recover from its lost confidence. My gut says this is a hw peculiarity.
Please make sure the bug has a i915 feature assigned before setting to ReadyForDev. Assigned to GEM/Other for lack of better understanding.
Chris, Any update here? Should I report to HW team?
Still waiting for feedback on the live_context improvement.
The problem with icl not handling the read-only flag is incidental to this death.
*** Bug 108535 has been marked as a duplicate of this bug. ***
Merging these 2 bugs as both stem from the HW behaving unusually following the execlists submission. In the latter case (bug 108535) we have a clean sequence where the CS event following the submission did not happen (the context being resubmitted was close to completion). Similar bad CS events would explain a lot of related bugs... So start by looking for common patterns.
*** Bug 108564 has been marked as a duplicate of this bug. ***
(In reply to Chris Wilson from comment #1) > Note the bogons started in live_contexts. The gpu never seemed to recover > from its lost confidence. My gut says this is a hw peculiarity. See bug 108569 for the GPU hangs icl encounters with read-only pages.
Update: Last seen this issue IGT_4682 (1 week, 5 days / 147 runs ago).
Investigating a possible reason with https://patchwork.freedesktop.org/series/51703/
@@ -910,6 +910,9 @@ static void process_csb(struct intel_engine_cs *engine) execlists->active); status = buf[2 * head]; + GEM_BUG_ON(!status); + GEM_DEBUG_EXEC(WRITE_ONCE(*(u32 *)(buf + 2 * head), 0)); -> <0> [48.247809] <idle>-0 2..s1 48257761us : execlists_submission_tasklet: rcs0 awake?=1, active=5 <0> [48.247866] <idle>-0 2d.s2 48257762us : process_csb: rcs0 cs-irq head=1, tail=2 <0> [48.247922] <idle>-0 2d.s2 48257763us : process_csb: rcs0 csb[2]: status=0x10000014:0x000001c0, active=0x5 <0> [48.247982] <idle>-0 2d.s2 48257764us : process_csb: rcs0 out[0]: ctx=448.1, global=60303 (fence 3cda:943) (current 60304), prio=0 <0> [48.248043] <idle>-0 2d.s2 48257765us : process_csb: rcs0 completed ctx=448 <0> [48.248099] <idle>-0 2d.s2 48257765us : __i915_request_submit: rcs0 fence 3cdc:943 -> global=60305, current 60304 <0> [48.248159] <idle>-0 2d.s2 48257768us : __execlists_submission_tasklet: rcs0 in[1]: ctx=512.1, global=60305 (fence 3cdc:943) (current 60304), prio=0 <0> [48.248223] <idle>-0 2d.s2 48257769us : __execlists_submission_tasklet: rcs0 in[0]: ctx=480.2, global=60304 (fence 3cdb:943) (current 60304), prio=0 <0> [48.248287] gem_ctx_-1086 6.... 48257775us : i915_request_add: rcs0 fence 3cf4:1080 <0> [48.248344] <idle>-0 2..s1 48257781us : execlists_submission_tasklet: rcs0 awake?=1, active=1 <0> [48.248402] <idle>-0 2d.s2 48257800us : process_csb: rcs0 cs-irq head=2, tail=4 <0> [48.248459] <idle>-0 2d.s2 48257801us : process_csb: rcs0 csb[3]: status=0x10000018:0x000001e0, active=0x1 <0> [48.248518] <idle>-0 2d.s2 48257802us : process_csb: rcs0 out[0]: ctx=480.2, global=60304 (fence 3cdb:943) (current 60304), prio=0 <0> [48.248580] <idle>-0 2d.s2 48257803us : process_csb: rcs0 csb[4]: status=0x10000001:0x00000000, active=0x1 <0> [48.248640] gem_ctx_-1086 6.... 48257810us : i915_request_retire: rcs0 fence 3cf5:942, global=60266, current 60304 <0> [48.248700] gem_ctx_-1086 6.... 48257810us : i915_request_retire: __retire_engine_request(rcs0) fence 3cf5:942, global=60266, current 60304 <0> [48.248762] <idle>-0 2..s1 48257817us : execlists_submission_tasklet: rcs0 awake?=1, active=5 <0> [48.248819] <idle>-0 2d.s2 48257818us : process_csb: rcs0 cs-irq head=4, tail=4 <0> [48.248875] <idle>-0 2..s1 48257839us : execlists_submission_tasklet: rcs0 awake?=1, active=5 <0> [48.248933] <idle>-0 2d.s2 48257840us : process_csb: rcs0 cs-irq head=4, tail=5 <0> [48.248989] <idle>-0 2d.s2 48257841us : process_csb: rcs0 csb[5]: status=0x10000014:0x000001e0, active=0x5 <0> [48.249049] gem_ctx_-1086 6.... 48257842us : i915_request_add: rcs0 fence 3cf5:1080 <0> [48.249106] <idle>-0 2d.s2 48257842us : process_csb: rcs0 out[0]: ctx=480.1, global=60304 (fence 3cdb:943) (current 60305), prio=0 <0> [48.249167] <idle>-0 2d.s2 48257843us : process_csb: rcs0 completed ctx=480 <0> [48.249223] <idle>-0 2d.s2 48257844us : __i915_request_submit: rcs0 fence 3cdd:943 -> global=60306, current 60305 <0> [48.249284] <idle>-0 2d.s2 48257847us : __execlists_submission_tasklet: rcs0 in[1]: ctx=544.1, global=60306 (fence 3cdd:943) (current 60305), prio=0 <0> [48.249348] <idle>-0 2d.s2 48257848us : __execlists_submission_tasklet: rcs0 in[0]: ctx=512.2, global=60305 (fence 3cdc:943) (current 60305), prio=0 <0> [48.249411] <idle>-0 2..s1 48257856us : execlists_submission_tasklet: rcs0 awake?=1, active=1 <0> [48.249468] <idle>-0 2d.s2 48257857us : process_csb: rcs0 cs-irq head=5, tail=1 <0> [48.249528] <idle>-0 2d.s2 48257858us : process_csb: rcs0 csb[0]: status=0x10000018:0x00000200, active=0x1 <0> [48.249587] <idle>-0 2d.s2 48257859us : process_csb: rcs0 out[0]: ctx=512.2, global=60305 (fence 3cdc:943) (current 60305), prio=0 <0> [48.249648] <idle>-0 2d.s2 48257860us : process_csb: rcs0 csb[1]: status=0x10000001:0x00000000, active=0x1 <0> [48.249708] gem_ctx_-1086 6.... 48257863us : i915_request_retire: rcs0 fence 3cf6:942, global=60267, current 60305 <0> [48.249768] gem_ctx_-1086 6.... 48257881us : i915_request_retire: __retire_engine_request(rcs0) fence 3cf6:942, global=60267, current 60305 <0> [48.249831] <idle>-0 2..s1 48257894us : execlists_submission_tasklet: rcs0 awake?=1, active=5 <0> [48.249889] <idle>-0 2d.s2 48257896us : process_csb: rcs0 cs-irq head=1, tail=2 <0> [48.249946] <idle>-0 2d.s2 48257897us : process_csb: rcs0 csb[2]: status=0x00000000:0x000001c0, active=0x5 <0> [48.250005] gem_ctx_-1086 6.... 48257930us : i915_request_add: rcs0 fence 3cf6:1080 <0> [48.250061] <idle>-0 2d.s2 48258061us : process_csb: process_csb:913 GEM_BUG_ON(!status)
Also seen on live_context in pre-merge: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2018/fi-icl-u/igt@drv_selftest@live_contexts.html <5> [534.212171] i915 0000:00:02.0: Resetting bcs0 for hang on bcs0 <3> [534.213048] [drm:gen8_reset_engines [i915]] *ERROR* bcs0: reset request timeout <7> [534.213404] [drm:i915_reset_engine [i915]] Failed to reset bcs0, ret=-110 <7> [534.214786] [drm:i915_reset_device [i915]] resetting chip <5> [534.216220] i915 0000:00:02.0: Resetting chip for hang on bcs0 <3> [534.216386] process_csb:977 GEM_BUG_ON(!i915_request_completed(rq)) <4> [534.216719] ------------[ cut here ]------------ <2> [534.216725] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:977! <4> [534.216745] invalid opcode: 0000 [#1] PREEMPT SMP PTI <4> [534.216755] CPU: 7 PID: 4747 Comm: drv_selftest Tainted: G U W 4.19.0-CI-CI_DRM_5054+ #1 <4> [534.216761] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018 <4> [534.216906] RIP: 0010:process_csb+0x24b/0x790 [i915] <4> [534.216914] Code: 74 53 90 e0 48 8b 35 04 07 1a 00 49 c7 c0 c0 de 8d a0 b9 d1 03 00 00 48 c7 c2 d0 3d 8c a0 48 c7 c7 e6 4c 7f a0 e8 75 e3 96 e0 <0f> 0b 49 8b 96 a8 00 00 00 48 8b 92 c0 02 00 00 8b 92 c0 00 00 00 <4> [534.216921] RSP: 0018:ffffc9000080f760 EFLAGS: 00010086 <4> [534.216930] RAX: 000000000000000b RBX: ffff8804a4c58008 RCX: 0000000000000000 <4> [534.216935] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8804ae252538 <4> [534.216941] RBP: ffffc9000080f7c8 R08: 0000000000077fb6 R09: ffff8804ad910000 <4> [534.216947] R10: 0000000000000000 R11: ffff8804ae252538 R12: ffff88049681f04c <4> [534.216952] R13: 0000000000000001 R14: ffff8804214d3640 R15: ffff88049681f040 <4> [534.216959] FS: 00007f9ae929f980(0000) GS:ffff8804affc0000(0000) knlGS:0000000000000000 <4> [534.216965] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [534.216971] CR2: 0000560b357eb7b0 CR3: 0000000470868002 CR4: 0000000000760ee0 <4> [534.216976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4> [534.216981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4> [534.216986] PKRU: 55555554 <4> [534.216991] Call Trace: <4> [534.217137] execlists_reset_prepare+0x54/0x150 [i915] <4> [534.217268] i915_gem_reset_prepare_engine+0x20/0x40 [i915] <4> [534.217381] i915_gem_reset_prepare+0x2c/0x70 [i915] <4> [534.217504] ? i915_request_wait+0x16d/0x840 [i915] <4> [534.217612] i915_reset+0x117/0x280 [i915] <4> [534.217731] __i915_wait_request_check_and_reset.isra.8+0x44/0x50 [i915] <4> [534.217843] i915_request_wait+0x4fd/0x840 [i915] <4> [534.217857] ? lock_acquire+0xa6/0x1c0 <4> [534.217867] ? wake_up_q+0x70/0x70 <4> [534.217875] ? wake_up_q+0x70/0x70 <4> [534.217992] i915_gem_object_wait_fence+0x8a/0x110 [i915] <4> [534.218105] i915_gem_object_wait+0x2dc/0x510 [i915] <4> [534.218217] i915_gem_object_set_to_cpu_domain+0x35/0x130 [i915] <4> [534.218328] igt_vm_isolation+0xb1f/0xfe0 [i915] <4> [534.218481] __i915_subtests+0x5e/0xf0 [i915] <4> [534.218625] __run_selftests+0x10b/0x190 [i915] <4> [534.218789] i915_live_selftests+0x2c/0x60 [i915] <4> [534.218948] i915_pci_probe+0x50/0xa0 [i915] <4> [534.218967] pci_device_probe+0xa1/0x130 <4> [534.218983] really_probe+0x25d/0x3c0 <4> [534.218996] driver_probe_device+0x10a/0x120 <4> [534.219008] __driver_attach+0xdb/0x100 <4> [534.219020] ? driver_probe_device+0x120/0x120 <4> [534.219030] bus_for_each_dev+0x74/0xc0 <4> [534.219043] bus_add_driver+0x15f/0x250 <4> [534.219054] ? 0xffffffffa09fe000 <4> [534.219065] driver_register+0x56/0xe0 <4> [534.219075] ? 0xffffffffa09fe000 <4> [534.219086] do_one_initcall+0x58/0x2e0 <4> [534.219098] ? rcu_lockdep_current_cpu_online+0x8f/0xd0 <4> [534.219108] ? do_init_module+0x1d/0x1ea <4> [534.219120] ? rcu_read_lock_sched_held+0x6f/0x80 <4> [534.219132] ? kmem_cache_alloc_trace+0x264/0x290 <4> [534.219144] do_init_module+0x56/0x1ea <4> [534.219158] load_module+0x26f5/0x29d0 <4> [534.219177] ? vfs_read+0x122/0x140 <4> [534.219201] ? __se_sys_finit_module+0xd3/0xf0 <4> [534.219212] __se_sys_finit_module+0xd3/0xf0 <4> [534.219232] do_syscall_64+0x55/0x190 <4> [534.219246] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [534.219255] RIP: 0033:0x7f9ae8b68839 <4> [534.219265] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48 <4> [534.219273] RSP: 002b:00007ffc0c2b8ad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 <4> [534.219286] RAX: ffffffffffffffda RBX: 000055800bb9bd40 RCX: 00007f9ae8b68839 <4> [534.219293] RDX: 0000000000000000 RSI: 000055800bb9cb30 RDI: 0000000000000006 <4> [534.219301] RBP: 000055800bb9cb30 R08: 0000000000000004 R09: 0000000000000000 <4> [534.219309] R10: 00007ffc0c2b8c50 R11: 0000000000000246 R12: 0000000000000000 <4> [534.219316] R13: 000055800bb96410 R14: 0000000000000020 R15: 000000000000003c <4> [534.219334] Modules linked in: i915(+) amdgpu chash gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ax88179_178a usbnet mii x86_pkg_temp_thermal snd_hda_codec coretemp crct10dif_pclmul crc32_pclmul snd_hwdep snd_hda_core ghash_clmulni_intel snd_pcm e1000e prime_numbers [last unloaded: i915]
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5055/fi-icl-u/igt@gem_ctx_switch@basic-default.html
Dropping the priority as this bug fix doesn't depend on the drm/i915 Intel GFX Driver.
Somehow this hasn't been reported in BAT for 3 days. Was a workaround merged, or some other fix?
Or more simply, it looks like the last BAT run was on Nov 2nd...
*** Bug 108679 has been marked as a duplicate of this bug. ***
We need to get systems back to grid, I suppose.
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4712/fi-icl-u/igt@drv_selftest@live_execlists.html <0>[ 549.241934] drv_self-4908 7d..1 549274550us : process_csb: process_csb:977 GEM_BUG_ON(!i915_request_completed(rq))
Further postulation: commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Nov 8 08:17:38 2018 +0000 drm/i915/execlists: Force write serialisation into context image vs execution Ensure that the writes into the context image are completed prior to the register mmio to trigger execution. Although previously we were assured by the SDM that all writes are flushed before an uncached memory transaction (our mmio write to submit the context to HW for execution), we have empirical evidence to believe that this is not actually the case. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656 References: https://bugs.freedesktop.org/show_bug.cgi?id=108315 References: https://bugs.freedesktop.org/show_bug.cgi?id=106887 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181108081740.25615-1-chris@chris-wilson.co.uk Cc: stable@vger.kernel.org
*** Bug 108697 has been marked as a duplicate of this bug. ***
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10758/fi-icl-u/igt@gem_exec_nop@basic-parallel.html <6> [106.068733] [IGT] gem_exec_nop: starting subtest basic-parallel <3> [106.599535] execlists_submit_ports:434 GEM_BUG_ON(count > !n) <4> [106.599652] ------------[ cut here ]------------ <2> [106.599655] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:434! <4> [106.599681] invalid opcode: 0000 [#1] PREEMPT SMP PTI <4> [106.599690] CPU: 2 PID: 1221 Comm: gem_exec_nop Tainted: G U W 4.20.0-rc1-CI-Patchwork_10758+ #1 <4> [106.599702] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018 <4> [106.599768] RIP: 0010:__execlists_submission_tasklet+0xa72/0xc20 [i915] <4> [106.599778] Code: bd b1 ef e0 48 8b 35 75 00 1a 00 49 c7 c0 f3 99 2e a0 b9 b2 01 00 00 48 c7 c2 10 f7 2c a0 48 c7 c7 cd 0f 20 a0 e8 0e 38 f6 e0 <0f> 0b 48 c7 c1 30 7c 30 a0 ba f3 01 00 00 48 c7 c6 50 f7 2c a0 48 <4> [106.599798] RSP: 0018:ffffc9000088f948 EFLAGS: 00010086 <4> [106.599806] RAX: 000000000000000a RBX: ffff8804a5e00480 RCX: 0000000000000000 <4> [106.599815] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8804ae26f4e8 <4> [106.599824] RBP: ffffc9000088f9a8 R08: 00000000002497c4 R09: ffff8804ae3fb000 <4> [106.599833] R10: 0000000000000000 R11: ffff8804ae26f4e8 R12: ffff8803e82b70c0 <4> [106.599842] R13: 0000000000000000 R14: ffff8804a5e00480 R15: ffff8804a5e00008 <4> [106.599851] FS: 00007f0e444d2980(0000) GS:ffff8804afe80000(0000) knlGS:0000000000000000 <4> [106.599861] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [106.599869] CR2: 0000557f2fa10920 CR3: 00000004aa93a002 CR4: 0000000000760ee0 <4> [106.599878] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4> [106.599887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4> [106.599896] PKRU: 55555554 <4> [106.599901] Call Trace: <4> [106.599947] execlists_submit_request+0x181/0x190 [i915] <4> [106.599991] submit_notify+0x8d/0x124 [i915] <4> [106.600023] __i915_sw_fence_complete+0x81/0x250 [i915] <4> [106.600065] i915_request_add+0x36b/0x7e0 [i915] <4> [106.600103] i915_gem_do_execbuffer+0x94f/0x1580 [i915] <4> [106.600114] ? deactivate_slab.isra.26+0x74b/0x7a0 <4> [106.600126] ? ___slab_alloc.constprop.34+0x21c/0x380 <4> [106.600134] ? ___slab_alloc.constprop.34+0x21c/0x380 <4> [106.600168] ? i915_gem_execbuffer2_ioctl+0xc4/0x3f0 [i915] <4> [106.600185] ? lock_acquire+0xa6/0x1c0 <4> [106.600193] ? __might_fault+0x38/0x90 <4> [106.600228] ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915] <4> [106.600262] i915_gem_execbuffer2_ioctl+0x21b/0x3f0 [i915] <4> [106.600296] ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915] <4> [106.600307] drm_ioctl_kernel+0x81/0xf0 <4> [106.600315] drm_ioctl+0x2de/0x390 <4> [106.600347] ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915] <4> [106.600358] ? trace_hardirqs_on_thunk+0x1a/0x1c <4> [106.600365] ? lockdep_hardirqs_on+0xe0/0x1b0 <4> [106.600373] ? trace_hardirqs_on_thunk+0x1a/0x1c <4> [106.600382] do_vfs_ioctl+0xa0/0x6e0 <4> [106.600390] ? retint_kernel+0x2d/0x2d <4> [106.600399] ksys_ioctl+0x35/0x60 <4> [106.600406] __x64_sys_ioctl+0x11/0x20 <4> [106.600412] do_syscall_64+0x55/0x190 <4> [106.600420] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [106.600428] RIP: 0033:0x7f0e4375e5d7 <4> [106.600434] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48 <4> [106.600455] RSP: 002b:00007ffd0caf68c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 <4> [106.600465] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0e4375e5d7 <4> [106.600474] RDX: 00007ffd0caf6a00 RSI: 0000000040406469 RDI: 0000000000000005 <4> [106.600483] RBP: 00007ffd0caf6a00 R08: 00007ffd0cb090b0 R09: 00007ffd0cb09080 <4> [106.600492] R10: 000000000002116c R11: 0000000000000246 R12: 0000000040406469 <4> [106.600501] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000 <4> [106.600513] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal ax88179_178a coretemp usbnet mii crct10dif_pclmul snd_hda_intel snd_hda_codec crc32_pclmul snd_hwdep snd_hda_core ghash_clmulni_intel snd_pcm e1000e prime_numbers
Did these logs reveal anything new?
Francesco, this issue is seen again, this needs to be investigated by appropriate team. https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_144/fi-icl-u/igt@gem_exec_schedule@wide-render.html Dmesg-Warnings <2> [271.860952] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:965! Dmesg <6> [266.023718] Console: switching to colour dummy device 80x25 <6> [266.023776] [IGT] gem_exec_schedule: executing <5> [266.032416] Setting dangerous option reset - tainting kernel <5> [266.040641] Setting dangerous option reset - tainting kernel <5> [266.040945] Setting dangerous option reset - tainting kernel <6> [266.044643] [IGT] gem_exec_schedule: starting subtest wide-render <7> [266.045729] [drm:vgem_gem_dumb_create [vgem]] Created object of size 1 <7> [266.103938] [drm:vgem_gem_dumb_create [vgem]] Created object of size 1 <3> [271.860814] process_csb:965 GEM_BUG_ON(buf[2 * head + 1] != port->context_id) <4> [271.860950] ------------[ cut here ]------------ <2> [271.860952] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:965! <4> [271.860972] invalid opcode: 0000 [#1] PREEMPT SMP PTI <4> [271.860979] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U 4.20.0-rc2-g2ba924da1cb3-drmtip_144+ #1 <4> [271.860989] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018 <4> [271.861040] RIP: 0010:process_csb+0x5c6/0x790 [i915] <4> [271.861047] Code: d9 19 c7 d6 48 8b 35 21 2c 1a 00 49 c7 c0 00 7d 59 c0 b9 c5 03 00 00 48 c7 c2 b0 f8 55 c0 48 c7 c7 a1 f2 48 c0 e8 7a a0 cd d6 <0f> 0b 48 c7 c1 b8 9a 57 c0 ba c7 03 00 00 48 c7 c6 b0 f8 55 c0 48 <4> [271.861062] RSP: 0018:ffff9b34efe03e20 EFLAGS: 00010082 <4> [271.861069] RAX: 000000000000000e RBX: ffff9b34da5042a8 RCX: 0000000000000000
Mika is trying to reproduce this on his C0.
It is there with C1: [20269.450705] [IGT] gem_exec_nop: starting subtest basic-parallel [20273.153995] process_csb:965 GEM_BUG_ON(buf[2 * head + 1] != port->context_id)
*** Bug 108801 has been marked as a duplicate of this bug. ***
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5262/shard-iclb6/igt@gem_exec_nop@basic-series.html <3> [1216.645106] execlists_submit_ports:446 GEM_BUG_ON(count > !n) <4> [1216.645233] ------------[ cut here ]------------ <2> [1216.645235] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:446! <4> [1216.645257] invalid opcode: 0000 [#1] PREEMPT SMP PTI <4> [1216.645265] CPU: 5 PID: 3628 Comm: gem_exec_nop Tainted: G U 4.20.0-rc5-CI-CI_DRM_5262+ #1 <4> [1216.645276] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018 <4> [1216.645331] RIP: 0010:__execlists_submission_tasklet+0xa75/0xc20 [i915] <4> [1216.645340] Code: 6a fc f1 e0 48 8b 35 ca 9d 1a 00 49 c7 c0 38 c4 2c a0 b9 be 01 00 00 48 c7 c2 10 19 2b a0 48 c7 c7 10 d2 1d a0 e8 db 82 f8 e0 <0f> 0b 48 c7 c1 48 ac 2e a0 ba ff 01 00 00 48 c7 c6 50 19 2b a0 48 <4> [1216.645358] RSP: 0018:ffffc9000066f950 EFLAGS: 00010082 <4> [1216.645365] RAX: 000000000000000a RBX: ffff8884a0ac04c0 RCX: 0000000000000000 <4> [1216.645373] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8884ae26e4f8 <4> [1216.645381] RBP: ffffc9000066f9b0 R08: 000000000023dd30 R09: ffff8884ad893000 <4> [1216.645389] R10: 0000000000000000 R11: ffff8884ae26e4f8 R12: ffff88846bd0c3c0 <4> [1216.645396] R13: 0000000000000000 R14: ffff8884a0ac04c0 R15: ffff8884a0ac0008 <4> [1216.645405] FS: 00007fc3f2769980(0000) GS:ffff8884aff40000(0000) knlGS:0000000000000000 <4> [1216.645414] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [1216.645421] CR2: 000055a9e32b8050 CR3: 000000044dabe003 CR4: 0000000000760ee0 <4> [1216.645428] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4> [1216.645436] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4> [1216.645444] PKRU: 55555554 <4> [1216.645448] Call Trace: <4> [1216.645489] execlists_submit_request+0x181/0x190 [i915] <4> [1216.645528] submit_notify+0x8d/0x124 [i915] <4> [1216.645557] __i915_sw_fence_complete+0x81/0x250 [i915] <4> [1216.645595] i915_request_add+0x36b/0x7e0 [i915] <4> [1216.645629] i915_gem_do_execbuffer+0xe45/0x1820 [i915] <4> [1216.645642] ? ___slab_alloc.constprop.34+0x21c/0x380 <4> [1216.645649] ? ___slab_alloc.constprop.34+0x21c/0x380 <4> [1216.645680] ? i915_gem_execbuffer2_ioctl+0xc4/0x3f0 [i915] <4> [1216.645690] ? lock_acquire+0xa6/0x1c0 <4> [1216.645698] ? __might_fault+0x38/0x90 <4> [1216.645733] ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915] <4> [1216.645764] i915_gem_execbuffer2_ioctl+0x21b/0x3f0 [i915] <4> [1216.645796] ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915] <4> [1216.645806] drm_ioctl_kernel+0x81/0xf0 <4> [1216.645813] drm_ioctl+0x2de/0x390 <4> [1216.645843] ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915] <4> [1216.645852] ? lockdep_hardirqs_on+0xe0/0x1b0 <4> [1216.645859] ? trace_hardirqs_on_thunk+0x1a/0x1c <4> [1216.645868] do_vfs_ioctl+0xa0/0x6e0 <4> [1216.645875] ksys_ioctl+0x35/0x60 <4> [1216.645882] __x64_sys_ioctl+0x11/0x20 <4> [1216.645887] do_syscall_64+0x55/0x190 <4> [1216.645895] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [1216.645902] RIP: 0033:0x7fc3f19f35d7 <4> [1216.645908] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48 <4> [1216.645926] RSP: 002b:00007fff23e22438 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 <4> [1216.645935] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3f19f35d7 <4> [1216.645943] RDX: 00007fff23e22570 RSI: 0000000040406469 RDI: 0000000000000005 <4> [1216.645951] RBP: 00007fff23e22570 R08: 00007fff23e550b0 R09: 00007fff23e55080 <4> [1216.645959] R10: 000000000017d692 R11: 0000000000000246 R12: 0000000040406469 <4> [1216.645967] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000
Optimistically but with a grain of confidence: commit d8f505311717cfc2033c44a8b26ba96c66187303 Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Wed Dec 5 15:46:12 2018 +0200 drm/i915/icl: Forcibly evict stale csb entries Gen11 fails to deliver wrt global observation point on tail/entry updates and we sometimes see old entry. Use clflush to forcibly evict our possibly stale copy of the cacheline in hopes that we get fresh one from gpu. Obviously there is something amiss in the coherency protocol so this can be consired as a workaround until real cause is found. The working hardware will do the evict without our cue anyways, so the cost in there should be ameliorated by that fact. v2: for next pass, s/flush/evict, add reset (Chris) References: https://bugzilla.freedesktop.org/show_bug.cgi?id=108315 Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20181205134612.24822-1-mika.kuoppala@linux.intel.com
CI buglog seems to report another occurrence: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5293/shard-iclb7/igt@runner@aborted.html <3>[ 2834.229599] process_csb:994 GEM_BUG_ON(!i915_request_completed(rq)) <4>[ 2834.229748] ------------[ cut here ]------------ <2>[ 2834.229750] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:994! <4>[ 2834.229774] invalid opcode: 0000 [#1] PREEMPT SMP PTI <4>[ 2834.229782] CPU: 2 PID: 13552 Comm: i915_selftest Tainted: G U 4.20.0-rc5-CI-CI_DRM_5293+ #1 <4>[ 2834.229793] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018 <4>[ 2834.229859] RIP: 0010:process_csb+0x248/0x7a0 [i915] <4>[ 2834.229868] Code: 17 fd ee e0 48 8b 35 bf ac 1a 00 49 c7 c0 c9 d3 2f a0 b9 e2 03 00 00 48 c7 c2 e0 29 2e a0 48 c7 c7 63 d1 20 a0 e8 88 83 f5 e0 <0f> 0b 49 8b 96 a8 00 00 00 48 8b 92 c0 02 00 00 8b 92 c0 00 00 00 <4>[ 2834.229890] RSP: 0018:ffffc9000037f760 EFLAGS: 00010086 <4>[ 2834.229899] RAX: 000000000000000b RBX: ffff8884954b2158 RCX: 0000000000000000 <4>[ 2834.229910] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8884ae26f4e8 <4>[ 2834.229919] RBP: ffffc9000037f7c8 R08: 00000000006fd9b9 R09: ffff8884ae3fa000 <4>[ 2834.229927] R10: 0000000000000000 R11: ffff8884ae26f4e8 R12: ffff8884a1d1906c <4>[ 2834.229936] R13: 0000000000000005 R14: ffff88839fd31b40 R15: ffff8884a1d19040 <4>[ 2834.229945] FS: 00007f45b47cb980(0000) GS:ffff8884afe80000(0000) knlGS:0000000000000000 <4>[ 2834.229956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 2834.229963] CR2: 00007fe9badb6100 CR3: 00000003e90ae001 CR4: 0000000000760ee0 <4>[ 2834.229972] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 2834.229981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4>[ 2834.229989] PKRU: 55555554 <4>[ 2834.229994] Call Trace: <4>[ 2834.230045] execlists_reset_prepare+0x54/0x150 [i915] <4>[ 2834.230098] i915_gem_reset_prepare_engine+0x20/0x40 [i915] <4>[ 2834.230144] i915_gem_reset_prepare+0x2c/0x70 [i915] <4>[ 2834.230194] ? i915_request_wait+0x16d/0x840 [i915] <4>[ 2834.230241] i915_reset+0x117/0x280 [i915] <4>[ 2834.230286] __i915_wait_request_check_and_reset.isra.8+0x44/0x50 [i915] <4>[ 2834.230329] i915_request_wait+0x4fd/0x840 [i915] <4>[ 2834.230340] ? lock_acquire+0xa6/0x1c0 <4>[ 2834.230347] ? wake_up_q+0x70/0x70 <4>[ 2834.230354] ? wake_up_q+0x70/0x70 <4>[ 2834.230399] i915_gem_object_wait_fence+0x8a/0x110 [i915] <4>[ 2834.230440] i915_gem_object_wait+0x2d0/0x500 [i915] <4>[ 2834.230482] i915_gem_object_set_to_cpu_domain+0x35/0x130 [i915] <4>[ 2834.230528] igt_vm_isolation+0xb1f/0xfe0 [i915] <4>[ 2834.230591] __i915_subtests+0x5e/0xf0 [i915] <4>[ 2834.230647] __run_selftests+0x10b/0x190 [i915] <4>[ 2834.230696] i915_live_selftests+0x2c/0x60 [i915] <4>[ 2834.230743] i915_pci_probe+0x50/0xa0 [i915] <4>[ 2834.230752] pci_device_probe+0xa1/0x130 <4>[ 2834.230761] really_probe+0xf3/0x3e0 <4>[ 2834.230769] driver_probe_device+0x10a/0x120 <4>[ 2834.230777] __driver_attach+0xdb/0x100 <4>[ 2834.230784] ? driver_probe_device+0x120/0x120 <4>[ 2834.230791] ? driver_probe_device+0x120/0x120 <4>[ 2834.230799] bus_for_each_dev+0x74/0xc0 <4>[ 2834.230806] bus_add_driver+0x15f/0x250 <4>[ 2834.230813] ? 0xffffffffa06d1000 <4>[ 2834.230819] driver_register+0x56/0xe0 <4>[ 2834.230826] ? 0xffffffffa06d1000 <4>[ 2834.230833] do_one_initcall+0x58/0x2e0 <4>[ 2834.230841] ? do_init_module+0x1d/0x1ea <4>[ 2834.230849] ? rcu_read_lock_sched_held+0x6f/0x80 <4>[ 2834.230857] ? kmem_cache_alloc_trace+0x264/0x290 <4>[ 2834.230866] do_init_module+0x56/0x1ea <4>[ 2834.230874] load_module+0x227a/0x29c0 <4>[ 2834.230887] ? __se_sys_finit_module+0xd3/0xf0 <4>[ 2834.230895] __se_sys_finit_module+0xd3/0xf0 <4>[ 2834.230908] do_syscall_64+0x55/0x190 <4>[ 2834.230929] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4>[ 2834.230939] RIP: 0033:0x7f45b4093839 <4>[ 2834.230946] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48 <4>[ 2834.230966] RSP: 002b:00007ffc71a34958 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 <4>[ 2834.230979] RAX: ffffffffffffffda RBX: 0000555ebf178bb0 RCX: 00007f45b4093839 <4>[ 2834.230987] RDX: 0000000000000000 RSI: 0000555ebf179950 RDI: 0000000000000006 <4>[ 2834.230995] RBP: 0000555ebf179950 R08: 0000000000000004 R09: 0000000000000000 <4>[ 2834.231004] R10: 00007ffc71a34ad0 R11: 0000000000000246 R12: 0000000000000000 <4>[ 2834.231012] R13: 0000555ebf16fc00 R14: 0000000000000020 R15: 000000000000003c
The logs in my previous comment are about a different issue. Mika confirmed that the fix was successful, closing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.