Summary: | [BAT] igt@* - dmesg-warn - *ERROR* Timeout waiting for engines to idle | *ERROR* vecs0 is not idle before parking | ||
---|---|---|---|
Product: | DRI | Reporter: | Marta Löfstedt <marta.lofstedt> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | critical | ||
Priority: | high | CC: | intel-gfx-bugs |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BSW/CHT | i915 features: | GEM/Other |
Description
Marta Löfstedt
2017-10-27 09:41:49 UTC
This is likely a side-effect of changing the wait loops in: commit 5427f207852d5b905e251a5a728c8604d3594d58 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Oct 23 22:32:34 2017 +0100 drm/i915: Bump wait-times for the final CS interrupt before parking In the idle worker we drop the prolonged GT wakeref used to cover such essentials as interrupt delivery. (When a CS interrupt arrives, we also assert that the GT is awake.) However, it turns out that 10ms is not long enough to be assured that the last CS interrupt has been delivered, so bump that to 200ms, and move the entirety of that wait to before we take the struct_mutex to avoid blocking. As this is now a potentially long wait, restore the earlier behaviour of bailing out early when a new request arrives. Afaict, the warning should be genuine, we really should have waited 200ms since the last request for the confirmation CS interrupt. Anyway, the plan for this is that we are going to move the warning into intel_engines_park() so we can do a detailed state_warn for each of the engines to identify what we think is still active. New subtest: CI_DRM_3302 fi-bsw-n3050 igt@gem_mmap_gtt@basic-read-no-prefault <3>[ 338.280747] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3302/fi-bsw-n3050/igt@gem_mmap_gtt@basic-read-no-prefault.html <3>[ 321.201920] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking <7>[ 321.202172] intel_engines_park vcs0 <7>[ 321.202185] intel_engines_park current seqno 122e2, last 122e2, hangcheck 0 [21344 ms], inflight 0 <7>[ 321.202194] intel_engines_park Reset count: 0 <7>[ 321.202246] intel_engines_park Requests: <7>[ 321.202443] intel_engines_park RING_START: 0x00000000 [0x00000000] <7>[ 321.202457] intel_engines_park RING_HEAD: 0x00000000 [0x00000000] <7>[ 321.202470] intel_engines_park RING_TAIL: 0x00000000 [0x00000000] <7>[ 321.202485] intel_engines_park RING_CTL: 0x00000000 <7>[ 321.202502] intel_engines_park RING_MODE: 0x00000200 [idle] <7>[ 321.202521] intel_engines_park ACTHD: 0x00000000_00000000 <7>[ 321.202538] intel_engines_park BBADDR: 0x00000000_00000000 <7>[ 321.202553] intel_engines_park Execlist status: 0x00000301 00000000 <7>[ 321.202566] intel_engines_park Execlist CSB read 1 [1 cached], write 1 [1 from hws], interrupt posted? no <7>[ 321.202576] intel_engines_park ELSP[0] idle <7>[ 321.202586] intel_engines_park ELSP[1] idle <7>[ 321.202595] intel_engines_park HW active? 0x0 <7>[ 321.202654] intel_engines_park commit 820c5bbbf418fba41149fdeb870d623e21be2463 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Nov 1 20:21:49 2017 +0000 drm/i915: Flush the irq and tasklets before asserting engine is idle Before we assert that the engine is idle, make sure we flush any residual tasklet. After that point, if the engine is not idle, more work may be queued despite us trying to park the engine and go to sleep. References: https://bugs.freedesktop.org/show_bug.cgi?id=103479 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171101202149.32493-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Still!!! <3>[ 100.640238] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking <7>[ 100.640569] intel_engines_park vecs0 <7>[ 100.640586] intel_engines_park current seqno 19a, last 19a, hangcheck 14a [5856 ms], inflight 0 <7>[ 100.640598] intel_engines_park Reset count: 0 <7>[ 100.640662] intel_engines_park Requests: <7>[ 100.640895] intel_engines_park RING_START: 0x00000000 [0x00000000] <7>[ 100.640913] intel_engines_park RING_HEAD: 0x00000000 [0x00000000] <7>[ 100.640931] intel_engines_park RING_TAIL: 0x00000000 [0x00000000] <7>[ 100.640952] intel_engines_park RING_CTL: 0x00000000 <7>[ 100.640974] intel_engines_park RING_MODE: 0x00000200 [idle] <7>[ 100.641000] intel_engines_park ACTHD: 0x00000000_00000000 <7>[ 100.641025] intel_engines_park BBADDR: 0x00000000_00000000 <7>[ 100.641046] intel_engines_park Execlist status: 0x00000301 00000000 <7>[ 100.641064] intel_engines_park Execlist CSB read 3 [3 cached], write 3 [3 from hws], interrupt posted? no <7>[ 100.641079] intel_engines_park ELSP[0] idle <7>[ 100.641092] intel_engines_park ELSP[1] idle <7>[ 100.641105] intel_engines_park HW active? 0x0 <7>[ 100.641294] intel_engines_park Afaict, that should pose no problems in being recognised as idle; so is it still idling between the check and the print? Possibly. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3313/fi-bsw-n3050/igt@kms_pipe_crc_basic@read-crc-pipe-c.html [ 484.323374] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/fi-bsw-n3050/igt@core_auth@basic-auth.html [ 95.600415] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/fi-bsw-n3050/igt@gem_exec_reloc@basic-write-cpu-noreloc.html [ 317.977566] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking Patchwork: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6987/fi-bsw-n3050/igt@prime_vgem@basic-write.html Patchwork https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6959/fi-bsw-n3050/igt@drv_module_reload@basic-reload.html [ 510.946740] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking Patchwork: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6956/fi-bsw-n3050/igt@drv_module_reload@basic-no-display.html [ 544.252270] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking Patchwork: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6997/fi-bsw-n3050/igt@gem_ctx_param@basic-default.html dmesg-warn: [ 119.345859] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking Now even the kernel agrees that this doesn't make sense: <3>[ 534.357705] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking <7>[ 534.359332] intel_engines_park vecs0 <7>[ 534.360232] intel_engines_park current seqno 10, last 10, hangcheck 0 [234603 ms], inflight 0 <7>[ 534.360240] intel_engines_park Reset count: 0 <7>[ 534.360295] intel_engines_park Requests: <7>[ 534.361646] intel_engines_park RING_START: 0x00000000 [0x00000000] <7>[ 534.361659] intel_engines_park RING_HEAD: 0x00000000 [0x00000000] <7>[ 534.361669] intel_engines_park RING_TAIL: 0x00000000 [0x00000000] <7>[ 534.361681] intel_engines_park RING_CTL: 0x00000000 <7>[ 534.361694] intel_engines_park RING_MODE: 0x00000200 [idle] <7>[ 534.361711] intel_engines_park ACTHD: 0x00000000_00000000 <7>[ 534.361725] intel_engines_park BBADDR: 0x00000000_00000000 <7>[ 534.361737] intel_engines_park Execlist status: 0x00000301 00000000 <7>[ 534.361747] intel_engines_park Execlist CSB read 1 [1 cached], write 1 [1 from hws], interrupt posted? no <7>[ 534.361755] intel_engines_park ELSP[0] idle <7>[ 534.361762] intel_engines_park ELSP[1] idle <7>[ 534.361768] intel_engines_park HW active? 0x0 <7>[ 534.362049] intel_engines_park IRQ? 0x0 (breadcrumbs? no) (execlists? no) <7>[ 534.362069] intel_engines_park Idle? yes <7>[ 534.362075] intel_engines_park We still have a race with a tasklet/irq/something here. This should suppress (not fix) the spurious !idle errors. commit 30b29406d9374989f34bce0eadaa630813049808 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Nov 10 11:25:50 2017 +0000 drm/i915: Restore the wait for idle engine after flushing interrupts So it appears that commit 5427f207852d ("drm/i915: Bump wait-times for the final CS interrupt before parking") was a little over optimistic in its belief that it had successfully waited for all residual activity on the engines before parking. Numerous sightings in CI since then of <7>[ 52.542886] [IGT] core_auth: executing <3>[ 52.561013] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking <7>[ 52.561215] intel_engines_park vcs0 <7>[ 52.561229] intel_engines_park current seqno 98, last 98, hangcheck 0 [-247449 ms], inflight 0 <7>[ 52.561238] intel_engines_park Reset count: 0 <7>[ 52.561266] intel_engines_park Requests: <7>[ 52.561363] intel_engines_park RING_START: 0x00000000 [0x00000000] <7>[ 52.561377] intel_engines_park RING_HEAD: 0x00000000 [0x00000000] <7>[ 52.561390] intel_engines_park RING_TAIL: 0x00000000 [0x00000000] <7>[ 52.561406] intel_engines_park RING_CTL: 0x00000000 <7>[ 52.561422] intel_engines_park RING_MODE: 0x00000200 [idle] <7>[ 52.561442] intel_engines_park ACTHD: 0x00000000_00000000 <7>[ 52.561459] intel_engines_park BBADDR: 0x00000000_00000000 <7>[ 52.561474] intel_engines_park Execlist status: 0x00000301 00000000 <7>[ 52.561489] intel_engines_park Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no <7>[ 52.561500] intel_engines_park ELSP[0] idle <7>[ 52.561510] intel_engines_park ELSP[1] idle <7>[ 52.561519] intel_engines_park HW active? 0x0 <7>[ 52.561608] intel_engines_park Idle? yes <7>[ 52.561617] intel_engines_park on Braswell, which indicates that the engine just needs that little bit longer after flushing the tasklet to settle. So give it a few more milliseconds before declaring an err and applying the emergency brake. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103479 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171110112550.28909-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Still don't know why it takes that little bit of time to settle (after already having 200ms), but it should kill the false positives. If we get a hit after another 10ms wait, that might show something interesting. Silence is golden. Still uncertain where the delay is coming from, but "drm/i915: Restore the wait for idle engine after flushing interrupts" did shut up the error in BAT. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.