Bug 103479

Summary:	[BAT] igt@* - dmesg-warn - ERROR Timeout waiting for engines to idle \| ERROR vecs0 is not idle before parking
Product:	DRI	Reporter:	Marta Löfstedt <marta.lofstedt>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	critical
Priority:	high	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	BSW/CHT	i915 features:	GEM/Other

Description Marta Löfstedt 2017-10-27 09:41:49 UTC

NOTE: this is an un-duplication from bug 103165 

2 patchwork runs on fi-bsw-n3050

https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_416/fi-bsw-n3050/igt@gem_flink_basic@bad-open.html

[  339.196712] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6166/fi-bsw-n3050/igt@drv_module_reload@basic-reload.html

[  504.799316] Setting dangerous option reset - tainting kernel
[  504.889967] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
[  506.771514] Setting dangerous option reset - tainting kernel

Comment 1 Marta Löfstedt 2017-10-27 09:52:54 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6199/fi-bsw-n3050/igt@drv_getparams_basic@basic-eu-total.html

Comment 2 Chris Wilson 2017-10-27 09:58:56 UTC

This is likely a side-effect of changing the wait loops in:

commit 5427f207852d5b905e251a5a728c8604d3594d58
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Oct 23 22:32:34 2017 +0100

    drm/i915: Bump wait-times for the final CS interrupt before parking
    
    In the idle worker we drop the prolonged GT wakeref used to cover such
    essentials as interrupt delivery. (When a CS interrupt arrives, we also
    assert that the GT is awake.) However, it turns out that 10ms is not
    long enough to be assured that the last CS interrupt has been delivered,
    so bump that to 200ms, and move the entirety of that wait to before we
    take the struct_mutex to avoid blocking. As this is now a potentially
    long wait, restore the earlier behaviour of bailing out early when a new
    request arrives.

Afaict, the warning should be genuine, we really should have waited 200ms since the last request for the confirmation CS interrupt.

Anyway, the plan for this is that we are going to move the warning into intel_engines_park() so we can do a detailed state_warn for each of the engines to identify what we think is still active.

Comment 3 Marta Löfstedt 2017-11-01 07:13:40 UTC

New subtest:
CI_DRM_3302 fi-bsw-n3050 igt@gem_mmap_gtt@basic-read-no-prefault

<3>[  338.280747] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3302/fi-bsw-n3050/igt@gem_mmap_gtt@basic-read-no-prefault.html

Comment 4 Chris Wilson 2017-11-01 20:14:10 UTC

<3>[  321.201920] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking
<7>[  321.202172] intel_engines_park vcs0
<7>[  321.202185] intel_engines_park 	current seqno 122e2, last 122e2, hangcheck 0 [21344 ms], inflight 0
<7>[  321.202194] intel_engines_park 	Reset count: 0
<7>[  321.202246] intel_engines_park 	Requests:
<7>[  321.202443] intel_engines_park 	RING_START: 0x00000000 [0x00000000]
<7>[  321.202457] intel_engines_park 	RING_HEAD:  0x00000000 [0x00000000]
<7>[  321.202470] intel_engines_park 	RING_TAIL:  0x00000000 [0x00000000]
<7>[  321.202485] intel_engines_park 	RING_CTL:   0x00000000
<7>[  321.202502] intel_engines_park 	RING_MODE:  0x00000200 [idle]
<7>[  321.202521] intel_engines_park 	ACTHD:  0x00000000_00000000
<7>[  321.202538] intel_engines_park 	BBADDR: 0x00000000_00000000
<7>[  321.202553] intel_engines_park 	Execlist status: 0x00000301 00000000
<7>[  321.202566] intel_engines_park 	Execlist CSB read 1 [1 cached], write 1 [1 from hws], interrupt posted? no
<7>[  321.202576] intel_engines_park 		ELSP[0] idle
<7>[  321.202586] intel_engines_park 		ELSP[1] idle
<7>[  321.202595] intel_engines_park 		HW active? 0x0
<7>[  321.202654] intel_engines_park

Comment 5 Chris Wilson 2017-11-02 11:32:55 UTC

commit 820c5bbbf418fba41149fdeb870d623e21be2463
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Nov 1 20:21:49 2017 +0000

    drm/i915: Flush the irq and tasklets before asserting engine is idle
    
    Before we assert that the engine is idle, make sure we flush any
    residual tasklet. After that point, if the engine is not idle, more work
    may be queued despite us trying to park the engine and go to sleep.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=103479
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171101202149.32493-1-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Comment 6 Chris Wilson 2017-11-03 14:11:50 UTC

Still!!!

<3>[  100.640238] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking
<7>[  100.640569] intel_engines_park vecs0
<7>[  100.640586] intel_engines_park 	current seqno 19a, last 19a, hangcheck 14a [5856 ms], inflight 0
<7>[  100.640598] intel_engines_park 	Reset count: 0
<7>[  100.640662] intel_engines_park 	Requests:
<7>[  100.640895] intel_engines_park 	RING_START: 0x00000000 [0x00000000]
<7>[  100.640913] intel_engines_park 	RING_HEAD:  0x00000000 [0x00000000]
<7>[  100.640931] intel_engines_park 	RING_TAIL:  0x00000000 [0x00000000]
<7>[  100.640952] intel_engines_park 	RING_CTL:   0x00000000
<7>[  100.640974] intel_engines_park 	RING_MODE:  0x00000200 [idle]
<7>[  100.641000] intel_engines_park 	ACTHD:  0x00000000_00000000
<7>[  100.641025] intel_engines_park 	BBADDR: 0x00000000_00000000
<7>[  100.641046] intel_engines_park 	Execlist status: 0x00000301 00000000
<7>[  100.641064] intel_engines_park 	Execlist CSB read 3 [3 cached], write 3 [3 from hws], interrupt posted? no
<7>[  100.641079] intel_engines_park 		ELSP[0] idle
<7>[  100.641092] intel_engines_park 		ELSP[1] idle
<7>[  100.641105] intel_engines_park 		HW active? 0x0
<7>[  100.641294] intel_engines_park 

Afaict, that should pose no problems in being recognised as idle; so is it still idling between the check and the print? Possibly.

Comment 7 Marta Löfstedt 2017-11-06 06:49:39 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3313/fi-bsw-n3050/igt@kms_pipe_crc_basic@read-crc-pipe-c.html

[  484.323374] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking

Comment 8 Marta Löfstedt 2017-11-07 12:03:18 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/fi-bsw-n3050/igt@core_auth@basic-auth.html
[   95.600415] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/fi-bsw-n3050/igt@gem_exec_reloc@basic-write-cpu-noreloc.html
[  317.977566] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking

Comment 9 Marta Löfstedt 2017-11-07 13:51:09 UTC

Patchwork:

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6987/fi-bsw-n3050/igt@prime_vgem@basic-write.html

Comment 10 Marta Löfstedt 2017-11-07 14:30:14 UTC

Patchwork

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6959/fi-bsw-n3050/igt@drv_module_reload@basic-reload.html

[  510.946740] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking

Comment 11 Marta Löfstedt 2017-11-07 14:32:16 UTC

Patchwork:

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6956/fi-bsw-n3050/igt@drv_module_reload@basic-no-display.html


[  544.252270] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking

Comment 12 Marta Löfstedt 2017-11-08 08:49:10 UTC

Patchwork:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6997/fi-bsw-n3050/igt@gem_ctx_param@basic-default.html

dmesg-warn:
[  119.345859] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking

Comment 13 Chris Wilson 2017-11-09 12:42:13 UTC

Now even the kernel agrees that this doesn't make sense:

<3>[  534.357705] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking
<7>[  534.359332] intel_engines_park vecs0
<7>[  534.360232] intel_engines_park 	current seqno 10, last 10, hangcheck 0 [234603 ms], inflight 0
<7>[  534.360240] intel_engines_park 	Reset count: 0
<7>[  534.360295] intel_engines_park 	Requests:
<7>[  534.361646] intel_engines_park 	RING_START: 0x00000000 [0x00000000]
<7>[  534.361659] intel_engines_park 	RING_HEAD:  0x00000000 [0x00000000]
<7>[  534.361669] intel_engines_park 	RING_TAIL:  0x00000000 [0x00000000]
<7>[  534.361681] intel_engines_park 	RING_CTL:   0x00000000
<7>[  534.361694] intel_engines_park 	RING_MODE:  0x00000200 [idle]
<7>[  534.361711] intel_engines_park 	ACTHD:  0x00000000_00000000
<7>[  534.361725] intel_engines_park 	BBADDR: 0x00000000_00000000
<7>[  534.361737] intel_engines_park 	Execlist status: 0x00000301 00000000
<7>[  534.361747] intel_engines_park 	Execlist CSB read 1 [1 cached], write 1 [1 from hws], interrupt posted? no
<7>[  534.361755] intel_engines_park 		ELSP[0] idle
<7>[  534.361762] intel_engines_park 		ELSP[1] idle
<7>[  534.361768] intel_engines_park 		HW active? 0x0
<7>[  534.362049] intel_engines_park IRQ? 0x0 (breadcrumbs? no) (execlists? no)
<7>[  534.362069] intel_engines_park Idle? yes
<7>[  534.362075] intel_engines_park 

We still have a race with a tasklet/irq/something here.

Comment 14 Chris Wilson 2017-11-10 13:00:07 UTC

This should suppress (not fix) the spurious !idle errors.

commit 30b29406d9374989f34bce0eadaa630813049808
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Nov 10 11:25:50 2017 +0000

    drm/i915: Restore the wait for idle engine after flushing interrupts
    
    So it appears that commit 5427f207852d ("drm/i915: Bump wait-times for
    the final CS interrupt before parking") was a little over optimistic in
    its belief that it had successfully waited for all residual activity on
    the engines before parking. Numerous sightings in CI since then of
    
    <7>[   52.542886] [IGT] core_auth: executing
    <3>[   52.561013] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking
    <7>[   52.561215] intel_engines_park vcs0
    <7>[   52.561229] intel_engines_park    current seqno 98, last 98, hangcheck 0 [-247449 ms], inflight 0
    <7>[   52.561238] intel_engines_park    Reset count: 0
    <7>[   52.561266] intel_engines_park    Requests:
    <7>[   52.561363] intel_engines_park    RING_START: 0x00000000 [0x00000000]
    <7>[   52.561377] intel_engines_park    RING_HEAD:  0x00000000 [0x00000000]
    <7>[   52.561390] intel_engines_park    RING_TAIL:  0x00000000 [0x00000000]
    <7>[   52.561406] intel_engines_park    RING_CTL:   0x00000000
    <7>[   52.561422] intel_engines_park    RING_MODE:  0x00000200 [idle]
    <7>[   52.561442] intel_engines_park    ACTHD:  0x00000000_00000000
    <7>[   52.561459] intel_engines_park    BBADDR: 0x00000000_00000000
    <7>[   52.561474] intel_engines_park    Execlist status: 0x00000301 00000000
    <7>[   52.561489] intel_engines_park    Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no
    <7>[   52.561500] intel_engines_park            ELSP[0] idle
    <7>[   52.561510] intel_engines_park            ELSP[1] idle
    <7>[   52.561519] intel_engines_park            HW active? 0x0
    <7>[   52.561608] intel_engines_park Idle? yes
    <7>[   52.561617] intel_engines_park
    
    on Braswell, which indicates that the engine just needs that little bit
    longer after flushing the tasklet to settle. So give it a few more
    milliseconds before declaring an err and applying the emergency brake.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103479
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171110112550.28909-1-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Still don't know why it takes that little bit of time to settle (after already having 200ms), but it should kill the false positives. If we get a hit after another 10ms wait, that might show something interesting.

Comment 15 Chris Wilson 2017-12-02 00:47:23 UTC

Silence is golden. Still uncertain where the delay is coming from, but "drm/i915: Restore the wait for idle engine after flushing interrupts" did shut up the error in BAT.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.