Bug 103165

Summary: [BAT][CI] igt@* - dmesg-warn|dmesg-fail - *ERROR* Timeout waiting for engines to idle | *ERROR* [CRTC:36:pipe A] flip_done timed out
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Marta Löfstedt <marta.lofstedt>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: intel-gfx-bugs, jaar, jani.saarinen, ricardo.vega
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: KBL i915 features: display/Other
Bug Depends on: 103163, 103170    
Bug Blocks:    

Description Marta Löfstedt 2017-10-09 12:36:35 UTC
There are a lot of dmesg-warn and/or dmesg-fail on a lot of test on the KBL-shards. This is typically when we havn't hit LSPCON issues see BUG 102295 or linktraining issues see BUG 102359


<7>[  797.729622] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 15
<7>[  799.711611] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x61/0x80 [i915], irq posted? yes, current seqno=203b7, last=203b7
<3>[  799.840467] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
<3>[  809.951409] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
<3>[  820.191422] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out
<3>[  830.431302] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out

Some examples:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3191/shard-kbl2/igt@pm_rpm@i2c.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3191/shard-kbl4/igt@gem_wait@await-bsd1.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3191/shard-kbl4/igt@kms_flip@rcs-wf_vblank-vs-modeset.html
Comment 1 shashank.sharma@intel.com 2017-10-13 08:17:23 UTC
Yep, This is not related to LSPCON/Link training, but looks a separate issue.

- Shashank
Comment 2 Marta Löfstedt 2017-10-13 10:08:01 UTC
(In reply to shashank.sharma@intel.com from comment #1)
> Yep, This is not related to LSPCON/Link training, but looks a separate issue.
> 
> - Shashank

I still think we should wait with any further investigation of this until your lspcon patches has landed.
Comment 3 Marta Löfstedt 2017-10-13 12:36:31 UTC
Shashank was right CI_DRM_3228 has the lspcon fixes.

So, lspcon issue see to be gone but the issue in this bug remain.
Comment 7 Chris Wilson 2017-10-17 08:51:54 UTC

*** This bug has been marked as a duplicate of bug 103170 ***
Comment 8 Marta Löfstedt 2017-10-17 10:47:16 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3244/shard-kbl3/dmesg29.log
has:
 <3>[  220.135728] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:46:pipe B] flip_done timed out

However:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3244/shard-kbl3/run29.log
show that the run completed. Hence, we can have the "*ERROR* [CRTC:46:pipe B] flip_done timed out" print without this causing an incomplete. This proves that this bug is NOT a duplicate of BUG 103170.

I do however, agree that a lot of KBL-shard incomplete have either/and/or the "*ERROR* Timeout waiting for engines to idle" and "*ERROR* [CRTC:36:pipe A] flip_done timed out". 

Also, since the dmesg-warns symptom obviously happened before the incomplete I believe that it is more logical to deal with the issue in this bug.
Comment 9 Marta Löfstedt 2017-10-18 07:57:47 UTC
Also, in bug 103163. There is a pattern with:

<3>[  472.866651] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
<3>[  482.784380] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
<3>[  493.024355] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out
<3>[  503.264164] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out

before the softdog is triggered.
Comment 10 Marta Löfstedt 2017-10-20 11:00:59 UTC
On CI_DRM_3269 BAT-machine fi-kbl-7567u igt@prime_vgem@basic-fence-flip and igt@prime_vgem@basic-fence-mmap

[  417.909011] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  420.086987] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3269/fi-kbl-7567u/igt@prime_vgem@basic-fence-flip.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3269/fi-kbl-7567u/igt@prime_vgem@basic-fence-mmap.html
Comment 11 Marta Löfstedt 2017-10-23 12:52:35 UTC
Alo on APL-shards:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3275/shard-apl1/igt@kms_cursor_legacy@basic-flip-before-cursor-varying-size.html

[  935.542706] hpet1: lost 7161 rtc interrupts
[  936.437376] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
[  936.450968] hpet1: lost 7161 rtc interrupts
Comment 12 Marta Löfstedt 2017-10-23 12:52:56 UTC
*** Bug 103218 has been marked as a duplicate of this bug. ***
Comment 13 Jani Saarinen 2017-10-24 13:44:13 UTC
Raise priority
Comment 15 Marta Löfstedt 2017-10-27 09:42:25 UTC
(In reply to Marta Löfstedt from comment #14)
> Also, filed patchwork on fi-bsw-n3050
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_416/fi-bsw-n3050/
> igt@gem_flink_basic@bad-open.html
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6166/fi-bsw-n3050/
> igt@drv_module_reload@basic-reload.html
> 
> on this bug.

I have unduplicated this to bug 103479
Comment 16 Marta Löfstedt 2017-11-02 06:37:39 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3307/shard-kbl7/igt@kms_color@legacy-gamma-reset-pipe0.html

This looks a bit different:

[  310.895874] i915 0000:00:02.0: Resetting bcs0 after gpu hang
[  310.896773] i915 0000:00:02.0: Resetting vcs0 after gpu hang
[  310.897738] i915 0000:00:02.0: Resetting vcs1 after gpu hang
[  310.898473] i915 0000:00:02.0: Resetting vecs0 after gpu hang
[  312.881353] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  315.067619] [drm:intel_engines_park [i915]] *ERROR* rcs0 is not idle before parking
[  315.068296] [drm:intel_engines_park [i915]] *ERROR* bcs0 is not idle before parking
[  315.069155] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking
[  315.070039] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking
[  326.263182] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
Comment 17 Marta Löfstedt 2017-11-02 06:38:42 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3307/shard-kbl2/igt@gem_userptr_blits@map-fixed-invalidate-overlap.html
This also has: *ERROR* vecs0 is not idle before parking

[  276.831015] i915 0000:00:02.0: Resetting chip after gpu hang
[  279.081273] [drm:intel_engines_park [i915]] *ERROR* rcs0 is not idle before parking
[  279.081812] [drm:intel_engines_park [i915]] *ERROR* bcs0 is not idle before parking
[  279.082355] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking
[  279.082846] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking
[  286.883983] i915 0000:00:02.0: Resetting chip after gpu hang
[  289.172584] [drm:intel_engines_park [i915]] *ERROR* rcs0 is not idle before parking
[  289.173338] [drm:intel_engines_park [i915]] *ERROR* bcs0 is not idle before parking
[  289.173842] [drm:intel_engines_park [i915]] *ERROR* vcs1 is not idle before parking
[  289.174606] [drm:intel_engines_park [i915]] *ERROR* vecs0 is not idle before parking
[  299.625225] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
Comment 18 Marta Löfstedt 2017-11-02 06:41:33 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3305/shard-kbl5/igt@gem_persistent_relocs@forked-faulting-reloc-thrashing.html

[  904.165570] i915 0000:00:02.0: Resetting vecs0 after gpu hang
[  906.213198] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  908.399684] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
[  908.421588] Setting dangerous option prefault_disable - tainting kernel
[  908.422662] Setting dangerous option prefault_disable - tainting kernel
[  908.426009] Setting dangerous option prefault_disable - tainting kernel
[  908.426018] Setting dangerous option prefault_disable - tainting kernel
[  914.216762] i915 0000:00:02.0: Resetting vcs0 after gpu hang
[  914.217422] i915 0000:00:02.0: Resetting vecs0 after gpu hang
[  918.186750] i915 0000:00:02.0: Resetting bcs0 after gpu hang
[  922.208800] [drm:i915_gem_wait_for_idle [i915]] *ERROR* Failed to idle engines, declaring wedged!
[  922.220751] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  922.233956] Setting dangerous option prefault_disable - tainting kernel
[  922.234436] Setting dangerous option prefault_disable - tainting kernel
[  922.234506] Setting dangerous option prefault_disable - tainting kernel
[  932.273591] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out

why are we continuing testing when we have declared wedged?
Comment 19 Marta Löfstedt 2017-11-02 06:42:58 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3305/shard-kbl5/igt@gem_wait@write-wait-vebox.html

(gem_wait:2011) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_wait:2011) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest write-wait-vebox failed.

[  773.090105] i915 0000:00:02.0: Resetting bcs0 after gpu hang
[  773.090392] i915 0000:00:02.0: Resetting vcs0 after gpu hang
[  773.090769] i915 0000:00:02.0: Resetting vcs1 after gpu hang
[  773.091122] i915 0000:00:02.0: Resetting vecs0 after gpu hang
[  777.124141] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  779.310206] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
Comment 21 Marta Löfstedt 2017-11-02 11:27:25 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3308/shard-kbl5/igt@kms_flip@vblank-vs-dpms-suspend.html

<7>[  176.769694] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 15
<3>[  176.972381] [drm:intel_engines_park [i915]] *ERROR* rcs0 is not idle before parking
<7>[  176.972425] intel_engines_park rcs0
<7>[  176.972428] intel_engines_park 	current seqno 1c6cf, last 1c6cf, hangcheck 0 [-123060 ms], inflight 0
<7>[  176.972430] intel_engines_park 	Reset count: 1
<7>[  176.972437] intel_engines_park 	Requests:
<7>[  176.972916] intel_engines_park 	RING_START: 0x0000b000 [0x00000000]
<7>[  176.972921] intel_engines_park 	RING_HEAD:  0x00000c10 [0x00000000]
<7>[  176.972925] intel_engines_park 	RING_TAIL:  0x00000c10 [0x00000000]
<7>[  176.972930] intel_engines_park 	RING_CTL:   0x00003000
<7>[  176.972935] intel_engines_park 	RING_MODE:  0x00000200 [idle]
<7>[  176.972942] intel_engines_park 	ACTHD:  0x00000000_00000c10
<7>[  176.972947] intel_engines_park 	BBADDR: 0x00000000_00000004
<7>[  176.972952] intel_engines_park 	Execlist status: 0x00000301 00000000
<7>[  176.972956] intel_engines_park 	Execlist CSB read 0 [-1 cached], write 1 [1 from hws], interrupt posted? no
<7>[  176.972960] intel_engines_park 	Execlist CSB[1]: 0x00000018 [0x00000018 in hwsp], context: 3 [3 in hwsp]
<7>[  176.972962] intel_engines_park 		ELSP[0] count=1, 
<7>[  176.972965] intel_engines_park rq: 1c6cf! [3:10] prio=0 @ 15379ms: kms_flip[2822]/0
<7>[  176.972968] intel_engines_park 		ELSP[1] idle
<7>[  176.972970] intel_engines_park 		HW active? 0x1
<7>[  176.972983] intel_engines_park
Comment 22 Marta Löfstedt 2017-11-02 13:25:55 UTC
I have unduplicated fi-cfl-s and APL-shards, from this issue. The frequency of this issue on those machine is so much lower, that it is not worth the super suppression caused by KBL.
Comment 23 Marta Löfstedt 2017-11-02 13:29:10 UTC
Note APL-shards bug for this issue is now:
https://bugs.freedesktop.org/show_bug.cgi?id=103545
Comment 24 shashank.sharma@intel.com 2017-11-07 16:00:11 UTC
Among the attached logs, I dont see any DP/LSPCON errors or Link training issues. Is there any LSPCON failure sighting here ?

- Shashank
Comment 25 Marta Löfstedt 2017-11-08 06:57:34 UTC
(In reply to shashank.sharma@intel.com from comment #24)
> Among the attached logs, I dont see any DP/LSPCON errors or Link training
> issues. Is there any LSPCON failure sighting here ?
> 
> - Shashank

Shashank this bug is not about DP/LSPCON issues. Check bug 102295, bug 103558, for that. The issue will hopefully be solved with new DMC FW soon.
Comment 26 Marta Löfstedt 2017-11-09 07:50:30 UTC
There may be issue filed on this bug that is not related to the DMC issue it is my intention to close this once the new DMC have arrived.
Same goes for bug 103163 and bug 103170.
Comment 30 Marta Löfstedt 2017-11-15 07:17:11 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3344/shard-kbl5/igt@prime_self_import@basic-with_one_bo_two_files.html
[  348.611179] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <0>
                 TDT                  <f>
                 next_to_use          <f>
                 next_to_clean        <0>
               buffer_info[next_to_clean]:
                 time_stamp           <fffc10e8>
                 next_to_watch        <0>
                 jiffies              <10000be80>
                 next_to_watch.status <0>
               MAC Status             <40000083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[  349.571421] i915 0000:00:02.0: Resetting chip after gpu hang
[  350.594120] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3346/shard-kbl3/igt@gem_set_tiling_vs_gtt.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3346/shard-kbl3/igt@drm_import_export@prime.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3346/shard-kbl3/igt@gem_ppgtt@flink-and-exit-vma-leak.html
[  268.828009] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  268.882233] [drm:i915_gem_wait_for_idle [i915]] *ERROR* Failed to idle engines, declaring wedged!
[  279.136624] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
[  289.381702] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out
[  299.626832] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out
Comment 34 Marta Löfstedt 2017-11-20 14:31:11 UTC
*** Bug 103039 has been marked as a duplicate of this bug. ***
Comment 35 Marta Löfstedt 2017-11-20 14:39:02 UTC
*** Bug 103049 has been marked as a duplicate of this bug. ***
Comment 36 Jani Saarinen 2017-11-21 07:43:45 UTC
*** Bug 102586 has been marked as a duplicate of this bug. ***
Comment 37 Elizabeth 2017-11-21 19:43:23 UTC
On KBL sometimes we're getting same assertion with test igt@gem_exec_suspend@basic-s3:

(gem_exec_suspend:11459) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_suspend:11459) igt-aux-CRITICAL: Failed assertion: !"GPU hung"

latest commit with fail: IGT-Version: 1.20-gf8f6db9 (x86_64) (Linux: 4.14.0-drm-intel-qa-ww47-commit-f710441+ x86_64)
Comment 40 Jani Saarinen 2017-11-23 09:56:23 UTC
Please note that only from CI_DRM_3375 onwards having DMC 1.04.
This based on still CI_DRM_3373/
Comment 41 Jani Saarinen 2017-11-23 09:57:50 UTC
Resolving now as CI_DRM_3375 has 1.04:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3375/commits_short.log

Lets see results after this now.
Comment 42 Marta Löfstedt 2017-11-23 13:06:25 UTC
CI_DRM_3375 and CI_DRM_3376 has no of:
 *ERROR* Timeout waiting for engines to idle nor *ERROR* [CRTC:36:pipe A] flip_done timed out.

I will close this

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.