Bug 101074

Summary: [BAT][CTG] igt@kms_pipe_crc_basic@suspend-read-crc-pipe-* tests take 4x longer than expected
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED DUPLICATE QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: intel-gfx-bugs, tomi.p.sarvela
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: GM45 i915 features: display/Other

Description Martin Peres 2017-05-17 13:37:44 UTC
Starting with CI_DRM_2624, the igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a test started taking 109.38s instead of the usual 24.19s on fi-ctg-l9400.

Here is what changed between CI_DRM_2623 and CI_DRM_2624, so it does not look like new commits are the reason:
2a80386 drm-tip: 2017y-05m-17d-12h-46m-57s UTC integration manifest
bb2af9b drm/tilcdc: fix include notation and remove -Iinclude/drm flag
64a9dfc drm/radeon: fix include notation and remove -Iinclude/drm flag
edaf492 drm/qxl: fix include notation and remove -Iinclude/drm flag
ae95621 drm/nouveau: fix include notation and remove -Iinclude/drm flag
78f27b1 drm/msm: fix include notation and remove -Iinclude/drm flag
277b9fc drm/mgag200: fix include notation and remove -Iinclude/drm flag
385157d drm/hisilicon: fix include notation and remove -Iinclude/drm flag
f7b4cf4 drm-tip: 2017y-05m-17d-11h-32m-43s UTC integration manifest

Full logs:
- Good: https://intel-gfx-ci.01.org/CI/CI_DRM_2623/fi-ctg-l9400/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
- Bad: https://intel-gfx-ci.01.org/CI/CI_DRM_2624/fi-ctg-l9400/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

Tomi, anything changed in the hardware side?
Comment 1 Chris Wilson 2017-05-17 13:44:38 UTC
(And why is there 20s of overhead? Good test runtime 4s, piglit execution time 20s.)
Comment 2 Tomi Sarvela 2017-05-17 13:45:04 UTC
Probably related to locking issue in MOUSE_PS2. Possibly relevant splat from CI_DRM_2614:

MOUSE_PS2 disabled from kconfig/debug for now.

[  365.524011] [drm:drm_dp_dual_mode_detect] DP dual mode HDMI ID:  (err -6)
[  365.524011] [drm:drm_helper_hpd_irq_event] [CONNECTOR:57:HDMI-A-2] status updated from disconnected to disconnected
[  365.524011] [drm:intel_dp_detect [i915]] [CONNECTOR:59:DP-2]
[  365.535130] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x51450085
[  365.545347] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x51450085

[  365.545014] ======================================================
[  365.545014] WARNING: possible circular locking dependency detected
[  365.545014] 4.12.0-rc1-CI-CI_DRM_2614+ #1 Not tainted
[  365.545014] ------------------------------------------------------
[  365.545014] kworker/0:1/22 is trying to acquire lock:
[  365.545014]  (s_active#62){++++.+}, at: [<ffffffff8127d060>] kernfs_remove_by_name_ns+0x40/0x90
[  365.545014] 
               but task is already holding lock:
[  365.545014]  (psmouse_mutex){+.+.+.}, at: [<ffffffff816a7ea7>] psmouse_disconnect+0x67/0x160
[  365.545014] 
               which lock already depends on the new lock.

[  365.545014] 
               the existing dependency chain (in reverse order) is:
[  365.545014] 
               -> #1 (psmouse_mutex){+.+.+.}:
[  365.545014]        lock_acquire+0xb5/0x210
[  365.545014]        __mutex_lock+0x83/0x980
[  365.545014]        mutex_lock_interruptible_nested+0x16/0x20
[  365.545014]        psmouse_attr_set_helper+0x2c/0x140
[  365.545014]        dev_attr_store+0x13/0x20
[  365.545014]        sysfs_kf_write+0x40/0x50
[  365.545014]        kernfs_fop_write+0x130/0x1b0
[  365.545014]        __vfs_write+0x23/0x120
[  365.545014]        vfs_write+0xc5/0x1d0
[  365.545014]        SyS_write+0x44/0xb0
[  365.545014]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  365.545014] 
               -> #0 (s_active#62){++++.+}:
[  365.545014]        __lock_acquire+0x189f/0x1960
[  365.545014]        lock_acquire+0xb5/0x210
[  365.545014]        __kernfs_remove+0x243/0x2b0
[  365.545014]        kernfs_remove_by_name_ns+0x40/0x90
[  365.545014]        remove_files.isra.0+0x31/0x70
[  365.545014]        sysfs_remove_group+0x3f/0x80
[  365.545014]        trackpoint_disconnect+0x1f/0x40
[  365.545014]        psmouse_disconnect+0x95/0x160
[  365.545014]        serio_disconnect_driver+0x2d/0x40
[  365.545014]        serio_driver_remove+0x10/0x20
[  365.545014]        device_release_driver_internal+0x158/0x210
[  365.545014]        device_release_driver+0xd/0x10
[  365.545014]        serio_disconnect_port+0x75/0xb0
[  365.545014]        serio_handle_event+0x1c6/0x260
[  365.545014]        process_one_work+0x1fe/0x670
[  365.545014]        worker_thread+0x49/0x3b0
[  365.545014]        kthread+0x10f/0x150
[  365.545014]        ret_from_fork+0x2e/0x40
[  365.545014] 
               other info that might help us debug this:

[  365.545014]  Possible unsafe locking scenario:

[  365.545014]        CPU0                    CPU1
[  365.545014]        ----                    ----
[  365.545014]   lock(psmouse_mutex);
[  365.545014]                                lock(s_active#62);
[  365.545014]                                lock(psmouse_mutex);
[  365.545014]   lock(s_active#62);
[  365.545014] 
                *** DEADLOCK ***

[  365.545014] 6 locks held by kworker/0:1/22:
[  365.545014]  #0:  ("events_long"){.+.+.+}, at: [<ffffffff8109a138>] process_one_work+0x178/0x670
[  365.545014]  #1:  (serio_event_work){+.+.+.}, at: [<ffffffff8109a138>] process_one_work+0x178/0x670
[  365.545014]  #2:  (serio_mutex){+.+.+.}, at: [<ffffffff816982ef>] serio_handle_event+0x1f/0x260
[  365.545014]  #3:  (&dev->mutex){......}, at: [<ffffffff815b9d34>] device_release_driver_internal+0x34/0x210
[  365.545014]  #4:  (&serio->drv_mutex){+.+.+.}, at: [<ffffffff8169734b>] serio_disconnect_driver+0x1b/0x40
[  365.545014]  #5:  (psmouse_mutex){+.+.+.}, at: [<ffffffff816a7ea7>] psmouse_disconnect+0x67/0x160
[  365.545014] 
               stack backtrace:
[  365.545014] CPU: 0 PID: 22 Comm: kworker/0:1 Not tainted 4.12.0-rc1-CI-CI_DRM_2614+ #1
[  365.545014] Hardware name: LENOVO 7465CTO/7465CTO, BIOS 6DET44WW (2.08 ) 04/22/2009
[  365.545014] Workqueue: events_long serio_handle_event
[  365.545014] Call Trace:
[  365.545014]  dump_stack+0x67/0x97
[  365.545014]  print_circular_bug+0x1e0/0x2d0
[  365.545014]  __lock_acquire+0x189f/0x1960
[  365.545014]  ? kernfs_remove_by_name_ns+0x25/0x90
[  365.545014]  ? __kernfs_remove+0x18b/0x2b0
[  365.545014]  lock_acquire+0xb5/0x210
[  365.545014]  ? lock_acquire+0xb5/0x210
[  365.545014]  ? kernfs_remove_by_name_ns+0x40/0x90
[  365.545014]  __kernfs_remove+0x243/0x2b0
[  365.545014]  ? kernfs_remove_by_name_ns+0x40/0x90
[  365.545014]  ? kernfs_find_ns+0x8f/0x110
[  365.545014]  kernfs_remove_by_name_ns+0x40/0x90
[  365.545014]  remove_files.isra.0+0x31/0x70
[  365.545014]  sysfs_remove_group+0x3f/0x80
[  365.545014]  trackpoint_disconnect+0x1f/0x40
[  365.545014]  psmouse_disconnect+0x95/0x160
[  365.545014]  serio_disconnect_driver+0x2d/0x40
[  365.545014]  serio_driver_remove+0x10/0x20
[  365.545014]  device_release_driver_internal+0x158/0x210
[  365.545014]  device_release_driver+0xd/0x10
[  365.545014]  serio_disconnect_port+0x75/0xb0
[  365.545014]  serio_handle_event+0x1c6/0x260
[  365.545014]  process_one_work+0x1fe/0x670
[  365.545014]  worker_thread+0x49/0x3b0
[  365.545014]  kthread+0x10f/0x150
[  365.545014]  ? process_one_work+0x670/0x670
[  365.545014]  ? kthread_create_on_node+0x40/0x40
[  365.545014]  ret_from_fork+0x2e/0x40
[  365.555007] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x51450085
[  365.565142] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x51450085
[  365.575283] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x51450085
[  365.585422] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x51450085
[  365.589462] PM: restore of devices complete after 582.090 msecs
[  365.595030] PM: Image restored successfully.
[  365.595030] PM: Basic memory bitmaps freed
[  365.595030] OOM killer enabled.
[  365.595030] Restarting tasks ... 
[  365.595442] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x51450085
[  365.611007] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x51450085
[  365.611007] done.
[  365.619008] video LNXVIDEO:00: Restoring backlight state
Comment 3 Chris Wilson 2017-05-17 13:46:53 UTC
(In reply to Chris Wilson from comment #1)
> (And why is there 20s of overhead? Good test runtime 4s, piglit execution
> time 20s.)

Ah, that's probably just the difference between monotonic time and walltime (i.e. the time we actually spent in suspend).
Comment 4 Chris Wilson 2017-05-17 13:50:19 UTC
Looks to be a core issue:

Good
[  692.066239] Suspended for 16.027 seconds
[  692.066239] Enabling non-boot CPUs ...
[  692.066239] x86: Booting SMP configuration:
[  692.066493] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  692.065884] Disabled fast string operations
[  692.473002]  cache: parent cpu1 should not be sleeping
[  692.576576] CPU1 is up

Bad
[  694.716026] Suspended for 15.789 seconds
[  694.716026] Enabling non-boot CPUs ...
[  694.716026] x86: Booting SMP configuration:
[  694.716026] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  694.715245] Disabled fast string operations
[  749.656665] perf: interrupt took too long (19602 > 19486), lowering kernel.perf_event_max_sample_rate to 10000
[  779.491744]  cache: parent cpu1 should not be sleeping
[  780.651986] CPU1 is up
Comment 5 Martin Peres 2017-05-18 08:12:14 UTC
So, should I consider this bug fixed or simply worked around by disabling the PS/2 mouse driver?
Comment 6 Jani Saarinen 2017-05-29 08:19:26 UTC
HI,
Was there any conclusion on this?
Comment 7 Martin Peres 2017-05-29 08:20:49 UTC
(In reply to Jani Saarinen from comment #6)
> HI,
> Was there any conclusion on this?

Isn't Petri bisecting this?
Comment 8 Martin Peres 2017-05-30 14:26:11 UTC

*** This bug has been marked as a duplicate of bug 101089 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.