From CI_DRM_3099 and sporadically incomplete no stored data on BAT KBL machines. Note CI_DRM_3099 is first freeze on 4.14.0-rc1. last line in: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3104/fi-kbl-7500u/dmesg0.log "[ 238.367830] hpet1: lost 1 rtc interrupts" https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3099/fi-kbl-7500u/igt@gem_exec_suspend@basic-s3.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3104/fi-kbl-7500u/igt@gem_exec_suspend@basic-s3.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3101/fi-kbl-r/igt@gem_exec_suspend@basic-s3.html
Probably related to #102849; causal relationship unclear.
Sometimes we see [ 38.681288] ====================================================== [ 38.681288] WARNING: possible circular locking dependency detected [ 38.681289] 4.14.0-rc1-CI-CI_DRM_3099+ #1 Not tainted [ 38.681290] ------------------------------------------------------ [ 38.681290] rtcwake/1414 is trying to acquire lock: [ 38.681291] ((complete)&st->done){+.+.}, at: [<ffffffff8190987d>] wait_for_completion+0x1d/0x20 [ 38.681295] but task is already holding lock: [ 38.681295] (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20 [ 38.681298] which lock already depends on the new lock. [ 38.681298] the existing dependency chain (in reverse order) is: [ 38.681299] -> #1 (sparse_irq_lock){+.+.}: [ 38.681301] __mutex_lock+0x86/0x9b0 [ 38.681302] mutex_lock_nested+0x1b/0x20 [ 38.681303] irq_lock_sparse+0x17/0x20 [ 38.681304] irq_affinity_online_cpu+0x18/0xd0 [ 38.681305] cpuhp_invoke_callback+0xa3/0x840 [ 38.681306] -> #0 ((complete)&st->done){+.+.}: [ 38.681308] check_prev_add+0x430/0x840 [ 38.681309] __lock_acquire+0x1420/0x15e0 [ 38.681310] lock_acquire+0xb0/0x200 [ 38.681311] wait_for_common+0x58/0x210 [ 38.681311] wait_for_completion+0x1d/0x20 [ 38.681312] takedown_cpu+0x89/0xf0 [ 38.681313] cpuhp_invoke_callback+0xa3/0x840 [ 38.681314] cpuhp_down_callbacks+0x42/0x80 [ 38.681314] _cpu_down+0xb9/0xf0 [ 38.681315] freeze_secondary_cpus+0xa3/0x390 [ 38.681316] hibernation_snapshot+0x24c/0x5f0 [ 38.681317] hibernate+0x14f/0x2b1 [ 38.681318] state_store+0xe5/0xf0 [ 38.681319] kobj_attr_store+0xf/0x20 [ 38.681321] sysfs_kf_write+0x45/0x60 [ 38.681322] kernfs_fop_write+0x124/0x1c0 [ 38.681323] __vfs_write+0x28/0x130 [ 38.681324] vfs_write+0xcb/0x1c0 [ 38.681324] SyS_write+0x49/0xb0 [ 38.681326] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 38.681326] other info that might help us debug this: [ 38.681326] Possible unsafe locking scenario: [ 38.681326] CPU0 CPU1 [ 38.681327] ---- ---- [ 38.681327] lock(sparse_irq_lock); [ 38.681328] lock((complete)&st->done); [ 38.681328] lock(sparse_irq_lock); [ 38.681329] lock((complete)&st->done); [ 38.681330] *** DEADLOCK *** [ 38.681330] 9 locks held by rtcwake/1414: [ 38.681331] #0: (sb_writers#5){.+.+}, at: [<ffffffff81220161>] vfs_write+0x171/0x1c0 [ 38.681333] #1: (&of->mutex){+.+.}, at: [<ffffffff812a3302>] kernfs_fop_write+0xf2/0x1c0 [ 38.681335] #2: (kn->count#206){.+.+}, at: [<ffffffff812a330b>] kernfs_fop_write+0xfb/0x1c0 [ 38.681337] #3: (pm_mutex){+.+.}, at: [<ffffffff810e79b9>] hibernate+0x59/0x2b1 [ 38.681339] #4: (device_hotplug_lock){+.+.}, at: [<ffffffff81617ff7>] lock_device_hotplug+0x17/0x20 [ 38.681342] #5: (acpi_scan_lock){+.+.}, at: [<ffffffff8153b3c7>] acpi_scan_lock_acquire+0x17/0x20 [ 38.681345] #6: (cpu_add_remove_lock){+.+.}, at: [<ffffffff8108106e>] freeze_secondary_cpus+0x2e/0x390 [ 38.681347] #7: (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff810d660b>] percpu_down_write+0x2b/0x110 [ 38.681349] #8: (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20 [ 38.681351] stack backtrace: [ 38.681353] CPU: 2 PID: 1414 Comm: rtcwake Not tainted 4.14.0-rc1-CI-CI_DRM_3099+ #1 [ 38.681353] Hardware name: /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017 [ 38.681354] Call Trace: [ 38.681355] dump_stack+0x68/0x9f [ 38.681357] print_circular_bug+0x235/0x3c0 [ 38.681358] ? lockdep_init_map_crosslock+0x20/0x20 [ 38.681359] check_prev_add+0x430/0x840 [ 38.681361] __lock_acquire+0x1420/0x15e0 [ 38.681362] ? __lock_acquire+0x1420/0x15e0 [ 38.681363] ? lockdep_init_map_crosslock+0x20/0x20 [ 38.681364] lock_acquire+0xb0/0x200 [ 38.681365] ? wait_for_completion+0x1d/0x20 [ 38.681366] wait_for_common+0x58/0x210 [ 38.681367] ? wait_for_completion+0x1d/0x20 [ 38.681368] ? cpuhp_invoke_callback+0x840/0x840 [ 38.681370] ? stop_machine_cpuslocked+0xc1/0xd0 [ 38.681370] ? cpuhp_invoke_callback+0x840/0x840 [ 38.681371] wait_for_completion+0x1d/0x20 [ 38.681372] takedown_cpu+0x89/0xf0 [ 38.681373] ? cpuhp_complete_idle_dead+0x20/0x20 [ 38.681374] cpuhp_invoke_callback+0xa3/0x840 [ 38.681375] cpuhp_down_callbacks+0x42/0x80 [ 38.681376] _cpu_down+0xb9/0xf0 [ 38.681377] freeze_secondary_cpus+0xa3/0x390 [ 38.681378] hibernation_snapshot+0x24c/0x5f0 [ 38.681379] hibernate+0x14f/0x2b1 [ 38.681380] state_store+0xe5/0xf0 [ 38.681381] kobj_attr_store+0xf/0x20 [ 38.681383] sysfs_kf_write+0x45/0x60 [ 38.681384] kernfs_fop_write+0x124/0x1c0 [ 38.681385] __vfs_write+0x28/0x130 [ 38.681386] ? rcu_read_lock_sched_held+0x7a/0x90 [ 38.681387] ? rcu_sync_lockdep_assert+0x2f/0x60 [ 38.681388] ? __sb_start_write+0x108/0x200 [ 38.681389] vfs_write+0xcb/0x1c0 [ 38.681390] SyS_write+0x49/0xb0 [ 38.681391] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 38.681392] RIP: 0033:0x7f5ed20048f0 [ 38.681393] RSP: 002b:00007ffc6aa4ba98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 38.681394] RAX: ffffffffffffffda RBX: ffffffff81492963 RCX: 00007f5ed20048f0 [ 38.681394] RDX: 0000000000000005 RSI: 00005572b5c20060 RDI: 0000000000000007 [ 38.681395] RBP: ffffc9000058bf88 R08: 00005572b5c1ddc0 R09: 00007f5ed24d6700 [ 38.681395] R10: 00007f5ed22cdb58 R11: 0000000000000246 R12: 00005572b5c1dce0 [ 38.681396] R13: 0000000000000001 R14: 0000000000000005 R15: 0000000000000005 [ 38.681398] ? __this_cpu_preempt_check+0x13/0x20 on a prior suspend.
(In reply to Chris Wilson from comment #2) > Sometimes we see > > [ 38.681288] ====================================================== > [ 38.681288] WARNING: possible circular locking dependency detected > [ 38.681289] 4.14.0-rc1-CI-CI_DRM_3099+ #1 Not tainted The lock-dep is present on all runs so far with 4.14 kernel, see BUG 102505.
Incomplete reproduced on: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/fi-kbl-7500u/igt@gem_exec_suspend@basic-s3.html Note just before there was: igt@gem_exec_store@basic-all [ 230.632776] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request. [ 230.699470] Setting dangerous option reset - tainting kernel igt@gem_exec_store@basic-default [ 232.276598] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request. [ 232.437161] Setting dangerous option reset - tainting kernel This pattern is present in most of the incompletes starting from CI_DRM_3099, but not on: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3113/fi-kbl-7500u/igt@gem_exec_suspend@basic-s3.html
History from: http://benchsrv.fi.intel.com/cibuglog/?action_failures_history=287 Raising priority as causing hard hangs and causing unreliable CI data.
Faced this issue with gem_mmap_gtt and gem_tiled_blits with custom patchset. https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5800/fi-kbl-7500u/igt@gem_mmap_gtt@basic-write-read.html https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5800/fi-kbl-7500u/igt@gem_tiled_blits@basic.html
Here is a new incomplete hang hang without any: "render: timed out waiting for forcewake ack request" in dmesg: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3125/fi-kbl-7500u/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
*** Bug 103039 has been marked as a duplicate of this bug. ***
There was also dmesg fail on same test: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/igt@gem_exec_suspend@basic-s4-devices.html [ 433.598425] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle [ 443.774330] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out [ 454.014178] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out [ 464.254342] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out
And from S4 too: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/igt@gem_exec_suspend@basic-s4-devices.html [ 309.567128] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle [ 319.870125] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out [ 330.110396] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out [ 340.350265] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out <3>[ 727.934290] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out <2>[ 738.088137] softdog: Initiating panic <0>[ 738.088190] Kernel panic - not syncing: Software Watchdog Timer expired <4>[ 738.088290] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G U 4.14.0-rc2-CI-CI_DRM_3158+ #1 <4>[ 738.088399] Hardware name: Dell Inc. XPS 13 9360/093TW6, BIOS 1.3.2 01/18/2017 <4>[ 738.088475] Call Trace: <4>[ 738.088525] <IRQ> <4>[ 738.088561] dump_stack+0x68/0x9f <4>[ 738.088627] panic+0xd4/0x21d <4>[ 738.088675] ? printk+0x43/0x4b <4>[ 738.088741] softdog_fire+0x6b/0x70 <4>[ 738.088792] __hrtimer_run_queues+0xed/0x4d0 <4>[ 738.088855] ? softdog_ping+0x50/0x50 <4>[ 738.088915] hrtimer_interrupt+0xc1/0x220 <4>[ 738.088982] smp_apic_timer_interrupt+0x7d/0x2e0 <4>[ 738.089042] apic_timer_interrupt+0x9a/0xa0 <4>[ 738.089098] </IRQ> <4>[ 738.089151] RIP: 0010:cpuidle_enter_state+0x136/0x370 <4>[ 738.089205] RSP: 0018:ffffc900000c3e80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10 <4>[ 738.089306] RAX: ffff880274b62a40 RBX: 00000000000029f7 RCX: 0000000000000001 <4>[ 738.089400] RDX: 0000000000000000 RSI: ffffffff81d0e06c RDI: ffffffff81cc1ea6 <4>[ 738.089474] RBP: ffffc900000c3eb8 R08: 00000000ffffffff R09: 0000000000000008 <4>[ 738.089550] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 <4>[ 738.089643] R13: 0000000000000001 R14: ffff88027e5a58a0 R15: 000000abd97ad493 <4>[ 738.089743] cpuidle_enter+0x17/0x20 <4>[ 738.089793] call_cpuidle+0x23/0x40 <4>[ 738.089846] do_idle+0x192/0x1e0 <4>[ 738.089908] cpu_startup_entry+0x1d/0x20 <4>[ 738.089957] start_secondary+0x11c/0x140 <4>[ 738.090028] secondary_startup_64+0xa5/0xa5 <0>[ 738.090183] Kernel Offset: disabled
(In reply to Jani Saarinen from comment #10) > And from S4 too: > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/ > igt@gem_exec_suspend@basic-s4-devices.html > Jani I don't think this is duplicate. This bug is about HARD HANGS we saw on the fi-kbl-7500u after 4.14.0-rc1 there is also a possible dependency that BUG 102849 happens before this. The BUG 103039 is not a HARD HANG it has pstore. Since this issue with HARD HANG on the fi-kbl-7560u and the potentially related BUG 102849, has not been reproduced since 4.14.0-rc2 my plan was to close these after discussing with Mika Kuoppala.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.