95063 – [SKL] random system hang with RC6 enabled

Bug 95063 - [SKL] random system hang with RC6 enabled

Summary: [SKL] random system hang with RC6 enabled

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Elio
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Duplicates (2):	98488 99057 (view as bug list)
Depends on:
Blocks:

Reported:	2016-04-22 09:50 UTC by Timo Aaltonen
Modified:	2018-10-25 13:07 UTC (History)
CC List:	9 users (show)

See Also:
i915 platform:	SKL
i915 features:	power/Other

Attachments
dmesg 4.9.0 (74.36 KB, text/plain) 2016-12-28 23:52 UTC, Chris Murphy	no flags	Details
journal.log (444.98 KB, text/plain) 2016-12-28 23:52 UTC, Chris Murphy	no flags	Details
lspci vvnn (2.01 KB, text/plain) 2016-12-28 23:58 UTC, Chris Murphy	no flags	Details
dmesg (250.87 KB, text/plain) 2017-02-22 20:31 UTC, Matt Turner	no flags	Details
dmesg 4.9.13 (83.92 KB, text/plain) 2017-03-07 00:26 UTC, Chris Murphy	no flags	Details
gpu crash dump 4.9.13 (755.68 KB, text/plain) 2017-03-07 00:27 UTC, Chris Murphy	no flags	Details
View All

Description Timo Aaltonen 2016-04-22 09:50:11 UTC

We've got a machine that suffers from system hangs if RC6 is enabled. At least v4.6-rc1 is tested, I'll ask to test nightly too.

GPU is 8086:191b (Halo GT2)

Comment 1 Gary Wang 2016-05-05 09:06:47 UTC

This is Dell's switchable GPU sku (Intel Gen/Nvidia) in SKL platform with Ubuntu 14.04.4/kernel 3.19+SKL_gfx-driver

It appears following test result for i915 Gfx driver suspend/resume (by echo 'mem' into power state) test,
1. Non-X, kernel with i915 module loaded (more than 5000+ test cycles without error)
2. xinit simple environment (more than 23000+ test cycles).

Its mesa is mesa-10.1.3. When I run glmark2 based on the second environment, it got system halt with blank screen within 1 hour. After upgrading to mesa-11.3.0 (5/4/2016), it's working correctly more than 5 hours until now.

Could you help to verify this issue by the latest mesa code and share the test result with us? Thanks!

Comment 2 Gary Wang 2016-05-09 02:47:38 UTC

Hi Timo, 
I correct mesa version descripted in #1. Original version of Mesa used in Dell's switchable GPU sku should be 10.5.9 but not 10.1.3. 

I also verified Mesa 10.5.9 in other Dell SKL sku (parkcity-14) with Ubunut 14.04 (with only Gen GPU) with the same environment setup (kernel: drm-intel-nightly 4/26) in #1, it passed glmark2 test more than 48 hours. The corresponding info,

vendor_id       : GenuineIntel
cpu family      : 6
model           : 78
model name      : Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
stepping        : 3
microcode       : 0x49

3.0 Mesa 10.5.9

Distributor ID: Ubuntu
Description:    Ubuntu 14.04.2 LTS
Release:        14.04
Codename:       trusty

It seems to be related to power in that switchable GPU sku but not mesa.

Comment 3 yann 2016-12-14 08:52:13 UTC

*** Bug 99057 has been marked as a duplicate of this bug. ***

Comment 4 yann 2016-12-14 08:53:39 UTC

*** Bug 98488 has been marked as a duplicate of this bug. ***

Comment 5 Chris Murphy 2016-12-28 23:50:49 UTC

I'm still hitting this on kernel 4.9.0, even with both firmwares enabled as boot parameters

[25414.982139] f25h kernel: WARNING: CPU: 2 PID: 1532 at drivers/gpu/drm/i915/intel_display.c:14189 intel_atomic_commit_tail+0xfd0/0xff0 [i915]
[25414.982140] f25h kernel: pipe A vblank wait timed out


BOOT_IMAGE=/vmlinuz-4.9.0-1.fc26.x86_64 root=UUID=c45caf39-a048-4c44-90c9-535dc8003c71 ro rootflags=subvol=root elevator=noop i915.enable_guc_loading=-1 i915.enable_guc_submission=-1


# cat /sys/class/drm/card0/error
no error state collected

Comment 6 Chris Murphy 2016-12-28 23:52:12 UTC

Created attachment 128680 [details]
dmesg 4.9.0

Comment 7 Chris Murphy 2016-12-28 23:52:43 UTC

Created attachment 128681 [details]
journal.log

Comment 8 Chris Murphy 2016-12-28 23:57:28 UTC

mesa-libwayland-egl-13.0.2-2.fc25.x86_64
mesa-libGLU-9.0.0-10.fc24.x86_64
mutter-3.22.2-3.fc25.x86_64

Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
stepping	: 3
microcode	: 0x9e

00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 520 [8086:1916] (rev 07) (prog-if 00 [VGA controller])

Comment 9 Chris Murphy 2016-12-28 23:58:24 UTC

Created attachment 128682 [details]
lspci vvnn

Comment 10 Chris Murphy 2016-12-29 00:00:27 UTC

It happens maybe 1 in 10 times gnome-shell wants to turn off the display, I come back and notice the backlight is on, with a mouse arrow visible, but otherwise the screen is black. Mouse arrow can be moved, but system is otherwise unresponsive. I can't get to a VT on the computer itself, but can ssh in. Previous instances sometimes have GPU crash information in /sys/class/drm/card0/error but not this time.

Comment 11 Matt Turner 2017-02-22 20:31:20 UTC

Created attachment 129842 [details]
dmesg

I have just reproduced this with linux-4.10.0 and mesa-17.0.0.

The system was idle, screen off. Attempting to wake it up by pressing a key seemed to have triggered the crash.

Comment 12 Chris Murphy 2017-03-07 00:25:18 UTC

This is still happening with 4.9.13 and 4.10.0; it doesn't ever happen with drm.debug=0x1e enabled, so maybe it's some kind of race, I have no idea. Is it possible to get an update on the status of this bug, what more information needs to be provided that hasn't been provided?

Comment 13 Chris Murphy 2017-03-07 00:26:57 UTC

Created attachment 130101 [details]
dmesg 4.9.13

Comment 14 Chris Murphy 2017-03-07 00:27:37 UTC

Created attachment 130102 [details]
gpu crash dump 4.9.13

/sys/class/drm/card0/error

Comment 15 Chris Murphy 2017-03-07 00:29:23 UTC

vendor_id	: GenuineIntel
cpu family	: 6
model		: 78
model name	: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
stepping	: 3
microcode	: 0x9e

Comment 16 Matt Turner 2017-03-08 17:37:06 UTC

I think this is still occurring on 4.11-rc1:

[drm:intel_panel_enable_backlight] pipe A
[drm:intel_panel_actually_set_backlight] set backlight PWM = 317
[drm:intel_psr_enable] PSR disable by flag
[drm:intel_edp_drrs_enable] Panel doesn't support DRRS
[drm:intel_fbc_enable] reserved 33177600 bytes of contiguous stolen space for FBC, threshold: 1
[drm:intel_fbc_enable] Enabling FBC on pipe A
------------[ cut here ]------------
WARNING: CPU: 3 PID: 1099 at /home/mattst88/projects/linux/drivers/gpu/drm/i915/intel_display.c:14239 intel_atomic_commit_tail+0xf6c/0xf80 
pipe A vblank wait timed out
Modules linked in: iwlmvm iwlwifi
CPU: 3 PID: 1099 Comm: Xorg Tainted: G        W       4.11.0-rc1 #1
Hardware name: LENOVO 20ENCTO1WW/20ENCTO1WW, BIOS N1EET65W (1.38 ) 02/09/2017
Call Trace:
 dump_stack+0x4d/0x66
 __warn+0xc6/0xe0
 warn_slowpath_fmt+0x4a/0x50
 ? finish_wait+0x51/0x60
 intel_atomic_commit_tail+0xf6c/0xf80 
 ? wake_atomic_t_function+0x50/0x50
 intel_atomic_commit+0x38a/0x450
 ? wake_atomic_t_function+0x50/0x50
 drm_atomic_commit+0x46/0x50
 drm_atomic_helper_set_config+0x7e/0xd0
 drm_mode_set_config_internal+0x60/0x110
 drm_mode_setcrtc+0x3cd/0x4b0
 drm_ioctl+0x1d7/0x440
 ? drm_mode_getcrtc+0x170/0x170
 ? __vfs_read+0xba/0x110
 do_vfs_ioctl+0x8f/0x5b0
 ? vfs_read+0x116/0x130
 SyS_ioctl+0x3c/0x70
 entry_SYSCALL_64_fastpath+0x13/0x94
RIP: 0033:0x7f22756b7167
RSP: 002b:00007ffe91a0ddc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000019f6b90 RCX: 00007f22756b7167
RDX: 00007ffe91a0de00 RSI: 00000000c06864a2 RDI: 000000000000000b
RBP: 00000000019f0940 R08: 0000000000000000 R09: 000000000257b9e0
R10: 00007ffe91a0def0 R11: 0000000000000246 R12: 00000000019ef878
R13: 00007ffe91a0e0ec R14: 00007ffe91a0e030 R15: 00007ffe91a0ed2c
---[ end trace d31868330c63feb5 ]---

Comment 17 Chris Murphy 2017-03-08 18:28:35 UTC

Since this problem crashes the desktop environment, and I lose everything I'm working on in my applications, it constitutes a data loss bug. It would be nice to have a better understanding of this near one year old bug and whether it can ever be fixed.

Comment 18 Annie 2017-03-08 21:05:26 UTC

Yann-Can we get someone assigned from the kernel team on this bug? Thanks.

Comment 19 yann 2017-03-10 14:05:31 UTC

(In reply to Annie from comment #18)
> Yann-Can we get someone assigned from the kernel team on this bug? Thanks.

Sure.

Elio, can you have a look on this bug and try to reproduce.
thanks

Comment 20 Esokrarkose 2017-09-07 15:03:14 UTC

Happened to me too on an XPS 13 9360 kabylake:

Sep  3 22:02:20 debian kernel: [185417.238668] [drm] GPU HANG: ecode 9:0:0xfffffffe, in gnome-shell [1226], reason: Hang on render ring, action: reset
Sep  3 22:02:20 debian kernel: [185417.238672] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Sep  3 22:02:20 debian kernel: [185417.238673] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Sep  3 22:02:20 debian kernel: [185417.238675] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Sep  3 22:02:20 debian kernel: [185417.238676] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Sep  3 22:02:20 debian kernel: [185417.238678] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Sep  3 22:02:20 debian kernel: [185417.238779] drm/i915: Resetting chip after gpu hang
Sep  3 22:02:20 debian kernel: [185417.238850] [drm] RC6 on
Sep  3 22:02:20 debian kernel: [185417.258102] [drm] GuC firmware load skipped

Unfortunately I could not save /sys/class/drm/card0/error as this resulted in a kernel panic.

lspci -vnn | grep VGA -A 12:

00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 620 [8086:5916] (rev 02) (prog-if 00 [VGA controller])
	Subsystem: Dell HD Graphics 620 [1028:075b]
	Flags: bus master, fast devsel, latency 0, IRQ 278
	Memory at db000000 (64-bit, non-prefetchable) [size=16M]
	Memory at 90000000 (64-bit, prefetchable) [size=256M]
	I/O ports at f000 [size=64]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: i915
	Kernel modules: i915

00:04.0 Signal processing controller [1180]: Intel Corporation Skylake Processor Thermal Subsystem [8086:1903] (rev 02)
	Subsystem: Dell Skylake Processor Thermal Subsystem [1028:075b]

Comment 21 Elio 2017-11-10 21:28:48 UTC

I was using the following hardware configuration without issues, reaching 19587 iterations without problem.
 
Firmware (DMC, GuC, HuC) was included as well from 01.org

Kernel version=  4.13.9-041309-generic


======================================
             Hardware
======================================
platform                   : Skylake Canyon
motherboard id             : NUC6i7KYB
form factor                : Desktop
cpu family                 : Core i7
cpu family id              : 6
cpu information            : Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
gpu card                   : Intel Corporation Sky Lake Integrated Graphics (rev 09) (prog-if 00 [VGA controller])
memory ram                 : 31.31 GB
max memory ram             : 32 GB
display resolution         : 3840x2160
cpu thread                 : 8
cpu core                   : 4
cpu model                  : 94
cpu stepping               : 3
socket                     : Other
signature                  : Type 0, Family 6, Model 94, Stepping 3
hard drive                 : 223GiB (240GB)
current cd clock frequency : 337500 kHz
maximum cd clock frequency : 675000 kHz
displays connected         : DP-3

Comment 22 Elio 2017-11-10 21:30:50 UTC

(In reply to Esokrarkose from comment #20)
> Happened to me too on an XPS 13 9360 kabylake:
> 
> Sep  3 22:02:20 debian kernel: [185417.238668] [drm] GPU HANG: ecode
> 9:0:0xfffffffe, in gnome-shell [1226], reason: Hang on render ring, action:
> reset
> Sep  3 22:02:20 debian kernel: [185417.238672] [drm] GPU hangs can indicate
> a bug anywhere in the entire gfx stack, including userspace.
> Sep  3 22:02:20 debian kernel: [185417.238673] [drm] Please file a _new_ bug
> report on bugs.freedesktop.org against DRI -> DRM/Intel
> Sep  3 22:02:20 debian kernel: [185417.238675] [drm] drm/i915 developers can
> then reassign to the right component if it's not a kernel issue.
> Sep  3 22:02:20 debian kernel: [185417.238676] [drm] The gpu crash dump is
> required to analyze gpu hangs, so please always attach it.
> Sep  3 22:02:20 debian kernel: [185417.238678] [drm] GPU crash dump saved to
> /sys/class/drm/card0/error
> Sep  3 22:02:20 debian kernel: [185417.238779] drm/i915: Resetting chip
> after gpu hang
> Sep  3 22:02:20 debian kernel: [185417.238850] [drm] RC6 on
> Sep  3 22:02:20 debian kernel: [185417.258102] [drm] GuC firmware load
> skipped
> 
> Unfortunately I could not save /sys/class/drm/card0/error as this resulted
> in a kernel panic.
> 
> lspci -vnn | grep VGA -A 12:
> 
> 00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 620
> [8086:5916] (rev 02) (prog-if 00 [VGA controller])
> 	Subsystem: Dell HD Graphics 620 [1028:075b]
> 	Flags: bus master, fast devsel, latency 0, IRQ 278
> 	Memory at db000000 (64-bit, non-prefetchable) [size=16M]
> 	Memory at 90000000 (64-bit, prefetchable) [size=256M]
> 	I/O ports at f000 [size=64]
> 	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
> 	Capabilities: <access denied>
> 	Kernel driver in use: i915
> 	Kernel modules: i915
> 
> 00:04.0 Signal processing controller [1180]: Intel Corporation Skylake
> Processor Thermal Subsystem [8086:1903] (rev 02)
> 	Subsystem: Dell Skylake Processor Thermal Subsystem [1028:075b]

It seems that you are missing the firmware (DMC, GuC, HuC), please check available firmware versions in 01.org/download

Comment 23 Esokrarkose 2017-11-10 23:13:17 UTC

So does this mean it's expected that the gpu starts hanging when I am missing proprietary firmware?

Comment 24 Chris Murphy 2017-11-10 23:23:45 UTC

I'm confused about this also because developers sometimes comment in bug reports that the firmware isn't being loaded, and to try loading it to see if the problem still happens. But upstream does not have firmware loading enabled by default, and doing so taints the kernel, preventing any other kernel bugs from being reportable by most distro automated reporting systems.

So I don't really grok the recommendations compared to the defaults. This is from kernel 4.13.10.

parm:           enable_guc_loading:Enable GuC firmware loading (-1=auto, 0=never [default], 1=if available, 2=required) (int)

Comment 25 Jani Nikula 2017-11-13 13:17:59 UTC

(In reply to bugzilla from comment #24)
> So I don't really grok the recommendations compared to the defaults. This is
> from kernel 4.13.10.

No matter what anyone else says, it's a bug if it fails with the default module parameter settings.

Comment 26 Jani Saarinen 2018-04-20 11:10:12 UTC

Closing, please re-open if still occurs.

Comment 27 Tomas Janousek 2018-10-24 11:44:32 UTC

I'm getting a similar thing occasionally with a docked ThinkPad T25 (Kaby Lake). If it's not docker, this never occurs. I do have GuC/HuC firmwares loaded. The error looks like this:

[drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe B FIFO underrun
[drm:pipe_config_err [i915]] *ERROR* mismatch in pixel_rate (expected 154000, found 307999)
[drm:pipe_config_err [i915]] *ERROR* mismatch in shared_dpll (expected 0000000063430065, found 000000001a62afe4)
[drm:pipe_config_err [i915]] *ERROR* mismatch in dpll_hw_state.ctrl1 (expected 0x00000003, found 0x00000001)
[drm:pipe_config_err [i915]] *ERROR* mismatch in base.adjusted_mode.crtc_clock (expected 154000, found 307999)
[drm:pipe_config_err [i915]] *ERROR* mismatch in port_clock (expected 270000, found 540000)
------------[ cut here ]------------
pipe state doesn't match!
WARNING: CPU: 0 PID: 4947 at /build/linux-s3QDK1/linux-4.18.10/drivers/gpu/drm/i915/intel_display.c:11767 intel_atomic_commit_tail+0xcd6/0xd40 [i915]
Modules linked in: cmac rfcomm ctr ccm xt_CHECKSUM ipt_MASQUERADE bridge stp llc xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat
 binfmt_misc nls_ascii nls_cp437 vfat fat arc4 snd_soc_skl intel_rapl snd_soc_skl_ipc x86_pkg_temp_thermal intel_powerclamp snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_acpi kvm_intel i915 snd_soc_core kvm iwlmvm irqbypass 
 nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_h323 nf_conntrack_irc nf_conntrack_ftp nf_conntrack crc32c_generic coretemp ecryptfs loop efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto btrfs xor zstd_
CPU: 0 PID: 4947 Comm: Xorg Tainted: G     U     O      4.18.0-2-amd64 #1 Debian 4.18.10-2
Hardware name: LENOVO 20K70000GE/20K70000GE, BIOS N1QET78W (1.53 ) 09/13/2018
RIP: 0010:intel_atomic_commit_tail+0xcd6/0xd40 [i915]
Code: b6 44 24 18 e9 14 f8 ff ff e8 e6 c1 05 c3 0f 0b e9 35 f8 ff ff e8 da c1 05 c3 0f 0b 0f b6 14 24 e9 42 fd ff ff e8 ca c1 05 c3 <0f> 0b e9 a1 f8 ff ff e8 be c1 05 c3 0f 0b 49 8b 46 50 e9 ac fd ff 
RSP: 0018:ffffa1ab0854bbf0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000006
RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff8fece2416730
RBP: ffff8fecb8240328 R08: 00000000000003ea R09: 000000000000000a
R10: ffffffffc11f91b0 R11: ffffffff857cffad R12: ffff8febaa445800
R13: ffff8fecb4002800 R14: ffff8febaa443000 R15: ffff8fecb8240000
FS:  0000000000000000(0000) GS:ffff8fece2400000(0063) knlGS:00000000f7680d40
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000348dac8ff000 CR3: 000000087897e002 CR4: 00000000003606f0
Call Trace:
 intel_atomic_commit+0x29a/0x2d0 [i915]
 drm_mode_atomic_ioctl+0x822/0x9a0 [drm]
 ? drm_atomic_set_property+0x510/0x510 [drm]
 drm_ioctl_kernel+0xa1/0xf0 [drm]
 ? __switch_to_asm+0x40/0x70
 drm_ioctl+0x2eb/0x390 [drm]
 ? drm_atomic_set_property+0x510/0x510 [drm]
 ? __schedule+0x2bf/0x880
 __ia32_compat_sys_ioctl+0xce/0x240
 do_fast_syscall_32+0x98/0x1d6
 entry_SYSENTER_compat+0x7f/0x91
---[ end trace 7ca6a67aeeffddb3 ]---

Shall I open a new bug or is this related?

Comment 28 Lakshmi 2018-10-25 13:00:06 UTC

(In reply to Tomas Janousek from comment #27)
> I'm getting a similar thing occasionally with a docked ThinkPad T25 (Kaby
> Lake). If it's not docker, this never occurs. I do have GuC/HuC firmwares
> loaded. The error looks like this:

There are few bugs related to this error (FIFO underrun) are open and few are closed as it got resolved with the latest drm-tip.
Can you please elaborate the issue like what are steps that caused this error? What is the impact of this error? 
Also, have you verified this issue with latest drm-tip?

Comment 29 Lakshmi 2018-10-25 13:03:01 UTC

 
> Shall I open a new bug or is this related?
This bug Originally didn't had any errors related to FIFO underrun. So, your issue is not related to bug?

Comment 30 Tomas Janousek 2018-10-25 13:07:23 UTC

Hi,

(In reply to Lakshmi from comment #28)
> There are few bugs related to this error (FIFO underrun) are open and few
> are closed as it got resolved with the latest drm-tip.
> Can you please elaborate the issue like what are steps that caused this
> error? What is the impact of this error? 
> Also, have you verified this issue with latest drm-tip?

I've only experienced it a few times, it happens cca once a month, and always it's docked, driving 2 external screens and the eDP display. When I leave it like this for a while and DPMS kicks in, there's a small chance that when I later try to log back in, one of the displays won't come back online and then I usually have a few seconds until the machine freezes completely. Yesterday I managed a "sync" before that happened, so this dmesg snippet is all I have now.

I turned DPMS off as a workaround but I'll try to get to testing drm-tip sooner or later.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.