99057 – non-recoverable hang/freeze following WARNING: CPU: 3 PID: 783 at drivers/gpu/drm/i915/intel_display.c:14189 intel_atomic_commit_tail+0xfd0/0xff0 [i915]

Bug 99057 - non-recoverable hang/freeze following WARNING: CPU: 3 PID: 783 at drivers/gpu/drm/i915/intel_display.c:14189 intel_atomic_commit_tail+0xfd0/0xff0 [i915]

Summary: non-recoverable hang/freeze following WARNING: CPU: 3 PID: 783 at drivers/gpu...

Status:	CLOSED DUPLICATE of bug 95063

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other Linux (All)

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:	regression

Depends on:
Blocks:

Reported:	2016-12-11 19:34 UTC by Chris Murphy
Modified:	2018-10-24 11:40 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	SKL
i915 features:	GPU hang

Attachments
photo of crashed system call trace (1.54 MB, image/jpeg) 2016-12-11 19:51 UTC, Chris Murphy	no flags	Details
journal full (523.21 KB, text/x-log) 2016-12-11 19:52 UTC, Chris Murphy	no flags	Details
journal kernel (101.27 KB, text/x-log) 2016-12-11 19:53 UTC, Chris Murphy	no flags	Details
lspci -vvnn (28.29 KB, text/plain) 2016-12-11 19:57 UTC, Chris Murphy	no flags	Details
drm card error (755.71 KB, text/plain) 2016-12-12 18:51 UTC, Chris Murphy	no flags	Details
dmesg debug (919.17 KB, text/plain) 2016-12-14 00:19 UTC, Chris Murphy	no flags	Details
View All

Description Chris Murphy 2016-12-11 19:34:35 UTC

kernel-4.9.0-0.rc8.git0.1.fc26.x86_64
libwayland-server-1.12.0-1.fc25.x86_64

Reproducible: non-deterministic but is a regression, doesn't ever happen with 4.8.x kernels. Uncertain if Wayland is the trigger since I'm pretty much only using Wayland.

Summary: Walk away from the laptop for some period, upon return there's a traceback on the screen, and the system is unresponsive. I can't get to a VT and I can't remotely login via ssh. Must be hard reset.

Comment 1 Chris Murphy 2016-12-11 19:51:39 UTC

Created attachment 128414 [details]
photo of crashed system call trace

Comment 2 Chris Murphy 2016-12-11 19:52:44 UTC

Created attachment 128415 [details]
journal full

sudo journalctl -b -o short-monotonic

Comment 3 Chris Murphy 2016-12-11 19:53:00 UTC

Created attachment 128416 [details]
journal kernel

sudo journalctl -b -o short-monotonic -k

Comment 4 Chris Murphy 2016-12-11 19:57:55 UTC

Created attachment 128417 [details]
lspci -vvnn

00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 520 [8086:1916] (rev 07) (prog-if 00 [VGA controller])
	Subsystem: Hewlett-Packard Company Device [103c:81a0]

Comment 5 Chris Murphy 2016-12-11 20:03:56 UTC

First instance in the journal there's a problem...

[31603.149518] f25h kernel: [drm] GPU HANG: ecode 9:0:0xfffffffe, in gnome-shell [7105], reason: Hang on render ring, action: reset

That comes before the cell phone photo; and the journal goes to [31723.152073] at which point there's a total crash and call trace. So it doesn't look like much information was lost in the log itself; but there was no way to get the GPU crash dump saved to /sys/class/drm/card0/error.

Comment 6 yann 2016-12-12 09:47:51 UTC

(In reply to bugzilla from comment #5)
> First instance in the journal there's a problem...
> 
> [31603.149518] f25h kernel: [drm] GPU HANG: ecode 9:0:0xfffffffe, in
> gnome-shell [7105], reason: Hang on render ring, action: reset
> 
> That comes before the cell phone photo; and the journal goes to
> [31723.152073] at which point there's a total crash and call trace. So it
> doesn't look like much information was lost in the log itself; but there was
> no way to get the GPU crash dump saved to /sys/class/drm/card0/error.

bugzilla@colorremedies.com, it could be really interesting to get this error crash dump. Moreover, it will  useful, if you can enable a more verbose log for drm by setting "drm.debug=0x1e log_buf_len=1M" in your boot command line and then attached the kernel log (after issue is happening again).

I may also recommend that you use latest firmware (GuC loading is indicated as skipped in your kernel log) ;  you can download directly from https://01.org/linuxgraphics/intel-linux-graphics-firmwares

You may have a try by using "i915.enable_rc6=0" in your boot command line and see if this issue is still occurring.

Comment 7 Chris Murphy 2016-12-12 18:51:09 UTC

[ 1272.199182] [drm] GPU HANG: ecode 9:0:0xfffffffe, in gnome-shell [1556], reason: Hang on render ring, action: reset
[ 1272.199195] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1272.199201] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1272.199206] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1272.199210] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1272.199216] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1272.199377] drm/i915: Resetting chip after gpu hang
[ 1272.201224] [drm] RC6 on
[ 1272.213478] [drm] GuC firmware load skipped
[ 1321.222138] drm/i915: Resetting chip after gpu hang
[ 1321.222547] [drm] RC6 on
[ 1321.239452] [drm] GuC firmware load skipped
[ 1333.254100] drm/i915: Resetting chip after gpu hang
[ 1333.254507] [drm] RC6 on
[ 1333.270956] [drm] GuC firmware load skipped
[ 1343.238096] drm/i915: Resetting chip after gpu hang
[ 1343.238500] [drm] RC6 on
[ 1343.257458] [drm] GuC firmware load skipped
[ 1343.274295] do_trap: 222 callbacks suppressed
[ 1343.274298] traps: gnome-software[1845] trap int3 ip:7fb92e8d7a21 sp:7fffd0941ca0 error:0
[ 1343.274303]  in libglib-2.0.so.0.5000.2[7fb92e888000+110000]

Comment 8 Chris Murphy 2016-12-12 18:51:56 UTC

Created attachment 128437 [details]
drm card error

# cat /sys/class/drm/card0/error

This time there was no hang or freeze. This might be a duplicate of bug 98488.

Comment 9 Chris Murphy 2016-12-12 19:05:10 UTC

/lib/firmware/i915/skl_dmc_ver1_26.bin and /lib/firmware/i915/skl_guc_ver6_1.bin are present already on the system. Their sha256sum matches that of the two binaries listed for skylake CPUs at https://01.org/linuxgraphics/intel-linux-graphics-firmwares

So I don't understand why there's a skipped message. Does it need to be in the initramfs?

'sudo lsinitrd /boot/initramfs-4.9.0-0.rc8.git0.1.fc26.x86_64.img skl' returns nothing, so maybe that's the problem.

Comment 10 Chris Murphy 2016-12-12 19:26:34 UTC

Nope, they are in the initramfs.
[chris@f25h skl_guc_ver6_1]$ sudo lsinitrd /boot/initramfs-4.9.0-0.rc8.git0.1.fc26.x86_64.img | grep skl
-rw-r--r--   1 root     root         8928 Sep 23 05:51 usr/lib/firmware/i915/skl_dmc_ver1_26.bin
-rw-r--r--   1 root     root       129024 Sep 23 05:51 usr/lib/firmware/i915/skl_guc_ver6_1.bin

Comment 11 Chris Murphy 2016-12-12 20:10:15 UTC

[chris@f25h i915]$ modinfo i915 | grep guc
firmware:       i915/kbl_guc_ver9_14.bin
firmware:       i915/bxt_guc_ver8_7.bin
firmware:       i915/skl_guc_ver6_1.bin
parm:           enable_guc_loading:Enable GuC firmware loading (-1=auto, 0=never [default], 1=if available, 2=required) (int)
parm:           enable_guc_submission:Enable GuC submission (-1=auto, 0=never [default], 1=if available, 2=required) (int)
parm:           guc_log_level:GuC firmware logging level (-1:disabled (default), 0-3:enabled) (int)

Looks like it's set to not load this firmware by default.

Comment 12 Chris Murphy 2016-12-13 22:42:57 UTC

Freeze/hang happened again just now. Black screen with mouse arrow that doesn't move, can't get to a VT either. This is all that's in the journal after rebooting.



Dec 13 15:22:11 f25h kernel: [drm] GPU HANG: ecode 9:0:0xfffffffe, in gnome-shell [1584], reason: Hang on render ring, action: reset
Dec 13 15:22:11 f25h kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Dec 13 15:22:11 f25h kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Dec 13 15:22:11 f25h kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Dec 13 15:22:11 f25h kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Dec 13 15:22:11 f25h kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Dec 13 15:22:11 f25h kernel: drm/i915: Resetting chip after gpu hang
Dec 13 15:22:11 f25h kernel: [drm] RC6 on
Dec 13 15:22:11 f25h kernel: [drm] GuC firmware load skipped
Dec 13 15:22:23 f25h kernel: drm/i915: Resetting chip after gpu hang
Dec 13 15:22:23 f25h kernel: [drm] RC6 on
Dec 13 15:22:23 f25h kernel: [drm] GuC firmware load skipped
Dec 13 15:23:00 f25h kernel: traps: gnome-terminal-[1997] trap int3 ip:7fc638631a21 sp:7ffe8ca59e60 error:0
Dec 13 15:23:00 f25h kernel:  in libglib-2.0.so.0.5000.2[7fc6385e2000+110000]
Dec 13 15:23:00 f25h kernel: traps: nautilus[5974] trap int3 ip:7f5beb288a21 sp:7ffdf124aec0 error:0
Dec 13 15:23:00 f25h kernel:  in libglib-2.0.so.0.5000.2[7f5beb239000+110000]
Dec 13 15:23:00 f25h kernel: traps: gnome-software[1862] trap int3 ip:7fe292685a21 sp:7fff3f201fc0 error:0
Dec 13 15:23:00 f25h kernel:  in libglib-2.0.so.0.5000.2[7fe292636000+110000]
Dec 13 15:23:00 f25h kernel: traps: abrt-applet[1868] trap int3 ip:7f5b977e2a21 sp:7fff9c4bcfb0 error:0
Dec 13 15:23:00 f25h kernel:  in libglib-2.0.so.0.5000.2[7f5b97793000+110000]
Dec 13 15:29:48 f25h kernel: intel_powerclamp: Start idle injection to reduce power

Comment 13 Chris Murphy 2016-12-14 00:19:36 UTC

Created attachment 128456 [details]
dmesg debug

$ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-4.9.0-1.fc26.x86_64 root=UUID=c45caf39-a048-4c44-90c9-535dc8003c71 ro rootflags=subvol=root elevator=noop no_console_suspend ignore_loglevel i915.enable_rc6=0 drm.debug=0xe log_buf_len=1M i915.enable_guc_loading=-1 i915.enable_guc_submission=-1 i915.guc_log_level=0

No crash yet, just dmesg following about an hour with the above command line. Both firmwares appear to be loaded now.

If enable_rc6=0 is possibly inhibiting the problem, I'd rather run without it so the problem happens and hopefully the problem gets logged.

Comment 14 yann 2016-12-14 08:52:13 UTC

(In reply to bugzilla from comment #13)
> Created attachment 128456 [details]
> dmesg debug
> 
> $ cat /proc/cmdline 
> BOOT_IMAGE=/vmlinuz-4.9.0-1.fc26.x86_64
> root=UUID=c45caf39-a048-4c44-90c9-535dc8003c71 ro rootflags=subvol=root
> elevator=noop no_console_suspend ignore_loglevel i915.enable_rc6=0
> drm.debug=0xe log_buf_len=1M i915.enable_guc_loading=-1
> i915.enable_guc_submission=-1 i915.guc_log_level=0
> 
> No crash yet, just dmesg following about an hour with the above command
> line. Both firmwares appear to be loaded now.
> 
> If enable_rc6=0 is possibly inhibiting the problem, I'd rather run without
> it so the problem happens and hopefully the problem gets logged.

thanks bugzilla@colorremedies.com. So it looks like to me that is may be a dup of 95063

*** This bug has been marked as a duplicate of bug 95063 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.