Bug 111920

Summary: NON-GuC constant i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Product: DRI Reporter: Kenneth C <kenny>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: arek.burdach, chris, intel-gfx-bugs, leho, mika.kuoppala
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: CFL i915 features: GPU hang
Attachments:
Description Flags
/sys/class/drm/card0/error
none
/sys/class/drm/card0/error
none
/sys/class/drm/card0/error none

Description Kenneth C 2019-10-08 00:26:16 UTC
Created attachment 145678 [details]
/sys/class/drm/card0/error

In bug 111085 (https://bugs.freedesktop.org/show_bug.cgi?id=111805) lakshminarayana.vudum@intel.com asked me to try running without the GuC enabled. 

I did that, and it's still hanging up. This is the DRM-tip right before commit c1132367 as that commit prevents my box from going into S0/s2idle suspend (see bug https://bugs.freedesktop.org/show_bug.cgi?id=111909).

Here's the worst part- if I can wrench control to a VT, I can usually "sudo systemctl hibernate" to force a power-cycle that unwedges the i915- but THIS time, right after the resume:

----
Oct  7 17:03:36 hp-x360n systemd-sleep[16719]: System resumed.
Oct  7 17:03:36 hp-x360n systemd[1]: Stopping TLP suspend/resume...
Oct  7 17:03:36 hp-x360n systemd[1]: Stopped TLP suspend/resume.
Oct  7 17:04:40 hp-x360n kernel: [20868.899672] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 17:05:16 hp-x360n kernel: [20904.931581] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 17:07:04 hp-x360n kernel: [21012.899361] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
----

<facepalm>

The latest i915 changes on Sept 26th are really killing my workflow, as I can never tell when my laptop will just decide to hang up (and I can be doing such mundane tasks as viewing a webpage or building some software in a konsole- I don't game and this time I wasn't even watching video).

Is there ANYTHING I can do to help you guys diagnose, mitigate, or warn me when it's likely to occur? I've posted some 7 .../card0/error files and apparently there's not enough info in these to help figure out what's going on. Are there any debug flags (that won't ruin daily-driver performance) that I can try so when this happens again there's more info?

(Is there any way to just hack out a merge from a GIT tree?)
Comment 1 Kenneth C 2019-10-08 00:30:18 UTC
Created attachment 145679 [details]
/sys/class/drm/card0/error

This is another non-GuC hang, from yesterday. (It is not from drm-tip, however)
Comment 2 Kenneth C 2019-10-08 00:35:54 UTC
This is the dmesg from today's hang:

I did notice this, which I hadn't seen before:

Asynchronous wait on fence i915:kwin_x11[3017]:d88a4 timed out (hint:intel_atomic_commit_ready+0x0/0x4c [i915])

----
Oct  7 16:54:54 hp-x360n kernel: [20328.929256] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Oct  7 16:54:54 hp-x360n kernel: [20328.929260] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Oct  7 16:54:54 hp-x360n kernel: [20328.929261] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Oct  7 16:54:54 hp-x360n kernel: [20328.929262] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Oct  7 16:54:54 hp-x360n kernel: [20328.929263] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Oct  7 16:54:54 hp-x360n kernel: [20328.929265] GPU crash dump saved to /sys/class/drm/card0/error
Oct  7 16:54:54 hp-x360n kernel: [20328.930273] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:54:54 hp-x360n kernel: [20328.931019] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  7 16:54:54 hp-x360n kernel: [20328.934266] i915 0000:00:02.0: Resetting chip for hang on rcs0
Oct  7 16:54:54 hp-x360n kernel: [20328.936037] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  7 16:54:54 hp-x360n kernel: [20328.936783] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  7 16:55:02 hp-x360n kernel: [20336.929187] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:10 hp-x360n kernel: [20344.929132] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:12 hp-x360n kernel: [20346.913128] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:14 hp-x360n kernel: [20348.897114] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:16 hp-x360n kernel: [20350.881102] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:18 hp-x360n kernel: [20352.929087] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:20 hp-x360n kernel: [20354.913078] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:22 hp-x360n kernel: [20356.897068] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:24 hp-x360n kernel: [20358.881055] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:26 hp-x360n kernel: [20360.929066] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:28 hp-x360n kernel: [20362.913023] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:30 hp-x360n kernel: [20364.897054] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:32 hp-x360n kernel: [20366.881001] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:34 hp-x360n kernel: [20368.928989] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:35 hp-x360n kernel: [20370.527934] mce: CPU2: Package temperature/speed normal
Oct  7 16:55:35 hp-x360n kernel: [20370.527935] mce: CPU6: Package temperature/speed normal
Oct  7 16:55:35 hp-x360n kernel: [20370.528006] mce: CPU1: Package temperature/speed normal
Oct  7 16:55:35 hp-x360n kernel: [20370.528007] mce: CPU0: Package temperature/speed normal
Oct  7 16:55:35 hp-x360n kernel: [20370.528007] mce: CPU4: Package temperature/speed normal
Oct  7 16:55:35 hp-x360n kernel: [20370.528008] mce: CPU5: Package temperature/speed normal
Oct  7 16:55:35 hp-x360n kernel: [20370.528009] mce: CPU3: Package temperature/speed normal
Oct  7 16:55:35 hp-x360n kernel: [20370.528010] mce: CPU7: Package temperature/speed normal
Oct  7 16:55:36 hp-x360n kernel: [20370.913036] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:38 hp-x360n kernel: [20372.897003] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:40 hp-x360n kernel: [20374.880977] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:42 hp-x360n kernel: [20376.928995] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:44 hp-x360n kernel: [20378.912956] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:46 hp-x360n kernel: [20380.896965] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:48 hp-x360n kernel: [20382.880929] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:50 hp-x360n kernel: [20384.928929] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:52 hp-x360n kernel: [20386.912919] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:54 hp-x360n kernel: [20388.896907] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:56 hp-x360n kernel: [20390.880898] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:55:58 hp-x360n kernel: [20392.929904] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:56:00 hp-x360n kernel: [20394.912873] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:56:02 hp-x360n kernel: [20396.896862] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:56:04 hp-x360n kernel: [20398.880874] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:56:06 hp-x360n kernel: [20400.928837] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
Oct  7 16:56:06 hp-x360n kernel: [20400.929027] i915 0000:00:02.0: Resetting chip for hang on rcs0
Oct  7 16:56:08 hp-x360n kernel: [20402.912858] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
Oct  7 16:56:08 hp-x360n kernel: [20402.913079] i915 0000:00:02.0: Resetting chip for hang on rcs0
Oct  7 16:56:16 hp-x360n kernel: [20410.912788] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:56:18 hp-x360n kernel: [20412.896789] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  7 16:56:19 hp-x360n kernel: [20414.049730] Asynchronous wait on fence i915:kwin_x11[3017]:d88a4 timed out (hint:intel_atomic_commit_ready+0x0/0x4c [i915])
Oct  7 16:56:20 hp-x360n kernel: [20414.880759] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
----
Comment 3 Lakshmi 2019-10-08 14:52:11 UTC
(In reply to Kenneth C from comment #0)
> Created attachment 145678 [details]
> /sys/class/drm/card0/error
> 
> In bug 111085 (https://bugs.freedesktop.org/show_bug.cgi?id=111805)
> lakshminarayana.vudum@intel.com asked me to try running without the GuC
> enabled. 
> 
> I did that, and it's still hanging up. This is the DRM-tip right before
> commit c1132367 as that commit prevents my box from going into S0/s2idle
> suspend (see bug https://bugs.freedesktop.org/show_bug.cgi?id=111909).
> 
> Here's the worst part- if I can wrench control to a VT, I can usually "sudo
> systemctl hibernate" to force a power-cycle that unwedges the i915- but THIS
> time, right after the resume:
> 
> ----
> Oct  7 17:03:36 hp-x360n systemd-sleep[16719]: System resumed.
> Oct  7 17:03:36 hp-x360n systemd[1]: Stopping TLP suspend/resume...
> Oct  7 17:03:36 hp-x360n systemd[1]: Stopped TLP suspend/resume.
> Oct  7 17:04:40 hp-x360n kernel: [20868.899672] i915 0000:00:02.0: Resetting
> rcs0 for hang on rcs0
> Oct  7 17:05:16 hp-x360n kernel: [20904.931581] i915 0000:00:02.0: Resetting
> rcs0 for hang on rcs0
> Oct  7 17:07:04 hp-x360n kernel: [21012.899361] i915 0000:00:02.0: Resetting
> rcs0 for hang on rcs0
> ----
> 
> <facepalm>
> 
> The latest i915 changes on Sept 26th are really killing my workflow, as I
> can never tell when my laptop will just decide to hang up (and I can be
> doing such mundane tasks as viewing a webpage or building some software in a
> konsole- I don't game and this time I wasn't even watching video).
> 
> Is there ANYTHING I can do to help you guys diagnose, mitigate, or warn me
> when it's likely to occur? I've posted some 7 .../card0/error files and
> apparently there's not enough info in these to help figure out what's going
> on. Are there any debug flags (that won't ruin daily-driver performance)
> that I can try so when this happens again there's more info?
> 
> (Is there any way to just hack out a merge from a GIT tree?)

Mika, any suggestions here?
Comment 4 Manuel Tiago Pereira 2019-11-24 15:57:33 UTC
Created attachment 146018 [details]
/sys/class/drm/card0/error
Comment 5 Manuel Tiago Pereira 2019-11-24 15:57:51 UTC
Hi,

I'm seeing the same error since October if memory serves me right. As per the `dmesg` log instructions, I've attached my `/sys/class/drm/card0/error` log file, redacting what seems to be some dump of whatever was in the GPU memory at the time of the failure.

If I understand correctly the symptoms described the the original poster, I'm having the exact same behaviour: everything on the graphical interface hangs for a couple of seconds and `dmesg` outputs these messages:

```
[ 9836.991710] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[ 9836.991712] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 9836.991713] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 9836.991713] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 9836.991714] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 9836.991715] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 9836.992728] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
```

I'm running Archlinux, default kernel package, on a Dell XPS 13 2012, with a "Intel Corporation UHD Graphics 620 (rev 07)" (as per `lspci` output). 

I hope that my report somehow helps understanding and fixing the issue!

Best regards,
Manuel Tiago Pereira
Comment 6 Lakshmi 2019-11-27 16:03:27 UTC
(In reply to Manuel Tiago Pereira from comment #5)
> Hi,
> 
> I'm seeing the same error since October if memory serves me right. As per
> the `dmesg` log instructions, I've attached my `/sys/class/drm/card0/error`
> log file, redacting what seems to be some dump of whatever was in the GPU
> memory at the time of the failure.
> 
> If I understand correctly the symptoms described the the original poster,
> I'm having the exact same behaviour: everything on the graphical interface
> hangs for a couple of seconds and `dmesg` outputs these messages:
> 
> ```
> [ 9836.991710] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on
> rcs0
> [ 9836.991712] [drm] GPU hangs can indicate a bug anywhere in the entire gfx
> stack, including userspace.
> [ 9836.991713] [drm] Please file a _new_ bug report on bugs.freedesktop.org
> against DRI -> DRM/Intel
> [ 9836.991713] [drm] drm/i915 developers can then reassign to the right
> component if it's not a kernel issue.
> [ 9836.991714] [drm] The gpu crash dump is required to analyze gpu hangs, so
> please always attach it.
> [ 9836.991715] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [ 9836.992728] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
> ```
> 
> I'm running Archlinux, default kernel package, on a Dell XPS 13 2012, with a
> "Intel Corporation UHD Graphics 620 (rev 07)" (as per `lspci` output). 
> 
> I hope that my report somehow helps understanding and fixing the issue!
> 
> Best regards,
> Manuel Tiago Pereira

Can you reproduce this issue with drmtip?(https://cgit.freedesktop.org/drm-tip)
Comment 7 Lakshmi 2019-11-27 16:05:09 UTC
(In reply to Kenneth C from comment #1)
> Created attachment 145679 [details]
> /sys/class/drm/card0/error
> 
> This is another non-GuC hang, from yesterday. (It is not from drm-tip,
> however)

Chris, any comments on this issue? Verifying with drmtip will help the user?
Comment 8 Kenneth C 2019-11-27 16:09:06 UTC
FWIW, I've been running the stuff that's been pushed into Linus' tip for a couple of weeks now and I've seen one GPU HANG and that appeared to recover, so I've switched back to those kernels now.

Good work, guys- as of now I'm satisfied.
Comment 9 Lakshmi 2019-11-28 13:00:50 UTC
(In reply to Kenneth C from comment #8)
> FWIW, I've been running the stuff that's been pushed into Linus' tip for a
> couple of weeks now and I've seen one GPU HANG and that appeared to recover,
> so I've switched back to those kernels now.
> 
> Good work, guys- as of now I'm satisfied.

Thanks for the feedback. I am closing this bug as WORKSFORME. If this issue appears again, please open a new bug. Ensure that issue is reproducible on drmtip (https://cgit.freedesktop.org/drm-tip).

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.