Created attachment 145678 [details] /sys/class/drm/card0/error In bug 111085 (https://bugs.freedesktop.org/show_bug.cgi?id=111805) lakshminarayana.vudum@intel.com asked me to try running without the GuC enabled. I did that, and it's still hanging up. This is the DRM-tip right before commit c1132367 as that commit prevents my box from going into S0/s2idle suspend (see bug https://bugs.freedesktop.org/show_bug.cgi?id=111909). Here's the worst part- if I can wrench control to a VT, I can usually "sudo systemctl hibernate" to force a power-cycle that unwedges the i915- but THIS time, right after the resume: ---- Oct 7 17:03:36 hp-x360n systemd-sleep[16719]: System resumed. Oct 7 17:03:36 hp-x360n systemd[1]: Stopping TLP suspend/resume... Oct 7 17:03:36 hp-x360n systemd[1]: Stopped TLP suspend/resume. Oct 7 17:04:40 hp-x360n kernel: [20868.899672] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 17:05:16 hp-x360n kernel: [20904.931581] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 17:07:04 hp-x360n kernel: [21012.899361] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 ---- <facepalm> The latest i915 changes on Sept 26th are really killing my workflow, as I can never tell when my laptop will just decide to hang up (and I can be doing such mundane tasks as viewing a webpage or building some software in a konsole- I don't game and this time I wasn't even watching video). Is there ANYTHING I can do to help you guys diagnose, mitigate, or warn me when it's likely to occur? I've posted some 7 .../card0/error files and apparently there's not enough info in these to help figure out what's going on. Are there any debug flags (that won't ruin daily-driver performance) that I can try so when this happens again there's more info? (Is there any way to just hack out a merge from a GIT tree?)
Created attachment 145679 [details] /sys/class/drm/card0/error This is another non-GuC hang, from yesterday. (It is not from drm-tip, however)
This is the dmesg from today's hang: I did notice this, which I hadn't seen before: Asynchronous wait on fence i915:kwin_x11[3017]:d88a4 timed out (hint:intel_atomic_commit_ready+0x0/0x4c [i915]) ---- Oct 7 16:54:54 hp-x360n kernel: [20328.929256] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0 Oct 7 16:54:54 hp-x360n kernel: [20328.929260] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Oct 7 16:54:54 hp-x360n kernel: [20328.929261] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Oct 7 16:54:54 hp-x360n kernel: [20328.929262] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Oct 7 16:54:54 hp-x360n kernel: [20328.929263] The GPU crash dump is required to analyze GPU hangs, so please always attach it. Oct 7 16:54:54 hp-x360n kernel: [20328.929265] GPU crash dump saved to /sys/class/drm/card0/error Oct 7 16:54:54 hp-x360n kernel: [20328.930273] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:54:54 hp-x360n kernel: [20328.931019] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} Oct 7 16:54:54 hp-x360n kernel: [20328.934266] i915 0000:00:02.0: Resetting chip for hang on rcs0 Oct 7 16:54:54 hp-x360n kernel: [20328.936037] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} Oct 7 16:54:54 hp-x360n kernel: [20328.936783] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} Oct 7 16:55:02 hp-x360n kernel: [20336.929187] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:10 hp-x360n kernel: [20344.929132] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:12 hp-x360n kernel: [20346.913128] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:14 hp-x360n kernel: [20348.897114] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:16 hp-x360n kernel: [20350.881102] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:18 hp-x360n kernel: [20352.929087] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:20 hp-x360n kernel: [20354.913078] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:22 hp-x360n kernel: [20356.897068] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:24 hp-x360n kernel: [20358.881055] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:26 hp-x360n kernel: [20360.929066] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:28 hp-x360n kernel: [20362.913023] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:30 hp-x360n kernel: [20364.897054] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:32 hp-x360n kernel: [20366.881001] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:34 hp-x360n kernel: [20368.928989] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:35 hp-x360n kernel: [20370.527934] mce: CPU2: Package temperature/speed normal Oct 7 16:55:35 hp-x360n kernel: [20370.527935] mce: CPU6: Package temperature/speed normal Oct 7 16:55:35 hp-x360n kernel: [20370.528006] mce: CPU1: Package temperature/speed normal Oct 7 16:55:35 hp-x360n kernel: [20370.528007] mce: CPU0: Package temperature/speed normal Oct 7 16:55:35 hp-x360n kernel: [20370.528007] mce: CPU4: Package temperature/speed normal Oct 7 16:55:35 hp-x360n kernel: [20370.528008] mce: CPU5: Package temperature/speed normal Oct 7 16:55:35 hp-x360n kernel: [20370.528009] mce: CPU3: Package temperature/speed normal Oct 7 16:55:35 hp-x360n kernel: [20370.528010] mce: CPU7: Package temperature/speed normal Oct 7 16:55:36 hp-x360n kernel: [20370.913036] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:38 hp-x360n kernel: [20372.897003] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:40 hp-x360n kernel: [20374.880977] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:42 hp-x360n kernel: [20376.928995] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:44 hp-x360n kernel: [20378.912956] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:46 hp-x360n kernel: [20380.896965] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:48 hp-x360n kernel: [20382.880929] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:50 hp-x360n kernel: [20384.928929] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:52 hp-x360n kernel: [20386.912919] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:54 hp-x360n kernel: [20388.896907] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:56 hp-x360n kernel: [20390.880898] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:55:58 hp-x360n kernel: [20392.929904] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:56:00 hp-x360n kernel: [20394.912873] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:56:02 hp-x360n kernel: [20396.896862] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:56:04 hp-x360n kernel: [20398.880874] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:56:06 hp-x360n kernel: [20400.928837] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. Oct 7 16:56:06 hp-x360n kernel: [20400.929027] i915 0000:00:02.0: Resetting chip for hang on rcs0 Oct 7 16:56:08 hp-x360n kernel: [20402.912858] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. Oct 7 16:56:08 hp-x360n kernel: [20402.913079] i915 0000:00:02.0: Resetting chip for hang on rcs0 Oct 7 16:56:16 hp-x360n kernel: [20410.912788] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:56:18 hp-x360n kernel: [20412.896789] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Oct 7 16:56:19 hp-x360n kernel: [20414.049730] Asynchronous wait on fence i915:kwin_x11[3017]:d88a4 timed out (hint:intel_atomic_commit_ready+0x0/0x4c [i915]) Oct 7 16:56:20 hp-x360n kernel: [20414.880759] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 ----
(In reply to Kenneth C from comment #0) > Created attachment 145678 [details] > /sys/class/drm/card0/error > > In bug 111085 (https://bugs.freedesktop.org/show_bug.cgi?id=111805) > lakshminarayana.vudum@intel.com asked me to try running without the GuC > enabled. > > I did that, and it's still hanging up. This is the DRM-tip right before > commit c1132367 as that commit prevents my box from going into S0/s2idle > suspend (see bug https://bugs.freedesktop.org/show_bug.cgi?id=111909). > > Here's the worst part- if I can wrench control to a VT, I can usually "sudo > systemctl hibernate" to force a power-cycle that unwedges the i915- but THIS > time, right after the resume: > > ---- > Oct 7 17:03:36 hp-x360n systemd-sleep[16719]: System resumed. > Oct 7 17:03:36 hp-x360n systemd[1]: Stopping TLP suspend/resume... > Oct 7 17:03:36 hp-x360n systemd[1]: Stopped TLP suspend/resume. > Oct 7 17:04:40 hp-x360n kernel: [20868.899672] i915 0000:00:02.0: Resetting > rcs0 for hang on rcs0 > Oct 7 17:05:16 hp-x360n kernel: [20904.931581] i915 0000:00:02.0: Resetting > rcs0 for hang on rcs0 > Oct 7 17:07:04 hp-x360n kernel: [21012.899361] i915 0000:00:02.0: Resetting > rcs0 for hang on rcs0 > ---- > > <facepalm> > > The latest i915 changes on Sept 26th are really killing my workflow, as I > can never tell when my laptop will just decide to hang up (and I can be > doing such mundane tasks as viewing a webpage or building some software in a > konsole- I don't game and this time I wasn't even watching video). > > Is there ANYTHING I can do to help you guys diagnose, mitigate, or warn me > when it's likely to occur? I've posted some 7 .../card0/error files and > apparently there's not enough info in these to help figure out what's going > on. Are there any debug flags (that won't ruin daily-driver performance) > that I can try so when this happens again there's more info? > > (Is there any way to just hack out a merge from a GIT tree?) Mika, any suggestions here?
Created attachment 146018 [details] /sys/class/drm/card0/error
Hi, I'm seeing the same error since October if memory serves me right. As per the `dmesg` log instructions, I've attached my `/sys/class/drm/card0/error` log file, redacting what seems to be some dump of whatever was in the GPU memory at the time of the failure. If I understand correctly the symptoms described the the original poster, I'm having the exact same behaviour: everything on the graphical interface hangs for a couple of seconds and `dmesg` outputs these messages: ``` [ 9836.991710] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0 [ 9836.991712] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 9836.991713] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 9836.991713] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 9836.991714] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 9836.991715] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 9836.992728] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 ``` I'm running Archlinux, default kernel package, on a Dell XPS 13 2012, with a "Intel Corporation UHD Graphics 620 (rev 07)" (as per `lspci` output). I hope that my report somehow helps understanding and fixing the issue! Best regards, Manuel Tiago Pereira
(In reply to Manuel Tiago Pereira from comment #5) > Hi, > > I'm seeing the same error since October if memory serves me right. As per > the `dmesg` log instructions, I've attached my `/sys/class/drm/card0/error` > log file, redacting what seems to be some dump of whatever was in the GPU > memory at the time of the failure. > > If I understand correctly the symptoms described the the original poster, > I'm having the exact same behaviour: everything on the graphical interface > hangs for a couple of seconds and `dmesg` outputs these messages: > > ``` > [ 9836.991710] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on > rcs0 > [ 9836.991712] [drm] GPU hangs can indicate a bug anywhere in the entire gfx > stack, including userspace. > [ 9836.991713] [drm] Please file a _new_ bug report on bugs.freedesktop.org > against DRI -> DRM/Intel > [ 9836.991713] [drm] drm/i915 developers can then reassign to the right > component if it's not a kernel issue. > [ 9836.991714] [drm] The gpu crash dump is required to analyze gpu hangs, so > please always attach it. > [ 9836.991715] [drm] GPU crash dump saved to /sys/class/drm/card0/error > [ 9836.992728] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 > ``` > > I'm running Archlinux, default kernel package, on a Dell XPS 13 2012, with a > "Intel Corporation UHD Graphics 620 (rev 07)" (as per `lspci` output). > > I hope that my report somehow helps understanding and fixing the issue! > > Best regards, > Manuel Tiago Pereira Can you reproduce this issue with drmtip?(https://cgit.freedesktop.org/drm-tip)
(In reply to Kenneth C from comment #1) > Created attachment 145679 [details] > /sys/class/drm/card0/error > > This is another non-GuC hang, from yesterday. (It is not from drm-tip, > however) Chris, any comments on this issue? Verifying with drmtip will help the user?
FWIW, I've been running the stuff that's been pushed into Linus' tip for a couple of weeks now and I've seen one GPU HANG and that appeared to recover, so I've switched back to those kernels now. Good work, guys- as of now I'm satisfied.
(In reply to Kenneth C from comment #8) > FWIW, I've been running the stuff that's been pushed into Linus' tip for a > couple of weeks now and I've seen one GPU HANG and that appeared to recover, > so I've switched back to those kernels now. > > Good work, guys- as of now I'm satisfied. Thanks for the feedback. I am closing this bug as WORKSFORME. If this issue appears again, please open a new bug. Ensure that issue is reproducible on drmtip (https://cgit.freedesktop.org/drm-tip).
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.