Bug 111919

Summary: Intel card (Coffeelake) short freezes (hang) after upgrade to kernel 5.3.4
Product: DRI Reporter: Stanislav Ochotnicky <freedesktop.org>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: bugsfree, chris, intel-gfx-bugs, jakov.ivkovic
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard: Triaged, ReadyForDev
i915 platform: CFL i915 features: GPU hang
Attachments:
Description Flags
/sys/class/drm/card0/error output
none
/sys/class/drm/card0/error output with i915.dmc_firmware_path=/dev/null
none
gpu crash dump
none
Similar crash in my system
none
Similar crash in my system none

Description Stanislav Ochotnicky 2019-10-07 14:33:48 UTC
I updated my kernel to 5.3.4 today and had a few short (few second long) UI freezes. Freezes recovered and I found following in my dmesg:

[101184.000657] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[101184.000659] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[101184.000659] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[101184.000660] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[101184.000660] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[101184.000660] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[101184.001664] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[101394.022014] usb 1-11.3: reset high-speed USB device number 6 using xhci_hcd
[101752.002696] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0


I can provide list of more system components/libraries, but as far as I can tell this is related to 5.3.4 kernel update. I've seen the freezes in a few cases - mostly web browser usage it seems. I am attaching the card error output from sysfs. If more info/testing etc is needed let me know.
Comment 1 Stanislav Ochotnicky 2019-10-07 14:34:34 UTC
Created attachment 145676 [details]
/sys/class/drm/card0/error output
Comment 2 Chris Wilson 2019-10-07 14:42:32 UTC
rcs0 command stream:
  IDLE?: no
  START: 0x00009000
  HEAD:  0x00400820 [0x00000000]
  TAIL:  0x00000820 [0x00000000, 0x00000000]
  CTL:   0x00003001
  MODE:  0x00000000
  HWS:   0xffffe000
  ACTHD: 0x00000000 00400820
  IPEIR: 0x00000000
  IPEHR: 0x7a000004
  INSTDONE: 0xffdfffff
  SC_INSTDONE: 0xffffffff
  SAMPLER_INSTDONE[0][0]: 0xffffffff
  SAMPLER_INSTDONE[0][1]: 0xffffffff
  SAMPLER_INSTDONE[0][2]: 0xffffffff
  ROW_INSTDONE[0][0]: 0xffffffff
  ROW_INSTDONE[0][1]: 0xffffffff
  ROW_INSTDONE[0][2]: 0xffffffff
  BBADDR: 0x0000fffe_ec2fca94
  BB_STATE: 0x00000020
  INSTPS: 0x00008840
  INSTPM: 0x00000000
  FADDR: 0x00000000 00009820
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00008000
  PDP0: 0x0000000b7e32f000
  PDP1: 0x0000000000000000
  PDP2: 0x0000000000000000
  PDP3: 0x0000000000000000
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck timestamp: 0ms (4395844992; epoch)
  engine reset count: 0
  ELSP[0]:  pid 801, seqno       15:0011639e!, prio 2, emitted -960ms, start 00009000, head 00000780, tail 00000820
  ELSP[1]:  pid 0, seqno        5:0000178c, prio -4093, emitted -959ms, start 00001000, head 000008c0, tail 00000928
  Active context: [0] hw_id 0, prio 0, guilty 0 active 0

The GPU did not do a context switch at the end of ELSP[0].

Seems like you are able to reproduce this fairly easily with your usage, could you try setting i915.dmc_firmware_path=/dev/null on your kernel/grub commandline?
Comment 3 Chris Wilson 2019-10-07 14:47:04 UTC
For the record, what is your last known good kernel version (what version did you upgrade from)?
Comment 4 Stanislav Ochotnicky 2019-10-07 15:02:11 UTC
As far as I can tell - 5.3.2 was OK but I am not 100% sure. I definitely skipped 5.3.3 during my updates so that could go either way. 

It's possible I used my PC mostly remotely during the time 5.3.2 was used so I would not have noticed any GFX issues. Let's say - 5.3.x might be affected.

I now have a system booted with i915.dmc_firmware_path=/dev/null kernel commandline. I'll report back if I have more info (anything specific to look for?)

For now I'll just use it and see if I notice any weirdness...
Comment 5 Chris Wilson 2019-10-07 15:07:13 UTC
Disabling dmc will prevent reaching package c-state 8+, otherwise it should be no impact, so we are on the lookout to see if it hangs again.
Comment 6 Stanislav Ochotnicky 2019-10-07 16:32:28 UTC
FWIW, I've dug a bit more in the journal and around the same time I have these (presumably Chromium) logs:
ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
ERROR:shared_image_manager.cc(120)] SharedImageManager::ProduceGLTexture: Trying to produce a representation from a non-existent mailbox. 3E:FB:28:49:D0:7B:96:F0:6F:34:7A:9B:8C:07:C3:09
ERROR:gles2_cmd_decoder.cc(18508)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoCreateAndTexStorage2DSharedImageINTERNAL: invalid mailbox name
ERROR:gles2_cmd_decoder.cc(18529)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoBeginSharedImageAccessCHROMIUM: bound texture is not a shared image
ERROR:gles2_cmd_decoder.cc(18552)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoEndSharedImageAccessCHROMIUM: bound texture is not a shared image
ERROR:gles2_cmd_decoder.cc(18529)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoBeginSharedImageAccessCHROMIUM: bound texture is not a shared image


The GL ERRORS repeat for a few seconds until:
ERROR:logger.cc(46)] Too many GL errors, not reporting any more for this context. use --disable-gl-error-limit to see all errors

I'll continue running with i915.dmc_firmware_path=/dev/null and see if I can reproduce (so far I haven't been able)
Comment 7 Stanislav Ochotnicky 2019-10-13 15:52:12 UTC
Created attachment 145727 [details]
/sys/class/drm/card0/error output with i915.dmc_firmware_path=/dev/null

I have managed to reproduce again - even after upgrading kernel to 5.3.5 and adding i915.dmc_firmware_path=/dev/null kernel command line option.

Attaching new output of /sys/class/drm/card0/error

I haven't yet found an exact reproducer, but will try to dig. My current two leads/ideas are:
 * Related to suspend/resume (i.e. I don't remember seeing hang after fresh boot, only after suspend/resume cycle)
 * Related to IOMMU/2nd video card being assigned to a VM

Both of the above might be wild goose chases at this point though.
Comment 8 jakov.ivkovic 2019-10-14 19:53:34 UTC
Created attachment 145737 [details]
gpu crash dump

Same thing happening to me.

I can confirm that (at least in my case) it doesn't happen only after suspend/resume cycle. This crash happened shortly after reboot.
Comment 9 jakov.ivkovic 2019-10-14 19:59:01 UTC
Forgot to mention; it happens to me while running chromium as well.
Comment 10 Lakshmi 2019-10-15 07:01:54 UTC
(In reply to Chris Wilson from comment #5)
> Disabling dmc will prevent reaching package c-state 8+, otherwise it should
> be no impact, so we are on the lookout to see if it hangs again.

(In reply to Stanislav Ochotnicky from comment #7)
> Created attachment 145727 [details]
> /sys/class/drm/card0/error output with i915.dmc_firmware_path=/dev/null
> 
> I have managed to reproduce again - even after upgrading kernel to 5.3.5 and
> adding i915.dmc_firmware_path=/dev/null kernel command line option.
> 
> Attaching new output of /sys/class/drm/card0/error
> 
> I haven't yet found an exact reproducer, but will try to dig. My current two
> leads/ideas are:
>  * Related to suspend/resume (i.e. I don't remember seeing hang after fresh
> boot, only after suspend/resume cycle)
>  * Related to IOMMU/2nd video card being assigned to a VM
> 
> Both of the above might be wild goose chases at this point though.

CC'ing Chris.
Comment 11 Stanislav Ochotnicky 2019-11-04 16:40:59 UTC
I can still reproduce on 5.3.8. I can provide another dump of /sys/class/drm/card0/error if needed.

I can also confirm this has nothing to do with suspend as I can reproduce after fresh restart.

Overall it's not a big deal for me since it ends up just as a short UI freeze. But if I can provide any additional information let me know.

I should be able to start git-bisect if that would help narrow things down.
Comment 12 Stanislav Ochotnicky 2019-11-04 18:30:11 UTC
I just started a VM which has a different graphics card assigned and I experienced another hang with i915. Perhaps this is not necesarily VM/IOMMU related but that might be one of the triggers?

In any case - I have a separate crash dump for this event.
Comment 13 Lakshmi 2019-11-05 14:15:32 UTC
 
> I should be able to start git-bisect if that would help narrow things down.

Yes, this definitely helps. Can you post the bad commit that caused the issue. Thanks!
Comment 14 ilvez 2019-11-27 05:32:29 UTC
Not sure  whether I should open new report. I think I've been hit by the same bug with similar characteristics.

I started to notice short (second, mostly less) freezes with 5.3 kernel (don't know the exact version, since I got it from Debian unstable it surely wasn't the initial 5.3.0). With 5.2 kernel everything was fine. I notice freezes only in Firefox and Thunderbird.

I get the same dmesg output (mostly Resetting rcs0 for hang on rcs0). /sys/class/drm/card0/error got generated I think when I tried to close message window in Thunderbird. If you want I can upload my /sys/class/drm/card0/error.

Not every freeze in Firefox/Thunderbird generates "Resetting..." message to Firefox, so I'm not 100% sure that these are related actually. I connected these things today, so I start monitoring.

I haven't compiled running kernel from source in a years, so I can't currently do it. If you point me to good instructions, I can try to git-bisect this.
Comment 15 ilvez 2019-11-29 05:33:45 UTC
Created attachment 146045 [details]
Similar crash in my system

Anyway, I'll add my crash dump. I have monitored relations to Firefox/Thunderbind mini-freezes and these errors, but haven't found any correlation.
Comment 16 ilvez 2019-11-29 05:35:21 UTC
Created attachment 146046 [details]
Similar crash in my system
Comment 17 Lakshmi 2019-11-29 07:10:05 UTC
(In reply to ilvez from comment #16)
> Created attachment 146046 [details]
> Similar crash in my system

Can you try to reproduce this issue using drm-tip (https://cgit.freedesktop.org/drm-tip)
Comment 18 Martin Peres 2019-11-29 19:38:10 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/484.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.