Bug 110457

Summary: System resumes failed and hits [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout on Acer Aspire A315-21G
Product: DRI Reporter: jian-hong
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED DUPLICATE QA Contact:
Severity: critical    
Priority: high    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg with amdgpu.dc=1 drm.debug=7 in boot command
none
dmesg with amdgpu.dc=1 drm.debug=7 amdgpu.runpm=0 in boot command
none
lspci -nnv on Acer Squirtle_SR
none
dmesg with amdgpu.dc=1 drm.debug=7 in boot command on Acer TravelMate B114-21
none
lspci -nnv on Acer TravelMate B114-21
none
journal log on Acer TravelMate B114-21
none
Thinkpad E585 log file with amdgpu errors none

Description jian-hong 2019-04-17 05:53:27 UTC
Created attachment 144006 [details]
dmesg with amdgpu.dc=1 drm.debug=7 in boot command

We have an Acer Squirtle_SR laptop equipped with AMD A9-9420e RADEON R5, 5 COMPUTE CORES 2C+3G and [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] [1002:6900].  We test it with Linux kernel 5.1.0-rc5+.

The kernel includes the patch [1] mentioned in comment 110360#c9 [2].

System keeps screen black after system resumes from suspending.

The error keeps showing on dmesg:

[  177.401716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=290, emitted seq=294
[  177.401848] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 569 thread Xorg:cs0 pid 571
[  177.401855] [drm] IP block:gfx_v8_0 is hung!
[  177.401932] [drm] GPU recovery disabled.

01:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] [1002:6900] (rev c3)
	Subsystem: Acer Incorporated [ALI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] [1025:1217]
	Flags: bus master, fast devsel, latency 0, IRQ 40
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at 3000 [size=256]
	Memory at d1400000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at d1440000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [270] #19
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

[1] https://patchwork.kernel.org/patch/10889269/
[2] https://bugzilla.freedesktop.org/show_bug.cgi?id=110360#c9
Comment 1 jian-hong 2019-04-17 06:05:26 UTC
Created attachment 144007 [details]
dmesg with amdgpu.dc=1 drm.debug=7 amdgpu.runpm=0 in boot command

Also tried with amdgpu.runpm=0 in boot command.  However, it still get the same error.

[   78.078762] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=290, emitted seq=294
[   78.078897] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 572 thread Xorg:cs0 pid 588
[   78.078908] [drm] IP block:gfx_v8_0 is hung!
[   78.079079] [drm] GPU recovery disabled.
Comment 2 jian-hong 2019-04-17 06:07:15 UTC
Created attachment 144008 [details]
lspci -nnv on Acer Squirtle_SR
Comment 3 jian-hong 2019-04-18 08:40:20 UTC
Created attachment 144030 [details]
dmesg with amdgpu.dc=1 drm.debug=7 in boot command on Acer TravelMate B114-21

We have another laptop Acer TravelMate B114-21, which hits the same issue.  It is equipped with AMD A4-9120C RADEON R4, 5 COMPUTE CORES 2C+3G.

[   60.011965] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=206, emitted seq=208
[   60.012215] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1388 thread gnome-shel:cs0 pid 1409
[   60.012226] [drm] IP block:gfx_v8_0 is hung!
[   60.012320] [drm] GPU recovery disabled.

00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Stoney [Radeon R2/R3/R4/R5 Graphics] [1002:98e4] (rev eb) (prog-if 00 [VGA controller])
	Subsystem: Acer Incorporated [ALI] Stoney [Radeon R2/R3/R4/R5 Graphics] [1025:132a]
	Flags: bus master, fast devsel, latency 0, IRQ 36
	Memory at e8000000 (64-bit, prefetchable) [size=128M]
	Memory at f0000000 (64-bit, prefetchable) [size=8M]
	I/O ports at f000 [size=256]
	Memory at fea00000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Root Complex Integrated Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270] #19
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

Also tried with amdgpu.runpm=0 in boot command, but this issue still can be reproduced.
Comment 4 jian-hong 2019-04-18 08:41:31 UTC
Created attachment 144031 [details]
lspci -nnv on Acer TravelMate B114-21
Comment 5 jian-hong 2019-04-19 07:18:49 UTC
Created attachment 144042 [details]
journal log on Acer TravelMate B114-21

Got more information after wait more time for resuming on Acer TravelMate B114-21.

Apr 19 15:06:38 endless kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2841, emitted seq=2845
Apr 19 15:06:38 endless kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 695 thread Xorg:cs0 pid 698
Apr 19 15:06:38 endless kernel: [drm] IP block:gfx_v8_0 is hung!
Apr 19 15:06:38 endless kernel: [drm] GPU recovery disabled.
Apr 19 15:06:40 endless kernel: INFO: task Xorg:695 blocked for more than 604 seconds.
Apr 19 15:06:40 endless kernel:       Tainted: G        W         5.1.0-rc5+ #1
Apr 19 15:06:40 endless kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 19 15:06:40 endless kernel: Xorg            D    0   695    683 0x00400004
Apr 19 15:06:40 endless kernel: Call Trace:
Apr 19 15:06:40 endless kernel:  __schedule+0x2d4/0x840
Apr 19 15:06:40 endless kernel:  schedule+0x2c/0x70
Apr 19 15:06:40 endless kernel:  schedule_timeout+0x258/0x360
Apr 19 15:06:40 endless kernel:  ? amdgpu_atom_execute_table_locked+0x136/0x210 [amdgpu]
Apr 19 15:06:40 endless kernel:  dma_fence_default_wait+0x20a/0x280
Apr 19 15:06:40 endless kernel:  ? dma_fence_release+0xa0/0xa0
Apr 19 15:06:40 endless kernel:  dma_fence_wait_timeout+0xe7/0x110
Apr 19 15:06:40 endless kernel:  amdgpu_fence_wait_empty+0x61/0xc0 [amdgpu]
Apr 19 15:06:40 endless kernel:  amdgpu_pm_compute_clocks+0x70/0x590 [amdgpu]
Apr 19 15:06:40 endless kernel:  dm_pp_apply_display_requirements+0x19a/0x1b0 [amdgpu]
Apr 19 15:06:40 endless kernel:  dce11_pplib_apply_display_requirements+0x1f4/0x210 [amdgpu]
Apr 19 15:06:40 endless kernel:  dce11_update_clocks+0xa0/0x100 [amdgpu]
Apr 19 15:06:40 endless kernel:  dce110_prepare_bandwidth+0x3e/0x50 [amdgpu]
Apr 19 15:06:40 endless kernel:  dc_commit_state+0x22d/0x5a0 [amdgpu]
Apr 19 15:06:40 endless kernel:  ? drm_calc_timestamping_constants+0x106/0x150 [drm]
Apr 19 15:06:40 endless kernel:  amdgpu_dm_atomic_commit_tail+0x1fb/0x1930 [amdgpu]
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x40/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x34/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x40/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x34/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x40/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x34/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x40/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x34/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x34/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x40/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x34/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x40/0x70
Apr 19 15:06:40 endless kernel:  ? __switch_to_xtra+0x3b8/0x5b0
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x34/0x70
Apr 19 15:06:40 endless kernel:  ? ttm_bo_mem_compat+0x28/0x60 [ttm]
Apr 19 15:06:40 endless kernel:  ? ttm_bo_validate+0x3d/0x130 [ttm]
Apr 19 15:06:40 endless kernel:  ? __switch_to+0x48b/0x4f0
Apr 19 15:06:40 endless kernel:  ? __switch_to_asm+0x34/0x70
Apr 19 15:06:40 endless kernel:  ? __schedule+0x2dc/0x840
Apr 19 15:06:40 endless kernel:  ? amdgpu_bo_pin_restricted+0x1a2/0x270 [amdgpu]
Apr 19 15:06:40 endless kernel:  ? _cond_resched+0x19/0x30
Apr 19 15:06:40 endless kernel:  ? wait_for_completion_timeout+0x38/0x140
Apr 19 15:06:40 endless kernel:  ? _cond_resched+0x19/0x30
Apr 19 15:06:40 endless kernel:  ? wait_for_completion_interruptible+0x35/0x1a0
Apr 19 15:06:40 endless kernel:  commit_tail+0x42/0x70 [drm_kms_helper]
Apr 19 15:06:40 endless kernel:  ? commit_tail+0x42/0x70 [drm_kms_helper]
Apr 19 15:06:40 endless kernel:  drm_atomic_helper_commit+0x113/0x120 [drm_kms_helper]
Apr 19 15:06:40 endless kernel:  amdgpu_dm_atomic_commit+0x9b/0xe0 [amdgpu]
Apr 19 15:06:40 endless kernel:  drm_atomic_commit+0x4a/0x50 [drm]
Apr 19 15:06:40 endless kernel:  drm_atomic_helper_set_config+0x87/0x90 [drm_kms_helper]
Apr 19 15:06:40 endless kernel:  drm_mode_setcrtc+0x1bb/0x740 [drm]
Apr 19 15:06:40 endless kernel:  ? drm_is_current_master+0x1f/0x40 [drm]
Apr 19 15:06:40 endless kernel:  ? drm_mode_getcrtc+0x1a0/0x1a0 [drm]
Apr 19 15:06:40 endless kernel:  drm_ioctl_kernel+0xb0/0x100 [drm]
Apr 19 15:06:40 endless kernel:  drm_ioctl+0x233/0x410 [drm]
Apr 19 15:06:40 endless kernel:  ? drm_mode_getcrtc+0x1a0/0x1a0 [drm]
Apr 19 15:06:40 endless kernel:  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
Apr 19 15:06:40 endless kernel:  do_vfs_ioctl+0xa9/0x640
Apr 19 15:06:40 endless kernel:  ? tomoyo_file_ioctl+0x19/0x20
Apr 19 15:06:40 endless kernel:  ksys_ioctl+0x67/0x90
Apr 19 15:06:40 endless kernel:  __x64_sys_ioctl+0x1a/0x20
Apr 19 15:06:40 endless kernel:  do_syscall_64+0x5a/0x110
Apr 19 15:06:40 endless kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 19 15:06:40 endless kernel: RIP: 0033:0x7f36f7126777
Apr 19 15:06:40 endless kernel: Code: Bad RIP value.
Apr 19 15:06:40 endless kernel: RSP: 002b:00007ffeb62a80d8 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
Apr 19 15:06:40 endless kernel: RAX: ffffffffffffffda RBX: 00007ffeb62a8110 RCX: 00007f36f7126777
Apr 19 15:06:40 endless kernel: RDX: 00007ffeb62a8110 RSI: 00000000c06864a2 RDI: 000000000000000d
Apr 19 15:06:40 endless kernel: RBP: 00007ffeb62a8110 R08: 0000000000000000 R09: 00005652f3eb9510
Apr 19 15:06:40 endless kernel: R10: 00007ffeb62a81d0 R11: 0000000000003246 R12: 00000000c06864a2
Apr 19 15:06:40 endless kernel: R13: 000000000000000d R14: 0000000000000000 R15: 00005652f3eb9510
Comment 6 Yury Zhuravlev 2019-04-24 00:31:43 UTC
Vega56
Ryzen 2700x
Kernel 5.0.3
Mesa latest master git
libdrm latest master git
llvm 8

I have the same problem then I use DXVK for the free version of Assasin Creed.

[ 3137.670744] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=191619, emitted seq=191621
[ 3137.670765] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ACU.exe pid 8085 thread ACU.exe:cs0 pid 8118
[ 3137.670767] amdgpu 0000:1f:00.0: GPU reset begin!
[ 3147.900752] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
Comment 7 Cameron Banfield 2019-05-09 19:34:26 UTC
I am having very similar issues and see similar errors in logs. The most recent error was:

kernel: amdgpu 0000:06:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1301 thread Xorg:cs0 pid 1362)
kernel: amdgpu 0000:06:00.0:   in page starting at address 0x0000800108a18000 from 27
kernel: amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00101031

The laptop is then unusable and requires a hard reboot.

Linux Mint 19.1
Kernel 5.1.0
AMD Ryzen PRO 2700U with Vega 10 graphics

Trying to load cities skylines is a guaranteed crash.
Comment 8 Matt Coffin 2019-06-04 13:27:48 UTC
This is probably related to bug 102322, yes?
Comment 9 redrield 2019-07-29 00:59:02 UTC
Created attachment 144900 [details]
Thinkpad E585 log file with amdgpu errors

I'm running into an issue that I think is related to this. Attached a journal file containing the traces from the last boot where it occurred. For some reason, it doesn't happen every time I try to resume from suspend, but when it does I have no choice but to hard reboot. This is a Thinkpad E585, uname -a "Linux thonkpad 5.2.3-arch1-1-ARCH #1 SMP PREEMPT Fri Jul 26 08:13:47 UTC 2019 x86_64 GNU/Linux"
Comment 10 Eugene Bright 2019-08-09 16:50:22 UTC
The patch is on it's way
https://bugs.freedesktop.org/show_bug.cgi?id=110258#c12
Comment 11 jian-hong 2019-08-13 02:55:41 UTC
(In reply to Eugene Bright from comment #10)
> The patch is on it's way
> https://bugs.freedesktop.org/show_bug.cgi?id=110258#c12

I tried the patch upon Linux stable 5.2.8.  It fixed this issue.  Thank you so much!
Comment 12 Alex Deucher 2019-08-13 02:57:43 UTC

*** This bug has been marked as a duplicate of bug 110258 ***
Comment 13 darkshvein 2019-09-23 20:56:50 UTC
Hello.
please, explain. 
Why I work fine with FX-8320 CPU,
but after Ryzen r5 1600 upgrade, I see this OS freezes and bug?

is pcie generation any cause? planned obsolescence?
or coincidence with amdgpu driver update?


part of my log:
[49266.138534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=5660155, emitted seq=5660157
[49266.138578] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Civ6Sub pid 1778 thread Civ6Sub:cs0 pid 1781
[49266.138580] [drm] GPU recovery disabled.
[49275.866518] INFO: task Xorg:sh1:1789 blocked for more than 122 seconds.
[49275.866521]       Tainted: G  R        O      5.2.10 #2

radeon 7970. 
mesa utils(8.4.0-1)
linux 5.2.10
amdgpu Version: 18.1.99+git20190207-1

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.