109022 – ring gfx timeout during particular shader generation on yuzu emulator

Bug 109022 - ring gfx timeout during particular shader generation on yuzu emulator

Summary: ring gfx timeout during particular shader generation on yuzu emulator

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-12-12 03:31 UTC by e88z4
Modified:	2019-09-25 18:37 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
freeze-screenshot (686.62 KB, image/jpeg) 2018-12-20 12:22 UTC, jdruel	Details
dmesg (6.71 KB, text/plain) 2018-12-20 12:23 UTC, jdruel	Details
apitrace log (377.21 KB, text/x-log) 2019-06-07 03:08 UTC, e88z4	Details
View All

Description e88z4 2018-12-12 03:31:03 UTC

How to reproduce the issue?
1. Using yuzu emualtor latest canary build.
2. Load Super Mario Odyssey.
3. Create a new game.
4. Wait until Mario jumps for the first time after being woken up by Cappy. 
5. Graphic crashes but system is still accessible through SSH

Error message:
[  540.189992] amdgpu 0000:23:00.0: GPU fault detected: 147 0x09088402 for process glretrace pid 2006 thread glretrace:cs0 pid 2007
[  540.190000] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00001121
[  540.190004] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E084002
[  540.190010] amdgpu 0000:23:00.0: VM fault (0x02, vmid 7, pasid 32771) at page 4385, read from 'TC7' (0x54433700) (132)
[  550.275970] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=20845, emitted seq=20848

System specification:
AMD Ryzen 5 2600
ASRock B540M with latest bios update
16 GB DDR4 RAM
MSI RX580 Gaming X
Debian Sid 
Kernel 4.20-rc6
Mesa 18.3.1

This bug occurs in all combinations of kernel 4.17, 4.18, 4.19, 4.20, drm-next-4.2.1-wip with mesa 18.2, 18.3, 19.0-development

I am able to get the apitrace result. It can be downloaded from my google drive.
https://drive.google.com/file/d/1MdR1c2g0pQ8ToTCAH53Jc9BUd3WFzbcg/view?usp=sharing

The bug can be easily reproduced by replaying the apitrace result very consistently.

Comment 1 e88z4 2018-12-12 03:51:09 UTC

A revision of my initial comment. I was able to reproduce the bug once while replaying the trace and generated the same error in the dmesg output but I am no longer able to reproduce the error while replaying the trace currently which is very odd.

Comment 2 e88z4 2018-12-12 04:04:12 UTC

Out of 10 times replaying the trace, I was able to reproduce the bug twice. 

[  540.189992] amdgpu 0000:23:00.0: GPU fault detected: 147 0x09088402 for process glretrace pid 2006 thread glretrace:cs0 pid 2007
[  540.190000] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00001121
[  540.190004] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E084002
[  540.190010] amdgpu 0000:23:00.0: VM fault (0x02, vmid 7, pasid 32771) at page 4385, read from 'TC7' (0x54433700) (132)
[  550.275970] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=20845, emitted seq=20848

The error message from dmesg is identical when producing the bug from apitrace.

Comment 3 e88z4 2018-12-18 01:59:09 UTC

I did a little more troubleshooting on my API trace. 

Call #36840880 is the culprit of the VMC Page Fault error. The GPU didn't crash right away but continuing 100-200 calls after 36840880 might randomly trigger the ring gfx timeout error.

Comment 4 jdruel 2018-12-20 12:22:52 UTC

Created attachment 142865 [details]
freeze-screenshot

Comment 5 jdruel 2018-12-20 12:23:18 UTC

Created attachment 142866 [details]
dmesg

Comment 6 jdruel 2018-12-20 12:25:05 UTC

Same pb here on a r9 fury, mesa 18.3
Complete freeze on various games in yuzu (easy to reproduce: super mario odissey). I attached a screenshot + the dmesg log obtained via ssh.

Comment 7 e88z4 2018-12-20 14:08:17 UTC

I can reproduce the error during the FMV as well. You can quickly bypass this FMV by skipping it before the error occurs (I couldn't remember how to trigger the skipping selection on your screen). I am wondering if you would get the same crash too when Mario was about to jump for the first time after being woken up by cappy?

Comment 8 e88z4 2019-01-29 04:06:03 UTC

Hi, 

This bug still occurs with kernel 5.0-rc4 with Mesa 19.0-devel (commit 932ed9c00b99e6ec92146ec9e820f546cf3e6551). There are more error logs in dmesg like below.

[ 1132.040752] amdgpu 0000:23:00.0: GPU fault detected: 147 0x0a808802 for process yuzu pid 19573 thread yuzu:cs0 pid 19577
[ 1132.040761] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0003F350
[ 1132.040766] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C088002
[ 1132.040773] amdgpu 0000:23:00.0: VM fault (0x02, vmid 6, pasid 32770) at page 258896, read from 'TC6' (0x54433600) (136)
[ 1142.103758] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=50481, emitted seq=50484
[ 1142.103820] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process yuzu pid 19573 thread yuzu:cs0 pid 19577
[ 1142.103826] amdgpu 0000:23:00.0: GPU reset begin!
[ 1149.487464] WARNING: CPU: 9 PID: 19117 at kernel/kthread.c:501 kthread_park+0x20/0x67
[ 1149.487467] Modules linked in: hid_switchcon hidp tda18271 s5h1411 rfcomm fuse cmac bnep binfmt_misc amdgpu mfd_core kvm irqbypass ghash_clmulni_intel btusb btrtl btbcm btintel bluetooth snd_hda_codec_realtek snd_hda_codec_generic snd_usb_audio snd_usbmidi_lib snd_hda_codec_hdmi sha256_generic snd_rawmidi snd_pcsp chash gpu_sched saa7164 xpad snd_hda_intel tveeprom drm_kms_helper joydev evdev aesni_intel syscopyarea dvb_core sysfillrect snd_hda_codec sysimgblt fb_sys_fops aes_x86_64 snd_hwdep ttm crypto_simd snd_hda_core cryptd ff_memless drbg glue_helper snd_pcm drm snd_timer i2c_algo_bit videodev snd ecdh_generic sp5100_tco soundcore k10temp media sg 8250 8250_base serial_core button acpi_cpufreq nct6775 hwmon_vid autofs4 ext4 crc16 mbcache jbd2 sd_mod hid_generic usbhid hid i2c_piix4 ahci xhci_pci crc32c_intel i2c_core libahci xhci_hcd r8169 libata realtek libphy usbcore scsi_mod usb_common
[ 1149.487525] CPU: 9 PID: 19117 Comm: MemoryInfra Not tainted 5.0.0-rc4-htpc+ #3
[ 1149.487526] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P2.00 12/18/2018
[ 1149.487530] RIP: 0010:kthread_park+0x20/0x67
[ 1149.487533] Code: e8 09 ff ff ff 48 89 c7 eb 8f 55 53 e8 fd fe ff ff f6 47 24 04 74 09 0f 0b b8 da ff ff ff eb 4e 48 89 c3 48 8b 00 a8 04 74 09 <0f> 0b b8 f0 ff ff ff eb 3b 48 89 fd f0 80 0b 04 65 48 8b 04 25 40
[ 1149.487534] RSP: 0018:ffffc900030d7bc0 EFLAGS: 00010202
[ 1149.487537] RAX: 0000000000000004 RBX: ffff88840868f680 RCX: 0000000000000000
[ 1149.487538] RDX: ffff8883d0794330 RSI: ffff88827243f330 RDI: ffff88840732ee80
[ 1149.487540] RBP: ffff888407ad7398 R08: 0000000000000001 R09: ffff888288facd80
[ 1149.487541] R10: 0000000000000040 R11: ffffc900030d7bc0 R12: 000000000000000d
[ 1149.487543] R13: 0000000000000006 R14: ffff88840b01d000 R15: ffff88840b01d0f0
[ 1149.487545] FS:  00007f96304a6700(0000) GS:ffff88841ee40000(0000) knlGS:0000000000000000
[ 1149.487547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1149.487548] CR2: 00007f229032b000 CR3: 0000000407b90000 CR4: 00000000003406e0
[ 1149.487549] Call Trace:
[ 1149.487559]  drm_sched_entity_fini+0x4c/0xeb [gpu_sched]
[ 1149.487612]  amdgpu_ctx_mgr_entity_fini+0x80/0xa8 [amdgpu]
[ 1149.487662]  amdgpu_ctx_mgr_fini+0x22/0x8b [amdgpu]
[ 1149.487706]  amdgpu_driver_postclose_kms+0x188/0x242 [amdgpu]
[ 1149.487717]  drm_file_free+0x1e6/0x22d [drm]
[ 1149.487728]  drm_release+0x75/0xdf [drm]
[ 1149.487733]  __fput+0xe9/0x18f
[ 1149.487737]  task_work_run+0x68/0x7c
[ 1149.487741]  do_exit+0x450/0x917
[ 1149.487744]  do_group_exit+0x95/0x95
[ 1149.487748]  get_signal+0x439/0x461
[ 1149.487752]  do_signal+0x31/0x4cc
[ 1149.487757]  prepare_exit_to_usermode+0x8b/0xf5
[ 1149.487762]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1149.487764] RIP: 0033:0x7f964651ffac
[ 1149.487772] Code: Bad RIP value.
[ 1149.487773] RSP: 002b:00007f96304a5640 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 1149.487776] RAX: fffffffffffffe00 RBX: 00007f96304a57b0 RCX: 00007f964651ffac
[ 1149.487777] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f96304a57d8
[ 1149.487778] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fff49705080
[ 1149.487779] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 1149.487781] R13: 00007f96304a5788 R14: 0000000000000000 R15: 00007f96304a57d8
[ 1149.487784] ---[ end trace 540c7632a8289097 ]---
[ 1152.333861] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out


If you need more clarification or detail, please let me know.

Comment 9 e88z4 2019-02-02 00:48:10 UTC

I tried to enable NIR using R600_DEBUG=nir, I got a different behaviour on GFX timeout. 
[Feb 1 19:44] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[Feb 1 19:45] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240145] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240170] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.239953] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240061] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240340] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[Feb 1 19:46] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240003] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240173] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240074] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.239890] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[Feb 1 19:44] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[Feb 1 19:45] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240145] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240170] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.239953] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240061] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240340] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[Feb 1 19:46] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240003] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240173] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ +10.240074] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

I get a a looping error message like the above but I can kill the application from ssh. Display is once again usable, no hard reset is required.

Comment 10 saliksaeed@ymail.com (Spammer; Account disabled) 2019-03-06 08:36:40 UTC

*** Bug 109870 has been marked as a duplicate of this bug. ***

Comment 11 glencoesmith 2019-03-25 13:44:17 UTC

Can you reproduce the bug on 18.3.5?

Comment 12 e88z4 2019-03-25 13:48:12 UTC

I usually build mesa master weekly since November 2018. This bug can be reproduced with latest Mesa from git master branch. 

Do you want me to try to build 18.3.5 and reproduce it?

Comment 13 glencoesmith 2019-03-25 14:48:28 UTC

(In reply to e88z4 from comment #12)
> I usually build mesa master weekly since November 2018. This bug can be
> reproduced with latest Mesa from git master branch. 
> 
> Do you want me to try to build 18.3.5 and reproduce it?

No, I was just curious if the bug showed up in 18.3.5.

Comment 14 e88z4 2019-05-26 23:19:19 UTC

I tried to reproduce this bug again with latest mesa master (659aa3dd651). My kernel is 5.1.3 and yuzu-canary #2318. The behaviour of the softlock is slightly changed. Yuzu didn't crash anymore but rendering was super slow. I was able to kill the application without using SSH. The following error was produced from dmesg.

[ 5179.131132] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 5183.739102] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 5189.883024] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 5193.979037] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 5204.218939] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 5214.458837] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 5224.698730] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 5229.818653] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 5234.938622] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 5234.938680] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 5241.850516] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 5245.178534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 5251.322426] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 5255.418427] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Comment 15 Timothy Arceri 2019-05-27 05:19:48 UTC

Maybe related to 110472

Comment 16 Timothy Arceri 2019-05-27 05:20:11 UTC

(In reply to Timothy Arceri from comment #15)
> Maybe related to 110472

Sorry to bug #110472

Comment 17 glencoesmith 2019-05-27 05:23:24 UTC

(In reply to Timothy Arceri from comment #16)
> (In reply to Timothy Arceri from comment #15)
> > Maybe related to 110472
> 
> Sorry to bug #110472


It definitely is, as it occurs using the same drivers, exact same program, and causes the same crash

Comment 18 glencoesmith 2019-05-27 05:28:57 UTC

(In reply to e88z4 from comment #14)
> I tried to reproduce this bug again with latest mesa master (659aa3dd651).
> My kernel is 5.1.3 and yuzu-canary #2318. The behaviour of the softlock is
> slightly changed. Yuzu didn't crash anymore but rendering was super slow. I
> was able to kill the application without using SSH. The following error was
> produced from dmesg.
> 
> [ 5179.131132] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting
> for fences timed out.
> [ 5183.739102] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> but soft recovered
> [ 5189.883024] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting
> for fences timed out.
> [ 5193.979037] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> but soft recovered
> [ 5204.218939] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> but soft recovered
> [ 5214.458837] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> but soft recovered
> [ 5224.698730] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> but soft recovered
> [ 5229.818653] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting
> for fences timed out.
> [ 5234.938622] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting
> for fences timed out.
> [ 5234.938680] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> but soft recovered
> [ 5241.850516] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting
> for fences timed out.
> [ 5245.178534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> but soft recovered
> [ 5251.322426] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting
> for fences timed out.
> [ 5255.418427] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> but soft recovered



Yay, progress

Comment 19 glencoesmith 2019-05-27 05:31:13 UTC

https://github.com/yuzu-emu/yuzu/issues/2523

WOOOOOOO!

This is getting a Yuzu side fix!

Comment 20 e88z4 2019-05-27 23:14:14 UTC

Hi Timothy,

I have tried your suggestion on the other bug reporting https://bugs.freedesktop.org/show_bug.cgi?id=110472

You suggested to add an environment variable allow_glsl_extension_directive_midshader=true. This doesn't resolve the issue.

I also noticed that the yuzu developers followed your suggestion from your bug reporting ticket https://github.com/yuzu-emu/yuzu/issues/2523
I decided to compile the latest source code with the merged code but the issue is not yet resolved. One thing that I noticed is the "ring fgx timeout" error message is no longer being produced". Instead error messages below are reproduced continuously until I kill the application.

May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: amdgpu_cs_query_fence_status failed.
May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: amdgpu_cs_query_fence_status failed.
May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:25 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:25 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.
May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost.

I will produce the apitrace shortly and will post it on a share location.

Comment 21 e88z4 2019-05-27 23:41:08 UTC

Hi, 

I uploaded the apitrace in my google drive. 

https://drive.google.com/file/d/1hg1ep4rJ2Y_g7WmA_HbmNndrHd-x7Etc/view?usp=sharing

When I replayed the trace, I was able to produce the error. 

My system is 
AMD Ryzen 2600
Radeon RX580
Mesa master 659aa3dd651
Kernel 5.1.3

Let me know if you need anything else.

Comment 22 e88z4 2019-06-07 03:08:51 UTC

Created attachment 144472 [details]
apitrace log

Hi, 

I attached the apitrace log being written to my standard output while replaying my apitrace file in my google drive.

Comment 23 e88z4 2019-07-31 20:16:19 UTC

Hi, 

Is there an update for this ticket? The issue is still easily be replicated in the latest mesa on RadeonSI driver.

System information
Linux 5.2.2
Mesa-master 
Radeon RX 580 Series (POLARIS10, DRM 3.32.0, 5.2.2-htpc, LLVM 10.0.0)


I have tried different combination of llvm backend such as llvm-8, llvm-9, and llvm-10. They produced the same error. Could this be a bug in amd llvm backend?

Previously, I have provided the api trace, I hope the trace can be used to debug this issue.


I provided the latest error from today. They are the same error like before.

^[	[Jul31 16:10] amdgpu 0000:23:00.0: GPU fault detected: 147 0x0d604401 for process yuzu pid 15061 thread yuzu:cs0 pid 15068
[  +0.000004] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0C03F7AC
[  +0.000001] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E044001
[  +0.000002] amdgpu 0000:23:00.0: VM fault (0x01, vmid 7, pasid 32770) at page 201586604, read from 'TC5' (0x54433500) (68)
[  +5.312442] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[  +4.874028] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=24383, emitted seq=24385
[  +0.000066] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process yuzu pid 15061 thread yuzu:cs0 pid 15068
[  +0.000006] amdgpu 0000:23:00.0: GPU reset begin!
[  +0.584424] cp is busy, skip halt cp
[  +0.328786] rlc is busy, skip halt rlc
[  +0.001031] amdgpu 0000:23:00.0: GPU pci config reset
[  +0.463461] amdgpu 0000:23:00.0: GPU reset succeeded, trying to resume
[  +0.002152] [drm] PCIE GART of 256M enabled (table at 0x000000F400900000).
[  +0.000016] [drm] VRAM is lost due to GPU reset!
[  +0.087583] [drm] UVD and UVD ENC initialized successfully.
[  +0.099970] [drm] VCE initialized successfully.
[  +0.018504] amdgpu 0000:23:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd (-110).
[  +0.003336] amdgpu 0000:23:00.0: ib ring test failed (-110).

Comment 24 GitLab Migration User 2019-09-25 18:37:07 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1348.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.