Bug 108992 - Regression: Lenovo e585 (ryzen 2500u) freezes during boot with 4.20-rc5/rc6, amdgpu error
Summary: Regression: Lenovo e585 (ryzen 2500u) freezes during boot with 4.20-rc5/rc6, ...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 109200 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-12-09 17:42 UTC by chris
Modified: 2019-01-19 17:43 UTC (History)
6 users (show)

See Also:
i915 platform:
i915 features:


Attachments
amdgpu error message (11.46 KB, text/plain)
2018-12-09 17:42 UTC, chris
no flags Details
full kernel log (3.21 MB, text/plain)
2018-12-31 21:06 UTC, Zheng Luo
no flags Details
journalctl -b of lockup from bisected commit (571.99 KB, text/plain)
2019-01-04 05:08 UTC, tones111
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description chris 2018-12-09 17:42:17 UTC
Created attachment 142765 [details]
amdgpu error message

Hi,

i upgraded from mainline kernel 4.19.7 to 4.20-rc5.
Sadly using that kernel the system freezes when it tries to show gdm.

OS: Ubuntu 18.04.1

Kernel:
Linux version 4.20.0-042000rc5-generic (kernel@gloin) (gcc version 8.2.0 (Ubuntu 8.2.0-10ubuntu1)) #201812030721 SMP Mon Dec 3 12:23:24 UTC 2018

Command line: BOOT_IMAGE=/boot/vmlinuz-4.20.0-042000rc5-generic root=UUID=1381a98d-77fd-481f-9cdb-115b30829bd8 ro ivrs_ioapic[32]=00:14.0 ivrs_ioapic[33]=00:00.1 vt.handoff=1

Mesa is at version 18.2.2 (X-Swat ppa)

Firmware files:
ll /lib/firmware/amdgpu/rav*
-rw-r--r-- 1 root root  33280 Nov  6 21:32 /lib/firmware/amdgpu/raven_asd.bin
-rw-r--r-- 1 root root   9344 Nov  6 21:32 /lib/firmware/amdgpu/raven_ce.bin
-rw-r--r-- 1 root root    316 Apr 25  2018 /lib/firmware/amdgpu/raven_gpu_info.bin
-rw-r--r-- 1 root root  17536 Nov  6 21:32 /lib/firmware/amdgpu/raven_me.bin
-rw-r--r-- 1 root root 263808 Nov  6 21:32 /lib/firmware/amdgpu/raven_mec2.bin
-rw-r--r-- 1 root root 263808 Nov  6 21:32 /lib/firmware/amdgpu/raven_mec.bin
-rw-r--r-- 1 root root  21632 Nov  6 21:32 /lib/firmware/amdgpu/raven_pfp.bin
-rw-r--r-- 1 root root  26948 Nov  6 21:32 /lib/firmware/amdgpu/raven_rlc.bin
-rw-r--r-- 1 root root  17408 Nov  6 21:32 /lib/firmware/amdgpu/raven_sdma.bin
-rw-r--r-- 1 root root 341728 Apr 25  2018 /lib/firmware/amdgpu/raven_vcn.bin

christian@christian-ThinkPad-E585:~$ apt-cache show linux-firmware
Package: linux-firmware
Architecture: all
Version: 1.173.2

Error-Log from journalctl:

Dez 09 16:26:20 christian-ThinkPad-E585 set-cpufreq[874]: Setting ondemand scheduler for all CPUs
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: gmc_v9_0_process_interrupt: 28 callbacks suppressed
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: amdgpu 0000:05:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32768, for process gnome-shell pid 1102 thread g
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: amdgpu 0000:05:00.0:   in page starting at address 0x0000800100020000 from 18
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: amdgpu 0000:05:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010013C
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: amdgpu 0000:05:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32768, for process gnome-shell pid 1102 thread g
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: amdgpu 0000:05:00.0:   in page starting at address 0x0000800100020000 from 18
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: amdgpu 0000:05:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010013C
ez 09 16:26:20 christian-ThinkPad-E585 kernel: [Hardware Error]: Deferred error, no action required.
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: [Hardware Error]: CPU:0 (17:11:0) MC20_STATUS[-|-|MiscV|-|AddrV|Deferred|-|SyndV
Dez 09 16:26:20 christian-ThinkPad-E585 systemd-journald[378]: Missed 68239 kernel messages
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: [Hardware Error]: Deferred error, no action required.
Dez 09 16:26:20 christian-ThinkPad-E585 systemd-journald[378]: Missed 6630 kernel messages
Dez 09 16:26:20 christian-ThinkPad-E585 kernel: [Hardware Error]: Coherent Slave Extended Error Code: 1
Dez 09 16:26:20 christian-ThinkPad-E585 systemd-journald[378]: Missed 7875 kernel messages

I attached an .txt file showing more of the error messages.

I also have seen freezes with 4.19.7 with a similar error message, but this happens very rarely. With 4.20-rc5 the issue happens every time gdm tries to start, which makes the system unusable.

If you need any other info, please ping me.

Many thanks !
Christian
Comment 1 chris 2018-12-10 14:21:34 UTC
Hi,

tested with the newly released rc6, same issue.

Many thanks !
Christian
Comment 2 Brian Schott 2018-12-10 23:25:13 UTC
I have the same issue with a 2700U in a Dell Inspiron 7375. All of the 4.20 RC versions that I have tried show the same problem. The system is able to boot with a 4.19 kernel.
Comment 3 Brian Schott 2018-12-13 12:12:20 UTC
The issue is still present in kernel 4.20.0-rc6-next-20181213.
Comment 4 Alex Deucher 2018-12-13 15:34:42 UTC
Can you boot the system without amdgpu loaded (e.g., append modprobe.blacklist=amdgpu)?  Or is this a general platform problem?
Comment 5 chris 2018-12-13 17:31:10 UTC
Can you boot the system without amdgpu loaded (e.g., append modprobe.blacklist=amdgpu)

-> Doing this, i am able to boot my system.
Comment 6 Brian Schott 2018-12-14 07:33:45 UTC
To clarify, the system can boot with the amdgpu module, but it will lock up when LightDM/X starts. Booting with the amdgpu module blacklisted works.
Comment 7 chris 2018-12-14 07:43:08 UTC
Yes, same here. The system boots until GDM wants to start, then it freezes with the mentioned amdgpu error. Disabling amdgpu let the system start up completely including gdm.
Comment 8 Alex Deucher 2018-12-14 19:32:38 UTC
Can you bisect?
Comment 9 Brian Schott 2018-12-15 04:05:43 UTC
020aa2ec15fc4a5ffdfcab7dc0db648a137abc41 lets me log in before the system freezes.

770af5859d6903049b7f39ed4f4e6612b63fd82d locks up before LightDM can start.

I'll do a bit more testing.
Comment 10 Brian Schott 2018-12-15 04:28:34 UTC
Ignore that previous comment. I'm getting some strange results here and may have marked a commit with an intermittent crash as "good" while bisecting.
Comment 11 Brian Schott 2018-12-16 03:31:12 UTC
"bc537a9cc47eec7f4e32b8164c494ddc35dca8ac is the first bad commit"

Well, that's kind of useless. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=bc537a9cc47eec7f4e32b8164c494ddc35dca8ac

Any suggestions on how to get a better idea of where the break was?
Comment 12 Michel Dänzer 2018-12-17 10:06:01 UTC
Make sure you've tested a commit plenty before declaring it "good".
Comment 13 Brian Ealdwine 2018-12-21 02:28:30 UTC
FYI, as a workaround, you can use the kernel opt:

 iommu=pt

..at least, on 4.20 rc7, which is the only one I've tried that on, but it should work with others.
Comment 14 chris 2018-12-21 06:38:29 UTC
I can confirm that the iommu=pt workaround works, also iommu=soft works to get gdm started and use the laptop. Sadly i have no idea what impact those workarounds have when it comes to performance of the gpu/cpu or battery lifetime ?
Comment 15 chris 2018-12-21 07:31:55 UTC
(In reply to chris from comment #14)
> I can confirm that the iommu=pt workaround works, also iommu=soft works to
> get gdm started and use the laptop. Sadly i have no idea what impact those
> workarounds have when it comes to performance of the gpu/cpu or battery
> lifetime ?

Sadly i had a freeze during desktop usage shortly after boot using iommu=pt.
The driver situation for raven ridge is really sad atm :( .
Comment 16 Brian Schott 2018-12-22 07:14:47 UTC
I've tested a next-20181221 kernel with IOMMU_DEFAULT_PASSTHROUGH set, and I'm able to get the system to start properly. Still seeing some system lockups, when playing games, but it's better than crashing on the login screen.
Comment 17 chris 2018-12-29 11:43:47 UTC
Hi,

the laptop is still freezing when trying to start with kernel 4.20 (release version) using latest amdgpu firmware from kernel firmware git.

Using iommu=soft still solves that issue.

I also tested with a kernel daily build from 26.12 which should include the latest drm changes, and it also shows the same issue.

Is there anything we can provide to help finding the root cause ?

Many thanks !
Christian
Comment 18 Zheng Luo 2018-12-31 21:05:21 UTC
*** Bug 109200 has been marked as a duplicate of this bug. ***
Comment 19 Zheng Luo 2018-12-31 21:06:54 UTC
Created attachment 142928 [details]
full kernel log
Comment 20 Ian Kidd 2019-01-01 20:14:03 UTC
Seeing same issue with Dell 5575 (AMD 2500u, Vega mobile) on 4.20 Release.  iommu=soft seems to allow boot.

Kernel Log: https://gist.github.com/ikidd/692dea4c63cc7656247071322d066405
Comment 21 Zheng Luo 2019-01-03 00:17:53 UTC
With iommu=soft I still occasionally experience frozen screen with following logs:

Jan 02 16:11:18 lzThinkpad gnome-shell[1647]: Failed to flip: Cannot allocate memory
Jan 02 16:11:18 lzThinkpad kernel: amdgpu 0000:05:00.0: 00000000a2e0b642 pin failed
Jan 02 16:11:18 lzThinkpad kernel: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Comment 22 Sergio Perez 2019-01-03 14:16:18 UTC
I would like to add that on my Lenovo E585 iommu=pt works reliably; even for hours and doing games/webvideos. But a few minutes in wayland produce a frozen screen (without iommu=pt is does not even start).
Comment 23 Alex Deucher 2019-01-03 16:12:52 UTC
Can anyone else try and bisect?
Comment 24 Chí-Thanh Christopher Nguyễn 2019-01-03 16:42:22 UTC
No problem here with amdgpu and iommu enabled, running kernel 4.20.0 on Dell Latitude 5495 (2700U). So BIOS issue maybe?

iommu=pt is however still needed for kfd (bug 107898).
Comment 25 tones111 2019-01-04 05:08:01 UTC
Created attachment 142974 [details]
journalctl -b of lockup from bisected commit

E585 owner here.  Please let me know if I can provide any additional information that would be helpful.  Thanks in advance for your help.

This problem was very consistently reproduced during the bisect.  I've attached a journalctl -b from the first bad commit.  I was able to bisect the problem to...


284dec4317c8e76f45d3ce922f673c80331812f1 is the first bad commit
commit 284dec4317c8e76f45d3ce922f673c80331812f1
Author: Christian König <christian.koenig@amd.com>
Date:   Wed Aug 22 16:44:56 2018 +0200

    drm/amdgpu: enable GTT PD/PT for raven v3

    Should work on Vega10 as well, but with an obvious performance hit.

    Older APUs can be enabled as well, but will probably be more work.

    v2: fix error checking
    v3: use more general check

    Signed-off-by: Christian König <christian.koenig@amd.com>
    Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Comment 26 chris 2019-01-04 11:04:53 UTC
Hi,

many thanks for that bisect.
I googled the commit and found the following in addition which seems to be the same issue ?

https://bugzilla.kernel.org/show_bug.cgi?id=201727

Hope that helps.

Many thanks !
Christian
Comment 27 chris 2019-01-18 20:28:08 UTC
Still the same issue with kernel 5.0-rc1. Any plan on when to tackle that issue ?
Comment 29 tones111 2019-01-19 16:14:28 UTC
I'm able to boot when building from that commit (1c1eba8) and looks like it will land in 4.20.4.

Thanks!
Comment 30 chris 2019-01-19 17:43:10 UTC
Very nice. Just tried 5.0-rc2 and booting works fine now without the iommu workaround !


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.