Bug 105760 - [4.17-rc1] RIP: smu7_populate_single_firmware_entry.isra.6+0x57/0xc0 [amdgpu] RSP: ffffa17901efb930
Summary: [4.17-rc1] RIP: smu7_populate_single_firmware_entry.isra.6+0x57/0xc0 [amdgpu]...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: high critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords: regression
: 106402 106513 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-03-27 08:58 UTC by taijian
Modified: 2018-09-10 03:43 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
recovered journal of boot attempt (112.15 KB, text/plain)
2018-03-27 08:58 UTC, taijian
no flags Details
recovered journal of boot attempt (106.68 KB, text/plain)
2018-04-17 23:08 UTC, taijian
no flags Details
dmesg after resume (76.13 KB, text/plain)
2018-05-21 21:33 UTC, Thomas Martitz
no flags Details
attachment-2556-0.html (1.66 KB, text/html)
2018-06-25 05:37 UTC, Mathieu.Dutour@gmail.com
no flags Details
dmesg after resume (76.17 KB, text/plain)
2018-06-27 07:50 UTC, Thomas Martitz
no flags Details
workaround (3.79 KB, patch)
2018-07-11 14:31 UTC, Thomas Martitz
no flags Details | Splinter Review
possible fix 1/4 (1.73 KB, patch)
2018-07-11 19:21 UTC, Alex Deucher
no flags Details | Splinter Review
possible fix 2/4 (1.76 KB, patch)
2018-07-11 19:21 UTC, Alex Deucher
no flags Details | Splinter Review
possible fix 3/4 (5.06 KB, patch)
2018-07-11 19:22 UTC, Alex Deucher
no flags Details | Splinter Review
possible fix 4/4 (1.54 KB, patch)
2018-07-11 19:22 UTC, Alex Deucher
no flags Details | Splinter Review
possible fix (6.94 KB, patch)
2018-07-11 22:01 UTC, Alex Deucher
no flags Details | Splinter Review
possible fix (7.01 KB, patch)
2018-07-12 05:41 UTC, Alex Deucher
no flags Details | Splinter Review
dmesg with 0001-drm-amdgpu-pp-smu7-cache-smu-firmware-toc.patch (105.95 KB, text/plain)
2018-07-12 08:00 UTC, Thomas Martitz
no flags Details
use gtt for firmware buffers (1.31 KB, patch)
2018-07-12 13:17 UTC, Alex Deucher
no flags Details | Splinter Review
workaround without memcpy (3.74 KB, patch)
2018-07-12 13:17 UTC, Thomas Martitz
no flags Details | Splinter Review
dmesg with 0001-workaround-v2.patch (79.60 KB, text/plain)
2018-07-12 13:19 UTC, Thomas Martitz
no flags Details
possible fix (1023 bytes, patch)
2018-07-12 13:28 UTC, Alex Deucher
no flags Details | Splinter Review
dmesg with 0001-workaround-v2.patch + 0001-drm-amdgpu-add-ATPX-quirk-for-a-polaris-12-laptop.patch (125.50 KB, text/plain)
2018-07-12 19:38 UTC, Thomas Martitz
no flags Details
possible fix (556 bytes, patch)
2018-07-16 17:12 UTC, Alex Deucher
no flags Details | Splinter Review
dmesg with force_asic_init.diff + 0001-workaround-v2.patch (80.98 KB, text/plain)
2018-07-17 06:28 UTC, Thomas Martitz
no flags Details
dmesg + Karols hack (79.83 KB, text/plain)
2018-07-26 22:32 UTC, Thomas Martitz
no flags Details
fixed hack.patch (1.44 KB, patch)
2018-07-26 22:33 UTC, Thomas Martitz
no flags Details | Splinter Review
acpidump (1.29 MB, text/plain)
2018-08-30 06:57 UTC, Thomas Martitz
no flags Details
lspci (393.61 KB, text/plain)
2018-08-30 06:59 UTC, Thomas Martitz
no flags Details

Description taijian 2018-03-27 08:58:37 UTC
Created attachment 138374 [details]
recovered journal of boot attempt

I am trying out the linux-4.17-drm-next kernel line from https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.17-wip and with the latest build (commit 576e538e5fe6ac103cde6b269c6210985b026689) my systemc no longer boots to the graphical target and instead hard freezes after loading the initramfs. A recovered journal is attached.
Comment 1 taijian 2018-03-30 16:04:44 UTC
OK, I think I've managed to narrow this one down a bit.

If I build the kernel from commit 09695ad78f1f5f315c7e9c5090f0c7b846a43690, which is also tagged as 'drm-next-4.17', then everything is shiny. However, if I go one step beyond that, which is commit 33d009cd889490838c5db9b9339856c9e3d3facc - being the rebasing of drm-next-4.17-wip onto David Airlie's drm-next branch after he merged AMD's drm-next-4.17 into his branch, then things to belly up and the kernel does not boot anymore.

Now, what I do not get is how the rebase to a tree that includes stuff that is not amdgpu would bork up the ability of amdgpu to load it's firmware?
Comment 2 taijian 2018-04-16 09:06:37 UTC
After upgrading my testing kernel to 4.17-rc1, the problem still persists and the system remains unbootable.
Comment 3 Alex Deucher 2018-04-16 14:48:46 UTC
Is the driver build as a module or built into the kernel?
Comment 4 taijian 2018-04-17 07:58:29 UTC
It is build as a module and then embedded in the initramfs.
Comment 5 taijian 2018-04-17 07:58:56 UTC
Assuming that you mean amdgpu.
Comment 6 taijian 2018-04-17 10:16:17 UTC
If I wanted to try to embed amdgpu in the kernel for testing, how would I even go about doing that? Simply editing my config file from =m to =y does not seem to do anything.
Comment 7 taijian 2018-04-17 23:08:49 UTC
Created attachment 138890 [details]
recovered journal of boot attempt

OK, trying out the latest git code from drm-next-4.18-wip up to and including commit 	37d6cbfb550ebde65ec12291ec9ec03f87cd0aff, we seem to be getting a step further in the boot process. Now the initramfs seems to hand over fine to GDM, I can select my user entry and enter my login password. However, the screen then freezes upon trying to start the user session (Xorg, haven't tried Wayland so far). Error messages look very similar to before.
Comment 8 taijian 2018-04-24 09:13:22 UTC
OK, the issue still persists with 4.17rc2. Same as before, I can boot into cli but trying to start X results in a hung system because X cannot access the dGPU. 
For reference, my firmware is current as of 

  Qs linux-firmware
  local/linux-firmware 20180402.8c1e439-1
Comment 9 Alex Deucher 2018-05-14 14:37:34 UTC
*** Bug 106513 has been marked as a duplicate of this bug. ***
Comment 10 taijian 2018-05-21 20:02:11 UTC
This seems to be fixed an the current drm-next-4.18-wip branch.
Comment 11 taijian 2018-05-21 20:11:18 UTC
*** Bug 106402 has been marked as a duplicate of this bug. ***
Comment 12 Thomas Martitz 2018-05-21 21:33:55 UTC
Created attachment 139668 [details]
dmesg after resume

I still get the backtrace on drm-next-4.18-wip, unfortunately.

(note that I have also cherry-picked the last patch from https://bugzilla.kernel.org/show_bug.cgi?id=199693 to make resume work at all)

On a side node, I also see that lspci hangs when running this kernel.

System: Arch Linux
Intel Kably Lake Refresh (i7-8550U), Intel UHD 620 + Radeon Pro WX 3100.
Comment 13 taijian 2018-06-24 22:30:56 UTC
Have you tried again on any of the 4.18rc? I am currently testing 4.18-rc2 and altough I have some other bug there, this one seems to be gone for me.
Comment 14 Mathieu.Dutour@gmail.com 2018-06-25 05:37:48 UTC
Created attachment 140312 [details]
attachment-2556-0.html

Thanks, but no time for that.
Reverted to 17.10

Le lun. 25 juin 2018 à 00:31, <bugzilla-daemon@freedesktop.org> a écrit :

> *Comment # 13 <https://bugs.freedesktop.org/show_bug.cgi?id=105760#c13> on
> bug 105760 <https://bugs.freedesktop.org/show_bug.cgi?id=105760> from
> taijian@posteo.de <taijian@posteo.de> *
>
> Have you tried again on any of the 4.18rc? I am currently testing 4.18-rc2 and
> altough I have some other bug there, this one seems to be gone for me.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>
Comment 15 Thomas Martitz 2018-06-27 07:50:05 UTC
Still happens for me on 4.18-rc2.
Comment 16 Thomas Martitz 2018-06-27 07:50:52 UTC
Created attachment 140356 [details]
dmesg after resume
Comment 17 taijian 2018-06-28 07:25:18 UTC
OK, over at bug 107045 we agreed I'd start bisecting this starting from 4.16. I'll report back once I find something, but it'll be a while...
Comment 18 Thomas Martitz 2018-06-28 07:51:08 UTC
Uhm, I can reproduce this problem also in 4.14 LTS, which prevented me from bisecting myself.
Comment 19 Thomas Martitz 2018-06-29 19:08:05 UTC
I tried https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.19-wip but the issue remains
Comment 20 taijian 2018-06-29 22:14:07 UTC
Yeah, screw this. 

I tried again, but because there are several different bugs interacting and screwing up the boot process, I really can't seem to be able to figure out which one exactly is borking up which build. 

I've been waiting for more than a year to be able to use my laptop the way it was meant to be, and I'm now ready to declare that I'm never again buying a piece of hardware that hasn't already been confirmed to work with Linux.
Comment 21 Thomas Martitz 2018-07-11 14:31:12 UTC
Created attachment 140560 [details] [review]
workaround

The attached workaround makes resuming generally work on my system. The problem seems to be that the memory that smu_data->header is pointing to changes behind the code's back. (I used printk inside smu7_populate_single_firmware_entry() and saw that the pointer (&toc->entry[0]) passed to it is widely different from what the caller sees, so I think the structure pointed to by toc gets overwritten)

I assume this is some mapped memory and some HW component in the GPU writes to it while the CPU is using it, isn't it? If so, the proper fix would be to prevent that but I don't know what's the proper way of doing it in this context.

I hope the experts can take a look into the patch for more insight and a real fix.
Comment 22 Alex Deucher 2018-07-11 19:21:28 UTC
Created attachment 140573 [details] [review]
possible fix 1/4

Does this patch set help?
Comment 23 Alex Deucher 2018-07-11 19:21:50 UTC
Created attachment 140574 [details] [review]
possible fix 2/4
Comment 24 Alex Deucher 2018-07-11 19:22:13 UTC
Created attachment 140575 [details] [review]
possible fix 3/4
Comment 25 Alex Deucher 2018-07-11 19:22:31 UTC
Created attachment 140576 [details] [review]
possible fix 4/4
Comment 26 Thomas Martitz 2018-07-11 20:20:51 UTC
No, unfortunately it doesn't seem to have an effect. I still run into the same oops, and the printk's I added indicate the same problem (the entry pointer passed to smu7_populate_single_firmware_entry() is busted, but fine just before the call).

[   40.676302] PM: suspend exit
[   40.686230] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[   40.744451] smu7_request_smu_load_fw 10 0000000095bc3514 10 000000008bdf55bd 0
[   40.744461] smu7_populate_single_firmware_entry 10 10 00000000348c95be 000000008bdf55bd
[   40.744467] smu7_populate_single_firmware_entry 20 10 00000000348c95be 000000008bdf55bd
[   40.744478] BUG: unable to handle kernel paging request at ffffb09e2045efec
[   40.744482] PGD 266d39067 P4D 266d39067 PUD 0 
[   40.744490] Oops: 0002 [#1] PREEMPT SMP PTI
[   40.744497] CPU: 6 PID: 219 Comm: kworker/6:2 Tainted: G     U            4.18.0-rc3-custom+ #63
[   40.744500] Hardware name: HP HP ZBook 14u G5/83B2, BIOS Q78 Ver. 01.00.05 01/25/2018
[   40.744510] Workqueue: pm pm_runtime_work
[   40.744517] RIP: 0010:smu7_populate_single_firmware_entry+0x83/0xda
[   40.744519] Code: 60 00 4d 89 e0 48 89 d9 89 ea 41 89 c5 48 c7 c6 40 fe ec 8f 48 c7 c7 fd c6 17 90 e8 5c b0 ad ff 45 85 ed 75 40 66 8b 44 24 02 <66> 89 2b 48 c7 43 0c 00 00 00 00 66 89 43 02 48 8b 44 24 10 48 89 
[   40.744581] RSP: 0018:ffffb0820175fc38 EFLAGS: 00010246
[   40.744585] RAX: 0000000000000018 RBX: ffffb09e2045efec RCX: 0000000000000001
[   40.744588] RDX: 0000000080000001 RSI: ffffffff901151ce RDI: 00000000ffffffff
[   40.744590] RBP: 000000000000000a R08: ffffffff8f499790 R09: 0000000000000421
[   40.744593] R10: 0000000000000004 R11: ffffffff90ab8f2d R12: ffff8e1f65474c00
[   40.744596] R13: 0000000000000000 R14: ffffb0822045f008 R15: ffff8e1f6564b0a0
[   40.744599] FS:  0000000000000000(0000) GS:ffff8e1f6f580000(0000) knlGS:0000000000000000
[   40.744602] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.744605] CR2: ffffb09e2045efec CR3: 000000025a40a005 CR4: 00000000003606e0
[   40.744608] Call Trace:
[   40.744617]  smu7_request_smu_load_fw+0xd2/0x120
[   40.744624]  ? vga_switcheroo_fini_domain_pm_ops+0x10/0x10
[   40.744629]  polaris10_start_smu+0x44/0x4d0
[   40.744635]  hwmgr_resume+0x29/0x80
[   40.744641]  amdgpu_device_ip_resume_phase2+0x51/0xb0
[   40.744647]  amdgpu_device_resume+0xb5/0x360
[   40.744653]  ? vga_switcheroo_fini_domain_pm_ops+0x10/0x10
[   40.744658]  amdgpu_pmops_runtime_resume+0x6b/0xa0
[   40.744663]  pci_pm_runtime_resume+0x78/0xb0
[   40.744669]  __rpm_callback+0x75/0x1b0
[   40.744675]  ? vga_switcheroo_fini_domain_pm_ops+0x10/0x10
[   40.744679]  rpm_callback+0x1f/0x70
[   40.744684]  ? vga_switcheroo_fini_domain_pm_ops+0x10/0x10
[   40.744689]  rpm_resume+0x57e/0x7c0
[   40.744695]  pm_runtime_work+0x50/0xa0
[   40.744700]  process_one_work+0x1eb/0x3c0
[   40.744704]  worker_thread+0x2d/0x3d0
[   40.744709]  ? process_one_work+0x3c0/0x3c0
[   40.744713]  kthread+0x112/0x130
[   40.744718]  ? kthread_flush_work_fn+0x10/0x10
[   40.744724]  ret_from_fork+0x35/0x40
[   40.744730] Modules linked in: cmac ccm rfcomm intel_rapl arc4 x86_pkg_temp_thermal intel_powerclamp bnep snd_hda_codec_hdmi coretemp snd_hda_codec_conexant snd_hda_codec_generic joydev mousedev kvm iwlmvm snd_soc_skl nls_iso8859_1 snd_soc_skl_ipc snd_soc_sst_ipc btusb snd_soc_sst_dsp btrtl snd_hda_ext_core btbcm nls_cp437 snd_soc_core hid_multitouch irqbypass vfat crct10dif_pclmul crc32_pclmul fat ghash_clmulni_intel mac80211 iTCO_wdt snd_compress hid_generic btintel pcbc mei_wdt iTCO_vendor_support snd_soc_acpi bluetooth snd_hda_intel i915 hp_wmi sparse_keymap intel_wmi_thunderbolt wmi_bmof snd_hda_codec iwlwifi snd_hwdep snd_hda_core aesni_intel aes_x86_64 crypto_simd crc16 ecdh_generic snd_pcm e1000e cryptd glue_helper intel_cstate cfg80211 snd_timer intel_uncore snd intel_rapl_perf input_leds
[   40.744798]  psmouse led_class uvcvideo videobuf2_vmalloc ptp videobuf2_memops videobuf2_v4l2 idma64 i2c_i801 mei_me pps_core soundcore ucsi_acpi videobuf2_common typec_ucsi rfkill tpm_crb i2c_hid mei videodev typec intel_lpss_pci processor_thermal_device hid intel_lpss intel_soc_dts_iosf intel_pch_thermal wmi intel_gtt evdev media int3403_thermal rtc_cmos int340x_thermal_zone tpm_tis tpm_tis_core tpm mac_hid int3400_thermal battery rng_core acpi_thermal_rel hp_wireless ac sg scsi_mod crypto_user ip_tables x_tables btrfs libcrc32c crc32c_generic xor zstd_decompress zstd_compress xxhash raid6_pq serio_raw atkbd libps2 xhci_pci xhci_hcd crc32c_intel usbcore usb_common i8042 serio
[   40.744861] CR2: ffffb09e2045efec
[   40.744865] ---[ end trace d2eb1a098dac8272 ]---
[   40.744870] RIP: 0010:smu7_populate_single_firmware_entry+0x83/0xda
[   40.744872] Code: 60 00 4d 89 e0 48 89 d9 89 ea 41 89 c5 48 c7 c6 40 fe ec 8f 48 c7 c7 fd c6 17 90 e8 5c b0 ad ff 45 85 ed 75 40 66 8b 44 24 02 <66> 89 2b 48 c7 43 0c 00 00 00 00 66 89 43 02 48 8b 44 24 10 48 89 
[   40.744925] RSP: 0018:ffffb0820175fc38 EFLAGS: 00010246
[   40.744929] RAX: 0000000000000018 RBX: ffffb09e2045efec RCX: 0000000000000001
[   40.744932] RDX: 0000000080000001 RSI: ffffffff901151ce RDI: 00000000ffffffff
[   40.744934] RBP: 000000000000000a R08: ffffffff8f499790 R09: 0000000000000421
[   40.744937] R10: 0000000000000004 R11: ffffffff90ab8f2d R12: ffff8e1f65474c00
[   40.744939] R13: 0000000000000000 R14: ffffb0822045f008 R15: ffff8e1f6564b0a0
[   40.744943] FS:  0000000000000000(0000) GS:ffff8e1f6f580000(0000) knlGS:0000000000000000
[   40.744946] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.744948] CR2: ffffb09e2045efec CR3: 000000025a40a005 CR4: 00000000003606e0
Comment 27 Thomas Martitz 2018-07-11 20:24:40 UTC
I wonder if any of the following warnings I see at boot has anyting to do with it:

[    0.905752] amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
[    0.905796] amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
[    0.905840] amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
[    0.905884] amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
[    0.905929] amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
[    0.905975] amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
[    0.906019] amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
[    0.927117] amdgpu: [powerplay] Failed to retrieve minimum clocks.
[    0.927118] amdgpu: [powerplay] Error in phm_get_clock_info 
[    0.927165] [drm] DM_PPLIB: values for Engine clock
[    0.927166] [drm] DM_PPLIB:   21400
[    0.927167] [drm] DM_PPLIB:   37200
[    0.927167] [drm] DM_PPLIB:   55100
[    0.927168] [drm] DM_PPLIB:   73400
[    0.927169] [drm] DM_PPLIB:   92100
[    0.927169] [drm] DM_PPLIB:   98000
[    0.927170] [drm] DM_PPLIB:   101800
[    0.927171] [drm] DM_PPLIB:   104600
[    0.927172] [drm] DM_PPLIB: Validation clocks:
[    0.927172] [drm] DM_PPLIB:    engine_max_clock: 104600
[    0.927173] [drm] DM_PPLIB:    memory_max_clock: 150000
[    0.927174] [drm] DM_PPLIB:    level           : 0
[    0.927176] [drm] DM_PPLIB: values for Memory clock
[    0.927177] [drm] DM_PPLIB:   30000
[    0.927177] [drm] DM_PPLIB:   62500
[    0.927178] [drm] DM_PPLIB:   150000
[    0.927179] [drm] DM_PPLIB: Validation clocks:
[    0.927180] [drm] DM_PPLIB:    engine_max_clock: 104600
[    0.927180] [drm] DM_PPLIB:    memory_max_clock: 150000
[    0.927181] [drm] DM_PPLIB:    level           : 0
[    0.927187] [drm:dc_create] *ERROR* DC: Number of connectors is zero!
[    0.928139] [drm] Display Core initialized with v3.1.52!
[    0.928203] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    0.928204] [drm] Driver supports precise vblank timestamp query.
[    0.959940] [drm] UVD and UVD ENC initialized successfully.
[    1.060904] [drm] VCE initialized successfully.
[    1.065071] [drm] Initialized amdgpu 3.26.0 20150101 for 0000:01:00.0 on minor 0
Comment 28 Alex Deucher 2018-07-11 22:01:35 UTC
Created attachment 140577 [details] [review]
possible fix

Something appears to be corrupting the cpu pointer.  This patch may work around the issue, but ideally we figure out what is corrupting the cpu pointer in the first place.
Comment 29 Thomas Martitz 2018-07-12 04:45:36 UTC
Unfortunately, the last patch doesn't help either. This time I removed all my printk's and applied your patch on top of ~agd5f/linux/drm-next-4.19


[   32.537266] BUG: unable to handle kernel paging request at ffffb77e20080fec
[   32.537270] PGD 266d39067 P4D 266d39067 PUD 0 
[   32.537274] Oops: 0002 [#1] PREEMPT SMP PTI
[   32.537276] CPU: 2 PID: 1042 Comm: kworker/2:4 Tainted: G     U            4.18.0-rc3-custom+ #64
[   32.537277] Hardware name: HP HP ZBook 14u G5/83B2, BIOS Q78 Ver. 01.00.05 01/25/2018
[   32.537282] Workqueue: pm pm_runtime_work
[   32.537286] RIP: 0010:smu7_populate_single_firmware_entry.isra.6+0x49/0xa0
[   32.537287] Code: 48 89 e7 f3 48 ab 89 d0 4c 89 c7 48 89 e2 0f b6 b0 00 fe ec 96 49 8b 00 48 8b 40 20 e8 c0 6c 60 00 85 c0 75 3d 0f b7 44 24 02 <66> 89 2b 48 c7 43 0c 00 00 00 00 66 89 43 02 48 8b 44 24 10 48 89 
[   32.537311] RSP: 0018:ffffb7620519bc28 EFLAGS: 00010246
[   32.537313] RAX: 000000000000008c RBX: ffffb77e20080fec RCX: 0000000000532000
[   32.537314] RDX: ffffffff965e8a16 RSI: 0000000000000004 RDI: ffff90aa63bf1c10
[   32.537315] RBP: 0000000000000003 R08: ffff90aa63bf1c10 R09: ffff90aa634cdf00
[   32.537316] R10: 0000000000000000 R11: 0000000000000000 R12: ffff90aa63462c14
[   32.537318] R13: ffff90aa63b94000 R14: 000000000000047e R15: ffff90aa63462c14
[   32.537319] FS:  0000000000000000(0000) GS:ffff90aa6f480000(0000) knlGS:0000000000000000
[   32.537321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.537322] CR2: ffffb77e20080fec CR3: 00000001d840a004 CR4: 00000000003606e0
[   32.537323] Call Trace:
[   32.537327]  smu7_request_smu_load_fw+0x179/0x430
[   32.537331]  ? vga_switcheroo_fini_domain_pm_ops+0x10/0x10
[   32.537333]  polaris10_start_smu+0x44/0x4d0
[   32.537336]  hwmgr_resume+0x29/0x80
[   32.537339]  amdgpu_device_ip_resume_phase2+0x51/0xb0
[   32.537342]  amdgpu_device_resume+0xb5/0x360
[   32.537345]  ? vga_switcheroo_fini_domain_pm_ops+0x10/0x10
[   32.537347]  amdgpu_pmops_runtime_resume+0x6b/0xa0
[   32.537349]  pci_pm_runtime_resume+0x78/0xb0
[   32.537352]  __rpm_callback+0x75/0x1b0
[   32.537354]  ? vga_switcheroo_fini_domain_pm_ops+0x10/0x10
[   32.537356]  rpm_callback+0x1f/0x70
[   32.537359]  ? vga_switcheroo_fini_domain_pm_ops+0x10/0x10
[   32.537361]  rpm_resume+0x57e/0x7c0
[   32.537363]  pm_runtime_work+0x50/0xa0
[   32.537366]  process_one_work+0x1eb/0x3c0
[   32.537368]  worker_thread+0x2d/0x3d0
[   32.537370]  ? process_one_work+0x3c0/0x3c0
[   32.537372]  kthread+0x112/0x130
[   32.537374]  ? kthread_flush_work_fn+0x10/0x10
[   32.537377]  ret_from_fork+0x35/0x40
[   32.537380] Modules linked in: cmac rfcomm ccm usbhid arc4 snd_hda_codec_hdmi bnep snd_hda_codec_conexant snd_hda_codec_generic joydev mousedev intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_soc_skl nls_iso8859_1 iwlmvm snd_soc_skl_ipc nls_cp437 vfat kvm snd_soc_sst_ipc fat hp_wmi snd_soc_sst_dsp iTCO_wdt hid_multitouch mac80211 hid_generic mei_wdt iTCO_vendor_support intel_wmi_thunderbolt snd_hda_ext_core sparse_keymap irqbypass crct10dif_pclmul wmi_bmof crc32_pclmul ghash_clmulni_intel snd_soc_core pcbc i915 iwlwifi uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_compress snd_soc_acpi videobuf2_common snd_hda_intel videodev aesni_intel btusb snd_hda_codec btrtl btbcm media aes_x86_64 crypto_simd cryptd btintel snd_hwdep glue_helper intel_cstate snd_hda_core bluetooth
[   32.537413]  intel_uncore input_leds intel_rapl_perf led_class snd_pcm psmouse cfg80211 snd_timer mei_me e1000e mei snd crc16 thunderbolt ecdh_generic processor_thermal_device soundcore idma64 intel_soc_dts_iosf ptp tpm_crb rfkill i2c_i801 pps_core i2c_hid intel_gtt intel_lpss_pci intel_lpss ucsi_acpi intel_pch_thermal hid typec_ucsi typec wmi evdev tpm_tis int3403_thermal tpm_tis_core int340x_thermal_zone rtc_cmos tpm int3400_thermal mac_hid ac battery acpi_thermal_rel rng_core hp_wireless sg scsi_mod crypto_user ip_tables x_tables btrfs libcrc32c crc32c_generic xor zstd_decompress zstd_compress xxhash raid6_pq serio_raw atkbd libps2 xhci_pci xhci_hcd crc32c_intel usbcore usb_common i8042 serio
[   32.537446] CR2: ffffb77e20080fec
[   32.537448] ---[ end trace 63bf3db85058595c ]---
[   32.537451] RIP: 0010:smu7_populate_single_firmware_entry.isra.6+0x49/0xa0
[   32.537452] Code: 48 89 e7 f3 48 ab 89 d0 4c 89 c7 48 89 e2 0f b6 b0 00 fe ec 96 49 8b 00 48 8b 40 20 e8 c0 6c 60 00 85 c0 75 3d 0f b7 44 24 02 <66> 89 2b 48 c7 43 0c 00 00 00 00 66 89 43 02 48 8b 44 24 10 48 89 
[   32.537476] RSP: 0018:ffffb7620519bc28 EFLAGS: 00010246
[   32.537477] RAX: 000000000000008c RBX: ffffb77e20080fec RCX: 0000000000532000
[   32.537479] RDX: ffffffff965e8a16 RSI: 0000000000000004 RDI: ffff90aa63bf1c10
[   32.537480] RBP: 0000000000000003 R08: ffff90aa63bf1c10 R09: ffff90aa634cdf00
[   32.537481] R10: 0000000000000000 R11: 0000000000000000 R12: ffff90aa63462c14
[   32.537482] R13: ffff90aa63b94000 R14: 000000000000047e R15: ffff90aa63462c14
[   32.537484] FS:  0000000000000000(0000) GS:ffff90aa6f480000(0000) knlGS:0000000000000000
[   32.537485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.537486] CR2: ffffb77e20080fec CR3: 00000001d840a004 CR4: 00000000003606e0
Comment 30 Alex Deucher 2018-07-12 04:56:36 UTC
Is it the first call to smu7_populate_single_firmware_entry() which fails or one of the later ones?
Comment 31 Thomas Martitz 2018-07-12 05:09:11 UTC
I can't say for sure for your latest patch because I removed the printks, but it has been the first one before (but only after resume, the ones at boot are OK). Should I check again for your latest patch?
Comment 32 Alex Deucher 2018-07-12 05:41:32 UTC
Created attachment 140584 [details] [review]
possible fix

How about this patch?
Comment 33 Alex Deucher 2018-07-12 06:34:05 UTC
(In reply to Thomas Martitz from comment #21)
> 
> I assume this is some mapped memory and some HW component in the GPU writes
> to it while the CPU is using it, isn't it? If so, the proper fix would be to
> prevent that but I don't know what's the proper way of doing it in this
> context.

The CPU is writing to a buffer that the GPU ultimately reads.  Even if the data in the buffer were corrupted by the GPU somehow, the CPU's pointer should still be valid.  Can you add slub_debug=FPZU to the kernel command line in grub and attach your dmesg output?
Comment 34 Thomas Martitz 2018-07-12 08:00:26 UTC
Created attachment 140585 [details]
dmesg with 0001-drm-amdgpu-pp-smu7-cache-smu-firmware-toc.patch

This patch makes resume work, attached is the dmesg output of boot + 3 suspend-resume cycles. Please note the powerplay error messages, followed ultimately by a GPU reset.

Looking at the patch it seems similar to my workarond, in that the toc is copied in one memcpy_toio (my patch uses plain memcpy, is there a difference here, for memory mapped buffers?) instead of changing the toc in-place. And I too see lots of powerplay errors if i apply my workaround.

Huge thanks for taking time to look into this!
Comment 35 Alex Deucher 2018-07-12 13:17:42 UTC
Created attachment 140590 [details] [review]
use gtt for firmware buffers

I still don't understand what's corrupting the cpu pointer.  Ultimately, it looks like the GPU does not power up correctly after powering down so the segfault is largely irrelevant.  Does this patch help? memcpy_toio is required for accessing memory mapped device resources (e.g., vram) on some platforms.  x86 doesn't matter.
Comment 36 Thomas Martitz 2018-07-12 13:17:49 UTC
Created attachment 140591 [details] [review]
workaround without memcpy

I made the following patch as an alternative workaround. The printks I added indicate what's going wrong. The smu_data->header pointer does not become busted. Instead, the toc->num_entries member somehow gets set to -1 (perhaps by accident), and since toc->num_entries is used as an index for the toc->entry array, the smu7_populate_single_firmware_entry() function gets passed an invalid pointer.

The workaround uses a temp. variable as the index (which seems to make resume work), but it's still to be found out why toc->num_entries changes to -1. Also, I still get lots of powerplay error messages with this patch. I'll attach dmesg next, below is just the output of the printks I added.

kugel@thomas-nb:linux.git$ dmesg  | grep smu7
[    0.908377] amdgpu: [powerplay] smu7_request_smu_load_fw: 10 ffffa8a060081000 0 1
[    0.908422] amdgpu: [powerplay] smu7_request_smu_load_fw: 20 ffffa8a060081000 0 1
[   30.042293] amdgpu: [powerplay] smu7_request_smu_load_fw: 10 ffffa8a060081000 0 1
[   30.042309] amdgpu: [powerplay] smu7_request_smu_load_fw: 20 ffffa8a060081000 -1 -1
Comment 37 Thomas Martitz 2018-07-12 13:19:33 UTC
Created attachment 140592 [details]
dmesg with 0001-workaround-v2.patch
Comment 38 Alex Deucher 2018-07-12 13:21:31 UTC
(In reply to Thomas Martitz from comment #36)
> Created attachment 140591 [details] [review] [review]
> workaround without memcpy
> 
> I made the following patch as an alternative workaround. The printks I added
> indicate what's going wrong. The smu_data->header pointer does not become
> busted. Instead, the toc->num_entries member somehow gets set to -1 (perhaps
> by accident), and since toc->num_entries is used as an index for the
> toc->entry array, the smu7_populate_single_firmware_entry() function gets
> passed an invalid pointer.
> 
> The workaround uses a temp. variable as the index (which seems to make
> resume work), but it's still to be found out why toc->num_entries changes to
> -1. Also, I still get lots of powerplay error messages with this patch. I'll
> attach dmesg next, below is just the output of the printks I added.

That explains it.  The problem is that the GPU does not power up properly on resume so when you read back from vram to get the index, it returns all 1s since the device is offline.
Comment 39 Alex Deucher 2018-07-12 13:28:40 UTC
Created attachment 140593 [details] [review]
possible fix

Does this patch help fix the root case?  To anyone else testing this patch, update the pci ids to match those of your chip.
Comment 40 Thomas Martitz 2018-07-12 13:31:31 UTC
Further investigations show that toc->num_entires and toc->structure_version are set to -1 after the first call to smu7_request_smu_load_fw(). Does that makes sense?

Since you say the GPU does not properly wake up, can you imagine a workaround? The laptop works with windows (of course...) so I'd think there ought to be a sw workaround.
Comment 41 Alex Deucher 2018-07-12 13:35:17 UTC
(In reply to Thomas Martitz from comment #40)
> Further investigations show that toc->num_entires and toc->structure_version
> are set to -1 after the first call to smu7_request_smu_load_fw(). Does that
> makes sense?

If you read back from the BAR resource on an offline pci device it returns all 1s.

> 
> Since you say the GPU does not properly wake up, can you imagine a
> workaround? The laptop works with windows (of course...) so I'd think there
> ought to be a sw workaround.

Does attachment 140593 [details] [review] fix it as a workaround?  I think ultimately it might be a flaw in how Linux handles d3cold on some platforms.
Comment 42 Thomas Martitz 2018-07-12 19:38:06 UTC
Created attachment 140611 [details]
dmesg with 0001-workaround-v2.patch + 0001-drm-amdgpu-add-ATPX-quirk-for-a-polaris-12-laptop.patch

Sorry to say, but this patch makes things actually *worse*.

First, by accident, I added your latest patch on-top of my previous workaround v2. This gives working suspend/resume but many more error messages in dmesg, in particular a WARN() triggers:

[  385.996911] Modules linked in: cmac rfcomm ccm arc4 snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic joydev intel_rapl mousedev x86_pkg_temp_thermal intel_powerclamp bnep coretemp iwlmvm snd_soc_skl snd_soc_skl_ipc hid_multitouch snd_soc_sst_ipc mac80211 snd_soc_sst_dsp hid_generic kvm snd_hda_ext_core mei_wdt snd_soc_core nls_iso8859_1 i915 nls_cp437 iwlwifi btusb irqbypass btrtl vfat btbcm crct10dif_pclmul snd_compress btintel iTCO_wdt crc32_pclmul iTCO_vendor_support snd_soc_acpi ghash_clmulni_intel fat bluetooth pcbc intel_wmi_thunderbolt hp_wmi sparse_keymap snd_hda_intel wmi_bmof snd_hda_codec cfg80211 crc16 snd_hwdep ecdh_generic aesni_intel snd_hda_core aes_x86_64 crypto_simd snd_pcm cryptd e1000e snd_timer glue_helper intel_cstate intel_uncore intel_rapl_perf uvcvideo idma64
[  385.996938]  tpm_crb input_leds led_class videobuf2_vmalloc snd psmouse videobuf2_memops mei_me i2c_i801 ptp mei videobuf2_v4l2 pps_core processor_thermal_device ucsi_acpi i2c_hid videobuf2_common typec_ucsi intel_lpss_pci soundcore typec rfkill intel_pch_thermal wmi intel_gtt intel_lpss intel_soc_dts_iosf hid videodev tpm_tis tpm_tis_core int3403_thermal int340x_thermal_zone rtc_cmos media evdev tpm ac int3400_thermal mac_hid battery acpi_thermal_rel rng_core hp_wireless sg scsi_mod crypto_user ip_tables x_tables btrfs libcrc32c crc32c_generic xor zstd_decompress zstd_compress xxhash serio_raw raid6_pq atkbd libps2 xhci_pci xhci_hcd crc32c_intel usbcore usb_common i8042 serio
[  385.996964] CPU: 4 PID: 215 Comm: kworker/4:2 Tainted: G     U  W         4.18.0-rc3-custom+ #73
[  385.996965] Hardware name: HP HP ZBook 14u G5/83B2, BIOS Q78 Ver. 01.00.05 01/25/2018
[  385.996968] Workqueue: pm pm_runtime_work
[  385.996970] RIP: 0010:generic_reg_wait+0xe7/0x160
[  385.996970] Code: 44 24 58 8b 54 24 48 89 de 44 89 4c 24 08 48 8b 4c 24 50 48 c7 c7 f8 29 19 bd e8 c4 24 e5 ff 83 7d 20 01 44 8b 4c 24 08 74 02 <0f> 0b 48 83 c4 10 44 89 c8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 0f 
[  385.996989] RSP: 0018:ffffb45681fefbd8 EFLAGS: 00010297
[  385.996990] RAX: 000000000000006b RBX: 000000000000000a RCX: 0000000000000001
[  385.996991] RDX: 0000000080000001 RSI: ffffffffbd1151a6 RDI: 00000000ffffffff
[  385.996991] RBP: ffff97f4e37e3240 R08: ffffffffbc499790 R09: 00000000ffffffff
[  385.996992] R10: 0000000000000004 R11: ffffffffbdab8f2d R12: 0000000000000bb9
[  385.996992] R13: 0000000000004ea4 R14: 0000000000010000 R15: 0000000000000000
[  385.996993] FS:  0000000000000000(0000) GS:ffff97f4ef500000(0000) knlGS:0000000000000000
[  385.996994] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  385.996994] CR2: 00007ff68c6ef000 CR3: 00000001ee40a002 CR4: 00000000003606e0
[  385.996995] Call Trace:
[  385.997000]  dce110_stream_encoder_dp_blank+0x11c/0x180
[  385.997002]  power_down_all_hw_blocks+0x3d/0x1c0
[  385.997003]  dce110_power_down+0xe/0x20
[  385.997005]  dc_set_power_state+0x1b/0x70
[  385.997007]  dm_suspend+0x4a/0x60
[  385.997009]  amdgpu_device_ip_suspend+0xe4/0x170
[  385.997011]  amdgpu_device_suspend+0x251/0x3a0
[  385.997013]  amdgpu_pmops_runtime_suspend+0x44/0xb0
[  385.997015]  pci_pm_runtime_suspend+0x64/0x180
[  385.997017]  ? vga_switcheroo_runtime_resume+0x60/0x60
[  385.997019]  vga_switcheroo_runtime_suspend+0x24/0xb0
[  385.997020]  __rpm_callback+0x75/0x1b0
[  385.997022]  ? __switch_to_asm+0x30/0x60
[  385.997024]  ? vga_switcheroo_runtime_resume+0x60/0x60
[  385.997025]  rpm_callback+0x1f/0x70
[  385.997026]  ? vga_switcheroo_runtime_resume+0x60/0x60
[  385.997028]  rpm_suspend+0x12a/0x610
[  385.997030]  ? finish_task_switch+0x83/0x2e0
[  385.997031]  ? __switch_to_asm+0x24/0x60
[  385.997032]  pm_runtime_work+0x7d/0xa0
[  385.997034]  process_one_work+0x1eb/0x3c0
[  385.997035]  worker_thread+0x2d/0x3d0
[  385.997037]  ? process_one_work+0x3c0/0x3c0
[  385.997038]  kthread+0x112/0x130
[  385.997039]  ? kthread_flush_work_fn+0x10/0x10
[  385.997041]  ret_from_fork+0x35/0x40
[  385.997043] ---[ end trace 04724a7f4f9fccf6 ]---

Then, there is new fatal error messages like this (the last line is new with your patch):

[  436.030371] amdgpu: [powerplay] 
                failed to send message 261 ret is 65535 
[  436.030394] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  436.030410] amdgpu: [powerplay] 
                failed to send message 261 ret is 65535 
[  436.030433] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  436.030448] amdgpu: [powerplay] 
                failed to send message 261 ret is 65535 
[  436.030471] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  436.030487] amdgpu: [powerplay] 
                failed to send message 261 ret is 65535 
[  436.145782] amdgpu 0000:01:00.0: GPU pci config reset
[  437.106049] [drm:amdgpu_device_suspend] *ERROR* amdgpu asic reset failed

I'm also quite sure I haven't seen the following before:
[  370.888835] [drm:gfx_v8_0_ring_test_ring] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xFFFFFFFF)
[  370.888839] [drm:amdgpu_device_ip_resume_phase2] *ERROR* resume of IP block <gfx_v8_0> failed -22
[  370.888841] [drm:amdgpu_device_resume] *ERROR* amdgpu_device_ip_resume failed (-22).

Most importantly, my observation that reading toc->num_entries returns -1 is still occuring:

[  368.991914] amdgpu: [powerplay] smu7_request_smu_load_fw: 10 ffffb456a0081000 0 1
[  368.991927] amdgpu: [powerplay] smu7_request_smu_load_fw: 20 ffffb456a0081000 -1 -1


Then, after I found my workaround is still aplied, I tried without. Unfortunately, with just your patch I can't get behind the SDDM login screen. The laptop freezes once the KDE session loads (I'm assuming starting X causes the freeze).g
Comment 43 Alex Deucher 2018-07-12 20:33:34 UTC
It's fine to have one of the patches to stop the segfault applied.  that's just a symptom of the root cause:
[   54.734549] amdgpu 0000:01:00.0: Refused to change power state, currently in D3
[   54.810069] amdgpu 0000:01:00.0: Refused to change power state, currently in D3
For some reason the GPU doesn't power up correctly.  There's not much the driver can do until we sort out why ACPI is not powering up the GPU correctly.

As a workaround, you can disable runtime pm by appending amdgpu.runpm=0 on the kernel command line in grub.
Comment 44 Thomas Martitz 2018-07-12 21:50:40 UTC
Disabling runtime pm probably result in poor battery life, right? This is a laptop with hybrid graphics afterall and the radeon should be disabled most of the time.

Is there anything I can try? Like checking something in windows or try the pro driver? Or make more code changes, eg retrying to power up the GPU a couple times?
Comment 45 Thomas Martitz 2018-07-13 22:00:25 UTC
> Most importantly, my observation that reading toc->num_entries returns -1 is still occuring:

> [  368.991914] amdgpu: [powerplay] smu7_request_smu_load_fw: 10 ffffb456a0081000 0 1
> [  368.991927] amdgpu: [powerplay] smu7_request_smu_load_fw: 20 ffffb456a0081000 -1 -1

These message come from.you patch, they done happen with just my last workaround.

If I insert a printk after the pci_set_power_state() succeeds and pci -> current_state indicates that gpu is powered on, yet reading from the mapped memory still returns -l. How does that make any sense?
Comment 46 Alex Deucher 2018-07-15 14:40:18 UTC
(In reply to Thomas Martitz from comment #45)
> 
> If I insert a printk after the pci_set_power_state() succeeds and pci ->
> current_state indicates that gpu is powered on, yet reading from the mapped
> memory still returns -l. How does that make any sense?

It's still not powered up properly.  If it were, the GPU would init properly (the ring failures and smu message failures wouldn't be there) and the reading back from the BARs would return real values rather than all ones.
Comment 47 Thomas Martitz 2018-07-16 14:36:04 UTC
So pci_raw_set_power_state() does a pci_read_config_word() and that returns a valid word. Yet, the device appears to be not in powerd up state later on. How's that possible, and why does it work on Windows?

Can I inspect Windows behavior in some way to get insight?

Since Windows works I'm sure there must be a SW fix (or at least a workaround) available. Perhaps just wait for a bit?
Comment 48 Alex Deucher 2018-07-16 17:07:03 UTC
(In reply to Thomas Martitz from comment #47)
> So pci_raw_set_power_state() does a pci_read_config_word() and that returns
> a valid word. Yet, the device appears to be not in powerd up state later on.
> How's that possible, and why does it work on Windows?
> 
> Can I inspect Windows behavior in some way to get insight?
> 
> Since Windows works I'm sure there must be a SW fix (or at least a
> workaround) available. Perhaps just wait for a bit?

In HG laptops, the d3cold control is handled by the OS rather than the driver (e.g., the driver doesn't call into APCI to handle d3cold, the pci core does).  The driver just has to support the necessary callbacks to the OS to enter/leave this state when idle.  Each OS interacts slightly differently with ACPI so if the OEM never validated Linux it's likely there is some slight differences in the sequencing that is causing a problem.  You might try playing with the ACPI interfaces directly on Linux.  There are user mode tools to interact with ACPI.  Blacklist the amdgpu driver and try calling the _PR3 method for the device to power it down/up and then see if the device comes back properly.
Comment 49 Alex Deucher 2018-07-16 17:12:55 UTC
Created attachment 140650 [details] [review]
possible fix

Does this patch help?  Just a hack for testing to see if the scratch registers are stale or corrupt.
Comment 50 Thomas Martitz 2018-07-17 06:28:33 UTC
Created attachment 140660 [details]
dmesg with force_asic_init.diff + 0001-workaround-v2.patch

Doesn't seem to make a difference.


[  255.418659] [drm:gfx_v8_0_ring_test_ring] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)
[  255.418670] [drm:amdgpu_device_ip_resume_phase2] *ERROR* resume of IP block <gfx_v8_0> failed -22
[  255.418675] [drm:amdgpu_device_resume] *ERROR* amdgpu_device_ip_resume failed (-22).
Comment 51 Thomas Martitz 2018-07-17 06:30:47 UTC
Btw, your suggestion to disable runtime pm (amdgpu.runpm=0) doesn't help as far as system suspend/resume is concerned. I think runtime pm generally works, because I see occasional debug outout from smu7_populate_single_firmware_entry() function even before suspending (that suggests to me that the GPU has been suspended regardless of system suspend)
Comment 52 Thomas Martitz 2018-07-17 06:33:14 UTC
(In reply to Thomas Martitz from comment #51)
> because I see occasional debug outout from
> smu7_populate_single_firmware_entry() function even before suspending

Forgot to add that the debug output before system suspend doesn't indicate errors (toc->num_entries/toc->structure_version is 0/1 as expected)
Comment 53 Thomas Martitz 2018-07-19 14:14:22 UTC
Alright, after some digging I found that that the ACPI address of my dgpu is \_SB_.PCI0.RP01.PXSX

Then I used https://github.com/mkottman/acpi_call to execute \_SB_.PCI0.RP01.PXSX._PR3 as you suggested, but it reported an error (NOT_FOUND). The examples in that repo suggest that _PS3 or _OFF might also work. _PS3 gave NOT_FOUND too, but I could execute _OFF without error. In other examples there are more calls necessary though, e.g. NVOP or _DSM, but how can I know what's needed?

To power back on, _ON seems to be working.

So, what do you suggest I should try with these methods exactly?
Comment 54 Alex Deucher 2018-07-26 19:33:44 UTC
Can you see if this patch fixes it?
https://gist.github.com/karolherbst/3cde7028a6b885ca42863b6f6320658c
Comment 55 Thomas Martitz 2018-07-26 22:32:31 UTC
Created attachment 140845 [details]
dmesg + Karols hack

No, unfortunately, the GPU is unusable after resume, but the dmesg output is different now.

FWIW, in this case I have booted with pcie_port_pm=off which I feel improves things on my system (but it's nowhere a solution), since I think the GPU is behind a PCIe bridge to which the TB3 port is also connected, and I found in the source that bridge pm should be disabled if there are TB3 ports behind due to hotplug.

Without pcie_port_pm the behavior is almost the same, except that dmesg shows lots of powerplay error messages that don't occur in the attached output (again, made with pcie_port_pm=off).

Karol's patch didn't apply cleanly onto drm-next-4.19-wip, so I made some changes, perhaps you may check if it's still equivalent (will attach with the next comment.
Comment 56 Thomas Martitz 2018-07-26 22:33:08 UTC
Created attachment 140846 [details] [review]
fixed hack.patch
Comment 57 taijian 2018-08-28 16:43:10 UTC
Just to reconfirm: This bug is fixed for me as original reporter in the 4.18.y release.
Comment 58 Michel Dänzer 2018-08-28 16:59:53 UTC
Resolving per comment 57. Anyone still having issues with current kernels, please file your own report.
Comment 59 Peter Wu 2018-08-29 21:21:11 UTC
taijian or Thomas, in order to understand the problem better, could you upload your machine information:

sudo acpidump > acpidump.txt
sudo lspci -nnvvvxxxx > lspci-nnvvvxxxx.txt
Comment 60 Thomas Martitz 2018-08-30 06:57:34 UTC
Created attachment 141369 [details]
acpidump
Comment 61 Thomas Martitz 2018-08-30 06:59:18 UTC
Created attachment 141370 [details]
lspci
Comment 62 Thomas Martitz 2018-08-30 07:02:41 UTC
To recap, the original bug (panic with that particular backtrace) is also fixed for me. However the crash was a symptom of the root cause that my system does not properly resume due to eGPU issues.

It may be worth noting that on my system the eGPU seems to be behind the same PCIe bridge as the TB3 port. I think I need at least to disable runtime pm for that bridge but that alone isn't sufficient from my testing.
Comment 63 Thomas Martitz 2018-08-30 07:05:13 UTC
Maybe https://bugzilla.kernel.org/show_bug.cgi?id=156341 is related? I found a related thread on the nouveau ML ("Rewriting Intel PCI bridge prefetch base address bits solves nvidia graphics issues") that seems to talk about similar resume problems.
Comment 64 Peter Wu 2018-08-30 10:04:52 UTC
(In reply to Thomas Martitz from comment #62)
> To recap, the original bug (panic with that particular backtrace) is also
> fixed for me.

In that case, it is probably better to open a new bug report (and refer to this bug for context) for other issues.

> It may be worth noting that on my system the eGPU seems to be behind the
> same PCIe bridge as the TB3 port. I think I need at least to disable runtime
> pm for that bridge but that alone isn't sufficient from my testing.

The dGPU and TB devices appear behind different bridges (lspci -tv):
-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
           ...
           +-1c.0-[01]----00.0  Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3100]
           ...
           +-1c.4-[03-3b]----00.0-[04-3b]--+-00.0-[05]----00.0  Intel Corporation JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016]

(In reply to Thomas Martitz from comment #63)
> Maybe https://bugzilla.kernel.org/show_bug.cgi?id=156341 is related? I found
> a related thread on the nouveau ML ("Rewriting Intel PCI bridge prefetch
> base address bits solves nvidia graphics issues") that seems to talk about
> similar resume problems.

Maybe related (or maybe not), that's why I was asking for some detail (thanks for providing these!). But please do open a new bug since the original issue appears to be solved.
Comment 65 Thomas Martitz 2018-09-08 11:03:27 UTC
https://patchwork.kernel.org/patch/10583229/ (modified such that the quirk is applied unconditionally) fixes GPU resume on my laptop as well. I think it's got the same PCIe bridge as the ASUS machines mentioned in that post.
Comment 66 Daniel Drake 2018-09-10 03:43:00 UTC
(In reply to Thomas Martitz from comment #65)
> https://patchwork.kernel.org/patch/10583229/ (modified such that the quirk
> is applied unconditionally) fixes GPU resume on my laptop as well. I think
> it's got the same PCIe bridge as the ASUS machines mentioned in that post.

Thanks for testing! That issue is now being tracked at https://bugzilla.kernel.org/show_bug.cgi?id=201069


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.