Created attachment 141520 [details]
I wanted to try the newly-added support for Raven Ridge in amdkfd, but initialization fails at:
"kfd: Failed to resume IOMMU for device 1002:15dd" on AMD Ryzen 5 2500U (Lenovo E485) with 4.19-rc3. IOMMU itself seems to initialize fine (As I understand, I can ignore the "AMD-Vi: Unable to write to IOMMU perf counter." msg). Full log is attached.
Added Felix to CC
The AMD-Vi messages in the log look OK. I'm seeing the same on my Raven system (Ryzen 5 2400G desktop).
I'm currently running a 4.19-rc1+ kernel from Alex Deucher's drm-next-4.20-wip branch. I haven't tried rc3 from the master branch yet. I'll try it tonight and see if I can reproduce the issue.
I'm not seeing this problem on my Raven system with 4.19-rc3+ ($ git describe
The most likely explanation is that on your system IOMMUv2 is not enabled. That may be a BIOS setting. If your system BIOS setup doesn't allow you to enable the IOMMUv2, then you may be out of luck. I'll attach a patch that adds some extra error messages that should confirm that or point to a different source of the problem.
Created attachment 141532 [details] [review]
Add iommu init instrumentation
Output with patch applied:
Sep 12 12:08:20 zen kernel: kfd kfd: Allocated 3969056 bytes on gart
Sep 12 12:08:20 zen kernel: Topology: Add APU node [0x15dd:0x1002]
Sep 12 12:08:20 zen kernel: Failed to attache to group
Sep 12 12:08:20 zen kernel: amd_iommu_init_device failed: -22
Sep 12 12:08:20 zen kernel: kfd kfd: Failed to resume IOMMU for device 1002:15dd
Sep 12 12:08:20 zen kernel: Creating topology SYSFS entries
Sep 12 12:08:20 zen kernel: kfd kfd: device 1002:15dd NOT added due to errors
Full log attached.
Created attachment 141533 [details]
dmesg 4.19-rc3 with iommu init instrumentation
Good timing. We were just given a laptop that has similar problems and found a partial workaround: Try adding "iommu=pt" to your kernel command line. This may at least get you through the KFD initialization, but there are likely more problems down the line.
The problems are due to BIOS bugs. We're looking into more workarounds to ignore or patch incorrect information in the CRAT ACPI table that describes the compute devices for KFD.
KFD initializes without errors using "iommu=pt". I will see whether I can get ROCm running on top of that.
Unfortunately, the BIOS has been terrible so far on the raven-based Lenovo laptops. I am happy to try any patches or workarounds you have, just let me know.
ROCm 1.9 runs OpenCL on GPU on top of mainline kfd and seems stable. However:
- CPU is not detected as a compute device (rocminfo attached)
- Performance, at least in darktable, is quite low (the "bench.SRW" benchmark in OpenCL on GPU takes more than 3 times longer than on CPU without OpenCL). The problem could be that memory buffers are too small, clinfo reports:
"Max memory allocation 268435456 (256MiB)"
which seems quite small to me (?).
Are these problems a result of incorrect information in CRAT?
Created attachment 141657 [details]
ROCm 1.9 info on 4.19-rc4
rocminfo reports both the CPU and the GPU.
If OpenCL can't use the CPU as a compute device, that's probably a limitation of the OpenCL implementation.
The max memory allocation size is strange. rocminfo reports a single 16GB memory pool attached to the CPU. That's system memory from the CRAT table and looks reasonable. It should be possible to use at least 3/8 of that with the upstream KFD. If CLinfo is reporting something different I'm wondering if it's an OpenCL limitation rather than a ROCm limitation.
If you're interested in the raw information reported by KFD to user mode, checkout /sys/class/kfd/kfd/topology/nodes. On an APU there should be only one node (0). Underneath that you'll find node properties as well as memory properties that may be interesting.
Thanks a lot for the info. /sys/class/kfd/kfd/topology/nodes/0/mem_banks/0/properties correctly reports 16GB of RAM.
As the issues seem to come from BIOS/OpenCL (not from kfd) and kfd successfully initializes with "iommu=pt", I will close this bug report as resolved.
I have the same issue on Dell Latitude 5495 with Linux kernel 4.19.1 and iommu=pt is a workaround here too.
But as AMD is working around other BIOS bugs (rather than getting them fixed quickly with their business partners), I think this bug report should be left open for now.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/4.