Bug 98505

Summary: [radeon, amdgpu] Regression introduced in 4.8-rc3
Product: DRI Reporter: Nayan Deshmukh <nayan26deshmukh>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: medium CC: alexdeucher, ckoenig.leichtzumerken, nayan26deshmukh, pavel.ondracka, peter
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Dell_Inc.-Inspiron_5548.tar.gz
none
amdgpu patch that checks whether the new interface can be used for PM
none
dmesg-with-pcie_port_pm=force
none
dmesg-with-amdgpu-blacklisted
none
sudo lspci -s4: -nnvv
none
dmesg_with_debug_options
none
dmesg_with_debug_options
none
dmesg-without_blacklist
none
dmesg_with_blacklist none

Description Nayan Deshmukh 2016-10-30 16:48:38 UTC
Everything was working fine with v4.8-rc2, but with v4.8-rc3 the module loads successfully but when I try to use it via vdpau it results in following errors:-

[  101.129388] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 1 test failed (scratch(0xC040)=0xCAFEDEAD)
[  101.340282] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 2 test failed (scratch(0xC040)=0xCAFEDEAD)
[  101.551074] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 3 test failed (scratch(0xC040)=0xCAFEDEAD)
[  101.761909] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 4 test failed (scratch(0xC040)=0xCAFEDEAD)
[  101.972856] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 5 test failed (scratch(0xC040)=0xCAFEDEAD)
[  102.183708] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 6 test failed (scratch(0xC040)=0xCAFEDEAD)
[  102.394620] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 7 test failed (scratch(0xC040)=0xCAFEDEAD)
[  102.605321] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 8 test failed (scratch(0xC040)=0xCAFEDEAD)
[  102.709055] [drm:sdma_v2_4_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD)
[  102.709065] [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <sdma_v2_4> failed -22
[  102.709074] [drm:amdgpu_resume_kms [amdgpu]] *ERROR* amdgpu_resume failed (-22).
[  126.965536] [TTM] Buffer eviction failed
Comment 1 Nayan Deshmukh 2016-10-30 16:53:33 UTC
After the git bisect, the following commit was leading to error:- 

commit c39b487f195b93235ee76384427467786f7bf29f
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Tue Aug 9 00:20:28 2016 -0400

    Revert "drm/amdgpu: work around lack of upstream ACPI support for D3cold"
    
    This reverts commit c63695cc5e5f685e924e25a8f9555f6e846f1fc6.
    
    Now that d3cold support is upstream, there is no more need for this
    workaround.
    
    bug:
    https://bugs.freedesktop.org/show_bug.cgi?id=97248
Comment 2 Alex Deucher 2016-11-01 13:29:15 UTC
Possibly the same root cause as bug 98398.
Comment 3 Nayan Deshmukh 2016-11-01 13:56:49 UTC
I had a look at Peter Wu's patch that landed today on drm-fixes. Do we also need to have a patch like that which maybe tweaks amdgpu_atpx_handler to make it work?
Comment 4 Peter Wu 2016-11-01 18:26:03 UTC
@Nayan To see whether the problem is similar as nouveau, I need more information about your machine. Could you provide your acpidump?

You can also submit a whole tarball as described at https://bugs.launchpad.net/lpbugreporter/+bug/752542, that would include the acpidump among other things.
Comment 5 Nayan Deshmukh 2016-11-01 20:46:27 UTC
Created attachment 127673 [details]
Dell_Inc.-Inspiron_5548.tar.gz
Comment 6 Nayan Deshmukh 2016-11-01 20:49:28 UTC
After I run any program which uses the AMD gpu the system will hang if I perform any heavy task like starting chrome for example. 

Please let me know if you need any more info or need to test some patches.
Comment 7 Peter Wu 2016-11-01 23:31:40 UTC
Created attachment 127678 [details] [review]
amdgpu patch that checks whether the new interface can be used for PM

PCIe port PM is not enabled because this BIOS is pre-2015: 12/04/2014
The BIOS seems to be able to report support for lots of things (can you post a fuller dmesg that include the supported functions?):

    Scope (\_SB.PCI0.GFX0) {
        Method (ATPX, 2, Serialized) {
            // ...
            If (Arg0 == One) {
                Name (TMP2, Buffer (0x0100) { 0x00 })
                CreateWordField (TMP2, Zero, F1SS)
                CreateDWordField (TMP2, 0x02, F1VM)
                CreateDWordField (TMP2, 0x06, F1FG)
                F1SS = 0x0A
                F1VM = 0x7FC0
                If ((\_SB.PCI0.RP05.PXSX.SGMD & 0x0F) == 0x02) {
                    // ...
                    If (\_SB.PCI0.RP05.PXSX.PXDY == One) {
                        F1FG |= 0x80 /* ATPX_DYNAMIC_PX_SUPPORTED */
                        F1VM |= 0x80
                    }
                    //
                    If (\_SB.PCI0.RP05.PXSX.FDPD == One) {
                        F1FG |= 0x0400 /* ATPX_DYNAMIC_DGPU_POWER_OFF_SUPPORTED */
                        F1VM |= 0x0400
                        If (OSYS >= 0x07DC) {
                            F1FG |= 0x0800 /* ATPX_DGPU_REQ_POWER_FOR_DISPLAYS */
                            F1VM |= 0x0800
                        }
                    }
                    // ...
                    If (OSYS >= 0x07DD) {
                        F1FG |= 0x4000 /* ATPX_MS_HYBRID_GFX_SUPPORTED */ 
                        F1VM |= 0x4000
                    }

amdgpu (and radeon) needs to check whether PCIe port PM is really supported. Possible patch is attached (there should probably be a pci_d3cold_disable call somewhere, see commit 279cf3f23870f7eb8ca071115e06d3d5ca0a2b9e for nouveau).
Comment 8 Nayan Deshmukh 2016-11-02 06:07:15 UTC
Comment on attachment 127678 [details] [review]
amdgpu patch that checks whether the new interface can be used for PM

Review of attachment 127678 [details] [review]:
-----------------------------------------------------------------

::: drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
@@ +206,5 @@
>  	atpx->is_hybrid = false;
>  	if (valid_bits & ATPX_MS_HYBRID_GFX_SUPPORTED) {
>  		printk("ATPX Hybrid Graphics\n");
> +		/* Disable legacy PM methods only when pcie port PM is usable. */
> +		atpx->functions.power_cntl = !amdgpu_atpx_priv->bridge_pm_usable;

With this replaced by 
atpx->functions.power_cntl = !amdgpu_atpx_priv.bridge_pm_usable;

The patch fixes the issue. :)
Comment 9 Nayan Deshmukh 2016-11-02 06:11:16 UTC
I will also try to the master branch of kernel to see if bug 98357 is gone
Comment 10 Alex Deucher 2016-11-02 16:35:01 UTC
(In reply to Peter Wu from comment #7)
> Created attachment 127678 [details] [review] [review]
> amdgpu patch that checks whether the new interface can be used for PM
> 
> PCIe port PM is not enabled because this BIOS is pre-2015: 12/04/2014
> The BIOS seems to be able to report support for lots of things (can you post
> a fuller dmesg that include the supported functions?):

Can we bump the bios white list to 2014?  Windows enabled Hybrid graphics on windows 8.x as well as 10, and a number of preliminary win10 systems have 2014 bios versions.  I'd prefer that to doing all of these hacks in the drivers.
Comment 11 Peter Wu 2016-11-02 16:46:15 UTC
(In reply to Alex Deucher from comment #10)
> (In reply to Peter Wu from comment #7)
> > Created attachment 127678 [details] [review] [review] [review]
> > amdgpu patch that checks whether the new interface can be used for PM
> > 
> > PCIe port PM is not enabled because this BIOS is pre-2015: 12/04/2014
> > The BIOS seems to be able to report support for lots of things (can you post
> > a fuller dmesg that include the supported functions?):
> 
> Can we bump the bios white list to 2014?  Windows enabled Hybrid graphics on
> windows 8.x as well as 10, and a number of preliminary win10 systems have
> 2014 bios versions. I'd prefer that to doing all of these hacks in the drivers.

Lowering the minumum from 2015 to 2014 should work for nouveau:
https://lists.freedesktop.org/archives/nouveau/2016-July/025619.html

You can try to ask Mika, see v4.7-rc2-3-g9d26d3a. Maybe the presence of a power resource (_PR3 or flags.power_resources) can be used as a hint whether to enable port PM.
Comment 12 Alex Deucher 2016-11-02 16:55:27 UTC
(In reply to Peter Wu from comment #11)
> (In reply to Alex Deucher from comment #10)
> > (In reply to Peter Wu from comment #7)
> > > Created attachment 127678 [details] [review] [review] [review] [review]
> > > amdgpu patch that checks whether the new interface can be used for PM
> > > 
> > > PCIe port PM is not enabled because this BIOS is pre-2015: 12/04/2014
> > > The BIOS seems to be able to report support for lots of things (can you post
> > > a fuller dmesg that include the supported functions?):
> > 
> > Can we bump the bios white list to 2014?  Windows enabled Hybrid graphics on
> > windows 8.x as well as 10, and a number of preliminary win10 systems have
> > 2014 bios versions. I'd prefer that to doing all of these hacks in the drivers.
> 
> Lowering the minumum from 2015 to 2014 should work for nouveau:
> https://lists.freedesktop.org/archives/nouveau/2016-July/025619.html
> 

For further clarify on your research, the current hybrid graphics spec requires no connectors on the dGPU and hence no audio devices.  For devices with connectors on the dGPU, they should use the older vendor specific methods and PR# should not be exposed.

> You can try to ask Mika, see v4.7-rc2-3-g9d26d3a. Maybe the presence of a
> power resource (_PR3 or flags.power_resources) can be used as a hint whether
> to enable port PM.

Yes, the presence of _PR3 would be a great way to determine when to enable it.
Comment 13 Alex Deucher 2016-11-02 16:56:31 UTC
(In reply to Alex Deucher from comment #12)
> methods and PR# should not be exposed.

PR3
Comment 14 Peter Wu 2016-11-04 00:08:09 UTC
(In reply to Alex Deucher from comment #12)
[..]
> > Lowering the minumum from 2015 to 2014 should work for nouveau:
> > https://lists.freedesktop.org/archives/nouveau/2016-July/025619.html
> > 
> 
> For further clarify on your research, the current hybrid graphics spec
> requires no connectors on the dGPU and hence no audio devices.  For devices
> with connectors on the dGPU, they should use the older vendor specific
> methods and PR3 should not be exposed.

The list was mentioned because of the BIOS dates and whether _PR3 is used.

The Windows 8 requirements state that the discrete GPU is render-only[0], but the Win10 reqs have no such requirement[1]. (Are these the same specs you had in mind? If not, could you share the right one?) On my new hybrid nvidia laptop, there are actually DP/HDMI ports to the dGPU (HDMI with audio).

 [0]: https://msdn.microsoft.com/en-us/library/windows/hardware/dn265510(v=vs.85).aspx
 [1]: https://msdn.microsoft.com/en-us/windows/hardware/commercialize/design/compatibility/systems#systemfundamentalsgraphicshybridgraphicsmultigpu

> > You can try to ask Mika, see v4.7-rc2-3-g9d26d3a. Maybe the presence of a
> > power resource (_PR3 or flags.power_resources) can be used as a hint whether
> > to enable port PM.
> 
> Yes, the presence of _PR3 would be a great way to determine when to enable
> it.

Have you been working on this or should I have a look?
Comment 15 Alex Deucher 2016-11-04 14:51:49 UTC
(In reply to Peter Wu from comment #14)
> (In reply to Alex Deucher from comment #12)
> [..]
> > > Lowering the minumum from 2015 to 2014 should work for nouveau:
> > > https://lists.freedesktop.org/archives/nouveau/2016-July/025619.html
> > > 
> > 
> > For further clarify on your research, the current hybrid graphics spec
> > requires no connectors on the dGPU and hence no audio devices.  For devices
> > with connectors on the dGPU, they should use the older vendor specific
> > methods and PR3 should not be exposed.
> 
> The list was mentioned because of the BIOS dates and whether _PR3 is used.
> 
> The Windows 8 requirements state that the discrete GPU is render-only[0],
> but the Win10 reqs have no such requirement[1]. (Are these the same specs
> you had in mind? If not, could you share the right one?) On my new hybrid
> nvidia laptop, there are actually DP/HDMI ports to the dGPU (HDMI with
> audio).
> 

I had access to the PR3 spec internally.  I'm not sure if it's available publicly or not.  Support for displays was not included in the information I saw or from talking with our internal hybrid graphics architects.  It's possible the spec has been updated since I last talked to them earlier this year.  I can double check.

Does your new laptop use PR3 or the older nvidia proprietary ACPI method?

>  [0]:
> https://msdn.microsoft.com/en-us/library/windows/hardware/dn265510(v=vs.85).
> aspx
>  [1]:
> https://msdn.microsoft.com/en-us/windows/hardware/commercialize/design/
> compatibility/systems#systemfundamentalsgraphicshybridgraphicsmultigpu
> 
> > > You can try to ask Mika, see v4.7-rc2-3-g9d26d3a. Maybe the presence of a
> > > power resource (_PR3 or flags.power_resources) can be used as a hint whether
> > > to enable port PM.
> > 
> > Yes, the presence of _PR3 would be a great way to determine when to enable
> > it.
> 
> Have you been working on this or should I have a look?

I have not been.  Please have a look.  Thanks!
Comment 16 Peter Wu 2016-11-04 15:04:34 UTC
(In reply to Alex Deucher from comment #15)
> (In reply to Peter Wu from comment #14)
[..]
> > The Windows 8 requirements state that the discrete GPU is render-only[0],
> > but the Win10 reqs have no such requirement[1]. (Are these the same specs
> > you had in mind? If not, could you share the right one?) On my new hybrid
> > nvidia laptop, there are actually DP/HDMI ports to the dGPU (HDMI with
> > audio).
> 
> I had access to the PR3 spec internally.  I'm not sure if it's available
> publicly or not.  Support for displays was not included in the information I
> saw or from talking with our internal hybrid graphics architects.  It's
> possible the spec has been updated since I last talked to them earlier this
> year.  I can double check.
> 
> Does your new laptop use PR3 or the older nvidia proprietary ACPI method?

The laptop (Clevo P651RA) uses PR3 (confirmed via a Windows 10 trace):
https://github.com/Bumblebee-Project/bbswitch/issues/115#issuecomment-218622306

> > Have you been working on this or should I have a look?
> 
> I have not been.  Please have a look.  Thanks!

Ok, do you have any certainty about the earliest BIOS date where _PR3 is used? E.g. if the minimum date is lowered to 2014 without checking for _PR3, would it be likely to miss out some models?
Comment 17 Alex Deucher 2016-11-07 18:42:22 UTC
(In reply to Peter Wu from comment #16)
> Ok, do you have any certainty about the earliest BIOS date where _PR3 is
> used? E.g. if the minimum date is lowered to 2014 without checking for _PR3,
> would it be likely to miss out some models?

It's always used if it's available.  I think 2014 should be ok (should catch most of the early ones), but I'll check with our windows architects.  I suspect it will come down to OEM/ODMs.
Comment 18 Alex Deucher 2016-11-11 17:56:03 UTC
*** Bug 98687 has been marked as a duplicate of this bug. ***
Comment 19 Alex Deucher 2016-11-11 18:14:44 UTC
(In reply to Peter Wu from comment #16)
> Ok, do you have any certainty about the earliest BIOS date where _PR3 is
> used? E.g. if the minimum date is lowered to 2014 without checking for _PR3,
> would it be likely to miss out some models?

Can you add me to any relevant threads?  I'd like to help get this fixed for 4.9.
Comment 20 Peter Wu 2016-11-11 18:17:48 UTC
(In reply to Alex Deucher from comment #19)
> (In reply to Peter Wu from comment #16)
> > Ok, do you have any certainty about the earliest BIOS date where _PR3 is
> > used? E.g. if the minimum date is lowered to 2014 without checking for _PR3,
> > would it be likely to miss out some models?
> 
> Can you add me to any relevant threads?  I'd like to help get this fixed for
> 4.9.

I will, but did you get any feedback from the Windows architects yet? I'd like to get some solid reasons that support lowering the version requirement.
Comment 21 Alex Deucher 2016-11-11 18:49:36 UTC
(In reply to Peter Wu from comment #20)
> 
> I will, but did you get any feedback from the Windows architects yet? I'd
> like to get some solid reasons that support lowering the version requirement.

Still waiting to hear back if they have any opinion on the dates.  On windows it will always use PR3 if it's available.
Comment 22 Alex Deucher 2016-11-11 21:22:51 UTC
Update from our windows team: Hybrid graphics platforms using _PR3 were first available in Oct 2013 (when windows 8.1 was released).
Comment 23 Peter Wu 2016-11-12 23:40:29 UTC
Nayan, as a workaround you may add the pcie_port_pm=force option to your kernel command line, then you should not need the patch. Can you check this?

Alex, I don't think that the minimum date should change in 4.8 (and 4.9?) due to the risk of breakage since it is not just limited to amdgpu/radeon. Another concern is that while the year seems a good heuristic, it does not match the checks that Windows 8 performs (which may or may not be an issue):
https://msdn.microsoft.com/en-us/windows/hardware/drivers/bringup/firmware-requirements-for-d3cold

I'll bring this up with linux-pci developers after the weekend. Should I proceed with proposing workaround amdgpu/radeon patches?
Comment 24 Nayan Deshmukh 2016-11-13 06:41:26 UTC
Created attachment 127944 [details]
dmesg-with-pcie_port_pm=force

Without the patch and with the option pcie_port_pm=force, I get a bunch of errors on resume. I attaching the dmesg.
Comment 25 Peter Wu 2016-11-13 11:24:51 UTC
"Resume" means "runtime resume" as opposed to "resume from system sleep" right (based on the logs).

When this happens, what is the output of "sudo lspci -s4: -nnvv"? (is the "rev" field "ff" or something else? Does the Power Management register report D0 or D3?)

Can you reproduce the problem in this way:

 1. Blacklist amdgpu (and radeon if enabled)
 2. Boot system
 3. Load amdgpu
 4. Wait for runtime PM to kick in (5 seconds)
 5. Wake up (e.g. "lspci")

If so, can you build your kernel with CONFIG_ACPI_DEBUG=y and either boot with acpi.trace_state=enable? If that gives too many messages, you can also try setting the option between step 2 and 3:

 echo enable > /sys/module/acpi/parameters/trace_state

What is the dmesg for this? (At this moment it is not clear for me what the port PM is doing versus what amdgpu is doing, hopefully some logs will clarify this).
Comment 26 Nayan Deshmukh 2016-11-13 16:01:39 UTC
Created attachment 127953 [details]
dmesg-with-amdgpu-blacklisted

I meant runtime resume. I am attaching the dmesg after blacklisting amdgpu and loading it without the debug option. I will recompile the kernel with the debug options and upload it whenever I am free.
Comment 27 Nayan Deshmukh 2016-11-13 16:02:23 UTC
Created attachment 127954 [details]
sudo lspci -s4: -nnvv
Comment 28 Peter Wu 2016-11-14 10:52:43 UTC
Ok, when possible, please try to obtain a dmesg with CONFIG_ACPI_DEBUG=y and the trace_state option as described above.
Comment 29 Alex Deucher 2016-11-14 15:12:43 UTC
(In reply to Peter Wu from comment #23)
> Alex, I don't think that the minimum date should change in 4.8 (and 4.9?)
> due to the risk of breakage since it is not just limited to amdgpu/radeon.
> Another concern is that while the year seems a good heuristic, it does not
> match the checks that Windows 8 performs (which may or may not be an issue):
> https://msdn.microsoft.com/en-us/windows/hardware/drivers/bringup/firmware-
> requirements-for-d3cold

I agree that using the date is probably not a good idea, but it's what we have now.  Adjusting the date seems lower impact than changing the policy for these kernels.

> 
> I'll bring this up with linux-pci developers after the weekend. Should I
> proceed with proposing workaround amdgpu/radeon patches?

Sounds good.  I generally don't like hacking around this in the driver, but I guess it's better than nothing at this point.  Thanks for your help with this.
Comment 30 Nayan Deshmukh 2016-11-17 15:03:34 UTC
Created attachment 128042 [details]
dmesg_with_debug_options

Sorry for the delay, I compiled my kernel with CONFIG_DEBUG_ACPI=y followed the steps that you mentioned.
Comment 31 Peter Wu 2016-11-19 00:22:11 UTC
(In reply to Nayan Deshmukh from comment #30)
> Created attachment 128042 [details]
> dmesg_with_debug_options
> 
> Sorry for the delay, I compiled my kernel with CONFIG_DEBUG_ACPI=y followed
> the steps that you mentioned.

Have you enabled acpi.trace_state=enable (either via cmdline or before loading the module)? It does not appear in the logs.
Comment 32 Nayan Deshmukh 2016-11-19 02:08:01 UTC
Created attachment 128064 [details]
dmesg_with_debug_options
Comment 33 Peter Wu 2016-11-19 17:32:51 UTC
(In reply to Nayan Deshmukh from comment #32)
> Created attachment 128064 [details]
> dmesg_with_debug_options

These logs do not show the "resume failed" messages that appeared in the original logs. Could you try to trigger this?

The logs also shows no \_SB.PCI0.RP05.PC05._ON (or _OFF) method invocations, so pcie_port_pm=off (or force) is likely not going to make a difference here.

Btw, to increase the kernel log size from 256K you could pass log_buf_len=1M to your cmdline. Could you try obtaining a log where the "resume failed" messages are shown?

These messages also look suspicious, any idea what it could be? (Alex?)
[  221.378665] VI should always have 2 performance levels
[  221.380787]  failed to send message 18a ret is 255 
[  221.380794]  failed to send pre message 145 ret is 255 
[  221.621346]  failed to send message 146 ret is 0 
[  221.862589]  failed to send pre message 148 ret is 0 

They also appear in bug 97240 and bug 97548.
Comment 34 Nayan Deshmukh 2016-11-19 19:21:29 UTC
Created attachment 128075 [details]
dmesg-without_blacklist

Sorry for the previous logs they had an extra patch so please ignore them.
I can get the same error messages if the amdgpu is not blacklisted.
Comment 35 Nayan Deshmukh 2016-11-19 19:23:30 UTC
Created attachment 128076 [details]
dmesg_with_blacklist

This is the dmesg with blacklisting amdgpu, it also leads to similar errors but this is a bit different from the previous one.
Comment 36 Nayan Deshmukh 2016-12-05 13:49:16 UTC
The bug is fixed in 4.9-rc7. Thanks for the effort Peter.
Comment 37 Jari Tahvanainen 2016-12-05 19:32:00 UTC
Closing resolved+Fixed. Verified by Reporter.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.