Summary: | [regression] Loss of i915 (Apollo Lake/Broxton) HDMI audio after display sleep v4.14+ | ||
---|---|---|---|
Product: | DRI | Reporter: | Simon Casey <simon501098c> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | imre.deak, intel-gfx-bugs, ville.syrjala |
Version: | XOrg git | Keywords: | bisected, regression |
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | BXT | i915 features: | display/audio, display/DP, display/LSPCON |
Attachments: |
Could you check the commits wrt. commit IDs? The ones you gave don't match. They are correctly: e8a3a2a3d7f7a1 is "drm/i915/bxt, glk: Fix assert on conditions for DC9 enabling" and 42d9366d41a992 is "drm/i915/gen9+: Don't remove secondary power well requests" Which of the above ones did you get as the first bad commit? If the bisect result is that e8a3a2a3d7f7a1 is the first bad commit then it's an incorrect result, since it only changes the condition for a WARN message (which doesn't trigger in any case in your log). I also doubt that 42d9366d41a992 would be a good result. The only difference it'd make is if a secondary power well request was set, in which case this request wouldn't be removed after this commit. But then you'd have the "Power well X forced on ..." debug message, which you don't have in your log. So I don't see how this commit could make any difference either in your case. All I can think of is some timing change, or that you didn't try enough to reproduce the issue with the commit preceding the bisect result. My bisect went as follows: (good/bad referring to the commit above it) aae3dbb4776e7916b6cd442d00159bea27a695c1 4.14-rc1 bad bf1d6b2c76eda86159519bf5c427b1fa8f51f733 bad edc2988c548db05e33b921fed15821010bc74895 bad e5fa05b96b02f45a91a364d2121451a4da05cd84 bad b9c55b6e2cc4369b0688961fa5de0e057f3ec0c4 bad 9f15a4ab19ab33658dbc9fd37be5210e8c1ac622 good cdc1cc7e9ac076c9a10eccca76f86697bcf0f7a7 bad 846c6b26d38e56e5004f1d71d4c13226d2514750 good ca91a2758fcef6635626993557dd51cfbb6dd134 bad 953152253e9cbd4f358d4b4ca56d48072af3846d bad 6602be0e2c6163bd747d490d8875ef0812c11560 bad 42d9366d41a992631abaa15f5a881ae1235a8203 bad e8a3a2a3d7f7a194e0f0ad92c5dd636f908e7601 good 42d9366d41a992631abaa15f5a881ae1235a8203 bad Indeed I got the commit IDs e8a3a2a3d7f7a1 and 42d9366d41a992 the wrong way! I'll check e8a3a2a3d7f7a1 again to see If I can get it to appear. Most of the time it happens within two DMPS offs, but testing this morning to get the full debug dmesg I think I had about 10 error-free sleeps before the I got it to break for the included log... So the issue does seems to be somehow triggered by 42d9366d41a992. I set the machine up rebooting every 5 minutes overnight alternately into 42d9366d41a992 and e8a3a2a3d7f7a1. Once booted the following was run sleep 120 ; xset -display :0 dpms force off ; sleep 30 ; xset -display :0 dpms force on ; sleep 120 ; xset -display :0 dpms force off ; sleep 10 and then dmesg was stored before configuring to boot the alternate kernel and rebooting into it. So two DPMS offs per boot. e8a3a2a3d7f7a1 was booted 59 times and never produced the issue. 42d9366d41a992 was booted 58 times and 27 of these produced the issue. 17 times the issue appeared with the first DPMS off, the remaining 10 happened with the second DPMS off. I'm attaching a debug dmesg from one of the 42d9366d41a992 runs where the issue happened on the 2nd sleep. The original dmesg attachment was from one of the latest drm-tip commits so I though it could be helpful to have the output from what appears to be the first commit showing the issue. Created attachment 137185 [details]
dmesg from commit 42d9366d41a992
Created attachment 137189 [details] Debug patches In reply to Simon Casey from comment #3) > So the issue does seems to be somehow triggered by 42d9366d41a992. Ok, thanks for the effort. I still suspect that some unintended dependency on timing causes this. Looking at azx_rirb_get_response() it's obvious there is some magic involved when accessing the audio codec. So based on that I'd like to ask you to try a few debug patches on top of 42d9366d4 I attached to the bug to narrow down the root cause. First, could you reproduce the bug and get the stacktrace with 0001-sound-hda-Print-stack-dump-during-probe-failure.patch ? One thing that 42d9366d4 changes is that intel_wait_for_register() is used instead of wait_for() to wait for the power well enabled status. This means we'll potentially complete the wait faster than before, as intel_wait_for_register() uses a 2 usec atomic wait before doing a sleeping wait. But that should be fine and nothing should depend on any delays there. To check for such a dependency, could you try if the problem goes away with 0002-drm-i915-Don-t-spin-wait-during-power-well-enabling.patch (without the other patches)? If with the above you can't reproduce the problem, could you try the same with 0002-drm-i915-Don-t-spin-wait-during-power-well-enabling.patch (again without the other patches)? Thanks. (In reply to Imre Deak from comment #5) > ... > If with the above you can't reproduce the problem, could you try the same > with > 0002-drm-i915-Don-t-spin-wait-during-power-well-enabling.patch > (again without the other patches)? Err, I meant here 0003-drm-i915-Wait-after-audio-power-enable.patch I'll certainly give those patches a try and report back. In the meantime I rebuilt both e8a3a2a3d7f7a1 and 42d9366d41a992 with gcc 7.3.0 previously used 6.4.0) and run a few passes of the last test. Essentially it seems 42d9366d41a992 is now less prone to show the issue, around 1/3 of tries were affected). But also e8a3a2a3d7f7a1 has now exhibited it in 3/16 runs. I've also rebuilt 4.13.16 with 7.3.0, and it's survived 16 passes without issue, not that it necessarily means it's free of the issue or not... (In reply to Simon Casey from comment #7) > I'll certainly give those patches a try and report back. > > In the meantime I rebuilt both e8a3a2a3d7f7a1 and 42d9366d41a992 with gcc > 7.3.0 previously used 6.4.0) and run a few passes of the last test. > > Essentially it seems 42d9366d41a992 is now less prone to show the issue, > around 1/3 of tries were affected). But also e8a3a2a3d7f7a1 has now > exhibited it in 3/16 runs. Ah, ok so 42d9366d41a992 is not the culprit. So maybe no use in trying my debug patches. Ville in the meantime told me there is a known problem with the codec probing if the CDCLK rate is too low. So could you instead try if the following fixes it (on top of 42d9366d41a992 let's say and/or drm-tip): diff --git a/drivers/gpu/drm/i915/intel_cdclk.c b/drivers/gpu/drm/i915/intel_cdclk.c index 1241e5891b29..03ebd6589b07 100644 --- a/drivers/gpu/drm/i915/intel_cdclk.c +++ b/drivers/gpu/drm/i915/intel_cdclk.c @@ -1083,10 +1083,8 @@ static int bxt_calc_cdclk(int max_pixclk) return 576000; else if (max_pixclk > 288000) return 384000; - else if (max_pixclk > 144000) - return 288000; else - return 144000; + return 288000; } static int glk_calc_cdclk(int max_pixclk) > I've also rebuilt 4.13.16 with 7.3.0, and it's survived 16 passes without > issue, not that it necessarily means it's free of the issue or not... > Ah, ok so 42d9366d41a992 is not the culprit. So maybe no use in trying my
> debug patches. Ville in the meantime told me there is a known problem with
> the codec probing if the CDCLK rate is too low. So could you instead try if
> the following fixes it (on top of 42d9366d41a992 let's say and/or drm-tip):
>
> diff --git a/drivers/gpu/drm/i915/intel_cdclk.c
> b/drivers/gpu/drm/i915/intel_cdclk.c
> index 1241e5891b29..03ebd6589b07 100644
> --- a/drivers/gpu/drm/i915/intel_cdclk.c
> +++ b/drivers/gpu/drm/i915/intel_cdclk.c
> @@ -1083,10 +1083,8 @@ static int bxt_calc_cdclk(int max_pixclk)
> return 576000;
> else if (max_pixclk > 288000)
> return 384000;
> - else if (max_pixclk > 144000)
> - return 288000;
> else
> - return 144000;
> + return 288000;
> }
>
> static int glk_calc_cdclk(int max_pixclk)
>
>
> > I've also rebuilt 4.13.16 with 7.3.0, and it's survived 16 passes without
> > issue, not that it necessarily means it's free of the issue or not...
Looking at drm-tip, max_pixclk has morphed into min_pixclk in bxt_calc_cdclk(). What do you suggest I change for drm-tip? I'll ping pong that against a patched 42d9366d41a992.
> Looking at drm-tip, max_pixclk has morphed into min_pixclk in
> bxt_calc_cdclk(). What do you suggest I change for drm-tip? I'll ping pong
> that against a patched 42d9366d41a992.
Of course I mean min_cdclk not min_pixclk... I guess ignoring the variable name the if/else otherwise looks identical to 42d9366d41a992 so will fudge it to do the same :)
Created attachment 137196 [details]
Stack trace patch result from 42d9366d41a992
Attached is the dmesg including stack trace from a failure on 42d9366d41a992. It also had the bxt_calc_cdclk() patch applied.
I tried the bxt_calc_cdclk() patch on both 42d9366d41a992 and latest drm-tip but the issue is still present.
Since the issue isn't created by 42d9366d41a992 I haven't tried the 0002 and 0003 patches. May try a new bisect tomorrow and see if that produces up anything better.
Imre, was looking at bspec for other stuff, might bspec id 19620 be related? Created attachment 137246 [details] [review] Save/restore AUD_FREQ_CNTRL (In reply to Jani Nikula from comment #12) > Imre, was looking at bspec for other stuff, might bspec id 19620 be related? Most of the context save/restore described there is done (or should be done) by the audio driver; I haven't checked that part in detail, I hope the audio people can do that. Two things from the save/restore sequence must be done on the display driver side (since the registers belong to it): programming HSW_AUD_CHICKENBIT and saving/restoring SKL_AUD_FREQ_CNTRL . I haven't found anything in the driver about the latter one. Simon could you try if the attached patch makes a difference for you (providing the dmesg in any case)? Created attachment 137248 [details] dmesg from drm-tip 2e76a2952923eb patched with attachment 137246 [details] [review] Here's dmesg as requested. I realised i wasn't running 'make clean' after each bisect stage previously, Almost complete with a new bisect but it seems to be homing in on a dead end again... Only commits left are from 55a0b9d70a6c57 to edeb729f7929d1, none of which look at all related. (In reply to Simon Casey from comment #14) > Created attachment 137248 [details] > dmesg from drm-tip 2e76a2952923eb patched with attachment 137246 [details] [review] > [review] > > Here's dmesg as requested. Thanks, so AUD_FREQ_CNTRL has just the reset value, so (for you at least) the lack of save/restore doesn't matter. > > I realised i wasn't running 'make clean' after each bisect stage previously, > Almost complete with a new bisect but it seems to be homing in on a dead end > again... Only commits left are from 55a0b9d70a6c57 to edeb729f7929d1, none > of which look at all related. First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug. I've just tested 4.16-rc7 and the issue is still present. I continued trying bisecting but didn't get anywhere. The first post 4.13 commit is 69c0067aa3f40d and is fine. The next merge is from drm, 906dde0f355bd9, and shows the issue. Bisecting between the two just doesn't seem possible. The very first commit from the drm merge, fc5e9d63a8db40, shows the issue yet this commit is completely unrelated to i915. However I later realised that jumping to fc5e9d63a8db40 produces a kernel snapshot at the point of that revision, rather than simply 69c0067aa3f40d plus fc5e9d63a8db40. The snapshot identifies as 4.12.0-rc1+, so I don't understand how this older snapshot has the issue, whilst 4.13 (and 4.12) don't have the issue. Any suggestions how I an better locate this bug? Please try current drm-tip branch of https://cgit.freedesktop.org/drm/drm-tip In particular, this could be fixed by commit b651bd2a3ae33f6a98d6ea36ef2518f5211bdc4f Author: Gaurav K Singh <gaurav.k.singh@intel.com> Date: Thu Apr 5 22:12:24 2018 +0530 drm/i915/audio: Fix audio enumeration issue on BXT I added the one-liner from b651bd2a3ae33f6a98d6ea36ef2518f5211bdc4f to both 4.16-rc7 and 4.16 (both of which showed the issue), and not quite as impressive as 25000 passes but booting between the 2 modded kernels with this patch and issuing 936 sleep events over the weekend, all came back successfully with hdmi audio! So, I'm happy to call it cased closed here :) Thanks to you all for looking into this! Thank you for confirming, closing Is this patch merged upstream? Because I'm still having the same issue. Linux bedroom.riaqn.org 4.19.2-arch1-1-ARCH #1 SMP PREEMPT Tue Nov 13 21:16:19 UTC 2018 x86_64 GNU/Linux I also added my dmesg to attachments. Created attachment 142610 [details]
dmesg for 4.19.2-arch1-1-ARCH
Getting the same error in 4.4 on Kaby Lake. Is it related to this issue or is it something else? [ 3.021279] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915_bpo]) [ 3.068792] snd_hda_codec_realtek hdaudioC0D0: autoconfig for ALC892: line_outs=1 (0x14/0x0/0x0/0x0/0x0) type:line [ 3.068795] snd_hda_codec_realtek hdaudioC0D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 3.068796] snd_hda_codec_realtek hdaudioC0D0: hp_outs=1 (0x1b/0x0/0x0/0x0/0x0) [ 3.068797] snd_hda_codec_realtek hdaudioC0D0: mono: mono_out=0x0 [ 3.068798] snd_hda_codec_realtek hdaudioC0D0: inputs: [ 3.068800] snd_hda_codec_realtek hdaudioC0D0: Front Mic=0x19 [ 3.068801] snd_hda_codec_realtek hdaudioC0D0: Rear Mic=0x18 [ 3.068802] snd_hda_codec_realtek hdaudioC0D0: Line=0x1a [ 3.092264] input: HDA Intel PCH Rear Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12 [ 3.102750] input: HDA Intel PCH Line as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13 [ 3.102795] input: HDA Intel PCH Line Out as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14 [ 3.102840] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input15 [ 3.102887] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input16 [ 3.102926] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input17 . . . [ 2720.133324] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to polling mode: last cmd=0x206f0100 [ 2721.141237] snd_hda_intel 0000:00:1f.3: No response from codec, disabling MSI: last cmd=0x206f0100 [ 2722.149139] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to single_cmd mode: last cmd=0x206f0100 [ 2722.149270] azx_single_wait_for_response: 69 callbacks suppressed (In reply to Zesen from comment #21) > Is this patch merged upstream? Because I'm still having the same issue. You've got GLK, which should've been fixed by commit b4615730530be85fc45ab4631c2ad6d8e2d0b97d Author: Gaurav K Singh <gaurav.k.singh@intel.com> Date: Tue Apr 17 23:52:18 2018 +0530 drm/i915/audio: Fix audio detection issue on GLK which is in v4.17 and should've been backported to stable. Please file a new bug. Add drm.debug=14 module parameter, and attach dmesg to that bug. (In reply to me from comment #23) > Getting the same error in 4.4 on Kaby Lake. > Is it related to this issue or is it something else? > > [ 3.021279] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops > i915_audio_component_bind_ops [i915_bpo]) You've got an old distro backported version of the driver, not supported by upstream. You'll need at least upstream kernel version v4.5, but preferrably update to a recent kernel. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 137170 [details] Dmesg output from drm-tip (2e76a2952923eba64c4f9baf461613bc42ee997a) kernel booted with drm.debug=0x1e log_buf_len=1M Starting with kernel 4.14.0, when DPMS turns the screen off, there is about a 50% chance that HDMI audio will break. This can be fixed by shutting down Xorg, removing and reloading module snd_hda_intel and then restarting Xorg. When the loss occurs, the following appears in a non-debug demsg a few seconds after the DPMS off command is issued. snd_hda_intel 0000:00:0e.0: azx_get_response timeout, switching to polling mode: last cmd=0x208f8100 snd_hda_intel 0000:00:0e.0: No response from codec, disabling MSI: last cmd=0x208f8100 snd_hda_intel 0000:00:0e.0: azx_get_response timeout, switching to single_cmd mode: last cmd=0x208f8100 snd_hda_codec_hdmi hdaudioC0D2: Unable to sync register 0x2f0d00. -5 Confirmed to happen in 4.14.0, 4.15.0 as well as the latest drm-tip (commit 2e76a2952923eba64c4f9baf461613bc42ee997a). Using git bisect, it seems to have started with commit e8a3a2a3d7f7a194e0f0ad92c5dd636f908e7601 "drm/i915/gen9+: Don't remove secondary power well requests" The previous commit 42d9366d41a992631abaa15f5a881ae1235a8203 does not exhibit this issue. I've booted the same HDD in a Haswell (i5-4570T) system and could not reproduce the issue. So far I only see this issue when running Xorg and Kodi (I don't have a regular desktop on this system but can install one if requested for testing). I've tried plain Xorg with no window manager, then starting either xterm or Google Chrome and haven't managed to reproduce the issue in that situation. Timing seems to have some effect; sleeping after 60 seconds (60 seconds from Xorg starting, or 60 seconds from the last DPMS on command) never exhibited the issue after several tries. 90 seconds or more seems to be needed. In the attached dmesg output, the following was performed (Kodi & Xorg started at boot, around T=8). T=144 xset -display :0 dpms force off This did not cause the issue. T=174 xset -display :0 dpms force on HDMI audio OK T=294 xset -display :0 dpms force off Here this issue appears, with the regular dmesg error (azx_get_response timeout) appearing at T=297. T=349 xset -display :0 dpms force on HDMI audio no longer functional. System details: Asrock J4205-ITX (pentium J4205) Onboard LSPCON converting DP 2.0 to HDMI output. HDMI connection to Denon AVR X3100W # uname -m x86_64 # uname -r 4.15.0+ OS: Gentoo Linux with profile default/linux/amd64/17.0