Bug 104952 - [regression] Loss of i915 (Apollo Lake/Broxton) HDMI audio after display sleep v4.14+
Summary: [regression] Loss of i915 (Apollo Lake/Broxton) HDMI audio after display slee...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords: bisected, regression
Depends on:
Blocks:
 
Reported: 2018-02-05 14:44 UTC by Simon Casey
Modified: 2019-01-11 10:11 UTC (History)
3 users (show)

See Also:
i915 platform: BXT
i915 features: display/audio, display/DP, display/LSPCON


Attachments
Dmesg output from drm-tip (2e76a2952923eba64c4f9baf461613bc42ee997a) kernel booted with drm.debug=0x1e log_buf_len=1M (199.93 KB, text/plain)
2018-02-05 14:44 UTC, Simon Casey
no flags Details
dmesg from commit 42d9366d41a992 (154.76 KB, text/plain)
2018-02-06 11:42 UTC, Simon Casey
no flags Details
Debug patches (10.00 KB, application/x-tar)
2018-02-06 16:08 UTC, Imre Deak
no flags Details
Stack trace patch result from 42d9366d41a992 (156.20 KB, text/plain)
2018-02-06 20:47 UTC, Simon Casey
no flags Details
Save/restore AUD_FREQ_CNTRL (4.16 KB, patch)
2018-02-09 15:19 UTC, Imre Deak
no flags Details | Splinter Review
dmesg from drm-tip 2e76a2952923eb patched with attachment 137246 (184.64 KB, text/plain)
2018-02-09 18:17 UTC, Simon Casey
no flags Details
dmesg for 4.19.2-arch1-1-ARCH (71.05 KB, text/plain)
2018-11-26 02:10 UTC, Zesen
no flags Details

Description Simon Casey 2018-02-05 14:44:35 UTC
Created attachment 137170 [details]
Dmesg output from drm-tip (2e76a2952923eba64c4f9baf461613bc42ee997a) kernel booted with drm.debug=0x1e log_buf_len=1M

Starting with kernel 4.14.0, when DPMS turns the screen off, there is about a 50% chance that HDMI audio will break. This can be fixed by shutting down Xorg, removing and reloading module snd_hda_intel and then restarting Xorg.

When the loss occurs, the following appears in a non-debug demsg a few seconds after the DPMS off command is issued.

snd_hda_intel 0000:00:0e.0: azx_get_response timeout, switching to polling mode: last cmd=0x208f8100
snd_hda_intel 0000:00:0e.0: No response from codec, disabling MSI: last cmd=0x208f8100
snd_hda_intel 0000:00:0e.0: azx_get_response timeout, switching to single_cmd mode: last cmd=0x208f8100
snd_hda_codec_hdmi hdaudioC0D2: Unable to sync register 0x2f0d00. -5


Confirmed to happen in 4.14.0, 4.15.0 as well as the latest drm-tip (commit 2e76a2952923eba64c4f9baf461613bc42ee997a).


Using git bisect, it seems to have started with commit e8a3a2a3d7f7a194e0f0ad92c5dd636f908e7601 "drm/i915/gen9+: Don't remove secondary power well requests"

The previous commit 42d9366d41a992631abaa15f5a881ae1235a8203 does not exhibit this issue.

I've booted the same HDD in a Haswell (i5-4570T) system and could not reproduce the issue.

So far I only see this issue when running Xorg and Kodi (I don't have a regular desktop on this system but can install one if requested for testing). I've tried plain Xorg with no window manager, then starting either xterm or Google Chrome and haven't managed to reproduce the issue in that situation.

Timing seems to have some effect; sleeping after 60 seconds (60 seconds from Xorg starting, or 60 seconds from the last DPMS on command) never exhibited the issue after several tries. 90 seconds or more seems to be needed.

In the attached dmesg output, the following was performed (Kodi & Xorg started at boot, around T=8).

T=144
xset -display :0 dpms force off
This did not cause the issue.

T=174
xset -display :0 dpms force on
HDMI audio OK

T=294
xset -display :0 dpms force off
Here this issue appears, with the regular dmesg error (azx_get_response timeout) appearing at T=297.

T=349
xset -display :0 dpms force on
HDMI audio no longer functional.


System details:

Asrock J4205-ITX (pentium J4205)
Onboard LSPCON converting DP 2.0 to HDMI output. HDMI connection to Denon AVR X3100W

# uname -m
x86_64
# uname -r
4.15.0+

OS: Gentoo Linux with profile default/linux/amd64/17.0
Comment 1 Imre Deak 2018-02-05 15:38:40 UTC
Could you check the commits wrt. commit IDs? The ones you gave don't match. They are correctly:

e8a3a2a3d7f7a1 is "drm/i915/bxt, glk: Fix assert on conditions for DC9 enabling"

and

42d9366d41a992 is "drm/i915/gen9+: Don't remove secondary power well requests"

Which of the above ones did you get as the first bad commit?

If the bisect result is that e8a3a2a3d7f7a1 is the first bad commit then it's an incorrect result, since it only changes the condition for a WARN message (which doesn't trigger in any case in your log).

I also doubt that 42d9366d41a992 would be a good result. The only difference it'd make is if a secondary power well request was set, in which case this request wouldn't be removed after this commit. But then you'd have the

"Power well X forced on ..."

debug message, which you don't have in your log. So I don't see how this commit could make any difference either in your case.

All I can think of is some timing change, or that you didn't try enough to reproduce the issue with the commit preceding the bisect result.
Comment 2 Simon Casey 2018-02-05 15:55:01 UTC
My bisect went as follows: (good/bad referring to the commit above it)

aae3dbb4776e7916b6cd442d00159bea27a695c1 4.14-rc1
bad
bf1d6b2c76eda86159519bf5c427b1fa8f51f733
bad
edc2988c548db05e33b921fed15821010bc74895
bad
e5fa05b96b02f45a91a364d2121451a4da05cd84
bad
b9c55b6e2cc4369b0688961fa5de0e057f3ec0c4
bad
9f15a4ab19ab33658dbc9fd37be5210e8c1ac622
good
cdc1cc7e9ac076c9a10eccca76f86697bcf0f7a7
bad
846c6b26d38e56e5004f1d71d4c13226d2514750
good
ca91a2758fcef6635626993557dd51cfbb6dd134
bad
953152253e9cbd4f358d4b4ca56d48072af3846d
bad
6602be0e2c6163bd747d490d8875ef0812c11560
bad
42d9366d41a992631abaa15f5a881ae1235a8203
bad
e8a3a2a3d7f7a194e0f0ad92c5dd636f908e7601
good
42d9366d41a992631abaa15f5a881ae1235a8203 bad


Indeed I got the commit IDs e8a3a2a3d7f7a1 and 42d9366d41a992 the wrong way! I'll check e8a3a2a3d7f7a1 again to see If I can get it to appear. Most of the time it happens within two DMPS offs, but testing this morning to get the full debug dmesg I think I had about 10 error-free sleeps before the I got it to break for the included log...
Comment 3 Simon Casey 2018-02-06 11:41:54 UTC
So the issue does seems to be somehow triggered by 42d9366d41a992.

I set the machine up rebooting every 5 minutes overnight alternately into 42d9366d41a992 and e8a3a2a3d7f7a1.

Once booted the following was run

sleep 120 ; xset -display :0 dpms force off ; sleep 30 ; xset -display :0 dpms force on ; sleep 120 ; xset -display :0 dpms force off ; sleep 10

and then dmesg was stored before configuring to boot the alternate kernel and rebooting into it.

So two DPMS offs per boot.

e8a3a2a3d7f7a1 was booted 59 times and never produced the issue.

42d9366d41a992 was booted 58 times and 27 of these produced the issue. 17 times the issue appeared with the first DPMS off, the remaining 10 happened with the second DPMS off.

I'm attaching a debug dmesg from one of the 42d9366d41a992 runs where the issue happened on the 2nd sleep. The original dmesg attachment was from one of the latest drm-tip commits so I though it could be helpful to have the output from what appears to be the first commit showing the issue.
Comment 4 Simon Casey 2018-02-06 11:42:48 UTC
Created attachment 137185 [details]
dmesg from commit 42d9366d41a992
Comment 5 Imre Deak 2018-02-06 16:08:00 UTC
Created attachment 137189 [details]
Debug patches

In reply to Simon Casey from comment #3)
> So the issue does seems to be somehow triggered by 42d9366d41a992.

Ok, thanks for the effort.

I still suspect that some unintended dependency on timing causes this. Looking at azx_rirb_get_response() it's obvious there is some magic involved when accessing the audio codec. So based on that I'd like to ask you to try a few debug patches on top of 42d9366d4 I attached to the bug to narrow down the root cause.

First, could you reproduce the bug and get the stacktrace with
0001-sound-hda-Print-stack-dump-during-probe-failure.patch
?

One thing that 42d9366d4 changes is that intel_wait_for_register() is used instead of wait_for() to wait for the power well enabled status. This means we'll potentially complete the wait faster than before, as intel_wait_for_register() uses a 2 usec atomic wait before doing a sleeping wait. But that should be fine and nothing should depend on any delays there. To check for such a dependency, could you try if the problem goes away with
0002-drm-i915-Don-t-spin-wait-during-power-well-enabling.patch
(without the other patches)?

If with the above you can't reproduce the problem, could you try the same with
0002-drm-i915-Don-t-spin-wait-during-power-well-enabling.patch
(again without the other patches)?

Thanks.
Comment 6 Imre Deak 2018-02-06 16:37:25 UTC
(In reply to Imre Deak from comment #5)
> ...
> If with the above you can't reproduce the problem, could you try the same
> with
> 0002-drm-i915-Don-t-spin-wait-during-power-well-enabling.patch
> (again without the other patches)?

Err, I meant here
0003-drm-i915-Wait-after-audio-power-enable.patch
Comment 7 Simon Casey 2018-02-06 18:14:58 UTC
I'll certainly give those patches a try and report back.

In the meantime I rebuilt both e8a3a2a3d7f7a1 and 42d9366d41a992 with gcc 7.3.0 previously used 6.4.0) and run a few passes of the last test.

Essentially it seems 42d9366d41a992 is now less prone to show the issue, around 1/3 of tries were affected). But also e8a3a2a3d7f7a1 has now exhibited it in 3/16 runs.

I've also rebuilt 4.13.16 with 7.3.0, and it's survived 16 passes without issue, not that it necessarily means it's free of the issue or not...
Comment 8 Imre Deak 2018-02-06 18:25:46 UTC
(In reply to Simon Casey from comment #7)
> I'll certainly give those patches a try and report back.
> 
> In the meantime I rebuilt both e8a3a2a3d7f7a1 and 42d9366d41a992 with gcc
> 7.3.0 previously used 6.4.0) and run a few passes of the last test.
> 
> Essentially it seems 42d9366d41a992 is now less prone to show the issue,
> around 1/3 of tries were affected). But also e8a3a2a3d7f7a1 has now
> exhibited it in 3/16 runs.

Ah, ok so 42d9366d41a992 is not the culprit. So maybe no use in trying my debug patches. Ville in the meantime told me there is a known problem with the codec probing if the CDCLK rate is too low. So could you instead try if the following fixes it (on top of 42d9366d41a992 let's say and/or drm-tip):

diff --git a/drivers/gpu/drm/i915/intel_cdclk.c b/drivers/gpu/drm/i915/intel_cdclk.c
index 1241e5891b29..03ebd6589b07 100644
--- a/drivers/gpu/drm/i915/intel_cdclk.c
+++ b/drivers/gpu/drm/i915/intel_cdclk.c
@@ -1083,10 +1083,8 @@ static int bxt_calc_cdclk(int max_pixclk)
 		return 576000;
 	else if (max_pixclk > 288000)
 		return 384000;
-	else if (max_pixclk > 144000)
-		return 288000;
 	else
-		return 144000;
+		return 288000;
 }
 
 static int glk_calc_cdclk(int max_pixclk)


> I've also rebuilt 4.13.16 with 7.3.0, and it's survived 16 passes without
> issue, not that it necessarily means it's free of the issue or not...
Comment 9 Simon Casey 2018-02-06 18:42:38 UTC
> Ah, ok so 42d9366d41a992 is not the culprit. So maybe no use in trying my
> debug patches. Ville in the meantime told me there is a known problem with
> the codec probing if the CDCLK rate is too low. So could you instead try if
> the following fixes it (on top of 42d9366d41a992 let's say and/or drm-tip):
> 
> diff --git a/drivers/gpu/drm/i915/intel_cdclk.c
> b/drivers/gpu/drm/i915/intel_cdclk.c
> index 1241e5891b29..03ebd6589b07 100644
> --- a/drivers/gpu/drm/i915/intel_cdclk.c
> +++ b/drivers/gpu/drm/i915/intel_cdclk.c
> @@ -1083,10 +1083,8 @@ static int bxt_calc_cdclk(int max_pixclk)
>  		return 576000;
>  	else if (max_pixclk > 288000)
>  		return 384000;
> -	else if (max_pixclk > 144000)
> -		return 288000;
>  	else
> -		return 144000;
> +		return 288000;
>  }
>  
>  static int glk_calc_cdclk(int max_pixclk)
> 
> 
> > I've also rebuilt 4.13.16 with 7.3.0, and it's survived 16 passes without
> > issue, not that it necessarily means it's free of the issue or not...

Looking at drm-tip, max_pixclk has morphed into min_pixclk in bxt_calc_cdclk(). What do you suggest I change for drm-tip? I'll ping pong that against a patched 42d9366d41a992.
Comment 10 Simon Casey 2018-02-06 18:48:39 UTC
> Looking at drm-tip, max_pixclk has morphed into min_pixclk in
> bxt_calc_cdclk(). What do you suggest I change for drm-tip? I'll ping pong
> that against a patched 42d9366d41a992.

Of course I mean min_cdclk not min_pixclk... I guess ignoring the variable name the if/else otherwise looks identical to 42d9366d41a992 so will fudge it to do the same :)
Comment 11 Simon Casey 2018-02-06 20:47:11 UTC
Created attachment 137196 [details]
Stack trace patch result from 42d9366d41a992

Attached is the dmesg including stack trace from a failure on 42d9366d41a992. It also had the bxt_calc_cdclk() patch applied.

I tried the bxt_calc_cdclk() patch on both 42d9366d41a992 and latest drm-tip but the issue is still present.

Since the issue isn't created by 42d9366d41a992 I haven't tried the 0002 and 0003 patches. May try a new bisect tomorrow and see if that produces up anything better.
Comment 12 Jani Nikula 2018-02-07 12:38:18 UTC
Imre, was looking at bspec for other stuff, might bspec id 19620 be related?
Comment 13 Imre Deak 2018-02-09 15:19:43 UTC
Created attachment 137246 [details] [review]
Save/restore AUD_FREQ_CNTRL

(In reply to Jani Nikula from comment #12)
> Imre, was looking at bspec for other stuff, might bspec id 19620 be related?

Most of the context save/restore described there is done (or should be done) by the audio driver; I haven't checked that part in detail, I hope the audio people can do that. Two things from the save/restore sequence must be done on the display driver side (since the registers belong to it): programming HSW_AUD_CHICKENBIT and saving/restoring SKL_AUD_FREQ_CNTRL . I haven't found anything in the driver about the latter one.

Simon could you try if the attached patch makes a difference for you (providing the dmesg in any case)?
Comment 14 Simon Casey 2018-02-09 18:17:52 UTC
Created attachment 137248 [details]
dmesg from drm-tip 2e76a2952923eb patched with attachment 137246 [details] [review]

Here's dmesg as requested.

I realised i wasn't running 'make clean' after each bisect stage previously, Almost complete with a new bisect but it seems to be homing in on a dead end again... Only commits left are from 55a0b9d70a6c57 to edeb729f7929d1, none of which look at all related.
Comment 15 Imre Deak 2018-02-09 18:31:56 UTC
(In reply to Simon Casey from comment #14)
> Created attachment 137248 [details]
> dmesg from drm-tip 2e76a2952923eb patched with attachment 137246 [details] [review]
> [review]
> 
> Here's dmesg as requested.

Thanks, so AUD_FREQ_CNTRL has just the reset value, so (for you at least) the lack of save/restore doesn't matter.

> 
> I realised i wasn't running 'make clean' after each bisect stage previously,
> Almost complete with a new bisect but it seems to be homing in on a dead end
> again... Only commits left are from 55a0b9d70a6c57 to edeb729f7929d1, none
> of which look at all related.
Comment 16 Jani Saarinen 2018-03-29 07:09:57 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 17 Simon Casey 2018-03-30 18:11:36 UTC
I've just tested 4.16-rc7 and the issue is still present.

I continued trying bisecting but didn't get anywhere.

The first post 4.13 commit is 69c0067aa3f40d and is fine. The next merge is from drm, 906dde0f355bd9, and shows the issue.

Bisecting between the two just doesn't seem possible. The very first commit from the drm merge, fc5e9d63a8db40, shows the issue yet this commit is completely unrelated to i915.

However I later realised that jumping to fc5e9d63a8db40 produces a kernel snapshot at the point of that revision, rather than simply 69c0067aa3f40d plus fc5e9d63a8db40. The snapshot identifies as 4.12.0-rc1+, so I don't understand how this older snapshot has the issue, whilst 4.13 (and 4.12) don't have the issue.

Any suggestions how I an better locate this bug?
Comment 18 Jani Nikula 2018-04-06 09:27:48 UTC
Please try current drm-tip branch of https://cgit.freedesktop.org/drm/drm-tip

In particular, this could be fixed by

commit b651bd2a3ae33f6a98d6ea36ef2518f5211bdc4f
Author: Gaurav K Singh <gaurav.k.singh@intel.com>
Date:   Thu Apr 5 22:12:24 2018 +0530

    drm/i915/audio: Fix audio enumeration issue on BXT
Comment 19 Simon Casey 2018-04-09 17:19:59 UTC
I added the one-liner from b651bd2a3ae33f6a98d6ea36ef2518f5211bdc4f to both 4.16-rc7 and 4.16 (both of which showed the issue), and not quite as impressive as 25000 passes but booting between the 2 modded kernels with this patch and issuing 936 sleep events over the weekend, all came back successfully with hdmi audio!

So, I'm happy to call it cased closed here :)

Thanks to you all for looking into this!
Comment 20 Jani Saarinen 2018-04-09 18:52:12 UTC
Thank you for confirming, closing
Comment 21 Zesen 2018-11-26 02:10:21 UTC
Is this patch merged upstream? Because I'm still having the same issue.

Linux bedroom.riaqn.org 4.19.2-arch1-1-ARCH #1 SMP PREEMPT Tue Nov 13 21:16:19 UTC 2018 x86_64 GNU/Linux


I also added my dmesg to attachments.
Comment 22 Zesen 2018-11-26 02:10:57 UTC
Created attachment 142610 [details]
dmesg for 4.19.2-arch1-1-ARCH
Comment 23 me 2018-12-14 17:27:10 UTC
Getting the same error in 4.4 on Kaby Lake.
Is it related to this issue or is it something else?

[    3.021279] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915_bpo])
[    3.068792] snd_hda_codec_realtek hdaudioC0D0: autoconfig for ALC892: line_outs=1 (0x14/0x0/0x0/0x0/0x0) type:line
[    3.068795] snd_hda_codec_realtek hdaudioC0D0:    speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[    3.068796] snd_hda_codec_realtek hdaudioC0D0:    hp_outs=1 (0x1b/0x0/0x0/0x0/0x0)
[    3.068797] snd_hda_codec_realtek hdaudioC0D0:    mono: mono_out=0x0
[    3.068798] snd_hda_codec_realtek hdaudioC0D0:    inputs:
[    3.068800] snd_hda_codec_realtek hdaudioC0D0:      Front Mic=0x19
[    3.068801] snd_hda_codec_realtek hdaudioC0D0:      Rear Mic=0x18
[    3.068802] snd_hda_codec_realtek hdaudioC0D0:      Line=0x1a
[    3.092264] input: HDA Intel PCH Rear Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12
[    3.102750] input: HDA Intel PCH Line as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
[    3.102795] input: HDA Intel PCH Line Out as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14
[    3.102840] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input15
[    3.102887] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input16
[    3.102926] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input17

. . .

[ 2720.133324] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to polling mode: last cmd=0x206f0100
[ 2721.141237] snd_hda_intel 0000:00:1f.3: No response from codec, disabling MSI: last cmd=0x206f0100
[ 2722.149139] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to single_cmd mode: last cmd=0x206f0100
[ 2722.149270] azx_single_wait_for_response: 69 callbacks suppressed
Comment 24 Jani Nikula 2018-12-28 09:12:30 UTC
(In reply to Zesen from comment #21)
> Is this patch merged upstream? Because I'm still having the same issue.

You've got GLK, which should've been fixed by

commit b4615730530be85fc45ab4631c2ad6d8e2d0b97d
Author: Gaurav K Singh <gaurav.k.singh@intel.com>
Date:   Tue Apr 17 23:52:18 2018 +0530

    drm/i915/audio: Fix audio detection issue on GLK

which is in v4.17 and should've been backported to stable.

Please file a new bug. Add drm.debug=14 module parameter, and attach dmesg to that bug.
Comment 25 Jani Nikula 2018-12-28 09:15:58 UTC
(In reply to me from comment #23)
> Getting the same error in 4.4 on Kaby Lake.
> Is it related to this issue or is it something else?
> 
> [    3.021279] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops
> i915_audio_component_bind_ops [i915_bpo])

You've got an old distro backported version of the driver, not supported by upstream. You'll need at least upstream kernel version v4.5, but preferrably update to a recent kernel.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.