Bug 99578

Summary: [SKL][bisected] Screen regularly returns from power save mode
Product: DRI Reporter: Patrick Steinhardt <ps>
Component: DRM/IntelAssignee: Patrick Steinhardt <ps>
Status: CLOSED INVALID QA Contact: Luis Botello <luis.botello.ortega>
Severity: critical    
Priority: highest CC: intel-gfx-bugs
Version: XOrg gitKeywords: bisected, regression
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
i915 platform: SKL i915 features: firmware/dmc, power/Other
Description Flags
Bisect log
Kernel configuration
dmesg drm-tip
dmesg with i915.enable_dc=0
dmesg with i915.enable_dc=-1
dmesg without i915.enable_dc
Xorg log
i915 runtime pm status with i915.enable_dc=0
i915 runtime pm status with i915.enable_dc=-1
dmesg with i915.enable_dc=0 drm.debug=0xe
dmesg with i915.enable_dc=-1 drm.debug=0xe
dmesg with i915.enable_dc=0 drm.debug=0xe none

Description Patrick Steinhardt 2017-01-28 16:50:43 UTC

Inside of X11, when the screen saver is activated, the monitor turns off correctly after a short amount of time. But as of recently, the monitor will turn back on again in regular intervals (roughly every 90 seconds) and then after a few seconds turn off again.

Some details on my setup: The issue has been surfacing since Linux v4.9. I'll attach my kernel configuration. I'm using xorg-server-1.19.1 with the built-in modesetting driver. My hardware includes both an Nvidia Geforce 750Ti (which is unused and stubbed out via vfio-pci) and an Intel 6600k CPU. There are two monitors connected to the iGPU, one via Display Port and one via HDMI. The issue has been occurring on the monitor connected via Display Port, though I haven't actually checked if the other one behaves similarly. The issue persists if the HDMI-connected screen is disabled.

As said, the issue started to appear in Linux v4.9 and persists at least until v4.9.6. I think I also had it occurring on v4.10-rc1, even though I'm not totally sure here. I've bisected the issue to commit 536ab3ca19ef (drm/i915: Fix botched merge that downgrades CSR versions., 2016-08-15), which updates the Intel firmware version to `i915/skl_dmc_ver1_26`. So it seems related to a bug in the updated firmware.

Reproducing the issue is easy on my computer:

1. Turn the computer on, have the DP display connected
2. `xset s 1` to quickly enable the screen saver
3. Wait a short time span. Usually the bug occurs after roughly 90 seconds where the screen will turn on again, simply saying "No signal"
4. The screen will turn off again shortly after

I'll attach the bisect log, kernel configuration as well as dmesg. dmesg does not say anything interesting though as soon as the issue occurs.

Thanks and regards
Comment 1 Patrick Steinhardt 2017-01-28 16:50:59 UTC
Created attachment 129195 [details]
Bisect log
Comment 2 Patrick Steinhardt 2017-01-28 16:51:30 UTC
Created attachment 129196 [details]
Kernel configuration
Comment 3 Patrick Steinhardt 2017-01-28 16:52:20 UTC
Created attachment 129197 [details]
Comment 4 Patrick Steinhardt 2017-01-28 16:54:37 UTC
I just noticed thet there now has been some interesting information spewed out by dmesg regarding problems with the EDID. For this boot, I've reproduced the issue two times: first around the 120 seconds mark, where there is no info available, and a second time later on. I don't exactly know though if the information logged to dmesg and the second reproduction were actually at the same time, but it's possible.
Comment 5 Patrick Steinhardt 2017-01-28 17:09:45 UTC
I just did a run with the latest drm-tip branch, the issue persists. But actually, there's a warning being triggered with 8c3608fc86d0 (drm-tip: 2017y-01m-27d-19h-56m-27s UTC integration manifest, 2017-01-27).

I'll add the new dmesg log
Comment 6 Patrick Steinhardt 2017-01-28 17:10:11 UTC
Created attachment 129198 [details]
dmesg drm-tip
Comment 7 Tomeu Vizoso 2017-02-06 13:15:36 UTC
(In reply to Patrick Steinhardt from comment #0)
> As said, the issue started to appear in Linux v4.9 and persists at least
> until v4.9.6. I think I also had it occurring on v4.10-rc1, even though I'm
> not totally sure here. I've bisected the issue to commit 536ab3ca19ef
> (drm/i915: Fix botched merge that downgrades CSR versions., 2016-08-15),
> which updates the Intel firmware version to `i915/skl_dmc_ver1_26`. So it
> seems related to a bug in the updated firmware.

Can you see if the problem can be reproduced if skl_dmc_ver1.bin is used instead of skl_dmc_ver1_26.bin?
Comment 8 Patrick Steinhardt 2017-02-06 15:06:41 UTC
No, I cannot actually reproduce the problem when instead including `skl_dmc_ver1.bin`. But actually, it is not being loaded by the kernel in v4.9.8:

[    0.531932] [drm] Initialized
[    0.532336] [drm] Memory usable by graphics device = 4096M
[    0.532338] [drm] VT-d active for gfx access
[    0.537328] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    0.537330] [drm] Driver supports precise vblank timestamp query.
[    0.542846] i915 0000:00:02.0: Direct firmware load for i915/skl_dmc_ver1_26.bin failed with error -2
[    0.542850] i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling runtime power management.
[    0.543002] [drm] Disabling framebuffer compression (FBC) to prevent screen flicker with VT-d enabled
[    0.545718] [drm] GuC firmware load skipped
[    0.606362] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
[    0.606505] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input3
[    0.606526] [drm] Initialized i915 1.6.0 20160919 for 0000:00:02.0 on minor 0

Other than the above, dmesg output has not really changed. So runtime power management is disabled, but the screen does not change back on.
Comment 9 Elio 2017-02-10 22:21:00 UTC
Using Kernel from drm tip : 4.10-rc7 commit fb21519 the problem is not present anymore , even if i wait for 3 minutes. 
Please try with this kernel version and let me know if you still having the problem.
Comment 10 Patrick Steinhardt 2017-02-11 16:20:50 UTC
The issue persists with latest drm-tip 4dbd7c0fbb78 (drm-tip: 2017y-02m-10d-21h-45m-14s UC integration manifest, 2017-02-10) as well as with 4.10-rc7.
Comment 11 Jari Tahvanainen 2017-02-13 10:52:19 UTC
> [    0.542846] i915 0000:00:02.0: Direct firmware load for
> i915/skl_dmc_ver1_26.bin failed with error -2
> [    0.542850] i915 0000:00:02.0: Failed to load DMC firmware
> [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling
> runtime power management.

Patrick, what is the content of /lib/firmware/i915 folder on your machine? Just double checking what firmware you have available in your machine.
Comment 12 Jari Tahvanainen 2017-02-13 11:05:33 UTC
and you assumable used update.sh included in .tar.gz when you installed new firmware - right? Or are you still using 1_23 or even earlier version?
Comment 13 Patrick Steinhardt 2017-02-13 11:21:07 UTC
No, I use `i915/skl_dmc_ver1_26.bin`. I retrieve the file via the linux-firmware package from Gentoo. Most of the other firmware files are filtered, so the following files remain:


As the firmware is provided by the distribution and I compile it directly into the kernel via `CONFIG_EXTRA_FIRMWARE`, I do not run any additional scripts to have them updated (besides re-compiling the kernel on changed firmware, obviously).

Just to avoid any misunderstandings: the error loading `skl_dmc_ver1_26.bin` was only caused by directly trying to load `i915/skl_dmc_ver1.bin` instead of `i915/skl_dmc_ver1_26.bin`, as was proposed by Elio. Otherwise, the firmware file loads correctly.
Comment 14 Rodrigo Vivi 2017-02-17 18:53:18 UTC
Patrick, could you please try latest drm-tip and try it twice:
one booting with i915.enable_dc=0
and another boot with i915.enable_dc=-1

In both cases, could you please also paste the output of:

Further thoughts:
1. Runtime PM is only active if DMC is loaded. So it can be an issue on DMC or an issue on RPM. But with dc_state=0 and firmware getting loaded RPM will probably run but firmware won't do anything. Another approach would be commenting out the intel_display_power_get(dev_priv, POWER_DOMAIN_INIT); from intel_csr_ucode_init.

2. Patrick's user space is totally different because he has those hybrid gpus. So the only way to really reproduce his issues is with same setup.
Btw, Patrick, could you please also attach your /var/log/Xorg.0.log?

Comment 15 Patrick Steinhardt 2017-02-22 08:09:28 UTC
I'm currently unable to test this. I will report back towards the end of this week.
Comment 16 Patrick Steinhardt 2017-02-23 12:07:29 UTC
Just a short reminder, triggered by the different userspace assertion: the second, external GPU is not in use and stubbed out with vfio-pci.

Anyway, finally got around to do these tests. No change in behavior, the bug constistently triggered after around 2 two three minutes for all three configurations (without i915.enable_dc and with i915.enable_dc=0 respectively i915.enable_dc=-1). dmesg shows no significant differences, diff (output is sorted to minimize diff, so this is not in chronological order):

--- /dev/fd/63	2017-02-23 13:06:04.456398971 +0100
+++ /dev/fd/62	2017-02-23 13:06:04.456398971 +0100
@@ -104,8 +104,8 @@
  ACPI: SSDT 0x0000000066B8E598 0006FD (v02 INTEL  xh_rvp08 00000000 INTL 20120913)
  ACPI: SSDT 0x0000000066B8EC98 0052EA (v02 SaSsdt SaSsdt   00003000 INTL 20120913)
  ACPI: SSDT 0x0000000066B93FD0 000E58 (v02 CpuRef CpuSsdt  00003000 INTL 20120913)
- ACPI: SSDT 0xFFFF88047368B800 0005AA (v02 PmRef  ApIst    00003000 INTL 20120913)
- ACPI: SSDT 0xFFFF88047368F000 000705 (v02 PmRef  Cpu0Ist  00003000 INTL 20120913)
+ ACPI: SSDT 0xFFFF8804736BA000 0005AA (v02 PmRef  ApIst    00003000 INTL 20120913)
+ ACPI: SSDT 0xFFFF8804736BD000 000705 (v02 PmRef  Cpu0Ist  00003000 INTL 20120913)
  ACPI: (supports S0 S3 S5)
  ACPI: UEFI 0x0000000066B93F88 000042 (v01                 00000000      00000000)
  ACPI: Using IOAPIC for interrupt routing
@@ -351,7 +351,7 @@
  iommu: Adding device 0000:04:00.0 to group 13
  io scheduler cfq registered (default)
  io scheduler noop registered
- Kernel command line: BOOT_IMAGE=/linux/gentoo/bzImage init=/usr/sbin/epoch-init vfio-pci.ids=10de:1380,10de:0fbc i915.enable_dc=0
+ Kernel command line: BOOT_IMAGE=/linux/gentoo/bzImage init=/usr/sbin/epoch-init vfio-pci.ids=10de:1380,10de:0fbc i915.enable_dc=-1
  Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
  Last level iTLB entries: 4KB 128, 2MB 8, 4MB 8
  libata version 3.00 loaded.
@@ -587,7 +587,7 @@
  rtc_cmos 00:03: alarms up to one month, y3k, 242 bytes nvram, hpet irqs
  rtc_cmos 00:03: RTC can wake from S4
  rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0
- rtc_cmos 00:03: setting system clock to 2017-02-23 11:54:15 UTC (1487850855)
+ rtc_cmos 00:03: setting system clock to 2017-02-23 11:58:27 UTC (1487851107)
  scsi 0:0:0:0: CD-ROM            HL-DT-ST BD-RE  BH16NS55  1.02 PQ: 0 ANSI: 5
  scsi 2:0:0:0: Direct-Access     ATA      Samsung SSD 840  CB6Q PQ: 0 ANSI: 5
  scsi 3:0:0:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118 PQ: 0 ANSI: 5
@@ -631,7 +631,7 @@
  smpboot: Allowing 4 CPUs, 0 hotplug CPUs
  smpboot: CPU0: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz (family: 0x6, model: 0x5e, stepping: 0x3)
  smpboot: Max logical packages: 1
- smpboot: Total of 4 processors activated (28046.20 BogoMIPS)
+ smpboot: Total of 4 processors activated (28046.15 BogoMIPS)
  smp: Bringing up secondary CPUs ...
  smp: Brought up 1 node, 4 CPUs
  snd_hda_codec_realtek hdaudioC0D0: autoconfig for ALC892: line_outs=3 (0x14/0x15/0x16/0x0/0x0) type:line
@@ -714,7 +714,7 @@
  udevd[122]: specified group 'uucp' unknown
  udevd[122]: specified group 'video' unknown
  udevd[251]: starting version 3.1.5
- udevd: 49 output lines suppressed due to ratelimiting
+ udevd: 48 output lines suppressed due to ratelimiting
  UDP hash table entries: 8192 (order: 6, 262144 bytes)
  UDP-Lite hash table entries: 8192 (order: 6, 262144 bytes)
  Unpacking initramfs...

Anyway, attaching all three dmesg logs
Comment 17 Patrick Steinhardt 2017-02-23 12:07:55 UTC
By the way, this was tested with bf89ec45d082 (drm-tip: 2017y-02m-22d-14h-30m-04s UTC integration manifest, 2017-02-22)
Comment 18 Patrick Steinhardt 2017-02-23 12:08:19 UTC
Created attachment 129864 [details]
dmesg with i915.enable_dc=0
Comment 19 Patrick Steinhardt 2017-02-23 12:08:46 UTC
Created attachment 129865 [details]
dmesg with i915.enable_dc=-1
Comment 20 Patrick Steinhardt 2017-02-23 12:09:02 UTC
Created attachment 129866 [details]
dmesg without i915.enable_dc
Comment 21 Patrick Steinhardt 2017-02-23 12:13:01 UTC
Created attachment 129867 [details]
Xorg log
Comment 22 Patrick Steinhardt 2017-02-23 12:18:15 UTC
Created attachment 129868 [details]
i915 runtime pm status with i915.enable_dc=0
Comment 23 Patrick Steinhardt 2017-02-23 12:18:30 UTC
Created attachment 129869 [details]
i915 runtime pm status with i915.enable_dc=-1
Comment 24 Patrick Steinhardt 2017-02-23 12:26:53 UTC
Commenting out the `intel_display_power_get(dev_priv, POWER_DOMAIN_INIT)` left unable to boot as the screen immediately turned off after the bootloader and did not come back up.
Comment 25 Rodrigo Vivi 2017-02-27 18:26:32 UTC
dmesgs are incomplete. Could you please make sure you boot with drm.debug=0xe and also with log_buf_len=4M

Also I'd like you to try another attempt. move /lib/firmware/i915/skl_dmc_ver1_26.bin out of /lib/firmware/i915 and regenerate your initrd. I'd like to make sure that firmware presence and loading is really impacting this somehow.
Please attach dmesg for both cases. With firmware loading and without firmware loading.

Because with DC states disabled firmware is not suppose to do anything anyways and the issue persist with dc_enable=0...
Also runtime PM is probably not the culprit because the count is high.
Comment 26 dog 2017-03-07 14:42:55 UTC
FW Engineer Pavana says:

I looked at the bug. I don’t think it is related to the DMC FW version. But it is a little hard to tell without performing different experiments.
In the description, it say that there are random wakes. DMC can’t wake display up without any external trigger (Register access). Which makes me think there is something else causing the wake.
Comment 27 Patrick Steinhardt 2017-03-07 17:41:07 UTC
Short notice: I won't be able to respond on this issue until mid of next week
Comment 28 Patrick Steinhardt 2017-03-07 23:37:15 UTC
Regarding the "random wakes": I think the wakeups aren't that random at all, I only perceived them as such initially. I think the initial wakeups happen at in a short timeframe around 2 to 3 minutes after the screensaver kicks in. I could do some measurements on this next week if it would help.
Comment 29 Patrick Steinhardt 2017-03-26 19:50:05 UTC
Created attachment 130472 [details]
dmesg with i915.enable_dc=0 drm.debug=0xe
Comment 30 Patrick Steinhardt 2017-03-26 19:50:44 UTC
Created attachment 130473 [details]
dmesg with i915.enable_dc=-1 drm.debug=0xe
Comment 31 Patrick Steinhardt 2017-03-26 19:51:52 UTC
Got around to get the dmesg logs with drm.debug=0xe. Error occured both times at around 160-180 seconds, as can be seen in the logs. Tested with 295dfed68b4c (drm-tip: 2017y-03m-26d-01h-26m-45s UTC integration manifest, 2017-03-26)
Comment 32 Rodrigo Vivi 2017-03-27 16:41:02 UTC
Hi Patrick,

Did you have a chance do test without dmc firmware in rootfs nor intiramfs?

Your first log for enable_dc=0 is short and without the wake-up part. But if you state that the wake up also happens on that case it is unlikely that this is dmc, but to prove we would need to remove dmc firmware loading completely.
And possibly re-do the bisect without dmc firmware installed on all iterations.

Also on last commit you mention that wakes up happens around here:

[  161.751305] [drm:0xffffffff8a398d76] Suspending device
[  161.753704] [drm:0xffffffff8a398e44] Device suspended
[  171.773089] [drm:0xffffffff8a43c9af] [CONNECTOR:48:HDMI-A-1]
[  171.789177] [drm:0xffffffff8a39836e] Resuming device
[  171.793424] [drm:0xffffffff8a398433] Device resumed

Usually this connector line with no other info are on the path where userspace requested connected connectors information....
(I'm not blaming userspace. If it was something that worked before with same userspace but it is not working now it is a kernel regression bug.)
So I believe the bug are around the connector state... Not around DMC firmware.

This is why I believe another bisect without dmc firmware in all iterations would be so helpful.
Comment 33 Patrick Steinhardt 2017-03-28 06:16:33 UTC
I think I did try it once without the firmware being present at all, but did not actually repeat it. I'll try to do another bisect without firmware soon.

Thanks for your feedback!
Comment 34 Patrick Steinhardt 2017-03-28 17:21:16 UTC
Created attachment 130507 [details]
dmesg with i915.enable_dc=0 drm.debug=0xe

I accidentally uploaded the wrong dmesg log for `enable_dc=0`. Here's the correct one without truncation.
Comment 35 Patrick Steinhardt 2017-03-28 17:40:09 UTC
Another observation: I think the issue _is_ actually connected to the firmware. I just now booted a kernel with firmware loading completely disabled (CONFIG_FW_LOADER=n) and the issue does not appear. With firmware loading the issue is still present on dabd99296104 (drm-tip: 2017y-03m-28d-15h-25m-54s UTC integration manifest, 2017-03-28).
Comment 36 anusha 2017-04-18 18:59:46 UTC
Hi Patrick,

1. I tried the screen saver setting and the monitor stays off for a long time (close to 15 mins). I never get "No signal" message on the external monitor.

2. The CONFIG_FW_LOADER from what i read is useful if we need userspace support to load the modules built out of tree. Which does not matter for DMC. We use the linux interface to load the firmware. 

3. I did use CONFIG_FW_LOADER = Y and CONFIG_FW_LOADER = n. Both the cases DMC loads just fine and the screen saver setting also works fine.
Comment 37 Patrick Steinhardt 2017-04-19 07:19:14 UTC
Well, this at least contradicts my observations. With CONFIG_FW_LOADER=Y, I experience the issue and see that the DMC is loaded. With CONFIG_FW_LOADER=N, there is no trace of the firmware being loaded and the issue is gone.

What's still intriguing is why the issue exists at all on my platform. But to be honest, I'm content with the solution of setting CONFIG_FW_LOADER=N. It decreases the amount of unknown code running in the kernel and fixes the issue for me. So I could close the issue as fixed, but given it is simply a workaround I think that's not the right thing to do.
Comment 38 anusha 2017-05-02 17:53:26 UTC
Hi Patrick,

Few thoughts here:

1. Have you installed GuC firmware apart from DMC too? Want to make sure there no interference from that component.....
2. Also, can you disable C States in BIOS via the CMOS page. Make it enabled to C0 state. And then try reproducing the issue?

If issue persists, then its nothing to do with DMC.
Comment 39 Ricardo 2017-05-09 16:34:01 UTC
Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included
Comment 40 Ricardo 2017-05-09 16:35:23 UTC
Anusha if you are waiting for Patrick (submitter) input, the procedure is that you change the status to NEEDINFO and assign the bug to Patrick
Comment 41 Ricardo 2017-05-09 16:42:57 UTC
removing tag ReadyForDev since now it needs info from submitter
Comment 42 Jani Saarinen 2017-06-06 09:39:15 UTC
Reporter, please provided needed info. Otherwise bug will be closed in few days.
Comment 43 Jani Saarinen 2017-06-07 10:52:06 UTC
Will be closed on friday if no response.
Comment 44 steven.j.hockemeier 2017-06-13 17:59:23 UTC
Looks like this should be closed now...(no response from Submitter)
Comment 45 Elizabeth 2017-06-14 15:20:27 UTC
Good day, this bug will be closed because there is not response from submitter, if later is added new information, please be so gently to mark as reopen. Thanks.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.