Summary: | [SKL][bisected] Screen regularly returns from power save mode | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Patrick Steinhardt <ps> | ||||||||||||||||||||||||||||
Component: | DRM/Intel | Assignee: | Patrick Steinhardt <ps> | ||||||||||||||||||||||||||||
Status: | CLOSED INVALID | QA Contact: | Luis Botello <luis.botello.ortega> | ||||||||||||||||||||||||||||
Severity: | critical | ||||||||||||||||||||||||||||||
Priority: | highest | CC: | intel-gfx-bugs | ||||||||||||||||||||||||||||
Version: | XOrg git | Keywords: | bisected, regression | ||||||||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||
i915 platform: | SKL | i915 features: | firmware/dmc, power/Other | ||||||||||||||||||||||||||||
Attachments: |
|
Description
Patrick Steinhardt
2017-01-28 16:50:43 UTC
Created attachment 129195 [details]
Bisect log
Created attachment 129196 [details]
Kernel configuration
Created attachment 129197 [details]
Dmesg
I just noticed thet there now has been some interesting information spewed out by dmesg regarding problems with the EDID. For this boot, I've reproduced the issue two times: first around the 120 seconds mark, where there is no info available, and a second time later on. I don't exactly know though if the information logged to dmesg and the second reproduction were actually at the same time, but it's possible. I just did a run with the latest drm-tip branch, the issue persists. But actually, there's a warning being triggered with 8c3608fc86d0 (drm-tip: 2017y-01m-27d-19h-56m-27s UTC integration manifest, 2017-01-27). I'll add the new dmesg log Created attachment 129198 [details]
dmesg drm-tip
(In reply to Patrick Steinhardt from comment #0) > > As said, the issue started to appear in Linux v4.9 and persists at least > until v4.9.6. I think I also had it occurring on v4.10-rc1, even though I'm > not totally sure here. I've bisected the issue to commit 536ab3ca19ef > (drm/i915: Fix botched merge that downgrades CSR versions., 2016-08-15), > which updates the Intel firmware version to `i915/skl_dmc_ver1_26`. So it > seems related to a bug in the updated firmware. Can you see if the problem can be reproduced if skl_dmc_ver1.bin is used instead of skl_dmc_ver1_26.bin? No, I cannot actually reproduce the problem when instead including `skl_dmc_ver1.bin`. But actually, it is not being loaded by the kernel in v4.9.8: ``` [ 0.531932] [drm] Initialized [ 0.532336] [drm] Memory usable by graphics device = 4096M [ 0.532338] [drm] VT-d active for gfx access [ 0.537328] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 0.537330] [drm] Driver supports precise vblank timestamp query. [ 0.542846] i915 0000:00:02.0: Direct firmware load for i915/skl_dmc_ver1_26.bin failed with error -2 [ 0.542850] i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling runtime power management. [ 0.543002] [drm] Disabling framebuffer compression (FBC) to prevent screen flicker with VT-d enabled [ 0.545718] [drm] GuC firmware load skipped [ 0.606362] ACPI: Video Device [GFX0] (multi-head: yes rom: no post: no) [ 0.606505] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input3 [ 0.606526] [drm] Initialized i915 1.6.0 20160919 for 0000:00:02.0 on minor 0 ``` Other than the above, dmesg output has not really changed. So runtime power management is disabled, but the screen does not change back on. Using Kernel from drm tip : 4.10-rc7 commit fb21519 the problem is not present anymore , even if i wait for 3 minutes. Please try with this kernel version and let me know if you still having the problem. The issue persists with latest drm-tip 4dbd7c0fbb78 (drm-tip: 2017y-02m-10d-21h-45m-14s UC integration manifest, 2017-02-10) as well as with 4.10-rc7. > [ 0.542846] i915 0000:00:02.0: Direct firmware load for
> i915/skl_dmc_ver1_26.bin failed with error -2
> [ 0.542850] i915 0000:00:02.0: Failed to load DMC firmware
> [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling
> runtime power management.
Patrick, what is the content of /lib/firmware/i915 folder on your machine? Just double checking what firmware you have available in your machine.
and you assumable used update.sh included in .tar.gz when you installed new firmware - right? Or are you still using 1_23 or even earlier version? No, I use `i915/skl_dmc_ver1_26.bin`. I retrieve the file via the linux-firmware package from Gentoo. Most of the other firmware files are filtered, so the following files remain: ``` /lib/firmware/ /lib/firmware/iwlwifi-7265D-21.ucode /lib/firmware/i915 /lib/firmware/i915/skl_guc_ver6_1.bin /lib/firmware/i915/skl_guc_ver6.bin /lib/firmware/i915/skl_dmc_ver1.bin /lib/firmware/i915/skl_dmc_ver1_26.bin /lib/firmware/i915/skl_guc_ver4.bin ``` As the firmware is provided by the distribution and I compile it directly into the kernel via `CONFIG_EXTRA_FIRMWARE`, I do not run any additional scripts to have them updated (besides re-compiling the kernel on changed firmware, obviously). Just to avoid any misunderstandings: the error loading `skl_dmc_ver1_26.bin` was only caused by directly trying to load `i915/skl_dmc_ver1.bin` instead of `i915/skl_dmc_ver1_26.bin`, as was proposed by Elio. Otherwise, the firmware file loads correctly. Patrick, could you please try latest drm-tip and try it twice: one booting with i915.enable_dc=0 and another boot with i915.enable_dc=-1 In both cases, could you please also paste the output of: /sys/kernel/debug/dri/0/i915_runtime_pm_status Further thoughts: 1. Runtime PM is only active if DMC is loaded. So it can be an issue on DMC or an issue on RPM. But with dc_state=0 and firmware getting loaded RPM will probably run but firmware won't do anything. Another approach would be commenting out the intel_display_power_get(dev_priv, POWER_DOMAIN_INIT); from intel_csr_ucode_init. 2. Patrick's user space is totally different because he has those hybrid gpus. So the only way to really reproduce his issues is with same setup. Btw, Patrick, could you please also attach your /var/log/Xorg.0.log? Thanks, Rodrigo. I'm currently unable to test this. I will report back towards the end of this week. Just a short reminder, triggered by the different userspace assertion: the second, external GPU is not in use and stubbed out with vfio-pci. Anyway, finally got around to do these tests. No change in behavior, the bug constistently triggered after around 2 two three minutes for all three configurations (without i915.enable_dc and with i915.enable_dc=0 respectively i915.enable_dc=-1). dmesg shows no significant differences, diff (output is sorted to minimize diff, so this is not in chronological order): ``` --- /dev/fd/63 2017-02-23 13:06:04.456398971 +0100 +++ /dev/fd/62 2017-02-23 13:06:04.456398971 +0100 @@ -104,8 +104,8 @@ ACPI: SSDT 0x0000000066B8E598 0006FD (v02 INTEL xh_rvp08 00000000 INTL 20120913) ACPI: SSDT 0x0000000066B8EC98 0052EA (v02 SaSsdt SaSsdt 00003000 INTL 20120913) ACPI: SSDT 0x0000000066B93FD0 000E58 (v02 CpuRef CpuSsdt 00003000 INTL 20120913) - ACPI: SSDT 0xFFFF88047368B800 0005AA (v02 PmRef ApIst 00003000 INTL 20120913) - ACPI: SSDT 0xFFFF88047368F000 000705 (v02 PmRef Cpu0Ist 00003000 INTL 20120913) + ACPI: SSDT 0xFFFF8804736BA000 0005AA (v02 PmRef ApIst 00003000 INTL 20120913) + ACPI: SSDT 0xFFFF8804736BD000 000705 (v02 PmRef Cpu0Ist 00003000 INTL 20120913) ACPI: (supports S0 S3 S5) ACPI: UEFI 0x0000000066B93F88 000042 (v01 00000000 00000000) ACPI: Using IOAPIC for interrupt routing @@ -351,7 +351,7 @@ iommu: Adding device 0000:04:00.0 to group 13 io scheduler cfq registered (default) io scheduler noop registered - Kernel command line: BOOT_IMAGE=/linux/gentoo/bzImage init=/usr/sbin/epoch-init vfio-pci.ids=10de:1380,10de:0fbc i915.enable_dc=0 + Kernel command line: BOOT_IMAGE=/linux/gentoo/bzImage init=/usr/sbin/epoch-init vfio-pci.ids=10de:1380,10de:0fbc i915.enable_dc=-1 Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 Last level iTLB entries: 4KB 128, 2MB 8, 4MB 8 libata version 3.00 loaded. @@ -587,7 +587,7 @@ rtc_cmos 00:03: alarms up to one month, y3k, 242 bytes nvram, hpet irqs rtc_cmos 00:03: RTC can wake from S4 rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0 - rtc_cmos 00:03: setting system clock to 2017-02-23 11:54:15 UTC (1487850855) + rtc_cmos 00:03: setting system clock to 2017-02-23 11:58:27 UTC (1487851107) scsi 0:0:0:0: CD-ROM HL-DT-ST BD-RE BH16NS55 1.02 PQ: 0 ANSI: 5 scsi 2:0:0:0: Direct-Access ATA Samsung SSD 840 CB6Q PQ: 0 ANSI: 5 scsi 3:0:0:0: Direct-Access ATA SAMSUNG HD103UJ 1118 PQ: 0 ANSI: 5 @@ -631,7 +631,7 @@ smpboot: Allowing 4 CPUs, 0 hotplug CPUs smpboot: CPU0: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz (family: 0x6, model: 0x5e, stepping: 0x3) smpboot: Max logical packages: 1 - smpboot: Total of 4 processors activated (28046.20 BogoMIPS) + smpboot: Total of 4 processors activated (28046.15 BogoMIPS) smp: Bringing up secondary CPUs ... smp: Brought up 1 node, 4 CPUs snd_hda_codec_realtek hdaudioC0D0: autoconfig for ALC892: line_outs=3 (0x14/0x15/0x16/0x0/0x0) type:line @@ -714,7 +714,7 @@ udevd[122]: specified group 'uucp' unknown udevd[122]: specified group 'video' unknown udevd[251]: starting version 3.1.5 - udevd: 49 output lines suppressed due to ratelimiting + udevd: 48 output lines suppressed due to ratelimiting UDP hash table entries: 8192 (order: 6, 262144 bytes) UDP-Lite hash table entries: 8192 (order: 6, 262144 bytes) Unpacking initramfs... ``` Anyway, attaching all three dmesg logs By the way, this was tested with bf89ec45d082 (drm-tip: 2017y-02m-22d-14h-30m-04s UTC integration manifest, 2017-02-22) Created attachment 129864 [details]
dmesg with i915.enable_dc=0
Created attachment 129865 [details]
dmesg with i915.enable_dc=-1
Created attachment 129866 [details]
dmesg without i915.enable_dc
Created attachment 129867 [details]
Xorg log
Created attachment 129868 [details]
i915 runtime pm status with i915.enable_dc=0
Created attachment 129869 [details]
i915 runtime pm status with i915.enable_dc=-1
Commenting out the `intel_display_power_get(dev_priv, POWER_DOMAIN_INIT)` left unable to boot as the screen immediately turned off after the bootloader and did not come back up. dmesgs are incomplete. Could you please make sure you boot with drm.debug=0xe and also with log_buf_len=4M Also I'd like you to try another attempt. move /lib/firmware/i915/skl_dmc_ver1_26.bin out of /lib/firmware/i915 and regenerate your initrd. I'd like to make sure that firmware presence and loading is really impacting this somehow. Please attach dmesg for both cases. With firmware loading and without firmware loading. Because with DC states disabled firmware is not suppose to do anything anyways and the issue persist with dc_enable=0... Also runtime PM is probably not the culprit because the count is high. FW Engineer Pavana says: I looked at the bug. I don’t think it is related to the DMC FW version. But it is a little hard to tell without performing different experiments. In the description, it say that there are random wakes. DMC can’t wake display up without any external trigger (Register access). Which makes me think there is something else causing the wake. Short notice: I won't be able to respond on this issue until mid of next week Regarding the "random wakes": I think the wakeups aren't that random at all, I only perceived them as such initially. I think the initial wakeups happen at in a short timeframe around 2 to 3 minutes after the screensaver kicks in. I could do some measurements on this next week if it would help. Created attachment 130472 [details]
dmesg with i915.enable_dc=0 drm.debug=0xe
Created attachment 130473 [details]
dmesg with i915.enable_dc=-1 drm.debug=0xe
Got around to get the dmesg logs with drm.debug=0xe. Error occured both times at around 160-180 seconds, as can be seen in the logs. Tested with 295dfed68b4c (drm-tip: 2017y-03m-26d-01h-26m-45s UTC integration manifest, 2017-03-26) Hi Patrick, Did you have a chance do test without dmc firmware in rootfs nor intiramfs? Your first log for enable_dc=0 is short and without the wake-up part. But if you state that the wake up also happens on that case it is unlikely that this is dmc, but to prove we would need to remove dmc firmware loading completely. And possibly re-do the bisect without dmc firmware installed on all iterations. Also on last commit you mention that wakes up happens around here: [ 161.751305] [drm:0xffffffff8a398d76] Suspending device [ 161.753704] [drm:0xffffffff8a398e44] Device suspended [ 171.773089] [drm:0xffffffff8a43c9af] [CONNECTOR:48:HDMI-A-1] [ 171.789177] [drm:0xffffffff8a39836e] Resuming device [ 171.793424] [drm:0xffffffff8a398433] Device resumed Usually this connector line with no other info are on the path where userspace requested connected connectors information.... (I'm not blaming userspace. If it was something that worked before with same userspace but it is not working now it is a kernel regression bug.) So I believe the bug are around the connector state... Not around DMC firmware. This is why I believe another bisect without dmc firmware in all iterations would be so helpful. I think I did try it once without the firmware being present at all, but did not actually repeat it. I'll try to do another bisect without firmware soon. Thanks for your feedback! Created attachment 130507 [details]
dmesg with i915.enable_dc=0 drm.debug=0xe
I accidentally uploaded the wrong dmesg log for `enable_dc=0`. Here's the correct one without truncation.
Another observation: I think the issue _is_ actually connected to the firmware. I just now booted a kernel with firmware loading completely disabled (CONFIG_FW_LOADER=n) and the issue does not appear. With firmware loading the issue is still present on dabd99296104 (drm-tip: 2017y-03m-28d-15h-25m-54s UTC integration manifest, 2017-03-28). Hi Patrick, 1. I tried the screen saver setting and the monitor stays off for a long time (close to 15 mins). I never get "No signal" message on the external monitor. 2. The CONFIG_FW_LOADER from what i read is useful if we need userspace support to load the modules built out of tree. Which does not matter for DMC. We use the linux interface to load the firmware. 3. I did use CONFIG_FW_LOADER = Y and CONFIG_FW_LOADER = n. Both the cases DMC loads just fine and the screen saver setting also works fine. Well, this at least contradicts my observations. With CONFIG_FW_LOADER=Y, I experience the issue and see that the DMC is loaded. With CONFIG_FW_LOADER=N, there is no trace of the firmware being loaded and the issue is gone. What's still intriguing is why the issue exists at all on my platform. But to be honest, I'm content with the solution of setting CONFIG_FW_LOADER=N. It decreases the amount of unknown code running in the kernel and fixes the issue for me. So I could close the issue as fixed, but given it is simply a workaround I think that's not the right thing to do. Hi Patrick, Few thoughts here: 1. Have you installed GuC firmware apart from DMC too? Want to make sure there no interference from that component..... 2. Also, can you disable C States in BIOS via the CMOS page. Make it enabled to C0 state. And then try reproducing the issue? If issue persists, then its nothing to do with DMC. Adding tag into "Whiteboard" field - ReadyForDev The bug still active *Status is correct *Platform is included *Feature is included *Priority and Severity correctly set *Logs included Anusha if you are waiting for Patrick (submitter) input, the procedure is that you change the status to NEEDINFO and assign the bug to Patrick removing tag ReadyForDev since now it needs info from submitter Reporter, please provided needed info. Otherwise bug will be closed in few days. Will be closed on friday if no response. Looks like this should be closed now...(no response from Submitter) Good day, this bug will be closed because there is not response from submitter, if later is added new information, please be so gently to mark as reopen. Thanks. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.