Recent 5.2 kernels showing black screen after logging into ICL machine. Tracked regression using drm-tip packages to sometime between 5-26 and 5-28 builds. All kernels booting on my KBL.
Not sure how to bisect commits from here and find commit that regressed kernel on ICL. If there is a straight-forward process to do this, please send me instructions.
HW: ICL D2
Can you also try instead of ppa to reproduce the error using latest drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
Also report out what BIOS version you have. You can see from from our CI that we have (https://intel-gfx-ci.01.org/tree/drm-tip/?hosts=icl) eg. on icl-u2: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6188/fi-icl-u2/boot0.log => DMI: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019.
Also note that if you see from CI pages og u3: https://intel-gfx-ci.01.org/tree/drm-tip/fi-icl-u3.html. There was one set of patches preventing ICL to boot at all that got fixed on later builds. This bad build also hits to your timeline: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6159/git-log-oneline.log. Maybe try later ppa too. It seems this 6159 (very bad build) was last builds from 28th that could explain as already 6160 is on 29th: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6160/git-log-oneline.log
This was reproduced with a local build of drm-tip on 6/3.
I asked Rodrigo how to interpret the gfx ci web page to see if we could identify the regression in the results, and we couldn't figure it out.
Looking at the link you provided, I'm still unsure of how to figure out what indicates the regression. Is there already a bug written up about this?
Please boot with drm.debug=0xe passed to the kernel cmdline and attach the resulting dmesg once you've hit the black screen. Also pass eg. log_buf_len=4M in case the log gets truncated. That should hopefully tell us if this is a display issue.
Also still, please report your BIOS version.
Mark, if you look: https://intel-gfx-ci.01.org/tree/drm-tip/fi-icl-u2.html
and column: CI_DRM_6159 you see that it is full empty. We noticed this as none of the icl's did not boot (eg https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6159/fi-icl-u2/) including shards. There was not bug made, or maybe Martin knows but one patch get reverted and some extra hickups after revert (https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6163/git-log-oneline.log). See those selftests (dmesg warnings). Anyway there should not be any issues as system boots.
Jani, thanks for the pointers. I'm confident that this is a separate issue. The system boots fine, but the display is blank. The issue has persisted on i915's CI builds for weeks.
We are currently bisecting down to the commit.
Author: Ville Syrjälä <firstname.lastname@example.org>
drm/i915: Make sure we have enough memory bandwidth on ICL
ICL has so many planes that it can easily exceed the maximum
effective memory bandwidth of the system. We must therefore check
that we don't exceed that limit
The algorithm is very magic number heavy and lacks sufficient
explanation for now. We also have no sane way to query the
memory clock and timings, so we must rely on a combination of
raw readout from the memory controller and hardcoded assumptions
The memory controller values obviously change as the system
jumps between the different SAGV points, so we try to stabilize
it first by disabling SAGV for the duration of the readout
The utilized bandwidth is tracked via a device wide atomic
private object. That is actually not robust because we can't
afford to enforce strict global ordering between the pipes
Thus I think I'll need to change this to simply chop up the
available bandwidth between all the active pipes. Each pipe
can then do whatever it wants as long as it doesn't exceed
its budget. That scheme will also require that we assume that
any number of planes could be active at any time
TODO: make it robust and deal with all the open questions
v2: Sleep longer after disabling SAGV
v3: Poll for the dclk to get raised (seen it take 250ms
If the system has 2133MT/s memory then we pointlessly
wait one full second
v4: Use the new pcode interface to get the qgv points rather
that using hardcoded numbers
v5: Move the pcode stuff into intel_bw.c (Matt)
Do the NV12/P010 as per spec for now (Matt)
v6: Ignore bandwidth limits if the pcode query fails
Signed-off-by: Ville Syrjälä <email@example.com>
Reviewed-by: Matt Roper <firstname.lastname@example.org>
Acked-by: Clint Taylor <Clinton.A.Taylor@intel.com>
I tried to set the kernel cmdline to include "set drm.debug=0xe". Not sure if I did it correctly. Please tell me if I messed this up. Attached dmesg.log.
File /var/log/Xorg.0.log was not created.
Created attachment 144464 [details]
dmesg of failing case (black display)
(In reply to fjdegroo from comment #10)
> Created attachment 144464 [details]
> dmesg of failing case (black display)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.1.0-rc5c457d9+ root=UUID=06f9ea84-363e-408d-ad4b-048233161a7a ro quiet splash vt.handoff=1
drm.debug=0xe not there
Created attachment 144465 [details]
dmesg of failing case, try #2
There is no way we would revert that changes that was major fix for your fifo underruns. What displays we are talking here? eDP and external ones?
I ahve ICL booting nicely with edp (4K) and DP and HDMI simultaneously with latest drm.tip. I still see you have some issues on your system.
one difference is also that you use ppa repo and not pure drm-tip...
[ 4.878097] [drm:intel_bw_init_hw [i915]] QGV 0: DCLK=224 tRP=34 tRDPRE=14 tRAS=79 tRCD=34 tRC=113
[ 4.878188] [drm:intel_bw_init_hw [i915]] BW0 / QGV 0: num_planes=4 deratedbw=7612
[ 4.878273] [drm:intel_bw_init_hw [i915]] BW1 / QGV 0: num_planes=2 deratedbw=12465
[ 4.878356] [drm:intel_bw_init_hw [i915]] BW2 / QGV 0: num_planes=1 deratedbw=17032
Looks like it only exposes a single QGV point, which I apparently failed to consider. I do remember thinking about that but apparently it slipped my mind.
BTW do you have SAGV disabled in the BIOS or something? Just wondering how we got to this state...
Can you reset to BIOS defaults and comment if that helps. Otherwise I also let Ville to fix some issues ;).
Regarding the bios, we are a power and performance group and so regularly fix the IA/GT/Ring/DRAM frequencies to reduce run to run variance. This is necessary for the performance issue that we routinely chase. Part of this is fixing the SAGV to High in the bios.
I reset the SAGV to Enable in the bios and this display issue went away. I can now see the desktop after login.
Re-enabling SAGV in the bios is a good workaround to get us unblocked. But long term we will need a solution for getting display when SAGV is fixed.
There is a patch from Ville already in the mailing list fixing this issue.
Author: Ville Syrjälä <email@example.com>
Date: Thu Jun 6 15:42:10 2019 +0300
drm/i915: Deal with machines that expose less than three QGV points