Bug 94337

Summary: Linux 4.5 regression: FIFO underruns on Skylake
Product: DRI Reporter: Andy Lutomirski <luto>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: ashley, gary.c.wang, giuliani.v, intel-gfx-bugs, manfred.kitzbichler, matthew.d.roper, przanoni, q3aiml
Version: unspecifiedKeywords: regression
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: SKL i915 features: display/eDP, display/watermark

Description Andy Lutomirski 2016-02-29 16:31:02 UTC
See previous discussion here: https://lists.freedesktop.org/archives/intel-gfx/2016-February/087710.html

On my Skylake laptop (Dell XPS 13 9350), I see relatively frequent FIFO underruns.  This only happens after suspend/resume -- I haven't seem them before the first suspend after a reboot.

My best guess is that this is the same issue as bug 93945 but that the fix for that bug didn't work on Skylake.  However, I *do* see underruns while the cursor is visible, but I don't seem to see them unless I've stopped interacting with the machine for a second or two.

The problem is still present in 4.5-rc6.  I'm using Fedora 23.

I can't dump VBIOS because "i915 0000:00:02.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff".  I'll keep poking PCI maintainers about that.  intel_reg_dumper output is in the mailing list thread.
Comment 1 Andy Lutomirski 2016-03-07 02:12:10 UTC
This was not fixed by "drm/i915/skl: Fix power domain suspend sequence".
Comment 2 Andy Lutomirski 2016-03-12 00:36:35 UTC
This is also not fixed in drm-intel-nightly 2016y-03m-11d-13h-31m-03s.
Comment 3 Andy Lutomirski 2016-03-13 23:44:31 UTC
There is some highly questionable code in here.

In skl_pipe_wm_get_hw_state:
	temp = hw->plane_trans[pipe][PLANE_CURSOR];
	skl_pipe_wm_active_state(temp, active, true, true, i, 0);
by "i", do you mean PLANE_CURSOR?  This bug probably doesn't matter, because if is_cursor, then i is ignored.  But I'm wondering why there's an is_cursor parameter at all, given that the code appears to be identical in both cases.


If PLANE_CURSOR is intended to be just like the other planes, why not either make it plane 0 or have a for_each_plane or similar that iterates over all plane indices including PLANE_CURSOR?
Comment 4 Andy Lutomirski 2016-03-14 05:19:39 UTC
Just for completeness: the bug is present in 4.5 final.

I've definitely seen FIFO overruns while the first two WM levels show the cursor being on (PLANE_WM_EN set).  I've never seen PLANE_WM_EN set beyond the second level (LP0 and LP1 if I understand the code correctly).

I tend to have a gnome-terminal running, and gnome-terminal loves toggling the cursor state, but I certainly don't need to have gnome-terminal in the foreground to see this issue.
Comment 5 Andy Lutomirski 2016-03-29 02:29:01 UTC
I have not reproduced this problem in 4.6-rc1, so it might be fixed.

If I had to guess, I'd say this was:

commit bf22045250fafbe733277e13300eaa240ba2104d
Author: Matt Roper <matthew.d.roper@intel.com>
Date:   Tue Jan 19 11:43:04 2016 -0800

    Revert "drm/i915: Add two-stage ILK-style watermark programming (v10)"

which is the fix for bug 93640.

I think there is something wrong with your release process.  v4.5 has:

commit e2e407dc093f530b771ee8bf8fe1be41e3cea8b3
Author:     Matt Roper <matthew.d.roper@intel.com>
AuthorDate: Mon Feb 8 11:05:28 2016 -0800
Commit:     Jani Nikula <jani.nikula@intel.com>
CommitDate: Tue Feb 9 11:24:39 2016 +0200

    drm/i915: Pretend cursor is always on for ILK-style WM calculations (v2)

v4.6-rc1 also has:

commit b2435692dbb709d4c8ff3b2f2815c9b8423b72bb
Author:     Matt Roper <matthew.d.roper@intel.com>
AuthorDate: Tue Feb 2 22:06:51 2016 -0800
Commit:     Matt Roper <matthew.d.roper@intel.com>
CommitDate: Wed Feb 3 05:59:03 2016 -0800

    drm/i915: Pretend cursor is always on for ILK-style WM calculations (v2)

What gives?  (I haven't confirmed that the latter is the change that fixes this.)
Comment 6 Michele Lacchia 2016-06-24 09:26:49 UTC
This is not fixed as of Linux 4.6.2-1-ARCH. Sometimes the bug causes my system to completely freeze and I have to reboot with the power button. In journal I see:

Jun 24 10:51:01 miki-laptop kernel: [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun

but the worse one is:

Jun 24 10:30:34 miki-laptop kernel: [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
Jun 24 10:39:25 miki-laptop kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
Jun 24 10:39:25 miki-laptop kernel: IP: [<          (null)>]           (null)
Jun 24 10:39:25 miki-laptop kernel: PGD 84d3e067 PUD 8523c067 PMD 0 
Jun 24 10:39:25 miki-laptop kernel: Oops: 0010 [#1] PREEMPT SMP 
Jun 24 10:39:25 miki-laptop kernel: Modules linked in: fuse sha256_ssse3 sha256_generic hmac drbg ansi_cprng ctr ccm uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev
Jun 24 10:39:25 miki-laptop kernel:  glue_helper ablk_helper snd input_leds cryptd cfg80211 led_class serio_raw pcspkr soundcore i2c_i801 hci_uart shpchp btbcm i2c_hid thermal wmi btqca hid elan_i2c 
Jun 24 10:39:25 miki-laptop kernel:  drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm intel_agp intel_gtt
Jun 24 10:39:25 miki-laptop kernel: CPU: 0 PID: 765 Comm: Xorg Tainted: G     U     O    4.6.2-1-ARCH #1
Jun 24 10:39:25 miki-laptop kernel: Hardware name: ASUSTeK COMPUTER INC. UX305UA/UX305UA, BIOS UX305UA.201 10/12/2015
Jun 24 10:39:25 miki-laptop kernel: task: ffff8802692b0f40 ti: ffff880084d34000 task.ti: ffff880084d34000
Jun 24 10:39:25 miki-laptop kernel: RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
Jun 24 10:39:25 miki-laptop kernel: RSP: 0018:ffff880084d37af0  EFLAGS: 00010286
Jun 24 10:39:25 miki-laptop kernel: RAX: ffff880084d37bb8 RBX: ffff88026a1b5c00 RCX: 000000000001fd36
Jun 24 10:39:25 miki-laptop kernel: RDX: 000000000001fd36 RSI: ffff8802685220f8 RDI: ffff88026a1b5f00
Jun 24 10:39:25 miki-laptop kernel: RBP: ffff880084d37b78 R08: ffff88026a1b5f00 R09: ffff88026a1b5f00
Jun 24 10:39:25 miki-laptop kernel: R10: ffff88020a43ed00 R11: 0000000000000000 R12: 0000000000000001
Jun 24 10:39:25 miki-laptop kernel: R13: ffff880268523368 R14: ffff8802685220f8 R15: 0000000000000000
Jun 24 10:39:25 miki-laptop kernel: FS:  00007fb7671b8940(0000) GS:ffff880273c00000(0000) knlGS:0000000000000000
Jun 24 10:39:25 miki-laptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 24 10:39:25 miki-laptop kernel: CR2: 0000000000000000 CR3: 000000007f387000 CR4: 00000000003406f0
Jun 24 10:39:25 miki-laptop kernel: Stack:
Jun 24 10:39:25 miki-laptop kernel:  ffffffffa0122da0 ffff880268520000 ffff8802685220f8 0001fd36000400d8
Jun 24 10:39:25 miki-laptop kernel:  ffff880084d37bb8 ffff8801cb6cf3c0 ffff88026a1b5c00 ffff880230579cc0
Jun 24 10:39:25 miki-laptop kernel:  ffff880084d37b40 ffffffffa0125ffd ffff880084d37b80 00000000afcbbfb4
Jun 24 10:39:25 miki-laptop kernel: Call Trace:
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffffa0122da0>] ? i915_gem_object_sync+0x1b0/0x340 [i915]
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffffa0125ffd>] ? i915_gem_object_pin+0x2d/0x30 [i915]
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffffa0135abd>] intel_execlists_submission+0x1cd/0x440 [i915]
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffffa0114a20>] i915_gem_do_execbuffer.isra.14+0xaf0/0x1450 [i915]
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffff812e6ae9>] ? idr_get_empty_slot+0x189/0x370
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffff812e6d53>] ? idr_alloc+0x83/0x100
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffffa0018079>] ? drm_gem_handle_create_tail+0xc9/0x1a0 [drm]
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffffa01160d4>] i915_gem_execbuffer2+0xd4/0x250 [i915]
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffffa0018aa2>] drm_ioctl+0x152/0x540 [drm]
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffffa0116000>] ? i915_gem_execbuffer+0x330/0x330 [i915]
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffff81209bc3>] do_vfs_ioctl+0xa3/0x5d0
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffff814a1091>] ? __sys_recvmsg+0x51/0x90
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffff8120a169>] SyS_ioctl+0x79/0x90
Jun 24 10:39:25 miki-laptop kernel:  [<ffffffff815c7272>] entry_SYSCALL_64_fastpath+0x1a/0xa4
Jun 24 10:39:25 miki-laptop kernel: Code:  Bad RIP value.
Jun 24 10:39:25 miki-laptop kernel: RIP  [<          (null)>]           (null)
Jun 24 10:39:25 miki-laptop kernel:  RSP <ffff880084d37af0>
Jun 24 10:39:25 miki-laptop kernel: CR2: 0000000000000000
Jun 24 10:39:25 miki-laptop kernel: ---[ end trace 946c0a8763286b97 ]---
Jun 24 10:39:25 miki-laptop org.a11y.atspi.Registry[5911]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server :0
Jun 24 10:39:25 miki-laptop org.a11y.atspi.Registry[5911]:       after 2113 requests (2113 known processed) with 0 events remaining.
-- Reboot --
Comment 7 Chris Wilson 2016-06-24 09:35:45 UTC
(In reply to Michele Lacchia from comment #6)
> This is not fixed as of Linux 4.6.2-1-ARCH. Sometimes the bug causes my
> system to completely freeze and I have to reboot with the power button. In
> journal I see:
> 
> Jun 24 10:51:01 miki-laptop kernel: [drm:intel_cpu_fifo_underrun_irq_handler
> [i915]] *ERROR* CPU pipe A FIFO underrun
> 
> but the worse one is:
...

Which is a completely separate and much more critical bug than the underrun. Please do file a separate bug report for it.
Comment 8 Drunkard Zhang 2016-09-04 12:14:27 UTC
I'm using MSI GS60 with latest mainline kernel, still hitting this bug:

Sep 03 12:06:25 mylap kernel: [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun

Extra info in case you need:

# uname -a
Linux mylap 4.8.0-rc3+ #85 SMP PREEMPT Thu Aug 25 17:20:53 CST 2016 x86_64 Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz GenuineIntel GNU/Linux

# lspci
00:00.0 Host bridge: Intel Corporation Skylake Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA Controller [AHCI mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1)
00:1c.2 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #3 (rev f1)
00:1c.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #4 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)
02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 20)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5249 PCI Express Card Reader (rev 01)
04:00.0 Ethernet controller: Qualcomm Atheros Killer E2400 Gigabit Ethernet Controller (rev 10)
3e:00.0 Non-Volatile memory controller: Toshiba America Info Systems Device 010f (rev 01)
Comment 9 Paulo Zanoni 2016-10-03 17:29:42 UTC
Hi

Over the course of the last month we submitted a significant number of fixes that could have fixed this bug. Can you please try to reproduce this bug on a recent drm-intel-nightly Kernel?

Thanks,
Paulo
Comment 10 Andre Fredette 2016-10-08 15:37:30 UTC
I'm having similar issues with a Skylake-based Lenovo T460s.  I'm using Fedora 24 with the latest updates:

$ uname -a
Linux gaston 4.7.5-200.fc24.x86_64 #1 SMP Mon Sep 26 21:25:47 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I see the comment above about bug fixes over the past month, but this is a fairly recent kernel, so I'm posting in case this is a good data point.

I see many errors like the following:

Oct 07 22:35:13 gaston kernel: [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe C FIFO underrun
Oct 07 22:35:13 gaston kernel: [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe B FIFO underrun
Oct 07 12:25:25 gaston kernel: [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun

I've been seeing one or the other of my external monitors black out for about a second, and then come back.  On a few occasions, the system has completely frozen during one of these blackouts.

It also frequently freezes when left unattended, and always when the screens are "blanked".  I've been able to prevent the system from freezing by changing the Power->"Blank screen" setting to "Never".

When the system "freezes", the only way I've been able to recover is to hold down the power button.

The issues only seem to happen when running with external monitors - I've never had the system freeze problem when I haven't had external monitors.  For example, I ran for a week recently without external monitors and without any system freezes.

I did a full fresh install of Fedora about a week ago to see if that would help.  I ran for a few days without issues, but then it started again.

I have no idea whether these error logs have anything to do with my system freeze problems.
Comment 11 yann 2016-10-11 06:50:37 UTC
Please re-test with Paulo's patch to apply memory workarounds for skylake: https://patchwork.freedesktop.org/series/13548/
Comment 12 Michael Vorburger 2016-11-03 11:13:38 UTC
FYI I think this is somehow related to bug 91883, I regularly hit this since weeks, and always see the error from this bug here and the one from that bug appear together, and 1 of 2 externally connected screens "flicker", on a Skylake-based Lenovo T460s under Fedora 24 with the latest updates, today that's a 4.8.4-200.fc24.x86_64 (same as Andre Fredette; we're both @ Red Hat; the T460s is our standard widely rolled out model...)

[60954.177636] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe B FIFO underrun
[60961.293531] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe C FIFO underrun
[61713.069574] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe B (start=50476 end=50477) time 152 us, min 1073, max 1079, scanline start 1072, end 1083

[62135.779839] CPU2: Core temperature above threshold, cpu clock throttled (total events = 409148)
[62135.779859] CPU2: Package temperature above threshold, cpu clock throttled (total events = 547662)

[62135.779868] mce_notify_irq: 1 callbacks suppressed
[62135.779869] mce: [Hardware Error]: Machine check events logged
[62135.782856] CPU2: Core temperature/speed normal

[64171.939147] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe B (start=198008 end=198009) time 160 us, min 1073, max 1079, scanline start 1070, end 1081
[64615.439710] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe B (start=224618 end=224619) time 158 us, min 1073, max 1079, scanline start 1072, end 1083
[65694.324329] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe B (start=289351 end=289352) time 178 us, min 1073, max 1079, scanline start 1068, end 1080
[66737.025209] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe B FIFO underrun
Comment 13 Jani Saarinen 2016-12-09 11:11:39 UTC
Is this issue still seen with latest kernel?
Comment 14 Andy Lutomirski 2016-12-12 20:46:32 UTC
I haven't seen it for a while on the latest kernel.
Comment 15 yann 2016-12-13 06:55:45 UTC
(In reply to Andy Lutomirski from comment #14)
> I haven't seen it for a while on the latest kernel.

Thanks Andy for your feedback. Closing as fixed then.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.