Bug 92335

Summary: [HSW Regression] Null pointer deference in intel_mmio_flip_work_func
Product: DRI Reporter: Andreas Reis <andreas.reis>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: medium CC: gary.c.wang, intel-gfx-bugs
Version: DRI gitKeywords: regression
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: HSW i915 features: display/Other
Attachments:
Description Flags
journalctl -b-1
none
Check for unpin_work under the spinlock
none
journalctl with drm.debug=3 none

Description Andreas Reis 2015-10-07 12:35:02 UTC
Created attachment 118732 [details]
journalctl -b-1

HSW 4770, instant freezes for about a week now with drm-intel-nightly, so far when ordinarily browsing with Chromium (full HW acceleration). No indication what specifically causes it.

journalctl shows an earlier WARN_ON_ONCE(!ppgtt) as well, only since yesterday so probably unrelated.
Comment 1 Andreas Reis 2015-10-09 21:09:04 UTC
Also got it on a 4200U without Chromium running.

The freeze is as total as it is instant, most of the time journalctl can't write the trace to disk.
Comment 2 Chris Wilson 2015-10-10 09:38:03 UTC
Created attachment 118793 [details] [review]
Check for unpin_work under the spinlock

Looks like the relevant information is in drm.debug=1, so try capturing the error dmesg with say drm.debug=3.
Comment 3 Chris Wilson 2015-10-10 09:47:01 UTC
It would also be great if it was bisectable :)
Comment 4 Andreas Reis 2015-10-10 09:52:17 UTC
So shall I check with your patch applied, or as is?

I'll do a bisect once this is reproducible, otherwise it'll never finish.
Comment 5 Chris Wilson 2015-10-10 10:03:09 UTC
I think the patch should mask the issue - if I have understood the basic mechanics of the oops. If you have the time to bisect (thanks in advance!) do so without the patch as that should make it easier to trigger.
Comment 6 Andreas Reis 2015-10-10 10:35:01 UTC
Created attachment 118795 [details]
journalctl with drm.debug=3

Here's one journalctl output with drm.debug=3, kernel as it was before. Not sure if it's any good as the actual crash didn't make it to disk again.

Happened by running intel_gpu_top, compton in the bg, and resizing the window of a 4K (h.264) video in Chromium around in a crazed fashion. Like this: https://i.imgur.com/rKJ4zCj.jpg

(The corrupted parts in the screenshot have been there for months, they're visible only when resizing a video, change as the window size changes, and blink about once or twice per second.)
Comment 7 Chris Wilson 2015-10-10 10:39:57 UTC
Don't run intel_gpu_top it will hard hang your machine (eventually).
Comment 8 Chris Wilson 2015-10-10 10:45:04 UTC
For what it's worth, I didn't see the telltale I was looking for in the drm.debug=3 dmesg (but I also presume that it was the hard lockup from intel_gpu_top).
Comment 9 Andreas Reis 2015-10-10 10:52:17 UTC
Seems like that was intel_gpu_top then, only resizing doesn't appear to freeze. Back to random chance then.
Comment 10 Andreas Reis 2015-10-10 12:51:34 UTC
Tried running four 4K videos in parallel with stress-ng -c 12 loitering in the bg, still wouldn't trigger it.

So… until I happen on how to reproduce it, I'll stop running with drm.debug=3. Using that for hours just doesn't sound all that healthy for my SSD.
Comment 11 Chris Wilson 2015-10-10 15:41:47 UTC
Ok, run with

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 71d7298648e0..850b11351c03 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -11400,7 +11400,7 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,
                 * the hardware completed the operation behind our backs.
                 */
                if (__intel_pageflip_stall_check(dev, crtc)) {
-                       DRM_DEBUG_DRIVER("flip queue: previous flip completed, continuing\n");
+                       DRM_ERROR("flip queue: previous flip completed, continuing\n");
                        page_flip_completed(intel_crtc);
                } else {
                        DRM_DEBUG_DRIVER("flip queue: crtc already busy\n");

and lets see if that crops up just before the fatal oops.
Comment 12 Andreas Reis 2015-10-11 14:46:30 UTC
Haven't encountered it since about two days now (assuming the last two cases were indeed from intel_gpu_top); maybe it was fixed. I'll look out for it for another week, then I guess this can be closed.
Comment 13 Andreas Reis 2015-10-20 12:03:53 UTC
Not encountered anymore, so apparently fixed en passant.
Comment 14 Jari Tahvanainen 2016-12-13 08:56:00 UTC
Closing resolved+worksforme set by reporter after one year of no comments.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.