92335 – [HSW Regression] Null pointer deference in intel_mmio_flip_work_func

Bug 92335 - [HSW Regression] Null pointer deference in intel_mmio_flip_work_func

Summary: [HSW Regression] Null pointer deference in intel_mmio_flip_work_func

Status:	CLOSED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:	regression

Depends on:
Blocks:

Reported:	2015-10-07 12:35 UTC by Andreas Reis
Modified:	2016-12-13 08:56 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	HSW
i915 features:	display/Other

Attachments
journalctl -b-1 (89.51 KB, text/plain) 2015-10-07 12:35 UTC, Andreas Reis	no flags	Details
Check for unpin_work under the spinlock (5.60 KB, patch) 2015-10-10 09:38 UTC, Chris Wilson	no flags	Details \| Splinter Review
journalctl with drm.debug=3 (1.09 MB, application/x-xz) 2015-10-10 10:35 UTC, Andreas Reis	no flags	Details
View All

Description Andreas Reis 2015-10-07 12:35:02 UTC

Created attachment 118732 [details]
journalctl -b-1

HSW 4770, instant freezes for about a week now with drm-intel-nightly, so far when ordinarily browsing with Chromium (full HW acceleration). No indication what specifically causes it.

journalctl shows an earlier WARN_ON_ONCE(!ppgtt) as well, only since yesterday so probably unrelated.

Comment 1 Andreas Reis 2015-10-09 21:09:04 UTC

Also got it on a 4200U without Chromium running.

The freeze is as total as it is instant, most of the time journalctl can't write the trace to disk.

Comment 2 Chris Wilson 2015-10-10 09:38:03 UTC

Created attachment 118793 [details] [review]
Check for unpin_work under the spinlock

Looks like the relevant information is in drm.debug=1, so try capturing the error dmesg with say drm.debug=3.

Comment 3 Chris Wilson 2015-10-10 09:47:01 UTC

It would also be great if it was bisectable :)

Comment 4 Andreas Reis 2015-10-10 09:52:17 UTC

So shall I check with your patch applied, or as is?

I'll do a bisect once this is reproducible, otherwise it'll never finish.

Comment 5 Chris Wilson 2015-10-10 10:03:09 UTC

I think the patch should mask the issue - if I have understood the basic mechanics of the oops. If you have the time to bisect (thanks in advance!) do so without the patch as that should make it easier to trigger.

Comment 6 Andreas Reis 2015-10-10 10:35:01 UTC

Created attachment 118795 [details]
journalctl with drm.debug=3

Here's one journalctl output with drm.debug=3, kernel as it was before. Not sure if it's any good as the actual crash didn't make it to disk again.

Happened by running intel_gpu_top, compton in the bg, and resizing the window of a 4K (h.264) video in Chromium around in a crazed fashion. Like this: https://i.imgur.com/rKJ4zCj.jpg

(The corrupted parts in the screenshot have been there for months, they're visible only when resizing a video, change as the window size changes, and blink about once or twice per second.)

Comment 7 Chris Wilson 2015-10-10 10:39:57 UTC

Don't run intel_gpu_top it will hard hang your machine (eventually).

Comment 8 Chris Wilson 2015-10-10 10:45:04 UTC

For what it's worth, I didn't see the telltale I was looking for in the drm.debug=3 dmesg (but I also presume that it was the hard lockup from intel_gpu_top).

Comment 9 Andreas Reis 2015-10-10 10:52:17 UTC

Seems like that was intel_gpu_top then, only resizing doesn't appear to freeze. Back to random chance then.

Comment 10 Andreas Reis 2015-10-10 12:51:34 UTC

Tried running four 4K videos in parallel with stress-ng -c 12 loitering in the bg, still wouldn't trigger it.

So… until I happen on how to reproduce it, I'll stop running with drm.debug=3. Using that for hours just doesn't sound all that healthy for my SSD.

Comment 11 Chris Wilson 2015-10-10 15:41:47 UTC

Ok, run with

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 71d7298648e0..850b11351c03 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -11400,7 +11400,7 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,
                 * the hardware completed the operation behind our backs.
                 */
                if (__intel_pageflip_stall_check(dev, crtc)) {
-                       DRM_DEBUG_DRIVER("flip queue: previous flip completed, continuing\n");
+                       DRM_ERROR("flip queue: previous flip completed, continuing\n");
                        page_flip_completed(intel_crtc);
                } else {
                        DRM_DEBUG_DRIVER("flip queue: crtc already busy\n");

and lets see if that crops up just before the fatal oops.

Comment 12 Andreas Reis 2015-10-11 14:46:30 UTC

Haven't encountered it since about two days now (assuming the last two cases were indeed from intel_gpu_top); maybe it was fixed. I'll look out for it for another week, then I guess this can be closed.

Comment 13 Andreas Reis 2015-10-20 12:03:53 UTC

Not encountered anymore, so apparently fixed en passant.

Comment 14 Jari Tahvanainen 2016-12-13 08:56:00 UTC

Closing resolved+worksforme set by reporter after one year of no comments.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.