Bug 106642 - X server crashes in i965 on desktop startup when DRI3 v1.2 / modifier support is enabled
Summary: X server crashes in i965 on desktop startup when DRI3 v1.2 / modifier support...
Status: VERIFIED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: highest blocker
Assignee: Daniel Stone
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: patch, regression
Depends on:
Blocks:
 
Reported: 2018-05-24 11:17 UTC by Eero Tamminen
Modified: 2018-06-08 14:08 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Gdb backtrace of the X server crash (4.91 KB, text/plain)
2018-05-24 11:17 UTC, Eero Tamminen
Details
Patch to use RGBA non-sRGB formats for compatibility checks (1.95 KB, patch)
2018-06-05 23:44 UTC, Jason Ekstrand
Details | Splinter Review
Gdb backtrace of the X server crash, with Mesa patch (13.67 KB, text/plain)
2018-06-06 08:25 UTC, Eero Tamminen
Details
dri: add missing 16bits formats mapping (1.05 KB, patch)
2018-06-06 11:59 UTC, Lionel Landwerlin
Details | Splinter Review
dri: add missing 16bits formats mapping (1.34 KB, patch)
2018-06-06 14:59 UTC, Lionel Landwerlin
Details | Splinter Review

Description Eero Tamminen 2018-05-24 11:17:54 UTC
Created attachment 139737 [details]
Gdb backtrace of the X server crash

Setup:
* drm-tip kernel
* Mesa git from last night
* X server v1.20 or newer, with DRI3 v1.2 / modifiers enabled in X configuration:
------------------------
Section "ServerFlags"
    Option "Debug" "dmabuf_capable"
EndSection
------------------------
* Rest is Ubuntu 16.04


Use-case:
* Log in to X session that uses GL/GLES compositor (Unity, Gnome...)

Expected result:
* Desktop appears

Actual result:
* X crashes in i965 driver, on all devices we have

Crash disappears if one disables the X configuration "dmabuf_capable" setting that enables DRI3 v1.2 / modifiers support in X.


Crash is here:
Thread 1 "X" received signal SIGSEGV, Segmentation fault.
0x00007f19d6d487ca in isl_format_supports_ccs_e (devinfo=devinfo@entry=0x2286ab8, format=ISL_FORMAT_UNSUPPORTED)
    at isl/isl_format.c:558

See attached Gdb backtrace for more info.


Looking at the indicated code, the bug is obvious:

* isl.h declares:
  ISL_FORMAT_UNSUPPORTED = UINT16_MAX

* isl_format_supports_ccs_e() function uses that as array index:
-------------------------------
isl_format_supports_ccs_e(const struct gen_device_info *devinfo,
                          enum isl_format format)
{
   if (!format_info[format].exists)
      return false;
...
-------------------------------

* But array size is much smaller:
  $ grep "  SF" ./src/intel/isl/isl_format.c | wc -l
  258


Crash started to happen between:
* 2018-05-22 17:02:28 UTC: 92f01fc5f9: i965: Emit VF cache invalidates for 48-bit addressing bugs with softpin.
* 2018-05-23 17:19:04 UTC: b73b340c37: i965: add {X,A}BGR2101010 to 'intel_image_formats'

Either addition of the new formats, or Jason's blitter support removal commits started to trigger usage of ISL_FORMAT_UNSUPPORTED.
Comment 1 Eero Tamminen 2018-05-24 13:06:30 UTC
(In reply to Eero Tamminen from comment #0)
> -------------------------------
> isl_format_supports_ccs_e(const struct gen_device_info *devinfo,
>                           enum isl_format format)
> {
>    if (!format_info[format].exists)
>       return false;
> -------------------------------

This:
-    if (!format_info[format].exists)
+    if (format == ISL_FORMAT_UNSUPPORTED || !format_info[format].exists)


Gets rid of the X server crashes both on BDW & SKL, so I assume it fixes the issue also for the other machines.

However, I wonder whether also other isl_format_supports_* functions should have guard for ISL_FORMAT_UNSUPPORTED...
Comment 2 Daniel Stone 2018-05-24 22:08:14 UTC
Jason - I'm happy to prepare the patch when back next week but don't know what's best. Should we check bounds in the function, or check in callers and assert in the function, or ... ?
Comment 3 Eero Tamminen 2018-05-25 08:07:33 UTC
Couldn't temporary one-liner fix for the desktop killer / startup crash be pushed right away?

Proper handling for ISL_FORMAT_UNSUPPORTED can be commited after it's more clear where / how that should be fixed.
Comment 4 Sergii Romantsov 2018-05-25 13:09:55 UTC
As alternative temporal solution (without patching) can be tried to set 'INTEL_DEBUG=norbc' (but can't check if it helps).
Comment 5 Eero Tamminen 2018-05-25 16:25:33 UTC
(In reply to Sergii Romantsov from comment #4)
> As alternative temporal solution (without patching) can be tried to set
> 'INTEL_DEBUG=norbc' (but can't check if it helps).

If one wants to do manual testing, best is just to patch Mesa, or instead of disabling all RBC (texture uploads, offscreen render targets...), one should just disable it for onscreen render targets from X server config ("dmabuf_capable" option).

For automated testing these aren't options though.  Hacking around upstream bugs in testing automation is just plain wrong.  More visible the issues are in automation, the quicker they should be fixed / worked around *in upstream*.

And if automation doesn't catch desktop killer bugs, automation should be fixed.
Comment 6 Eero Tamminen 2018-06-05 10:15:38 UTC
(In reply to Eero Tamminen from comment #3)
> Couldn't temporary one-liner fix for the desktop killer / startup crash be
> pushed right away?
> 
> Proper handling for ISL_FORMAT_UNSUPPORTED can be commited after it's more
> clear where / how that should be fixed.

Ping.

It's nearly two weeks since this desktop killer X server crash issue was introduced to Mesa and I provided one-liner workaround for it.

(None of our tests that enable modifier support for X or Wayland+Xwayland work before there's a fix for this.)
Comment 7 Daniel Stone 2018-06-05 10:34:48 UTC
(In reply to Daniel Stone from comment #2)
> Jason - I'm happy to prepare the patch when back next week but don't know
> what's best. Should we check bounds in the function, or check in callers and
> assert in the function, or ... ?

Jason, any thoughts?
Comment 8 Jason Ekstrand 2018-06-05 13:08:38 UTC
Ugh... Sorry for letting this one slip.  I'll look at it today and see if I can come up with a proper fix.
Comment 9 Jason Ekstrand 2018-06-05 23:44:29 UTC
Created attachment 140039 [details] [review]
Patch to use RGBA non-sRGB formats for compatibility checks

I've attached a patch which I think should fix it.  Unfortunately, I've been trying all day and have yet to be able to reproduce it on a Sky Lake laptop.  Please test and let me know if that fixes the bug.
Comment 10 Eero Tamminen 2018-06-06 08:25:23 UTC
Created attachment 140047 [details]
Gdb backtrace of the X server crash, with Mesa patch

(In reply to Jason Ekstrand from comment #9)
> Created attachment 140039 [details] [review] [review]
> Patch to use RGBA non-sRGB formats for compatibility checks
> 
> I've attached a patch which I think should fix it.

Thanks!


> Please test and let me know if that fixes the bug.

Unfortunately it didn't, X still crashes.

Attached is backtrace with it. Mesa still gets format=ISL_FORMAT_UNSUPPORTED.


> Unfortunately, I've been trying all day and have yet to be able to
> reproduce it on a Sky Lake laptop.

Were you able to verify that you got modifier support for onscreen buffers?

Note: I haven't tested this with X server v1.20 release, only with git versions of X server.  However, I've tested X git versions both from before 1.20 (when X had everything required for modifiers) and after it, with the latest Mesa and there's crash with both.

Currently I'm using:

* kernel git://anongit.freedesktop.org/drm-tip at 2309ca0c3ab113e1e760045e230576e0ab4a88e2 2018-06-04 17:22:43

* xserver git://anongit.freedesktop.org/git/xorg/xserver at 4d5950ce14676f970d9de97380929a93948b98f2 2018-05-28 15:08:34 glamor: Propagate glamor_fds_from_pixmap error in glamor_fd_from_pixmap

* mesa git://anongit.freedesktop.org/git/mesa/mesa at 66c61797ada12e0e2b396affcc2dc495b6cc04ed 2018-06-05 09:39:04 autotools: add missing android file to package

* libdrm git://anongit.freedesktop.org/git/mesa/drm at c1f2d9b900e30119bcf6f88c0d11a0dd620fd060 2018-05-25 16:12:00 amdgpu: Destroy fd_hash table when the last device is removed.

* wayland-protocols git://anongit.freedesktop.org/wayland/wayland-protocols at c5f0f1a739aa1502d38915f1f17716b68227c300 2018-05-03 15:45:15 configure.ac: Bump version to 1.14

* libwayland git://anongit.freedesktop.org/wayland/wayland at 0e6ac72288f884ba0321c285a65578492c706534 2018-04-20 18:19:13 tests: Add free-without-remove test

* xcbproto git://anongit.freedesktop.org/xcb/proto at 55c7baa5e992a1c661b235acf84eb710cdb61bcc 2018-03-14 17:18:29 screensaver: Use CARD32 encoding for ScreenSaverSuspend 'suspend'

libxcb git://anongit.freedesktop.org/xcb/libxcb at 7e0f166579672d71efd819c81f0c932b0acd542c 2018-02-28 17:24:53 Release libxcb 1.13

xorgproto git://anongit.freedesktop.org/git/xorg/proto/xorgproto at 30a2013800296b354449706ebb01b51bc52cd388 2018-03-19 19:25:13 randrproto: Fix missing #undef RRLease

* libevdev git://anongit.freedesktop.org/git/libevdev master 4dd93f0108142a51261adddec77744fdda4e6348 2016-06-15 06:11:41 libevdev 1.5.2

* xf86-input-evdev git://anongit.freedesktop.org/git/xorg/driver/xf86-input-evdev master 7b1267f7f18c478d3dc34a7668eaefa402815891 2016-06-01 23:42:05 Support XINPUT ABI version 23 (threaded input)

* xf86-input-synaptics git://anongit.freedesktop.org/git/xorg/driver/xf86-input-synaptics master 59e5db025307404fbfbc82f2fb3fe91d6a3005d7 2016-05-23 01:35:27 conf: rename to 70-synaptics.conf

Rest is from Ubuntu 16.04.
Comment 11 Lionel Landwerlin 2018-06-06 11:58:42 UTC
(In reply to Eero Tamminen from comment #10)
> Created attachment 140047 [details]
> Gdb backtrace of the X server crash, with Mesa patch
> 

Thanks a lot for the backtrace.
This dri_format=4109 is __DRI_IMAGE_FORMAT_R16.
Now this is a mapping missing in the common dri part of mesa.
Attaching another patch.
Comment 12 Lionel Landwerlin 2018-06-06 11:59:02 UTC
Created attachment 140053 [details] [review]
dri: add missing 16bits formats mapping
Comment 13 Eero Tamminen 2018-06-06 14:47:33 UTC
(In reply to Lionel Landwerlin from comment #11)
> (In reply to Eero Tamminen from comment #10)
> > Created attachment 140047 [details]
> > Gdb backtrace of the X server crash, with Mesa patch
> 
> Thanks a lot for the backtrace.
> This dri_format=4109 is __DRI_IMAGE_FORMAT_R16.
>
> Now this is a mapping missing in the common dri part of mesa.
> Attaching another patch.

Still crashing.

Without Jason's patch: dri_format=4112
With Jason's patch:  dri_format=4110

I guess more DRI formats are needed?
Comment 14 Lionel Landwerlin 2018-06-06 14:59:52 UTC
Created attachment 140055 [details] [review]
dri: add missing 16bits formats mapping

Here is a v2 for the new dri_format (G16R16 this time).
I've gone through all the dri image formats, so hopefully we're not missing anything anymore.
Comment 15 Jason Ekstrand 2018-06-06 15:09:56 UTC
Somehow someone is asking is about modifiers support on a format we don't advertise as being supported.  This seems sketchy in so many ways...
Comment 16 Daniel Stone 2018-06-06 15:14:49 UTC
Not unless the original backtrace is somehow wrong:

#4  0x00007f19d4e6ff67 in eglQueryDmaBufModifiersEXT (dpy=0x23a76f0, format=808665688, max_modifiers=0, modifiers=0x0, external_only=0x0, num_modifiers=0x7ffd17dbfc94)

>>> hex(808665688)
'0x30334258'
>>> chr(0x58)
'X'
>>> chr(0x42)
'B'
>>> chr(0x33)
'3'
>>> chr(0x30)
'0'
Comment 17 Lionel Landwerlin 2018-06-06 15:19:34 UTC
(In reply to Jason Ekstrand from comment #15)
> Somehow someone is asking is about modifiers support on a format we don't
> advertise as being supported.  This seems sketchy in so many ways...

The ones I added into dri_utils.c are listed in intel_image_formats[].
Comment 18 Daniel Stone 2018-06-06 15:20:53 UTC
(In reply to Daniel Stone from comment #16)
> Not unless the original backtrace is somehow wrong:
> 
> #4  0x00007f19d4e6ff67 in eglQueryDmaBufModifiersEXT (dpy=0x23a76f0,
> format=808665688, max_modifiers=0, modifiers=0x0, external_only=0x0,
> num_modifiers=0x7ffd17dbfc94)

I missed the second one, which is in fact 'R16 '. That seems super broken. I have no idea why a client would be trying to create X11/DRI3 surfaces for those formats ... ?

No, wait, I understand. In the server, we cache the list of formats returned to us by the EGL implementation, and get the list of modifiers for every supported format.

Since the extension makes no distinction between formats used as sampler sources and as render targets, this means we'll end up trying to pull the list of supported modifiers for R8, R8G8, R16, R16G16, etc.

So this in fact seems completely correct from the X server.
Comment 19 Eero Tamminen 2018-06-06 16:17:52 UTC
(In reply to Lionel Landwerlin from comment #14)
> Created attachment 140055 [details] [review] [review]
> dri: add missing 16bits formats mapping
> 
> Here is a v2 for the new dri_format (G16R16 this time).
> I've gone through all the dri image formats, so hopefully we're not missing
> anything anymore.

A quick try of this patch alone -> still get the same crash?

---------------------------------------------------
Thread 1 "X" received signal SIGSEGV, Segmentation fault.
0x00007f4666cd5c2a in isl_format_supports_ccs_e (devinfo=devinfo@entry=0x1aa72e8, 
    format=ISL_FORMAT_UNSUPPORTED) at isl/isl_format.c:558
558	isl/isl_format.c: No such file or directory.
(gdb) bt
#0  0x00007f4666cd5c2a in isl_format_supports_ccs_e (devinfo=devinfo@entry=0x1aa72e8, 
    format=ISL_FORMAT_UNSUPPORTED) at isl/isl_format.c:558
#1  0x00007f4666ba1fac in modifier_is_supported (devinfo=devinfo@entry=0x1aa72e8, 
    fmt=fmt@entry=0x7f4666ed6f98 <intel_image_formats+216>, dri_format=4112, dri_format@entry=0, 
    modifier=modifier@entry=72057594037927940) at intel_screen.c:345
#2  0x00007f4666ba212a in intel_query_dma_buf_modifiers (_screen=<optimized out>, 
    fourcc=<optimized out>, max=0, modifiers=0x0, external_only=0x0, count=count@entry=0x7fff9c783a64)
    at intel_screen.c:1304
#3  0x00007f4664e05050 in dri2_query_dma_buf_modifiers (drv=<optimized out>, disp=<optimized out>, 
    format=<optimized out>, max=<optimized out>, modifiers=<optimized out>, 
    external_only=<optimized out>, count=0x7fff9c783a64) at drivers/dri2/egl_dri2.c:2394
#4  0x00007f4664df9f67 in eglQueryDmaBufModifiersEXT (dpy=0x1bc90a0, format=808665688, max_modifiers=0, 
    modifiers=0x0, external_only=0x0, num_modifiers=0x7fff9c783a64) at main/eglapi.c:2519
#5  0x00007f4667c21396 in glamor_get_modifiers (screen=<optimized out>, format=808665688, 
    num_modifiers=0x7fff9c783b04, modifiers=0x7fff9c783b10) at ../../../glamor/glamor_egl.c:647
#6  0x0000000000564416 in cache_formats_and_modifiers (screen=<optimized out>) at dri3_screen.c:196
#7  dri3_get_supported_modifiers (screen=<optimized out>, drawable=0x1da5f80, depth=<optimized out>, 
    bpp=<optimized out>, num_intersect_modifiers=num_intersect_modifiers@entry=0x7fff9c783b70, 
    intersect_modifiers=intersect_modifiers@entry=0x7fff9c783b80, num_screen_modifiers=0x7fff9c783b74, 
    screen_modifiers=0x7fff9c783b88) at dri3_screen.c:234
#8  0x0000000000563700 in proc_dri3_get_supported_modifiers (client=0x203e750) at dri3_request.c:373
#9  0x000000000043e6cb in Dispatch () at dispatch.c:478
---------------------------------------------------


Btw. With Weston/Xwayland this is even easier to trigger.  It's enough to run any X program (like xrandr or xhost), something that doesn't even use Mesa itself, and both Xwayland and Weston come crashing down.

I was unable to get backtrace out of Xwayland (with gdb server, Gdb wasn't able to get sync for some reason), but the same one-liner fix in comment 1 gets rid of that crash too.
Comment 20 Daniel Stone 2018-06-06 16:22:30 UTC
> #4  0x00007f4664df9f67 in eglQueryDmaBufModifiersEXT (dpy=0x1bc90a0, format=808665688, max_modifiers=0, modifiers=0x0, external_only=0x0, num_modifiers=0x7fff9c783a64) at main/eglapi.c:2519

This is XB30 aka DRM_FORMAT_XBGR2101010.
Comment 21 Lionel Landwerlin 2018-06-06 16:58:16 UTC
(In reply to Eero Tamminen from comment #19)
> (In reply to Lionel Landwerlin from comment #14)
> > Created attachment 140055 [details] [review] [review] [review]
> > dri: add missing 16bits formats mapping
> > 
> > Here is a v2 for the new dri_format (G16R16 this time).
> > I've gone through all the dri image formats, so hopefully we're not missing
> > anything anymore.
> 
> A quick try of this patch alone -> still get the same crash?
> 

You'll need Jason' patch unfortunately.
Comment 22 Lionel Landwerlin 2018-06-06 17:10:08 UTC
(In reply to Eero Tamminen from comment #19)
> (In reply to Lionel Landwerlin from comment #14)
> > Created attachment 140055 [details] [review] [review] [review]
> > dri: add missing 16bits formats mapping
> > 
> > Here is a v2 for the new dri_format (G16R16 this time).
> > I've gone through all the dri image formats, so hopefully we're not missing
> > anything anymore.
> 
> A quick try of this patch alone -> still get the same crash?
> 
> ---------------------------------------------------
> Thread 1 "X" received signal SIGSEGV, Segmentation fault.
> 0x00007f4666cd5c2a in isl_format_supports_ccs_e
> (devinfo=devinfo@entry=0x1aa72e8, 
>     format=ISL_FORMAT_UNSUPPORTED) at isl/isl_format.c:558
> 558	isl/isl_format.c: No such file or directory.
> (gdb) bt
> #0  0x00007f4666cd5c2a in isl_format_supports_ccs_e
> (devinfo=devinfo@entry=0x1aa72e8, 
>     format=ISL_FORMAT_UNSUPPORTED) at isl/isl_format.c:558
> #1  0x00007f4666ba1fac in modifier_is_supported
> (devinfo=devinfo@entry=0x1aa72e8, 
>     fmt=fmt@entry=0x7f4666ed6f98 <intel_image_formats+216>, dri_format=4112,
> dri_format@entry=0, 
>     modifier=modifier@entry=72057594037927940) at intel_screen.c:345

I'm really disturbed by the value of dri_format for this function now.
In the commit you pointed at for mesa, the caller should be this :

https://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/intel_screen.c?id=66c61797ada12e0e2b396affcc2dc495b6cc04ed#n1304

So dri_format should be 0, but the backtrace implies it's not...
Comment 23 Jason Ekstrand 2018-06-06 17:55:28 UTC
I think I finally managed to reproduce the issue.  This series should fix it:

https://patchwork.freedesktop.org/series/44365/

Lionel is writing a piglit test that emulates the X server's usage of the two queries to ensure that this won't happen again.
Comment 24 Jason Ekstrand 2018-06-06 17:56:18 UTC
(In reply to Lionel Landwerlin from comment #22)
> So dri_format should be 0, but the backtrace implies it's not...

That's because I did a bad thing in that function and it over-writes its own parameter.
Comment 26 Eero Tamminen 2018-06-07 08:07:35 UTC
(In reply to Jason Ekstrand from comment #23)
> I think I finally managed to reproduce the issue.  This series should fix it:
> 
> https://patchwork.freedesktop.org/series/44365/

Tested with and without Lionel's patch, on BXT, SKL & BDW.  Desktop boots up fine, and few tests I run on BXT work also fine.  I.e. I can verify it fixes the crash

(I don't build Mesa with debug, so I assume asserts aren't built in, in which case it's obvious from the patches that no crashes will be triggered as there's now guard for array values in addition to asserts.)

Tested-by: Eero Tamminen <eero.t.tamminen@intel.com>


> Lionel is writing a piglit test that emulates the X server's usage of the
> two queries to ensure that this won't happen again.

Excellent! :-)


PS. Background objects in GfxBench v5-GOLD2 AztecRuins seems to flicker a bit, but I think that's some unrelated issue.
Comment 27 Jason Ekstrand 2018-06-07 18:25:58 UTC
This should now be fixed by the following commits in master:

commit 0e7f3febf7e739c075a139ae641d65a0618752f3
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Tue Jun 5 11:42:37 2018 -0700

    i965/screen: Use RGBA non-sRGB formats for images
    
    Not all of the MESA_FORMAT and ISL_FORMAT helpers we use can properly
    handle RGBX formats.  Also, we don't want to make decisions based on
    those in the first place because we can't render to RGBA and we use the
    non-sRGB version to determine whether or not to allow CCS_E.
    
    Cc: mesa-stable@lists.freedesktop.org
    Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>

commit a26693493570a9d0f0fba1be617e01ee7bfff4db
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Wed Jun 6 10:24:01 2018 -0700

    i965/screen: Return false for unsupported formats in query_modifiers
    
    Cc: mesa-stable@lists.freedesktop.org
    Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Comment 28 Eero Tamminen 2018-06-08 14:08:03 UTC
Verified.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.