Bug 32534

Summary: [arrandale/sandybridge] i965 receiving stale flink name from DRI2
Product: xorg Reporter: Fabian Henze <flyser42>
Component: Server/Ext/DRIAssignee: Xorg Project Team <xorg-team>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: medium CC: arekm, arequipeno, frapell, keithp, michal, post+fdo, przanoni, rasasi78
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Bug Depends on: 35452    
Bug Blocks:    
Attachments:
Description Flags
backtrace of Xorg. generated using gdb "backtrace full"
none
the related Xorg.0.log
none
gdb backtrace
none
Full backtrace from crash in _swrast_write_rgba_span
none
Backtrace of intel_region_alloc_for_handle memory allocation failure
none
Spreadsheet showing lifecycle of problematic GEM object
none
Backtrace when last handle to GEM object is being closed none

Description Fabian Henze 2010-12-20 13:15:49 UTC
Created attachment 41315 [details]
backtrace of Xorg. generated using gdb "backtrace full"

Hi,
on my Thinkpad T510 with an Intel Core i7, X crashes nearly everytime after closing fullscreen flash windows (e.g. when switching fullscreen on youtube on and off). This happens only, if compositing (using KWin) is enabled.
I am using gentoo with linux 2.6.37-rc6 (iirc 2.6.36 was affected too), mesa 7.9.0 (iirc 7.8.1 was affected too) and xf86-video-intel-2.13.0.

Let me know, if you need further information. I can also follow instructions on irc and test patches.
Comment 1 Fabian Henze 2010-12-20 13:16:57 UTC
Created attachment 41316 [details]
the related Xorg.0.log
Comment 2 Chris Wilson 2010-12-20 13:37:59 UTC
Looks like it is deep in mesa. Can you please run:

$ addr2line -e /usr/lib64/dri/i965_dri.so 0x7818f 0x6271b 0x5579a 0x131cfe 0x12d531 0x12f792 0xeeb23 0xeec18

or attach gdb and grab a bt?
Comment 3 Fabian Henze 2010-12-20 13:43:44 UTC
I attached a backtrace. btw: I just tested nouveau on the same notebook (it's a model with nvidia optimus) and it does not crash.

regards
Comment 4 Chris Wilson 2010-12-20 13:48:06 UTC
(In reply to comment #3)
> I attached a backtrace. btw: I just tested nouveau on the same notebook (it's a
> model with nvidia optimus) and it does not crash.

So you did. I'm going senile. Thanks.
Comment 5 Chris Wilson 2010-12-21 03:10:56 UTC
The immediate bug would be fixed by:

diff --git a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c b/src/mesa/drivers
index 76fc94d..9714cac 100644
--- a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
+++ b/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
@@ -589,9 +589,11 @@ prepare_wm_surfaces(struct brw_context *brw)
       for (i = 0; i < ctx->DrawBuffer->_NumColorDrawBuffers; i++) {
         struct gl_renderbuffer *rb = ctx->DrawBuffer->_ColorDrawBuffers[i];
         struct intel_renderbuffer *irb = intel_renderbuffer(rb);
-        struct intel_region *region = irb ? irb->region : NULL;
 
-        brw_add_validated_bo(brw, region->buffer);
+         if (!irb || !irb->region)
+            continue;
+
+        brw_add_validated_bo(brw, irb->region->buffer);
         nr_surfaces = SURF_INDEX_DRAW(i) + 1;
       }
    }

But it doesn't explain how irb or irb->region was NULL there and whether that should have been handled much earlier.
Comment 6 Fabian Henze 2010-12-21 04:45:30 UTC
(In reply to comment #5)
> The immediate bug would be fixed by:

That seems to have fixed the bug for me (thanks for the quick response :)), but it still causes visual corruption and/or not updated screen content, that goes away, when I disable my compositing manager. But it's still much better than a crashing X server :-)

> But it doesn't explain how irb or irb->region was NULL there and whether that
> should have been handled much earlier.

Maybe the screen corruption would go away, if the real source of the problem was fixed? What can I do to help you debug this problem?

Is it possible to attach gdb to X and generate meaningful backtraces without user interaction? I am trying to get a backtrace for a different hard-to-reproduce crash, that happens once in a week or so, but I don't want to have a second machine running all the time to debug X on my production machine ...
Comment 7 Fabian Henze 2010-12-28 11:28:02 UTC
any news on this? fullscreen flash video is not really usable.
do you need any more debugging work done by me or can you reproduce the bug yourself?
Comment 8 Fabian Henze 2011-01-11 05:44:58 UTC
Still present in mesa 7.10
Comment 9 Chris Wilson 2011-01-11 15:18:02 UTC
*** Bug 33007 has been marked as a duplicate of this bug. ***
Comment 10 Arkadiusz Miskiewicz 2011-01-15 11:46:40 UTC
Created attachment 42084 [details]
gdb backtrace

Bug is also in mesa master as of today (non gallium version). Triggered by going fullscreen and back with opera and flash plaing some video. GM45, Linux 2.6.37, xorg 1.9.3, ddx from git master.

Program received signal SIGSEGV, Segmentation fault.
prepare_wm_surfaces (brw=0x1b61b00) at brw_wm_surface_state.c:528
528              brw_add_validated_bo(brw, region->buffer);
<gdb> bt
#0  prepare_wm_surfaces (brw=0x1b61b00) at brw_wm_surface_state.c:528
#1  0x00007f32b15f7f76 in brw_validate_state (brw=0x1b61b00) at brw_state_upload.c:397
#2  0x00007f32b15e7615 in brw_try_draw_prims (ctx=0x1b61b00, arrays=0x1b963f8, prim=0x1b94b14, nr_prims=1, ib=0x0, index_bounds_valid=<value optimized out>, min_index=0,
    max_index=3) at brw_draw.c:362
#3  brw_draw_prims (ctx=0x1b61b00, arrays=0x1b963f8, prim=0x1b94b14, nr_prims=1, ib=0x0, index_bounds_valid=<value optimized out>, min_index=0, max_index=3) at brw_draw.c:447
#4  0x00007f32b16c3312 in vbo_exec_vtx_flush (exec=<value optimized out>, unmap=<value optimized out>) at vbo/vbo_exec_draw.c:382
#5  0x00007f32b16c105c in vbo_exec_FlushVertices_internal (ctx=<value optimized out>, unmap=<value optimized out>) at vbo/vbo_exec_api.c:912
#6  0x00007f32b16c122a in vbo_exec_FlushVertices (ctx=<value optimized out>, flags=1) at vbo/vbo_exec_api.c:946
#7  0x00007f32b179333e in _mesa_PopAttrib () at main/attrib.c:859
#8  0x00007f32cbb2b2b5 in KWin::PaintClipper::Iterator::~Iterator() () from /usr/lib64/libkwineffects.so.1
#9  0x00007f32cbb36868 in KWin::renderGLGeometry(QRegion const&, int, float const*, float const*, float const*, int, int) () from /usr/lib64/libkwineffects.so.1
[...]
Comment 11 Chris Wilson 2011-02-20 04:28:20 UTC
*** Bug 33422 has been marked as a duplicate of this bug. ***
Comment 12 Chris Wilson 2011-02-21 05:20:26 UTC
Something is still very wrong to hit this path at all, but this should prevent the crash:

commit 13bab58f04c1ec6d0d52760eab490a0997d9abe2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 18 17:51:10 2011 +0000

    i965: Fallback on encountering a NULL render buffer
    
    Following a GPU hang, or other error, the render target is not likely to
    have an allocated BO and so we must fallback to avoid using it.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32534
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 13 Paulo Zanoni 2011-02-21 06:28:23 UTC
Just compiled today's Mesa git master.

Etracer is really unplayable now: I see a lot of "garbage" on the screen. It looks like the polygons are being drawn in the wrong places. For example, Tux's eyes are not attached to his head: they are floating in front of his head (which makes it look like if he was wearing sunglasses). If you need, I could record some kind of video to show you.

After a few minutes playing, X segfaults:

Program received signal SIGSEGV, Segmentation fault.
0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0xb6f8dc91 in _swrast_write_rgba_span (ctx=0x947ea68, span=0xbfe019bc) at swrast/s_span.c:1275
#2  0xb6fa7b69 in general_triangle (ctx=0x947ea68, v0=0xb6229208, v1=0xb62293f0, v2=0xb62295d8) at swrast/s_tritemp.h:819
#3  0xb6f820be in _swrast_Triangle (ctx=0x947ea68, v0=0xb6229208, v1=0xb62293f0, v2=0xb62295d8) at swrast/s_context.c:709
#4  0xb6fb2870 in triangle_rgba (ctx=0x947ea68, e0=1, e1=2, e2=3) at swrast_setup/ss_tritmp.h:176
#5  0xb6f4ddde in _tnl_render_quads_verts (ctx=0x947ea68, start=0, count=4, flags=55) at tnl/t_vb_rendertmp.h:383
#6  0xb6f4f4e1 in run_render (ctx=0x947ea68, stage=0x908dc58) at tnl/t_vb_render.c:321
#7  0xb6f42f82 in _tnl_run_pipeline (ctx=<value optimized out>) at tnl/t_pipeline.c:153
#8  0xb6f43a49 in _tnl_draw_prims (ctx=0x947ea68, arrays=0x9145d10, prim=0x9144664, nr_prims=1, ib=0x0, min_index=0, max_index=3) at tnl/t_draw.c:524
#9  0xb6e46afa in brw_draw_prims (ctx=0x947ea68, arrays=0x9145d10, prim=0x9144664, nr_prims=1, ib=0x0, index_bounds_valid=1 '\001', min_index=0, max_index=3) at brw_draw.c:458
#10 0xb6f3a21d in vbo_exec_vtx_flush (exec=0x91444f0, unmap=1 '\001') at vbo/vbo_exec_draw.c:383
#11 0xb6f312b9 in vbo_exec_FlushVertices_internal (ctx=0x39d, unmap=8 '\b') at vbo/vbo_exec_api.c:912
#12 0xb6f31358 in vbo_exec_FlushVertices (ctx=0x39d, flags=1) at vbo/vbo_exec_api.c:946
#13 0xb701f4c1 in _mesa_PopAttrib () at main/attrib.c:859
#14 0xb72ae0de in __glXDisp_PopAttrib (pc=0xb63c4168 "\004") at indirect_dispatch.c:1443
#15 0xb72d6d29 in __glXDisp_Render (cl=0x92c1f88, pc=0xb63c4164 "\004") at glxcmds.c:1847
#16 0xb72db870 in __glXDispatch (client=0x92c1eb0) at glxext.c:600
#17 0x08070fff in Dispatch () at dispatch.c:432
#18 0x080625ba in main (argc=8, argv=0xbfe02c54, envp=0xbfe02c78) at main.c:291


After the crash I rebooted, and:

[pzanoni@mandriva ~]$ DISPLAY=:0 glxinfo | grep render
direct rendering: Yes
OpenGL renderer string: Mesa DRI Intel(R) Sandybridge Mobile GEM 20100330 DEVELOPMENT x86/MMX/SSE2

I didn't update kernel, X, libdrm or ddx.
Comment 14 Chris Wilson 2011-02-24 15:16:18 UTC
A few things that occur looking at that bt:

* Look for some message indicating the root cause of the error that causes swrast.

* Apply the indirect glx opcode cache and reply with your tested-by.

* Fix the swrast bugs.

* Fix your system configuration and stop using indirect rendering on the local display.

I'd recommend doing the latter if nothing else.
Comment 15 Paulo Zanoni 2011-02-25 13:12:45 UTC
(In reply to comment #14)
> A few things that occur looking at that bt:
> 
> * Look for some message indicating the root cause of the error that causes
> swrast.

Didn't see, at least in Xorg.0.log or dmesg. I'll try to look elsewhere.

> 
> * Apply the indirect glx opcode cache and reply with your tested-by.
 
Do you mean the patch with this name:
"glx: Cache indirect opcode->index conversion" ?

I just tested. I still get the same segfault on swrast after running it.

> 
> * Fix the swrast bugs.

:)

> 
> * Fix your system configuration and stop using indirect rendering on the local
> display.

Should I ask upstream KDE to disable desktop effects by default on Intel machines? :P

By the way, I'm also seeing bugs on direct rendering (like the armagetron one).

> 
> I'd recommend doing the latter if nothing else.

Thanks for your help,
Paulo
Comment 16 Paulo Zanoni 2011-02-25 13:13:51 UTC
(In reply to comment #13)
> Etracer is really unplayable now: I see a lot of "garbage" on the screen. It
> looks like the polygons are being drawn in the wrong places. For example, Tux's
> eyes are not attached to his head: they are floating in front of his head
> (which makes it look like if he was wearing sunglasses). If you need, I could
> record some kind of video to show you.

I just tested today's Mesa git-master and I don't see this behavior anymore. The graphics are fine again, but the segfault still happens.
Comment 17 Chris Wilson 2011-03-14 09:59:02 UTC
*** Bug 35260 has been marked as a duplicate of this bug. ***
Comment 18 Ian Pilcher 2011-03-27 17:11:38 UTC
Created attachment 44926 [details]
Full backtrace from crash in _swrast_write_rgba_span

I am also getting the crash in _swrast_write_rgba_span, after applying the
patch in commit 13bab58f04c1ec6d0d52760eab490a0997d9abe2.  In my case, the
crash occurs when I unlock an an OpenGL screensaver.  (I am running KDE on
Fedora 15 Alpha.)

One (possibly) interesting thing is that the crash does not always occur.
If the KDE unlock dialog is displayed correctly, then I know that I will
be able to unlock the screensaver without a crash.  If only the password
entry field is displayed (and the "Switch User...", "Unlock", and "Cancel"
buttons *if* I mouse over them), then I know that X will crash when I
enter my password and press enter.  Thus far, the correlation has been 100%.
Comment 19 Ian Pilcher 2011-04-17 19:55:11 UTC
I spent a significant amount of time digging into this today, and I've been
able to figure out the following sequence of events:

* Starting point is GLMatrix screensaver running on KDE 4.6.2 (Fedora 15
  x86_64, Core i7 2600 "HD 2000" GPU).  At this point everything appears
  to be working fine.

* Hit a key, move the mouse, etc. to bring up the screensaver unlock dialog.
  If the dialog is rendered properly at this point, then the crash will not
  occur.  Everything from here on is the incorrectly rendered case.

* The screensaver unlock dialog is not rendered correctly.  Most or all of
  it is invisible (black on black).  Various portions may appear is one
  "mouses over" or tabs to them.

* Type the password and press Enter.

* This is where I am able to catch the first sign of failure in the Mesa
  code (although the rendering problems indicate that something has already
  gone wrong, at least at the KDE level).

  drm_intel_bo_gem_create_from_name returns NULL to
  intel_region_alloc_for_handle.  This NULL gets propagated up to
  intel_update_renderbuffers, which sets the region of the renderbuffer
  to NULL.

* When prepare_wm_surfaces tries to use this renderbuffer, it encounters
  the NULL region.  This used to cause an immediate segfault, but it now
  detects the NULL region, sets brw->intel.Fallback to GL_TRUE, and bails.

* brw_draw_prims detects that brw_try_draw_prims failed, so it falls back
  to the software rasterizer, calling _swsetup_Wakeup and _tnl_draw_prims
  in turn.

* Eventually, it gets to _swrast_write_rgba_span, which tries to call the
  renderbuffer's PutRow function.  Of course, the renderbuffer is an
  intel_renderbuffer, so it's PutRow function is NULL, which causes the
  segfault we're seeing now.

Based on the last point, it seems like the software fallback that was
introduced in commit 13bab58f04c1ec6d0d52760eab490a0997d9abe2 is
fundamentally broken.  It clearly isn't possible to simply pass an
intel_renderbuffer to the software rasterizer.

I really feel that I've done as much digging on this as someone unfamiliar
with the codebase can be reasonably expected to do.  My wife agrees, BTW.
;-)  It would be *really* nice if someone familiar with how all of this is
supposed to work could take a look at this.
Comment 20 Ian Pilcher 2011-04-17 20:01:57 UTC
Created attachment 45750 [details]
Backtrace of intel_region_alloc_for_handle memory allocation failure
Comment 21 Ian Pilcher 2011-04-18 10:04:46 UTC
A bit more information.  The failure in drm_intel_bo_gem_create_from_name
occurs when drmIoctl is called with DRM_IOCTL_GEM_OPEN.  It is returning a
"No such file or directory" error.
Comment 22 Ian Pilcher 2011-04-18 13:28:10 UTC
Created attachment 45782 [details]
Spreadsheet showing lifecycle of problematic GEM object

I modified drmIoctl to log GEM object lifecycle-related calls to syslog.  The
attached spreadsheet shows the log from a crash.  (I used a spreadsheet,
because it allowed me to hide 1,300+ calls that aren't related to the
problematic object, without actually deleting those lines; I might have
missed something.)  The interesting lines are:

  DRM_IOCTL_I915_GEM_CREATE(size: 14680064) succeeded -- handle: 8e
  DRM_IOCTL_GEM_FLINK(handle: 8e) succeeded -- name: 3
  DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: f6, size: 14680064
  DRM_IOCTL_GEM_CLOSE(handle: f6) succeeded
  DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 221, size: 14680064
  DRM_IOCTL_GEM_CLOSE(handle: 221) succeeded
  DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 431, size: 14680064
  DRM_IOCTL_GEM_CLOSE(handle: 431) succeeded
  DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 43a, size: 14680064
  DRM_IOCTL_GEM_CLOSE(handle: 43a) succeeded
  DRM_IOCTL_GEM_CLOSE(handle: 8e) succeeded
  DRM_IOCTL_GEM_OPEN(name: 3) failed: No such file or directory

So there appear to be at least two things happening here:

  1.  Based on the fact that unlocking works sometimes, the root cause is
      almost certainly a race condition in KDE.  However ...

  2.  There's very little prospect of that race condition ever being
      fixed (or even acknowledged) as long as Mesa is swallowing these
      errors and creating unusable renderbuffers.

I propose that, at the very least, a failure in intel_region_alloc_for_handle
(and probably intel_region_alloc as well) needs cause an error to be returned
to the application.  I will attempt to create a patch that does this, but it
would be *really* helpful if someone with more knowledge of the internals of
Mesa, GLX, etc. would step in and help out here.
Comment 23 Ian Pilcher 2011-04-19 12:50:04 UTC
Created attachment 45827 [details]
Backtrace when last handle to GEM object is being closed

I have been able to determine that the last handle to the GEM object is
being closed during a call to CloseDownClient.

#0  drmIoctl (fd=8, request=1074291721, arg=0x7fff65bb88b0) at xf86drm.c:225
#1  0x00007f9dede84176 in drm_intel_gem_bo_free (bo=0x19d7780) at intel_bufmgr_gem.c:884
#2  0x00007f9dede8504c in drm_intel_gem_bo_unreference (bo=0x19d7780) at intel_bufmgr_gem.c:995
#3  drm_intel_gem_bo_unreference (bo=0x19d7780) at intel_bufmgr_gem.c:982
#4  0x00007f9dee09ed64 in intel_set_pixmap_bo (pixmap=0x44f3cb0, bo=0x0) at intel_uxa.c:638
#5  0x00007f9dee09ff34 in intel_uxa_destroy_pixmap (pixmap=0x44f3cb0) at intel_uxa.c:1105
#6  0x000000000052faa6 in damageDestroyPixmap (pPixmap=0x44f3cb0) at damage.c:1696
#7  0x00007f9def2068da in XvDestroyPixmap (pPix=0x44f3cb0) at xvmain.c:389
#8  0x00000000004f2916 in ShmDestroyPixmap (pPixmap=0x44f3cb0) at shm.c:276
#9  0x00007f9dee0b8652 in I830DRI2DestroyBuffer (drawable=0x1f1a200, buffer=0x445a830) at intel_dri.c:390
#10 0x00007f9dee2f3fe1 in DRI2DrawableGone (p=0x445a290, id=1092616195) at dri2.c:303
#11 0x0000000000459ebf in FreeClientResources (client=0x44b91d0) at resource.c:854
#12 0x0000000000430e13 in CloseDownClient (client=0x44b91d0) at dispatch.c:3461
#13 0x0000000000429331 in Dispatch () at dispatch.c:416
#14 0x0000000000421620 in main (argc=9, argv=0x7fff65bb8d08, envp=0x7fff65bb8d58) at main.c:287
Comment 24 Ian Pilcher 2011-04-23 18:56:17 UTC
Initial testing of the patch at
http://lists.x.org/archives/xorg-devel/2011-March/020716.html is looking good
for solving the KDE/OpenGL screensaver unlock crash.

The issue of "swallowing" GEM errors and creating render buffers with NULL
regions and functions pointers still exists.
Comment 25 Franco Pellegrini 2011-05-09 04:26:41 UTC
Subscribing to this bug, since i'm hitting this bug when unlocking kde's screensaver under Debian Testing (KDE 4.6.2).

Can someone let me know under which version of which package should this be fixed ?

thanks a lot.

Franco
Comment 26 Eric Anholt 2012-10-03 19:50:18 UTC
OK, so this is not the bug I thought it was:  I was assuming this was an instance of the "there was a GPU hang during the screensaver, and then when we come back the 2d driver is in wedged mode and everything breaks".  But it looks like from comment #23 that there's actually some sort of race with the server handing us a bad buffer name.  This may be fixed by the DRI2.n plan we've had (which would ensure that the buffer stays live)

There's not much we can do when the X Server gives us a bad buffer name -- we're supposed to draw to the X Server's buffer.
Comment 27 Adam Jackson 2018-06-13 16:55:59 UTC
As far as I know we haven't seen this in years. Please reopen if this is still an issue.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.