Created attachment 41315 [details] backtrace of Xorg. generated using gdb "backtrace full" Hi, on my Thinkpad T510 with an Intel Core i7, X crashes nearly everytime after closing fullscreen flash windows (e.g. when switching fullscreen on youtube on and off). This happens only, if compositing (using KWin) is enabled. I am using gentoo with linux 2.6.37-rc6 (iirc 2.6.36 was affected too), mesa 7.9.0 (iirc 7.8.1 was affected too) and xf86-video-intel-2.13.0. Let me know, if you need further information. I can also follow instructions on irc and test patches.
Created attachment 41316 [details] the related Xorg.0.log
Looks like it is deep in mesa. Can you please run: $ addr2line -e /usr/lib64/dri/i965_dri.so 0x7818f 0x6271b 0x5579a 0x131cfe 0x12d531 0x12f792 0xeeb23 0xeec18 or attach gdb and grab a bt?
I attached a backtrace. btw: I just tested nouveau on the same notebook (it's a model with nvidia optimus) and it does not crash. regards
(In reply to comment #3) > I attached a backtrace. btw: I just tested nouveau on the same notebook (it's a > model with nvidia optimus) and it does not crash. So you did. I'm going senile. Thanks.
The immediate bug would be fixed by: diff --git a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c b/src/mesa/drivers index 76fc94d..9714cac 100644 --- a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c +++ b/src/mesa/drivers/dri/i965/brw_wm_surface_state.c @@ -589,9 +589,11 @@ prepare_wm_surfaces(struct brw_context *brw) for (i = 0; i < ctx->DrawBuffer->_NumColorDrawBuffers; i++) { struct gl_renderbuffer *rb = ctx->DrawBuffer->_ColorDrawBuffers[i]; struct intel_renderbuffer *irb = intel_renderbuffer(rb); - struct intel_region *region = irb ? irb->region : NULL; - brw_add_validated_bo(brw, region->buffer); + if (!irb || !irb->region) + continue; + + brw_add_validated_bo(brw, irb->region->buffer); nr_surfaces = SURF_INDEX_DRAW(i) + 1; } } But it doesn't explain how irb or irb->region was NULL there and whether that should have been handled much earlier.
(In reply to comment #5) > The immediate bug would be fixed by: That seems to have fixed the bug for me (thanks for the quick response :)), but it still causes visual corruption and/or not updated screen content, that goes away, when I disable my compositing manager. But it's still much better than a crashing X server :-) > But it doesn't explain how irb or irb->region was NULL there and whether that > should have been handled much earlier. Maybe the screen corruption would go away, if the real source of the problem was fixed? What can I do to help you debug this problem? Is it possible to attach gdb to X and generate meaningful backtraces without user interaction? I am trying to get a backtrace for a different hard-to-reproduce crash, that happens once in a week or so, but I don't want to have a second machine running all the time to debug X on my production machine ...
any news on this? fullscreen flash video is not really usable. do you need any more debugging work done by me or can you reproduce the bug yourself?
Still present in mesa 7.10
*** Bug 33007 has been marked as a duplicate of this bug. ***
Created attachment 42084 [details] gdb backtrace Bug is also in mesa master as of today (non gallium version). Triggered by going fullscreen and back with opera and flash plaing some video. GM45, Linux 2.6.37, xorg 1.9.3, ddx from git master. Program received signal SIGSEGV, Segmentation fault. prepare_wm_surfaces (brw=0x1b61b00) at brw_wm_surface_state.c:528 528 brw_add_validated_bo(brw, region->buffer); <gdb> bt #0 prepare_wm_surfaces (brw=0x1b61b00) at brw_wm_surface_state.c:528 #1 0x00007f32b15f7f76 in brw_validate_state (brw=0x1b61b00) at brw_state_upload.c:397 #2 0x00007f32b15e7615 in brw_try_draw_prims (ctx=0x1b61b00, arrays=0x1b963f8, prim=0x1b94b14, nr_prims=1, ib=0x0, index_bounds_valid=<value optimized out>, min_index=0, max_index=3) at brw_draw.c:362 #3 brw_draw_prims (ctx=0x1b61b00, arrays=0x1b963f8, prim=0x1b94b14, nr_prims=1, ib=0x0, index_bounds_valid=<value optimized out>, min_index=0, max_index=3) at brw_draw.c:447 #4 0x00007f32b16c3312 in vbo_exec_vtx_flush (exec=<value optimized out>, unmap=<value optimized out>) at vbo/vbo_exec_draw.c:382 #5 0x00007f32b16c105c in vbo_exec_FlushVertices_internal (ctx=<value optimized out>, unmap=<value optimized out>) at vbo/vbo_exec_api.c:912 #6 0x00007f32b16c122a in vbo_exec_FlushVertices (ctx=<value optimized out>, flags=1) at vbo/vbo_exec_api.c:946 #7 0x00007f32b179333e in _mesa_PopAttrib () at main/attrib.c:859 #8 0x00007f32cbb2b2b5 in KWin::PaintClipper::Iterator::~Iterator() () from /usr/lib64/libkwineffects.so.1 #9 0x00007f32cbb36868 in KWin::renderGLGeometry(QRegion const&, int, float const*, float const*, float const*, int, int) () from /usr/lib64/libkwineffects.so.1 [...]
*** Bug 33422 has been marked as a duplicate of this bug. ***
Something is still very wrong to hit this path at all, but this should prevent the crash: commit 13bab58f04c1ec6d0d52760eab490a0997d9abe2 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Feb 18 17:51:10 2011 +0000 i965: Fallback on encountering a NULL render buffer Following a GPU hang, or other error, the render target is not likely to have an allocated BO and so we must fallback to avoid using it. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32534 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Just compiled today's Mesa git master. Etracer is really unplayable now: I see a lot of "garbage" on the screen. It looks like the polygons are being drawn in the wrong places. For example, Tux's eyes are not attached to his head: they are floating in front of his head (which makes it look like if he was wearing sunglasses). If you need, I could record some kind of video to show you. After a few minutes playing, X segfaults: Program received signal SIGSEGV, Segmentation fault. 0x00000000 in ?? () (gdb) bt #0 0x00000000 in ?? () #1 0xb6f8dc91 in _swrast_write_rgba_span (ctx=0x947ea68, span=0xbfe019bc) at swrast/s_span.c:1275 #2 0xb6fa7b69 in general_triangle (ctx=0x947ea68, v0=0xb6229208, v1=0xb62293f0, v2=0xb62295d8) at swrast/s_tritemp.h:819 #3 0xb6f820be in _swrast_Triangle (ctx=0x947ea68, v0=0xb6229208, v1=0xb62293f0, v2=0xb62295d8) at swrast/s_context.c:709 #4 0xb6fb2870 in triangle_rgba (ctx=0x947ea68, e0=1, e1=2, e2=3) at swrast_setup/ss_tritmp.h:176 #5 0xb6f4ddde in _tnl_render_quads_verts (ctx=0x947ea68, start=0, count=4, flags=55) at tnl/t_vb_rendertmp.h:383 #6 0xb6f4f4e1 in run_render (ctx=0x947ea68, stage=0x908dc58) at tnl/t_vb_render.c:321 #7 0xb6f42f82 in _tnl_run_pipeline (ctx=<value optimized out>) at tnl/t_pipeline.c:153 #8 0xb6f43a49 in _tnl_draw_prims (ctx=0x947ea68, arrays=0x9145d10, prim=0x9144664, nr_prims=1, ib=0x0, min_index=0, max_index=3) at tnl/t_draw.c:524 #9 0xb6e46afa in brw_draw_prims (ctx=0x947ea68, arrays=0x9145d10, prim=0x9144664, nr_prims=1, ib=0x0, index_bounds_valid=1 '\001', min_index=0, max_index=3) at brw_draw.c:458 #10 0xb6f3a21d in vbo_exec_vtx_flush (exec=0x91444f0, unmap=1 '\001') at vbo/vbo_exec_draw.c:383 #11 0xb6f312b9 in vbo_exec_FlushVertices_internal (ctx=0x39d, unmap=8 '\b') at vbo/vbo_exec_api.c:912 #12 0xb6f31358 in vbo_exec_FlushVertices (ctx=0x39d, flags=1) at vbo/vbo_exec_api.c:946 #13 0xb701f4c1 in _mesa_PopAttrib () at main/attrib.c:859 #14 0xb72ae0de in __glXDisp_PopAttrib (pc=0xb63c4168 "\004") at indirect_dispatch.c:1443 #15 0xb72d6d29 in __glXDisp_Render (cl=0x92c1f88, pc=0xb63c4164 "\004") at glxcmds.c:1847 #16 0xb72db870 in __glXDispatch (client=0x92c1eb0) at glxext.c:600 #17 0x08070fff in Dispatch () at dispatch.c:432 #18 0x080625ba in main (argc=8, argv=0xbfe02c54, envp=0xbfe02c78) at main.c:291 After the crash I rebooted, and: [pzanoni@mandriva ~]$ DISPLAY=:0 glxinfo | grep render direct rendering: Yes OpenGL renderer string: Mesa DRI Intel(R) Sandybridge Mobile GEM 20100330 DEVELOPMENT x86/MMX/SSE2 I didn't update kernel, X, libdrm or ddx.
A few things that occur looking at that bt: * Look for some message indicating the root cause of the error that causes swrast. * Apply the indirect glx opcode cache and reply with your tested-by. * Fix the swrast bugs. * Fix your system configuration and stop using indirect rendering on the local display. I'd recommend doing the latter if nothing else.
(In reply to comment #14) > A few things that occur looking at that bt: > > * Look for some message indicating the root cause of the error that causes > swrast. Didn't see, at least in Xorg.0.log or dmesg. I'll try to look elsewhere. > > * Apply the indirect glx opcode cache and reply with your tested-by. Do you mean the patch with this name: "glx: Cache indirect opcode->index conversion" ? I just tested. I still get the same segfault on swrast after running it. > > * Fix the swrast bugs. :) > > * Fix your system configuration and stop using indirect rendering on the local > display. Should I ask upstream KDE to disable desktop effects by default on Intel machines? :P By the way, I'm also seeing bugs on direct rendering (like the armagetron one). > > I'd recommend doing the latter if nothing else. Thanks for your help, Paulo
(In reply to comment #13) > Etracer is really unplayable now: I see a lot of "garbage" on the screen. It > looks like the polygons are being drawn in the wrong places. For example, Tux's > eyes are not attached to his head: they are floating in front of his head > (which makes it look like if he was wearing sunglasses). If you need, I could > record some kind of video to show you. I just tested today's Mesa git-master and I don't see this behavior anymore. The graphics are fine again, but the segfault still happens.
*** Bug 35260 has been marked as a duplicate of this bug. ***
Created attachment 44926 [details] Full backtrace from crash in _swrast_write_rgba_span I am also getting the crash in _swrast_write_rgba_span, after applying the patch in commit 13bab58f04c1ec6d0d52760eab490a0997d9abe2. In my case, the crash occurs when I unlock an an OpenGL screensaver. (I am running KDE on Fedora 15 Alpha.) One (possibly) interesting thing is that the crash does not always occur. If the KDE unlock dialog is displayed correctly, then I know that I will be able to unlock the screensaver without a crash. If only the password entry field is displayed (and the "Switch User...", "Unlock", and "Cancel" buttons *if* I mouse over them), then I know that X will crash when I enter my password and press enter. Thus far, the correlation has been 100%.
I spent a significant amount of time digging into this today, and I've been able to figure out the following sequence of events: * Starting point is GLMatrix screensaver running on KDE 4.6.2 (Fedora 15 x86_64, Core i7 2600 "HD 2000" GPU). At this point everything appears to be working fine. * Hit a key, move the mouse, etc. to bring up the screensaver unlock dialog. If the dialog is rendered properly at this point, then the crash will not occur. Everything from here on is the incorrectly rendered case. * The screensaver unlock dialog is not rendered correctly. Most or all of it is invisible (black on black). Various portions may appear is one "mouses over" or tabs to them. * Type the password and press Enter. * This is where I am able to catch the first sign of failure in the Mesa code (although the rendering problems indicate that something has already gone wrong, at least at the KDE level). drm_intel_bo_gem_create_from_name returns NULL to intel_region_alloc_for_handle. This NULL gets propagated up to intel_update_renderbuffers, which sets the region of the renderbuffer to NULL. * When prepare_wm_surfaces tries to use this renderbuffer, it encounters the NULL region. This used to cause an immediate segfault, but it now detects the NULL region, sets brw->intel.Fallback to GL_TRUE, and bails. * brw_draw_prims detects that brw_try_draw_prims failed, so it falls back to the software rasterizer, calling _swsetup_Wakeup and _tnl_draw_prims in turn. * Eventually, it gets to _swrast_write_rgba_span, which tries to call the renderbuffer's PutRow function. Of course, the renderbuffer is an intel_renderbuffer, so it's PutRow function is NULL, which causes the segfault we're seeing now. Based on the last point, it seems like the software fallback that was introduced in commit 13bab58f04c1ec6d0d52760eab490a0997d9abe2 is fundamentally broken. It clearly isn't possible to simply pass an intel_renderbuffer to the software rasterizer. I really feel that I've done as much digging on this as someone unfamiliar with the codebase can be reasonably expected to do. My wife agrees, BTW. ;-) It would be *really* nice if someone familiar with how all of this is supposed to work could take a look at this.
Created attachment 45750 [details] Backtrace of intel_region_alloc_for_handle memory allocation failure
A bit more information. The failure in drm_intel_bo_gem_create_from_name occurs when drmIoctl is called with DRM_IOCTL_GEM_OPEN. It is returning a "No such file or directory" error.
Created attachment 45782 [details] Spreadsheet showing lifecycle of problematic GEM object I modified drmIoctl to log GEM object lifecycle-related calls to syslog. The attached spreadsheet shows the log from a crash. (I used a spreadsheet, because it allowed me to hide 1,300+ calls that aren't related to the problematic object, without actually deleting those lines; I might have missed something.) The interesting lines are: DRM_IOCTL_I915_GEM_CREATE(size: 14680064) succeeded -- handle: 8e DRM_IOCTL_GEM_FLINK(handle: 8e) succeeded -- name: 3 DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: f6, size: 14680064 DRM_IOCTL_GEM_CLOSE(handle: f6) succeeded DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 221, size: 14680064 DRM_IOCTL_GEM_CLOSE(handle: 221) succeeded DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 431, size: 14680064 DRM_IOCTL_GEM_CLOSE(handle: 431) succeeded DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 43a, size: 14680064 DRM_IOCTL_GEM_CLOSE(handle: 43a) succeeded DRM_IOCTL_GEM_CLOSE(handle: 8e) succeeded DRM_IOCTL_GEM_OPEN(name: 3) failed: No such file or directory So there appear to be at least two things happening here: 1. Based on the fact that unlocking works sometimes, the root cause is almost certainly a race condition in KDE. However ... 2. There's very little prospect of that race condition ever being fixed (or even acknowledged) as long as Mesa is swallowing these errors and creating unusable renderbuffers. I propose that, at the very least, a failure in intel_region_alloc_for_handle (and probably intel_region_alloc as well) needs cause an error to be returned to the application. I will attempt to create a patch that does this, but it would be *really* helpful if someone with more knowledge of the internals of Mesa, GLX, etc. would step in and help out here.
Created attachment 45827 [details] Backtrace when last handle to GEM object is being closed I have been able to determine that the last handle to the GEM object is being closed during a call to CloseDownClient. #0 drmIoctl (fd=8, request=1074291721, arg=0x7fff65bb88b0) at xf86drm.c:225 #1 0x00007f9dede84176 in drm_intel_gem_bo_free (bo=0x19d7780) at intel_bufmgr_gem.c:884 #2 0x00007f9dede8504c in drm_intel_gem_bo_unreference (bo=0x19d7780) at intel_bufmgr_gem.c:995 #3 drm_intel_gem_bo_unreference (bo=0x19d7780) at intel_bufmgr_gem.c:982 #4 0x00007f9dee09ed64 in intel_set_pixmap_bo (pixmap=0x44f3cb0, bo=0x0) at intel_uxa.c:638 #5 0x00007f9dee09ff34 in intel_uxa_destroy_pixmap (pixmap=0x44f3cb0) at intel_uxa.c:1105 #6 0x000000000052faa6 in damageDestroyPixmap (pPixmap=0x44f3cb0) at damage.c:1696 #7 0x00007f9def2068da in XvDestroyPixmap (pPix=0x44f3cb0) at xvmain.c:389 #8 0x00000000004f2916 in ShmDestroyPixmap (pPixmap=0x44f3cb0) at shm.c:276 #9 0x00007f9dee0b8652 in I830DRI2DestroyBuffer (drawable=0x1f1a200, buffer=0x445a830) at intel_dri.c:390 #10 0x00007f9dee2f3fe1 in DRI2DrawableGone (p=0x445a290, id=1092616195) at dri2.c:303 #11 0x0000000000459ebf in FreeClientResources (client=0x44b91d0) at resource.c:854 #12 0x0000000000430e13 in CloseDownClient (client=0x44b91d0) at dispatch.c:3461 #13 0x0000000000429331 in Dispatch () at dispatch.c:416 #14 0x0000000000421620 in main (argc=9, argv=0x7fff65bb8d08, envp=0x7fff65bb8d58) at main.c:287
Initial testing of the patch at http://lists.x.org/archives/xorg-devel/2011-March/020716.html is looking good for solving the KDE/OpenGL screensaver unlock crash. The issue of "swallowing" GEM errors and creating render buffers with NULL regions and functions pointers still exists.
Subscribing to this bug, since i'm hitting this bug when unlocking kde's screensaver under Debian Testing (KDE 4.6.2). Can someone let me know under which version of which package should this be fixed ? thanks a lot. Franco
OK, so this is not the bug I thought it was: I was assuming this was an instance of the "there was a GPU hang during the screensaver, and then when we come back the 2d driver is in wedged mode and everything breaks". But it looks like from comment #23 that there's actually some sort of race with the server handing us a bad buffer name. This may be fixed by the DRI2.n plan we've had (which would ensure that the buffer stays live) There's not much we can do when the X Server gives us a bad buffer name -- we're supposed to draw to the X Server's buffer.
As far as I know we haven't seen this in years. Please reopen if this is still an issue.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.