http://bugzilla.gnome.org/show_bug.cgi?id=581526 describes a scenario where XCreateWindow appears to reuse an XID while the DestroyNotify for the previous owner of that XID is still sitting in the event queue. This causes GDK to get confused, and things go downhill from there.
It seems unreasonable to demand that clients peek the queue for pending destroy notifies whenever they want to create a window, in particular since this problem does not occur without the resource-reusing extension.
I can't see a reasonable way for either Xlib or the Xserver to guarantee that
XIDs in client's event queues are unique.
The X server has handed off the DestroyNotify event, so it thinks it has
finished with the event.
Xlib could ensure not to allocate an XID referenced in its own event queue
(for known event types), but it wouldn't know what other clients might have a
reference to a candidate XID sitting in their event queues.
If the server were to keep XIDs of destroyed windows allocated until clients
have processed events on that window, it would need to know when the events in
Xlib's queue have been processed. I can't see how the Xserver can know this
(without some change in protocol).
The other way of looking at this is that the events are a history of what has
happened and need to be interpreted in the context of when they happened.
This bug causes serious problems for some of us. In my case, (after bug 20254 was fixed) this is I believe the cause behind the way most of my firefox sessions terminate (after sometimes producing the disembodied windows mentioned in the first comment at https://bugzilla.gnome.org/show_bug.cgi?id=581526 )
So. Even if it's not possible to completely prevent an XID from being reused before it's processed, perhaps it could be made so unlikely that it won't happen in reasonable circumstances? I am thinking of the way process IDs work - each one is higher than the previous one assigned until it hits an integer limit and wraps back to 0, but any unallocated XIDs that old would hopefully not still be in queues.
I tried to take a look at the code but quickly came to the conclusion that this isn't something I personally could just jump into. So I don't know if it's a feasible suggestion or not - if not perhaps there could be some similar workaround to delay a given ID's reuse until it's simply unlikely to be a problem
Improving the algorithm providing the XID range so that it provided a larger range where possible would make this less likely (though it could still happen less often in reasonable circumstances).
Keeping a buffer of a certain number of recently released XIDs is another possibility.
Or perhaps calculating the range in advance, so that the range used is a range of XIDs that were available (but not advertised) at the time of a previous range request.
Reducing the frequency of the problem would provide relief. In my (possibly naive) opinion it is the wrong approach: the design flaw needs to be fixed. Perhaps that requires an API change.
This seems to be biting me too, to the order of once every 15 minutes (closing a firefox tab has by my estimate a 10% chance of crashing the firefox process). Meanwhile, .xsession-errors is flooded with messages from GDK warning of XID collisions.
I run most of the Xorg stack from git and interestingly enough, this behavior started a few weeks ago. I haven't had a chance to try bisecting yet, but as soon I get a chance I'll drop a note.
Created attachment 29852 [details]
Firefox backtrace with RenderBadPicture
It seems that Google Maps serves as an excellent reproduction case for the Firefox crash. Opening Google Maps in a tab and closing it will almost always result in a a RenderBadPicture within 3 attempts. Attached is a backtrace from doing just that. Is it possible that this backtrace is caused by aggressive XID reuse?
Comment on attachment 29852 [details]
Firefox backtrace with RenderBadPicture
(In reply to comment #6)
> Is it possible that this backtrace is caused by aggressive XID reuse?
I wouldn't have expected RenderBadPicture from this bug. If you can get a stack when running Firefox with --sync, then it would be best to file a bug at https://bugzilla.mozilla.org/ under Core -> Widget: Gtk
The RenderBadPicture may be caused by running cairo master. If you are, try downgrading to 1.8.8.
(In reply to comment #8)
> The RenderBadPicture may be caused by running cairo master. If you are, try
> downgrading to 1.8.8.
Yep, indeed I am running cairo from master. I just reverted and the usual reproduction cases seem to be stable. This is evidently a known issue? Has a bug been opened for it? Can I do anything to help? Thanks a ton for your comment. I've been passively scratching my head over this for weeks now.
I don't know if a bug has been filed, but I do know that it has been talked about on the #cairo IRC channel, and that at least Chris Wilson is aware of it.
I'm sure they'd appreciate a bisecting, although that's a bit painful to do because the bug isn't 100% reproducible.
(In reply to comment #10)
> I don't know if a bug has been filed, but I do know that it has been talked
> about on the #cairo IRC channel, and that at least Chris Wilson is aware of it.
Yeah, Chris and I talked briefly on #intel-gfx.
> I'm sure they'd appreciate a bisecting, although that's a bit painful to do
> because the bug isn't 100% reproducible.
I actually tried but it looks like the bug predates 1.8.8. Arg!
Note that if you install 1.8.8 on top of an 1.9 installation, you'll need to delete the existing libcairo.so, or it won't take effect.
(In reply to comment #12)
> Note that if you install 1.8.8 on top of an 1.9 installation, you'll need to
> delete the existing libcairo.so, or it won't take effect.
Yep, restarted my Xorg session in between tests which I thought should be sufficient. Moreover, I'm fairly certain the newly installed libraries did take effect after the restart as a scaling bug seen in firefox in 1.8.8 reared its head again. So anyways, I'm fairly confident that I did in fact establish that the bug predates 1.8.8, although it strikes me as odd that it's not seen by more people.
My analysis into this bug indicates that the RenderBadPicture results from a delayed cairo_surface_destroy() after firefox has called XDestroyWindow() on the *parent* Window. In this situation firefox should be calling cairo_surface_finish(), or cairo_surface_destroy() and disposing of the cairo_surface_t, on the destroyed hierarchy.
So the RenderBadPicture is a separate bug (and not ours! ;-) from the XID reuse.
Well, I haven't looked into this bug, but for me, it is definitely the case that it happens with cairo master and not with 1.8.8.
Created attachment 30180 [details]
xtrace of a typical crash
Note that cairo calls RenderFreePicture (4ebda) immediately upon the cairo_surface_finish() [which presumably is actually trigged by the final cairo_surface_destroy() and is not being manually called], but the drawable was destroyed much earlier (the DestroyNotify arrives at 47608) and note that the drawable is never explicitly destroyed but is reaped along with its parent (475f7).
The full trace is available at http://people.freedesktop.org/~ickle/ff.crash.log
I saw a way to reproduce this bug in Firefox at:
I can confirm it gets reliably triggered with cairo 1.9.4 but not with cairo 1.8.8
Created attachment 31379 [details]
firefox crash and gdb of corpse
I still get crashes from FireFox every few days.
Before each crash, I see one or more messages like this:
(firefox:5290): Gdk-WARNING **: XID collision, trouble ahead
The actual crash is usually a SEGV. I think that it is a null pointer dereference but I cannot be sure because GDB is unreliable with optimized code. (I have an example where gdb prints 0 for a pointer variable but when I look at the assembly code I see that that variable is not represented at that point in the code.)
I don't think my problem has anything to do with cairo because I don't find RenderBadPicture in any of the tracebacks. Am I being naive? Should I look for something else? I'm using an up-to-date Fedora 11 on x86-64; cairo-1.8.8-1.fc11.x86_64; no flash plugin.
I'm attaching a very long typescript of a firefox session that failed and a gdb of the resulting core file. Perhaps someone could tell if
I think that the Cairo problems are a different bug and should have a different bugzilla entry.
The original posting in this bugzilla entry describes a bug that I still think is real. I imagine that this is the bug that is afflicting me.
I'm attaching a very long typescript of a firefox session that failed and a gdb of the resulting core file. Perhaps someone could tell from this if what I've said in this comment is wrong.
You may want to view the following video here:
I created this video to clearly demonstrate at least one trigger for the XID
Collision message. I believe there are at least two triggers and that both
triggers are adobe flash 10 related.
You can see from the video that you should have re-createable real life test
cases for this problem.
I run a Gentoo installation.
For those familiar with Gentoo, at the end of the video, I run:
emerge -epv mozilla-firefox | less
I have saved the output of these to text files if anyone is interested. Just
The reason is that the emerge -epv mozilla-firefox command will display every
package and depencies required for mozilla-firefox. For the record, prior to
creating the video, I actually did re-compile every package in this list
(emerge -e mozilla-firefox) in order to ensure a clean run.
In the video, the left part of the screen is a konsole terminal window. The
right part of the screen is firefox. I start firefox with the command "firefox
-sync' in the terminal window.
I have FF set up to start with a number of tabs. As I change focus from tab to
tab, watch the terminal window. There are two tabs where changing focus causes
XID Collision messages to appear. It is particularly obvious that the error
messages are generated during flash activity. Note especially the generation of
messages as the flash window controls autohide and then re-appear. It's not
clear to me in the second tab (The Daily Show) what kind of flash control is
causing the messages. However, that site never seems to stop loading flash
objects. Or rather, my patience runs out before the flash downloads can
My reading of other people's problems suggest that x86 (i386) based systems
don't have this problem but please regard this as an unconfirmed data point.
In this thread in the Gentoo forums, I am 'dufeu':
The video best viewed in HD on a screen 1384x768 or larger. (full screen mode)
Thank you all for your time and patience!
BTW - I did understand the discussion of asynchonous ID assignment and release. However, while the problem seems to be properly identified, I'm not sure that the exact trigger for invoking the problem has been properly identified. I hope the video will be helpful. Unless I (as and end-user) have completely misunderstood what I see, it's seems clear that the actual trigger is probably flash 10.
Displaimer: I am only and end user. I am not a programmer.
This seems more like a server issue. I think it could easily be possible for
the server to guarantee that XIDs are not reused within a certain time period
since it issued a DestroyNotify. That won't guarantee that clients are happy,
but it can certainly help. We just need to store a timestamp of the time the
XID was destroyed and if the head of the recycle queue is too recent, we
allocate a new XID rather than recycling.
Tracking for 1.12, but I'd consider this for 1.11.x if the change is simple
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/xserver/issues/380.