Bug 20692

Summary: Xorg spinning in intel-driver
Product: xorg Reporter: Clemens Eisserer <linuxhippy>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium Keywords: NEEDINFO
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
my Xorg.0.log
none
stacktraces
none
kernel stacktraces
none
more xorg-stacktraces
none
sysprof profile of spinning xorg
none
kernel stacktraces of 2.6.31rc5
none
userspace stack traces of 2.8
none
sysprof log with 2.9 + 2.6.31.1
none
pstack-output from spinning Xorg, 2.9 + 2.6.31.1
none
sysprof profile (user+kernelspace)
none
Kill -EIO from tcflush none

Description Clemens Eisserer 2009-03-16 10:24:30 UTC
Created attachment 23922 [details]
my Xorg.0.log

I recently encountered a case where Xorg spinned at 100% cpu, but the system was still useable. This was 2.6.99 + Xorg-1.6.0_2 (from rawhide).

I collected a few stacktraces, no idea how useful they are.
Comment 1 Clemens Eisserer 2009-03-16 10:25:30 UTC
Created attachment 23923 [details]
stacktraces
Comment 2 Jesse Barnes 2009-03-30 17:30:34 UTC
Looks like the driver is just waiting on the GPU in those traces, can you reliably cause any problems here?  Or do you just see occasional spikes (which could be normal)?
Comment 3 Clemens Eisserer 2009-03-31 09:11:35 UTC
Just experienced this again, this time with XAA (was running UXA last time).

When I get those stack-traces, XOrg consumes 100% CPU on one core, although no graphic stuff is going on. Looking at "top", Xorg in this case is the only process consuming a lot cycles, so I can't imagine this is just a misbehaving X-client feeding the server with tons of input.
Also the stack-traces indicate that there's nothing else going on in the server.

Just restarting X doesn't help here, I have to reboot to make it working again.
Comment 4 Clemens Eisserer 2009-03-31 09:15:05 UTC
sorry, just saw that both was on UXA - The wish for XAA seems to be ignored.
Comment 5 Jesse Barnes 2009-03-31 09:20:22 UTC
Were the stack traces the same?
Comment 6 Clemens Eisserer 2009-03-31 09:54:07 UTC
yes, were the same.
unfourtunatly I did not find a way to trigger the problem - it happend to me always by accident.
Comment 7 Jesse Barnes 2009-03-31 10:00:00 UTC
Another thing to capture would be the kernel stack trace from the process.  You should be able to get it by using sysrq-t and capturing dmesg (or echo t > /proc/sysrq-trigger).
Comment 8 Clemens Eisserer 2009-03-31 10:15:53 UTC
ok, when it happens again I try to get some kernel stack traces.
Comment 9 Clemens Eisserer 2009-03-31 15:19:23 UTC
Created attachment 24414 [details]
kernel stacktraces
Comment 10 Clemens Eisserer 2009-03-31 15:19:45 UTC
Created attachment 24415 [details]
more xorg-stacktraces
Comment 11 Clemens Eisserer 2009-03-31 15:20:56 UTC
VT switching also doesn't seem to work in this situation, thanks for the sysrq-trigger tip :)
Comment 12 Clemens Eisserer 2009-03-31 15:28:26 UTC
restarting the X-server doesn't seem to help - both issues persist.
VT switching is still broken, and the x-server process keeps spinning.
Comment 13 Jesse Barnes 2009-05-04 14:57:10 UTC
I still don't have any ideas on this one.  Would it be possible for you to get sysprof output during the CPU spike?
Comment 14 Clemens Eisserer 2009-05-04 15:54:22 UTC
Thanks for looking at that issue.

I can trigger the spinning by switching between runleven 3 and runlevel 5 a few times. I don't even need to log in with kdm.

Another thing I noted: When I kill X (Ctrl+Alt+Backspace) I can't type on the VT console anymore. Each keypress results in multiple garbage characters.

top says X consumes about 10% of one CPU in userspace, and 90% in kernel-space.

OProfile gave me:
samples  %        image name               app name                 symbol name
34       21.7949  vmlinux                  vmlinux                  read_hpet
9         5.7692  vmlinux                  vmlinux                  acpi_idle_do_entry
6         3.8462  vmlinux                  vmlinux                  acpi_idle_enter_bm
5         3.2051  vmlinux                  vmlinux                  system_call
4         2.5641  vmlinux                  vmlinux                  __ticket_spin_unlock
4         2.5641  vmlinux                  vmlinux                  acpi_os_read_port
4         2.5641  vmlinux                  vmlinux                  do_select
4         2.5641  vmlinux                  vmlinux                  native_flush_tlb_single
Comment 15 Clemens Eisserer 2009-05-04 15:56:15 UTC
Created attachment 25442 [details]
sysprof profile of spinning xorg
Comment 16 Jesse Barnes 2009-05-04 16:28:07 UTC
> --- Comment #15 from Clemens Eisserer <linuxhippy@gmail.com>
> 2009-05-04 15:56:15 PST --- Created an attachment (id=25442)
>  --> (http://bugs.freedesktop.org/attachment.cgi?id=25442)
> sysprof profile of spinning xorg

Any chance you could install debug symbols for your kernel & X server
so we can see more details here?  At first blush it appears something
is stuck calling gettimeofday in a tight loop.
Comment 17 Clemens Eisserer 2009-05-05 00:38:19 UTC
I had/have debug symbols installed for both, Xorg and the kernel - the symbol names do show up in the profile.
Comment 18 Clemens Eisserer 2009-05-05 01:05:03 UTC
do you know how I can tell sysprof where to look for vmlinux + kernel-module debug info? 
I only found that information for oprofile, but I don't understand its call-graphis.

Thanks, Clemens
Comment 19 Jesse Barnes 2009-05-05 09:37:26 UTC
No, sorry.  It usually "just works" for me...
Comment 20 Jesse Barnes 2009-07-08 13:56:01 UTC
Have you seen this recently Clemens?  Or do you think it's fixed?
Comment 21 Clemens Eisserer 2009-08-18 08:19:37 UTC
Haven't seen this for a very long time, but today experienced it with 2.6.31rc5 + intel-2.8 (that stuff deployed with fedora rawhide 11.91, 20080818).

Attached are user/kernel-space stacktraces.
Comment 22 Clemens Eisserer 2009-08-18 08:21:18 UTC
Created attachment 28747 [details]
kernel stacktraces of 2.6.31rc5
Comment 23 Clemens Eisserer 2009-08-18 08:21:49 UTC
Created attachment 28748 [details]
userspace stack traces of 2.8
Comment 24 Chris Wilson 2009-09-02 08:59:50 UTC
This sounds like (but I didn't spot any collobrating evidence in your profile due to lack of kernel symbols) the fence thrashing bug fixed by:

commit a09ba7faf75fa4b21980d81de8e5f3d5c0785ccf
Author: Eric Anholt <eric@anholt.net>
Date:   Sat Aug 29 12:49:51 2009 -0700

    drm/i915: Fix CPU-spinning hangs related to fence usage by using an LRU.
    
    The lack of a proper LRU was partially worked around by taking the fence
    from the object containing the oldest seqno.  But if there are multiple
    objects inactive, then they don't have seqnos and the first fence reg
    among them would be chosen.  If you were trying to copy data between two
    mappings, this could result in each page fault stealing the fence from
    the other argument, and your application hanging.
Comment 25 Jesse Barnes 2009-10-05 10:46:31 UTC
Closing due to lack of activity, probably fixed though.
Comment 26 Clemens Eisserer 2009-10-05 11:14:13 UTC
haven't seen this for a long time, guess its fixed :)
Comment 27 Jesse Barnes 2009-10-05 11:16:52 UTC
Cool, see we fix bugs!  We just don't always correlate the fixes back to bug reports. :p
Comment 28 Clemens Eisserer 2009-10-14 05:53:02 UTC
I've just seen it again - on kernel-2.6.31.1 + intel-2.9.0 when I switched from runleven 5->3->5.

As in the reports before, Xorg spins using one core but everything else seems to work. I noticed it because my laptop lost battery-charge quite fast and the fan started to blow.
Comment 29 Jesse Barnes 2009-10-14 08:56:42 UTC
Can you capture a sysprof of the new spinning?  Is it the same as before?
Comment 30 Clemens Eisserer 2009-10-14 12:39:49 UTC
Created attachment 30412 [details]
sysprof log with 2.9 + 2.6.31.1
Comment 31 Clemens Eisserer 2009-10-14 12:40:45 UTC
Created attachment 30413 [details]
pstack-output from spinning Xorg, 2.9 + 2.6.31.1
Comment 32 Jesse Barnes 2009-11-20 13:09:55 UTC
Sorry missed the update.  Can you get sysprof output with kernel symbols?  Usually your distro has a kernel debug package that will provide them.  Most of the time is definitely in the kernel in your sysprof output, but we can't tell where.
Comment 33 Jesse Barnes 2010-02-11 09:50:41 UTC
Timeout.  Hope this isn't still occurring.  Maybe Chris can take a look if it is.
Comment 34 Clemens Eisserer 2010-02-26 15:58:00 UTC
Please reopen, just happend again with kernel-2.6.32.8-58.fc12.i686 and intel-2.9.1

All I had to do was:
- Boot into runlevel 5
- Log into KDE
- Logout
- "init 3" in VT
- "init 5" in VT

Xorg was spinning again.

Thanks to the kernel's new profiling framework I am now able to provide users+kernel profile, as attached.
Comment 35 Clemens Eisserer 2010-02-26 15:58:43 UTC
Created attachment 33600 [details]
sysprof profile (user+kernelspace)
Comment 36 Jesse Barnes 2010-02-26 23:52:34 UTC
Woo driver funkiness.
Comment 37 Jesse Barnes 2010-02-26 23:53:50 UTC
Any ideas Chris?
Comment 38 Chris Wilson 2010-02-27 00:27:18 UTC
It looks like the X server has been sent into a spin around select(). Possibly an invalid fd in one its fdsets? strace would confirm that it is continuously calling select, and test the hypothesis that select is returning an error. I will try to reproduce this locally later, though it might be machine dependent - at this point there is nothing to indicate the cause of the spin.
Comment 39 Clemens Eisserer 2010-06-03 10:35:05 UTC
still spins with kernel-2.6.33.5 + intel-2.11.
I'll try to strace the server some time soon.
Comment 40 Chris Wilson 2010-07-18 05:44:30 UTC
Any luck Clemens? This issue is just so bizarre that I am curious to know what the cause is.
Comment 41 Chris Wilson 2010-08-20 05:13:06 UTC
This is a likely culprit in conjunction with a spin in select:

commit c882f6a22a862c1664c375e05e5e6fc4bdb04edb
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Aug 18 10:21:22 2010 +0100

    Move registration of vsync fd from pre-init to screen-init
    
    Marty Jack reported an issue he found where the page-flipping handler
    was being lost on server reset. This results in the swap completion
    notification being lost, with the sporadic hang of full screen
    applications like Compiz, flash and even glxgears!
    
    Fixes:
    
      Bug 29584 - Server in compute loop
      https://bugs.freedesktop.org/show_bug.cgi?id=29584
    
    There are also several possibly related bugs with similar symptoms, i.e.
    OpenGL applications hanging on missed swap notifications.
    
    Reported-by: Marty Jack <martyj19@comcast.net>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Keith Packard <keithp@keithp.com>
Comment 42 Chris Wilson 2010-09-05 16:28:48 UTC
Created attachment 38462 [details] [review]
Kill -EIO from tcflush

And this patch from Adam Jackson seems more relevant.
Comment 43 Chris Wilson 2010-09-19 13:27:30 UTC
The potential EIO spin on vt-switch fits the bug description and profiles, so presuming fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.