Bug 18075

Summary:

[UXA] XPutImage performance regression

Product:

xorg

Reporter:

Clemens Eisserer <linuxhippy>

Component:

Driver/intel

Assignee:

Chris Wilson <chris>

Status:

RESOLVED FIXED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

high

CC:

xhejtman

Version:

git

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
Xorg log from the self-compiled server	none
the ugly benchmark ;)	none
profile while benchmark was executed	none
JXRenderMark source	none
JXRenderMark source	none
Uses a temporary pixmap as composition mask	none
Experimental put-image acceleration.	none

Description Clemens Eisserer 2008-10-15 09:37:07 UTC

Created attachment 19666 [details]
Xorg log from the self-compiled server

With the intel-git pulled after commit 6707371176147340fabc9ab6f1e3d6d5ac980662 as well as the version delivered with Fedora10-Rawhide (some 2.5 prerelease) I see a large performance regression for a real-world workload, down to 50% of original throughput.

I wrote a test-case which tries to mimic the behaviour of the real-world app to some degree, and those were my results:
intel-2.1.1/xserver-1.3/2.6.25:      80ms (Fedora 8 xorg distribution)
intel-2.5pre/xserver-1.5.2/2.6.27:  150ms (Fedora 10 rawhide distribution)
intel-git/xserver-git/2.6.27:       200ms (self compiled) 

Attached you find the benchmark as well as a sysprof-profile of the self-compiled Xorg as well as Xorg.log.
The benchmark itself does not handle ms-overflow, so don't wonder of you see sometimes negative results.

My System:
Fedora 10 rawhide
Intel-945GM
Core2Duo-T7200, 3GB DDR2-533

Comment 1 Clemens Eisserer 2008-10-15 09:39:38 UTC

Created attachment 19667 [details]
the ugly benchmark ;)

Comment 2 Clemens Eisserer 2008-10-15 09:40:55 UTC

Created attachment 19668 [details]
profile while benchmark was executed

Comment 3 Lukas Hejtmanek 2008-10-15 09:48:52 UTC

could you roll back the intel driver before this commit and give it a try?
c4565a9811487402d899d0933cc63e27ffe1ff08
(it does not matter, if you include this one or not).

Comment 4 Lukas Hejtmanek 2008-10-16 07:06:00 UTC

(In reply to comment #3)
> could you roll back the intel driver before this commit and give it a try?
> c4565a9811487402d899d0933cc63e27ffe1ff08
> (it does not matter, if you include this one or not).
> 

I tested that the commit above and it is OK, the test show 90ms which seems to 
be sane.

I highly suspect the commit 
5c9a62a29f62a9ecce37fae98cb01f8217eaba15
from causing the slowdown.

Comment 5 Clemens Eisserer 2008-10-16 07:22:41 UTC

Lukas: As far as I know the commit you mention is i965 specific, however the slowdown also happens on a 945GM (i915-class).

Comment 6 Lukas Hejtmanek 2008-10-16 11:27:43 UTC

(In reply to comment #5)
> Lukas: As far as I know the commit you mention is i965 specific, however the
> slowdown also happens on a 945GM (i915-class).
> 

Could you play bisect then?

Comment 7 Clemens Eisserer 2008-10-28 16:10:08 UTC

sorry for not bisecting, I am super-busy right now at university.

However Carl mentioned he is working on performance improvements, I'll wait until 2.6.28 is released (or at least a preview life-cd) and I can test performance on GEM.
In my opinion non-GEM mode will soon be obsolete, so I really hope in GEM mode the driver will even beat its old results once all the new stuff stabilizes.
I now also have a i830-class machine to test on ... so I won't miss any generation ;)

Comment 8 Eric Anholt 2009-03-20 11:53:53 UTC

27% of CPU in that profile is spent in the EXA offscreen management failure.  This should be gone with UXA, and doesn't appear to be there on my system.

However, instead the shmputimage is hitting a really slow path with UXA.  But I don't get why you're creating SHM images (a dubious optimization even when done right) when you're not even pushing the data from your client to the server.  The test app is gratuitously preventing hardware acceleration here.

Comment 9 Clemens Eisserer 2009-03-22 15:37:13 UTC

Hi Eric,

Thanks for looking into this report.

The original benchmarking code has been obsoleted by JXRenderMark, which is now part of the phoronix test suite. (attached). 
It has its focus on testing how several features the new XRender-Java2d backend relies on perform on various hardware/driver combinations.
Although its maybe quite broken too, I guess its still better than whats here ;)

Some paths are not settled, however the following paths will be used, especially the "put composition" part will be quite important for a size of about 20x20px - 32x32px. 
All the antialiased rendering will go through this path with java generating <32x32px A8 tiles in software, uploading them and using them as mask.

For the 15x15 case EXA archieves 37% of XAA, whereas UXA archieves 4.3% of XAA.


XAA (no offscreen pixmaps) / xorg-xserver-1.3.0 / intel-2.2.1:
  294664.411765 Ops/s; rects (!); 15x15
  98847.048301 Ops/s; rects (!); 75x75
  21097.500000 Ops/s; rects (!); 250x250
  123611.913357 Ops/s; rects composition (!); 15x15
  16042.500000 Ops/s; rects composition (!); 75x75
  2048.285199 Ops/s; rects composition (!); 250x250
  168811.567164 Ops/s; put composition (!); 15x15
  15068.807339 Ops/s; put composition (!); 75x75
  1699.748744 Ops/s; put composition (!); 250x250

EXA / xorg-xserver-1.3.0 / intel-2.2.1:
  102111.180905 Ops/s; rects (!); 15x15
  27104.304636 Ops/s; rects (!); 75x75
  8182.500000 Ops/s; rects (!); 250x250
  54326.815642 Ops/s; rects composition (!); 15x15
  19972.989950 Ops/s; rects composition (!); 75x75
  3225.689882 Ops/s; rects composition (!); 250x250
  62233.493810 Ops/s; put composition (!); 15x15
  16747.201493 Ops/s; put composition (!); 75x75
  2618.421053 Ops/s; put composition (!); 250x250

UXA / xorg-xserver-1.6.0 / intel-2.6.902
  55325.000000 Ops/s; rects (!); 15x15              
  15646.039604 Ops/s; rects (!); 75x75              
  5363.874346 Ops/s; rects (!); 250x250
  44448.148148 Ops/s; rects composition (!); 15x15
  14300.625000 Ops/s; rects composition (!); 75x75
  3650.884495 Ops/s; rects composition (!); 250x250
  7267.766497 Ops/s; put composition (!); 15x15
  5319.587629 Ops/s; put composition (!); 75x75
  1430.136986 Ops/s; put composition (!); 250x250

Comment 10 Clemens Eisserer 2009-03-22 15:39:56 UTC

Created attachment 24134 [details]
JXRenderMark source

Please give it a try again. Individual benchmarks can be run too, e.g.

> ./JXRenderMark 3 20
> 6924.586777 Ops/s; put composition (!); 20x20

please see "-help" for more details.

Comment 11 Lukas Hejtmanek 2009-03-22 15:44:47 UTC

(In reply to comment #9)

> UXA / xorg-xserver-1.6.0 / intel-2.6.902
>   55325.000000 Ops/s; rects (!); 15x15              
>   15646.039604 Ops/s; rects (!); 75x75              
>   5363.874346 Ops/s; rects (!); 250x250
>   44448.148148 Ops/s; rects composition (!); 15x15
>   14300.625000 Ops/s; rects composition (!); 75x75
>   3650.884495 Ops/s; rects composition (!); 250x250
>   7267.766497 Ops/s; put composition (!); 15x15
>   5319.587629 Ops/s; put composition (!); 75x75
>   1430.136986 Ops/s; put composition (!); 250x250
> 

I'm working on speedup of compositing. Hopefully, I will get something usable soon :)

Comment 12 Michel Dänzer 2009-03-24 03:42:51 UTC

(In reply to comment #10)
> Created an attachment (id=24134) [details]
> JXRenderMark source

I can't build this because glyphs.h is missing. Where's that supposed to come from?

Comment 13 Clemens Eisserer 2009-03-24 04:18:09 UTC

Created attachment 24186 [details]
JXRenderMark source

Oops, sorry I forgot about that.

Comment 14 Clemens Eisserer 2009-03-27 04:59:38 UTC

By the way, these are the results I get on my NVidia GeForce 6600, with the proprietary driver:

301778,21 Ops/s; rects;15x15: 
174022,65 Ops/s;rects;75x75: 
38125 Ops/s; rects;250x250: 
71292,44 Ops/s; rectscomposition;15x15: 	
22127,79 Ops/s;	rectscomposition;75x75: 
2282,61 Ops/s; rectscomposition;250x250: 	
136777,12 Ops/s; putcomposition;15x15: 
32324,78 Ops/s; putcomposition;75x75:
3364,83 Ops/s; putcomposition;250x250

However this was on a slower CPU (AMD Sempron, 256kb L2 cache, 1.8ghz).
Its even for the small sizes on par with XAA and does very well for larger sizes.

Comment 15 Lukas Hejtmanek 2009-04-02 02:21:27 UTC

I got these numbers on NVidia 8800GT with binary drivers:

339863.984674 Ops/s; rects (!); 15x15
44306.109726 Ops/s; rects (!); 75x75
5805.882353 Ops/s; rects (!); 250x250
95083.125000 Ops/s; rects composition (!); 15x15
78007.056452 Ops/s; rects composition (!); 75x75
43479.375000 Ops/s; rects composition (!); 250x250
95956.967213 Ops/s; put composition (!); 15x15
28594.637224 Ops/s; put composition (!); 75x75
8936.224490 Ops/s; put composition (!); 250x250
...
68766.471838 Ops/s; Transformed Blit Linear (!); 15x15
4372.791519 Ops/s; Transformed Blit Linear (!); 75x75
374.060150 Ops/s; Transformed Blit Linear (!); 250x250
96.153846 Ops/s; Transformed Blit Billinear sharp edges (!); 15x15
3.906250 Ops/s; Transformed Blit Billinear sharp edges (!); 75x75


seems to be odd that I got only 4 fps with 75x75 picture. Is the test sane?

Comment 16 Clemens Eisserer 2009-04-19 07:01:59 UTC

I am not sure about "Transformed Blit Billinear sharp edges", my only complain is about poor performance of the "put composition" test.

We plan to use it a lot, with XPutImage's of 500-1kb, and even with the latest rawhide updates (xorg-1.6.1 + kernel-2.6.29.1 + intel-2.6.902_03) its still way slower than EXA or NVidia:

31172.110553 Ops/s; put composition (!); 15x15
7116.952790 Ops/s; put composition (!); 75x75
1588.302752 Ops/s; put composition (!); 250x250

Comment 17 Clemens Eisserer 2009-04-27 13:30:56 UTC

With NoAccel I get:
152815.831987 Ops/s; put composition (!); 20x20
147570.075758 Ops/s; put composition (!); 20x20

So about ~5x faster than UXA. Of course thats not fair ;)

Comment 18 Clemens Eisserer 2009-05-04 13:48:07 UTC

Created attachment 25428 [details]
Uses a temporary pixmap as composition mask

OpCode 3 is PutImage with reusing the mask pixmap
Opcode 18 generates a new pixmap each iteration.

To compare both, simply run e.g.: 
./jx 3 24 18 24

Comment 19 Clemens Eisserer 2009-05-05 01:44:13 UTC

On EXA using a temporary pixmap reduces performance by about 20k ops/s:

[ce@localhost putperf]$ ./jx 3 20 3 20 18 20 18 20
62471.938776 Ops/s; put composition (!); 20x20
40884.087237 Ops/s; put composition - temp mask; 20x20

Comment 20 Clemens Eisserer 2009-05-20 22:49:48 UTC

A new version of the JXRenderMark is available at: http://78.31.67.79:8080/jxrender/RenderMark.html

It features more precise timing as well as some adoptions to the Blit tests.

Comment 21 Hans-Christian Jansen 2009-07-21 04:26:47 UTC

Following the discussion on Intel-gfx I benchmarked my laptop (P4-2.6ghz, Geforce2Go (low cost mobile gpu) + propietary legacy drivers, 5 year old):

114542.024410 Ops/s; put composition (!); 15x15
17224.712189 Ops/s; put composition (!); 75x75
2849.247486 Ops/s; put composition (!); 250x250

This 5yo machine is about 3-4x faster than the Core2Duo machine with Intel-IGP the submitter benchmarked on.

Comment 22 Clemens Eisserer 2009-08-14 06:55:59 UTC

Note that low XPutImage performance also hurts KDE4's user experience quite a lot.
I've just profiled a few slow behaving KDE applications, and it boils down that nearly 30% of cpu cycles are spent inside kernel-code, triggered mostly by calls to XPutImage (to upload the pre-rendered SVG content).

And no, those *silly* trapezoids would not be a solution to those performance problems.

Comment 23 Chris Wilson 2009-11-29 03:05:51 UTC

Created attachment 31545 [details] [review]
Experimental put-image acceleration.

I'm experimenting with this patch which aims to miminse the number of CPU stalls whilst waiting for dirty pixmaps. This seems a sane thing to do per-se, but most of the cpu time is spent doing busy-work in the kernel -- it's especially painful if you enable mutex/spinlock debugging.

Clemens, I'd be interested to know if this has any effect for your workloads, thanks.

Comment 24 Chris Wilson 2009-11-29 04:57:35 UTC

Ok, I found a reasonable set of benchmarks that benefit from this behaviour - the basic RENDER path in cairo does all computation of masks and sources on the CPU and uploads via [Shm]PutImage for composition on the GPU by the xserver.

The xcb backend is the regular RENDER accelerated path, where xcb-render-0.0 is the fallback + composite path.

old: no-put-image
new: put-image
Speedups
========
xcb-render-0.0-rgba                    poppler-0    70628.98 (70717.42 0.06%) -> 17578.64 (17598.08 0.34%):  4.02x speedup
███
xcb-render-0.0-rgba         gnome-terminal-vim-0    137326.07 (137439.39 0.04%) -> 47035.30 (47059.89 0.18%):  2.92x speedup
█▉
xcb-render-0.0-rgba       firefox-planet-gnome-0    133030.59 (133187.32 0.06%) -> 70809.69 (71246.89 0.34%):  1.88x speedup
▉
xcb-render-0.0-rgba          firefox-talos-gfx-0    137900.76 (139008.97 0.40%) -> 78817.67 (80564.31 1.03%):  1.75x speedup
▊
xcb-render-0.0-rgba                  evolution-0    107534.53 (107633.77 0.05%) -> 76159.85 (77595.22 0.97%):  1.41x speedup
▍
xcb-render-0.0-rgba         swfdec-giant-steps-0    11385.34 (11390.89 0.02%) -> 8737.90 (8795.25 0.59%):  1.30x speedup
▎
xcb-render-0.0-rgba             swfdec-youtube-0    12869.06 (13022.56 0.59%) -> 10599.66 (10658.71 1.14%):  1.21x speedup
▎
  xcb-rgba                  evolution-0    41035.17 (41109.37 0.09%) -> 34674.08 (35267.29 1.38%):  1.18x speedup
▏
xcb-render-0.0-rgba       gnome-system-monitor-0    17733.01 (17741.65 0.02%) -> 15564.74 (15615.23 0.21%):  1.14x speedup
▏
  xcb-rgba       firefox-planet-gnome-0    85445.04 (85867.32 0.25%) -> 75448.13 (76586.55 0.86%):  1.13x speedup
▏
  xcb-rgba                       gvim-0    71743.82 (71754.51 0.01%) -> 65753.64 (65756.36 0.07%):  1.09x speedup
▏
xcb-render-0.0-rgba          firefox-talos-svg-0    111644.58 (112783.48 0.51%) -> 104088.25 (104866.37 0.54%):  1.07x speedup
▏
  xcb-rgba       gnome-system-monitor-0    8024.22 (8059.03 0.22%) -> 7491.09 (7509.25 0.20%):  1.07x speedup
▏

As can be seen in workloads dominated by rendering to lots of intermediate surfaces, this accelerated put_image restores earlier performance lost due to the switch to reusing "gpu-hot" buffers. However the issue of whether we are doing excess work in the kernel is still open.

Comment 25 Chris Wilson 2009-11-29 17:00:44 UTC

commit 19d8c0cf50e98909c533ebfce3a0dd3f72b755c1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Nov 29 21:16:49 2009 +0000

    uxa: PutImage acceleration
    
    Avoid waiting on dirty buffer object by streaming the upload to a fresh,
    non-GPU hot buffer and blitting to the destination.
    
    This should help to redress the regression reported in bug 18075:
    
      [UXA] XPutImage performance regression
      https://bugs.freedesktop.org/show_bug.cgi?id=18075
    
    Using the particular synthetic benchmark in question on a g45:
    
    Before:
       9542.910448 Ops/s; put composition (!); 15x15
       5623.271889 Ops/s; put composition (!); 75x75
       1685.520362 Ops/s; put composition (!); 250x250
    
    After:
      40173.865300 Ops/s; put composition (!); 15x15
      28670.280612 Ops/s; put composition (!); 75x75
       4794.368601 Ops/s; put composition (!); 250x250
    
    which while not stellar performance is at least an improvement. As
    anticipated this has little impact on the non-fallback RENDER paths, for
    instance the current cairo-xlib backend is unaffected by this change.

Comment 26 Clemens Eisserer 2009-11-30 10:56:04 UTC

Wonderful :)
Thanks a lot for working on this, I'll re-run some real world tests as soon as I have a little bit of spare time.

I guess this will also improve QT's performance, at least <=QT-4.5 uses client-side gradients. Gradient based themes sufferd a lot because of this :-/

Comment 27 Clemens Eisserer 2010-02-28 14:24:47 UTC

On my 945GM I now get now 50k iterations, also QT themes using client-side gradients feel more smooth when resizing.

Thanks a lot!

Can be closed I guess :)

Comment 28 Chris Wilson 2010-03-01 01:07:06 UTC

Thanks Clemens, though hopefully this won't be the last improvement to image upload speed.

Comment 29 Clemens Eisserer 2012-01-20 14:29:33 UTC

SNA beats XAA on the same hardware even for small operations, often with a few times the throughput archived with XAA/EXA/UXA.
Thanks a lot for keeping up work on the performance side.

intel-2.17-git / SNA:
468383.245106 Ops/s; rects (!); 15x15
117185.000155 Ops/s; rects (!); 75x75
34506.887657 Ops/s; rects (!); 250x250
181565.516196 Ops/s; rects composition (!); 15x15
77165.508893 Ops/s; rects composition (!); 75x75
28857.551604 Ops/s; rects composition (!); 250x250
235465.519634 Ops/s; put composition (!); 15x15
36163.067633 Ops/s; put composition (!); 75x75
5751.605615 Ops/s; put composition (!); 250x250

I have to admit I am a bit sad my notebook is now almost EOL ;)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.