Summary: | [UXA] XPutImage performance regression | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Clemens Eisserer <linuxhippy> | ||||||||||||||||
Component: | Driver/intel | Assignee: | Chris Wilson <chris> | ||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||||
Severity: | normal | ||||||||||||||||||
Priority: | high | CC: | xhejtman | ||||||||||||||||
Version: | git | ||||||||||||||||||
Hardware: | Other | ||||||||||||||||||
OS: | All | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||
Attachments: |
|
Created attachment 19667 [details]
the ugly benchmark ;)
Created attachment 19668 [details]
profile while benchmark was executed
could you roll back the intel driver before this commit and give it a try? c4565a9811487402d899d0933cc63e27ffe1ff08 (it does not matter, if you include this one or not). (In reply to comment #3) > could you roll back the intel driver before this commit and give it a try? > c4565a9811487402d899d0933cc63e27ffe1ff08 > (it does not matter, if you include this one or not). > I tested that the commit above and it is OK, the test show 90ms which seems to be sane. I highly suspect the commit 5c9a62a29f62a9ecce37fae98cb01f8217eaba15 from causing the slowdown. Lukas: As far as I know the commit you mention is i965 specific, however the slowdown also happens on a 945GM (i915-class). (In reply to comment #5) > Lukas: As far as I know the commit you mention is i965 specific, however the > slowdown also happens on a 945GM (i915-class). > Could you play bisect then? sorry for not bisecting, I am super-busy right now at university. However Carl mentioned he is working on performance improvements, I'll wait until 2.6.28 is released (or at least a preview life-cd) and I can test performance on GEM. In my opinion non-GEM mode will soon be obsolete, so I really hope in GEM mode the driver will even beat its old results once all the new stuff stabilizes. I now also have a i830-class machine to test on ... so I won't miss any generation ;) 27% of CPU in that profile is spent in the EXA offscreen management failure. This should be gone with UXA, and doesn't appear to be there on my system. However, instead the shmputimage is hitting a really slow path with UXA. But I don't get why you're creating SHM images (a dubious optimization even when done right) when you're not even pushing the data from your client to the server. The test app is gratuitously preventing hardware acceleration here. Hi Eric, Thanks for looking into this report. The original benchmarking code has been obsoleted by JXRenderMark, which is now part of the phoronix test suite. (attached). It has its focus on testing how several features the new XRender-Java2d backend relies on perform on various hardware/driver combinations. Although its maybe quite broken too, I guess its still better than whats here ;) Some paths are not settled, however the following paths will be used, especially the "put composition" part will be quite important for a size of about 20x20px - 32x32px. All the antialiased rendering will go through this path with java generating <32x32px A8 tiles in software, uploading them and using them as mask. For the 15x15 case EXA archieves 37% of XAA, whereas UXA archieves 4.3% of XAA. XAA (no offscreen pixmaps) / xorg-xserver-1.3.0 / intel-2.2.1: 294664.411765 Ops/s; rects (!); 15x15 98847.048301 Ops/s; rects (!); 75x75 21097.500000 Ops/s; rects (!); 250x250 123611.913357 Ops/s; rects composition (!); 15x15 16042.500000 Ops/s; rects composition (!); 75x75 2048.285199 Ops/s; rects composition (!); 250x250 168811.567164 Ops/s; put composition (!); 15x15 15068.807339 Ops/s; put composition (!); 75x75 1699.748744 Ops/s; put composition (!); 250x250 EXA / xorg-xserver-1.3.0 / intel-2.2.1: 102111.180905 Ops/s; rects (!); 15x15 27104.304636 Ops/s; rects (!); 75x75 8182.500000 Ops/s; rects (!); 250x250 54326.815642 Ops/s; rects composition (!); 15x15 19972.989950 Ops/s; rects composition (!); 75x75 3225.689882 Ops/s; rects composition (!); 250x250 62233.493810 Ops/s; put composition (!); 15x15 16747.201493 Ops/s; put composition (!); 75x75 2618.421053 Ops/s; put composition (!); 250x250 UXA / xorg-xserver-1.6.0 / intel-2.6.902 55325.000000 Ops/s; rects (!); 15x15 15646.039604 Ops/s; rects (!); 75x75 5363.874346 Ops/s; rects (!); 250x250 44448.148148 Ops/s; rects composition (!); 15x15 14300.625000 Ops/s; rects composition (!); 75x75 3650.884495 Ops/s; rects composition (!); 250x250 7267.766497 Ops/s; put composition (!); 15x15 5319.587629 Ops/s; put composition (!); 75x75 1430.136986 Ops/s; put composition (!); 250x250 Created attachment 24134 [details] JXRenderMark source Please give it a try again. Individual benchmarks can be run too, e.g. > ./JXRenderMark 3 20 > 6924.586777 Ops/s; put composition (!); 20x20 please see "-help" for more details. (In reply to comment #9) > UXA / xorg-xserver-1.6.0 / intel-2.6.902 > 55325.000000 Ops/s; rects (!); 15x15 > 15646.039604 Ops/s; rects (!); 75x75 > 5363.874346 Ops/s; rects (!); 250x250 > 44448.148148 Ops/s; rects composition (!); 15x15 > 14300.625000 Ops/s; rects composition (!); 75x75 > 3650.884495 Ops/s; rects composition (!); 250x250 > 7267.766497 Ops/s; put composition (!); 15x15 > 5319.587629 Ops/s; put composition (!); 75x75 > 1430.136986 Ops/s; put composition (!); 250x250 > I'm working on speedup of compositing. Hopefully, I will get something usable soon :) (In reply to comment #10) > Created an attachment (id=24134) [details] > JXRenderMark source I can't build this because glyphs.h is missing. Where's that supposed to come from? Created attachment 24186 [details]
JXRenderMark source
Oops, sorry I forgot about that.
By the way, these are the results I get on my NVidia GeForce 6600, with the proprietary driver: 301778,21 Ops/s; rects;15x15: 174022,65 Ops/s;rects;75x75: 38125 Ops/s; rects;250x250: 71292,44 Ops/s; rectscomposition;15x15: 22127,79 Ops/s; rectscomposition;75x75: 2282,61 Ops/s; rectscomposition;250x250: 136777,12 Ops/s; putcomposition;15x15: 32324,78 Ops/s; putcomposition;75x75: 3364,83 Ops/s; putcomposition;250x250 However this was on a slower CPU (AMD Sempron, 256kb L2 cache, 1.8ghz). Its even for the small sizes on par with XAA and does very well for larger sizes. I got these numbers on NVidia 8800GT with binary drivers: 339863.984674 Ops/s; rects (!); 15x15 44306.109726 Ops/s; rects (!); 75x75 5805.882353 Ops/s; rects (!); 250x250 95083.125000 Ops/s; rects composition (!); 15x15 78007.056452 Ops/s; rects composition (!); 75x75 43479.375000 Ops/s; rects composition (!); 250x250 95956.967213 Ops/s; put composition (!); 15x15 28594.637224 Ops/s; put composition (!); 75x75 8936.224490 Ops/s; put composition (!); 250x250 ... 68766.471838 Ops/s; Transformed Blit Linear (!); 15x15 4372.791519 Ops/s; Transformed Blit Linear (!); 75x75 374.060150 Ops/s; Transformed Blit Linear (!); 250x250 96.153846 Ops/s; Transformed Blit Billinear sharp edges (!); 15x15 3.906250 Ops/s; Transformed Blit Billinear sharp edges (!); 75x75 seems to be odd that I got only 4 fps with 75x75 picture. Is the test sane? I am not sure about "Transformed Blit Billinear sharp edges", my only complain is about poor performance of the "put composition" test. We plan to use it a lot, with XPutImage's of 500-1kb, and even with the latest rawhide updates (xorg-1.6.1 + kernel-2.6.29.1 + intel-2.6.902_03) its still way slower than EXA or NVidia: 31172.110553 Ops/s; put composition (!); 15x15 7116.952790 Ops/s; put composition (!); 75x75 1588.302752 Ops/s; put composition (!); 250x250 With NoAccel I get: 152815.831987 Ops/s; put composition (!); 20x20 147570.075758 Ops/s; put composition (!); 20x20 So about ~5x faster than UXA. Of course thats not fair ;) Created attachment 25428 [details]
Uses a temporary pixmap as composition mask
OpCode 3 is PutImage with reusing the mask pixmap
Opcode 18 generates a new pixmap each iteration.
To compare both, simply run e.g.:
./jx 3 24 18 24
On EXA using a temporary pixmap reduces performance by about 20k ops/s: [ce@localhost putperf]$ ./jx 3 20 3 20 18 20 18 20 62471.938776 Ops/s; put composition (!); 20x20 40884.087237 Ops/s; put composition - temp mask; 20x20 A new version of the JXRenderMark is available at: http://78.31.67.79:8080/jxrender/RenderMark.html It features more precise timing as well as some adoptions to the Blit tests. Following the discussion on Intel-gfx I benchmarked my laptop (P4-2.6ghz, Geforce2Go (low cost mobile gpu) + propietary legacy drivers, 5 year old): 114542.024410 Ops/s; put composition (!); 15x15 17224.712189 Ops/s; put composition (!); 75x75 2849.247486 Ops/s; put composition (!); 250x250 This 5yo machine is about 3-4x faster than the Core2Duo machine with Intel-IGP the submitter benchmarked on. Note that low XPutImage performance also hurts KDE4's user experience quite a lot. I've just profiled a few slow behaving KDE applications, and it boils down that nearly 30% of cpu cycles are spent inside kernel-code, triggered mostly by calls to XPutImage (to upload the pre-rendered SVG content). And no, those *silly* trapezoids would not be a solution to those performance problems. Created attachment 31545 [details] [review] Experimental put-image acceleration. I'm experimenting with this patch which aims to miminse the number of CPU stalls whilst waiting for dirty pixmaps. This seems a sane thing to do per-se, but most of the cpu time is spent doing busy-work in the kernel -- it's especially painful if you enable mutex/spinlock debugging. Clemens, I'd be interested to know if this has any effect for your workloads, thanks. Ok, I found a reasonable set of benchmarks that benefit from this behaviour - the basic RENDER path in cairo does all computation of masks and sources on the CPU and uploads via [Shm]PutImage for composition on the GPU by the xserver. The xcb backend is the regular RENDER accelerated path, where xcb-render-0.0 is the fallback + composite path. old: no-put-image new: put-image Speedups ======== xcb-render-0.0-rgba poppler-0 70628.98 (70717.42 0.06%) -> 17578.64 (17598.08 0.34%): 4.02x speedup ███ xcb-render-0.0-rgba gnome-terminal-vim-0 137326.07 (137439.39 0.04%) -> 47035.30 (47059.89 0.18%): 2.92x speedup █▉ xcb-render-0.0-rgba firefox-planet-gnome-0 133030.59 (133187.32 0.06%) -> 70809.69 (71246.89 0.34%): 1.88x speedup ▉ xcb-render-0.0-rgba firefox-talos-gfx-0 137900.76 (139008.97 0.40%) -> 78817.67 (80564.31 1.03%): 1.75x speedup ▊ xcb-render-0.0-rgba evolution-0 107534.53 (107633.77 0.05%) -> 76159.85 (77595.22 0.97%): 1.41x speedup ▍ xcb-render-0.0-rgba swfdec-giant-steps-0 11385.34 (11390.89 0.02%) -> 8737.90 (8795.25 0.59%): 1.30x speedup ▎ xcb-render-0.0-rgba swfdec-youtube-0 12869.06 (13022.56 0.59%) -> 10599.66 (10658.71 1.14%): 1.21x speedup ▎ xcb-rgba evolution-0 41035.17 (41109.37 0.09%) -> 34674.08 (35267.29 1.38%): 1.18x speedup ▏ xcb-render-0.0-rgba gnome-system-monitor-0 17733.01 (17741.65 0.02%) -> 15564.74 (15615.23 0.21%): 1.14x speedup ▏ xcb-rgba firefox-planet-gnome-0 85445.04 (85867.32 0.25%) -> 75448.13 (76586.55 0.86%): 1.13x speedup ▏ xcb-rgba gvim-0 71743.82 (71754.51 0.01%) -> 65753.64 (65756.36 0.07%): 1.09x speedup ▏ xcb-render-0.0-rgba firefox-talos-svg-0 111644.58 (112783.48 0.51%) -> 104088.25 (104866.37 0.54%): 1.07x speedup ▏ xcb-rgba gnome-system-monitor-0 8024.22 (8059.03 0.22%) -> 7491.09 (7509.25 0.20%): 1.07x speedup ▏ As can be seen in workloads dominated by rendering to lots of intermediate surfaces, this accelerated put_image restores earlier performance lost due to the switch to reusing "gpu-hot" buffers. However the issue of whether we are doing excess work in the kernel is still open. commit 19d8c0cf50e98909c533ebfce3a0dd3f72b755c1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Nov 29 21:16:49 2009 +0000 uxa: PutImage acceleration Avoid waiting on dirty buffer object by streaming the upload to a fresh, non-GPU hot buffer and blitting to the destination. This should help to redress the regression reported in bug 18075: [UXA] XPutImage performance regression https://bugs.freedesktop.org/show_bug.cgi?id=18075 Using the particular synthetic benchmark in question on a g45: Before: 9542.910448 Ops/s; put composition (!); 15x15 5623.271889 Ops/s; put composition (!); 75x75 1685.520362 Ops/s; put composition (!); 250x250 After: 40173.865300 Ops/s; put composition (!); 15x15 28670.280612 Ops/s; put composition (!); 75x75 4794.368601 Ops/s; put composition (!); 250x250 which while not stellar performance is at least an improvement. As anticipated this has little impact on the non-fallback RENDER paths, for instance the current cairo-xlib backend is unaffected by this change. Wonderful :) Thanks a lot for working on this, I'll re-run some real world tests as soon as I have a little bit of spare time. I guess this will also improve QT's performance, at least <=QT-4.5 uses client-side gradients. Gradient based themes sufferd a lot because of this :-/ On my 945GM I now get now 50k iterations, also QT themes using client-side gradients feel more smooth when resizing. Thanks a lot! Can be closed I guess :) Thanks Clemens, though hopefully this won't be the last improvement to image upload speed. SNA beats XAA on the same hardware even for small operations, often with a few times the throughput archived with XAA/EXA/UXA. Thanks a lot for keeping up work on the performance side. intel-2.17-git / SNA: 468383.245106 Ops/s; rects (!); 15x15 117185.000155 Ops/s; rects (!); 75x75 34506.887657 Ops/s; rects (!); 250x250 181565.516196 Ops/s; rects composition (!); 15x15 77165.508893 Ops/s; rects composition (!); 75x75 28857.551604 Ops/s; rects composition (!); 250x250 235465.519634 Ops/s; put composition (!); 15x15 36163.067633 Ops/s; put composition (!); 75x75 5751.605615 Ops/s; put composition (!); 250x250 I have to admit I am a bit sad my notebook is now almost EOL ;) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 19666 [details] Xorg log from the self-compiled server With the intel-git pulled after commit 6707371176147340fabc9ab6f1e3d6d5ac980662 as well as the version delivered with Fedora10-Rawhide (some 2.5 prerelease) I see a large performance regression for a real-world workload, down to 50% of original throughput. I wrote a test-case which tries to mimic the behaviour of the real-world app to some degree, and those were my results: intel-2.1.1/xserver-1.3/2.6.25: 80ms (Fedora 8 xorg distribution) intel-2.5pre/xserver-1.5.2/2.6.27: 150ms (Fedora 10 rawhide distribution) intel-git/xserver-git/2.6.27: 200ms (self compiled) Attached you find the benchmark as well as a sysprof-profile of the self-compiled Xorg as well as Xorg.log. The benchmark itself does not handle ms-overflow, so don't wonder of you see sometimes negative results. My System: Fedora 10 rawhide Intel-945GM Core2Duo-T7200, 3GB DDR2-533