Bug 97035

Summary: compilation of WebGL demo iamnop.com/particles shader slowed 5x, run-time perf dropped to 1/10th
Product: Mesa Reporter: nataraj.deshpande
Component: Drivers/DRI/i965Assignee: Intel 3D Bugs Mailing List <intel-3d-bugs>
Status: RESOLVED MOVED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: eero.t.tamminen
Version: unspecifiedKeywords: bisected, regression
Hardware: All   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=91857
https://bugs.freedesktop.org/show_bug.cgi?id=70613
Whiteboard:
i915 platform: i915 features:
Attachments: ui logs collected from chromebook with --disable-gpu-watchdog
Other spilling shader
Spilling slowdown shader

Description nataraj.deshpande 2016-07-21 23:27:45 UTC
Created attachment 125249 [details]
ui logs collected from chromebook with --disable-gpu-watchdog

Steps to Reproduce
===============
go to http://www.iamnop.com/particles/ in chrome browser

Expected Result
=============
website should load within 10 seconds with good performance (~30FPS)

Actual Result
==============
Website renders with 0-1FPS. On Skylake/Broadwell it takes ~40 seconds to start rendering. On Brasswell it takes over 2 mins to start rendering.

This looks like performance regression in mesa in later versions(12.0 etc). 
With mesa 11.30.0 issue is not seen on skylake/broadwell while performance is little better on braswell(cyan) taking ~20 seconds. 
With mesa ToT issue is always seen.

Tried on above platforms on chromebooks with chromeos and arch linux.
Chrome trace indicate
BDW with mesa 12.1.0: DoLinkProgram(load stage) wall duration=33.8 seconds
BDW with mesa 11.30: DoLinkProgram(load stage) wall duration=6.8 seconds (Good working)

Backporting to mesa 11.3 version on the same target device issue is not seen on big core.
Comment 1 Eero Tamminen 2016-08-29 15:49:01 UTC
Created attachment 126101 [details]
Other spilling shader
Comment 2 Eero Tamminen 2016-08-29 15:50:03 UTC
Created attachment 126102 [details]
Spilling slowdown shader
Comment 3 Eero Tamminen 2016-08-29 15:51:45 UTC
Tested this on SKL GT2 with today's Mesa.

All the other shapes selectable in the WebGL demo load instantly and run well over 60FPS.  However, selecting the default "galaxy" shape takes ~20s and runs at ~2FPS.

Problem is couple of shaders that get compiled, Mesa spills with them and as result they're huge.  I've attached cleaned up versions of the shaders (indented then & replaced hex hash names with something shorter).

OGL_PS_0016.glsl is the main culprit, Mesa produces 800KB shader assembly file with ~8000 instructions for it.  According to 'perf', compile time goes to:
-------------------------------
# Cost  Object        Symbol
27.14%  i965_dri.so   ra_allocate                                              
19.94%  i965_dri.so   fs_visitor::virtual_grf_interferes                       
13.90%  i965_dri.so   ra_add_node_adjacency                                    
10.21%  i965_dri.so   fs_visitor::assign_regs                                  
 4.11%  i965_dri.so   decrement_q.isra.2                                       
 3.83%  i965_dri.so   instruction_scheduler::add_dep                           
 2.97%  i965_dri.so   ra_add_node_interference                                 
 1.71%  libc-2.23.so  _int_malloc
-------------------------------

While OGL_PS_0031.glsl also spills and is in instruction count nearly half of the previous shader, it compiles and runs >10x faster.  Its 'perf' profile looks completely different:
-------------------------------
# Cost  Object        Symbol
17.66%  i965_dri.so   (anon)::ir_copy_propagation_visitor::visit
17.12%  i965_dri.so   (anon)::ir_copy_propagation_elements_visitor::kill
12.46%  i965_dri.so   (anon)::ir_copy_propagation_visitor::kill
12.14%  i965_dri.so   (anon)::ir_copy_propagation_elements_visitor::handle_rvalue
 4.47%  i965_dri.so   ra_allocate
 3.12%  i965_dri.so   ir_expression::accept
 2.66%  i965_dri.so   fs_visitor::virtual_grf_interferes
 1.61%  i965_dri.so   ra_add_node_adjacency
 1.39%  i965_dri.so   fs_visitor::assign_regs
-------------------------------
Comment 4 Eero Tamminen 2016-08-29 16:00:45 UTC
The shaders are very similar despite the huge compilation time difference.  In the much faster compiled shader, the large amount of expanded functions is called conditionally based on sampler result (in the slow one that call is unconditional) and it has few lines less before that.
Comment 5 Eero Tamminen 2016-09-01 09:48:41 UTC
(In reply to Eero Tamminen from comment #3)
> Tested this on SKL GT2 with today's Mesa.
> 
> All the other shapes selectable in the WebGL demo load instantly and run
> well over 60FPS.  However, selecting the default "galaxy" shape takes ~20s
> and runs at ~2FPS.

With Mesa 11.2, compilation takes only few seconds and "galaxy" runs at full speed.

When comparing at the generated assembly, Mesa 11.2 spills a bit:
SIMD8 shader: 4432 instructions. 0 loops. 35394 cycles. 81:179 spills:fills. Promoted 18 constants. Compacted 70912 to 48880 bytes (31%)
 
But the latest Mesa spills a lot more:
SIMD8 shader: 7985 instructions. 0 loops. 57210 cycles. 359:993 spills:fills. Promoted 19 constants. Compacted 127760 to 86032 bytes (33%)

I'll try whether I can locate the regression in this with eZBench:
  https://cgit.freedesktop.org/ezbench/tree/README


Note: When looking at the SIMD8 shader code generated by the Windows driver, it doesn't spill at all.
Comment 6 Matt Turner 2016-09-01 18:59:22 UTC
Eero, could you capture the shaders from this demo and put them into shader-db?
Comment 7 Matt Turner 2016-09-02 07:43:54 UTC
(In reply to Matt Turner from comment #6)
> Eero, could you capture the shaders from this demo and put them into
> shader-db?

Ken pointed out to me that we may not want a shader that takes 20 seconds to compile in shader-db. :)

Also, I guess I can reproduce the problem with what you have already attached. Thanks for providing those!
Comment 8 Eero Tamminen 2016-09-05 07:51:52 UTC
Had some troubles at first with ezBench, but finally it bisected this.

As expected, the same commit caused 5x increase in compilation speed and even larger drop in performance.  This was the commit:
-----------------------------
commit 5886cd79a0ed10b887840f6afb980a5a6a47e26e
Author:     Kenneth Graunke <kenneth@whitecape.org>
AuthorDate: Wed Apr 6 22:34:39 2016 -0700
Commit:     Kenneth Graunke <kenneth@whitecape.org>
CommitDate: Mon Apr 11 18:43:55 2016 -0700

    nir: Do basic constant reassociation.
    
    Many shaders contain expression trees of the form:
    
        const_1 * (value * const_2)
    
    Reorganizing these to
    
        (const_1 * const_2) * value
    
    will allow constant folding to combine the constants.  Sometimes, these
    constants are 2 and 0.5, so we can remove a multiply altogether.  Other
    times, it can create more immediate constants, which can actually hurt.
    
    Finding a good balance here is tricky.  While much more could be done,
    this simple patch seems to have a lot of positive benefit while having
    a low downside.
-----------------------------

According to Pekka, he's seen performance drops with above commit also with other test(s).

For the almost identical, but slightly smaller shader attached here, there was no perf regression.

(Matt's later "Drop -fno-strict-aliasing" commit improved compilation speed slightly.)
Comment 9 Eero Tamminen 2016-09-05 15:41:13 UTC
I checked our current benchmarks set with commit before and after this change, but that didn't show anything significant, so the problem seems to be fairly isolated.  I'll ask Pekka for more info where else he saw it have a negative effect.
Comment 10 Pekka Jylhä-Ollila 2016-09-07 10:51:29 UTC
(In reply to Eero Tamminen from comment #9)
> I checked our current benchmarks set with commit before and after this
> change, but that didn't show anything significant, so the problem seems to
> be fairly isolated.  I'll ask Pekka for more info where else he saw it have
> a negative effect.

GpuTest Volplosion performance decreased ~3.5% and FurMark decreased ~2%
(tested on HSW). I didn't notice any other changes in performance.
Comment 11 Eero Tamminen 2016-09-12 09:18:54 UTC
(In reply to Pekka Jylhä-Ollila from comment #10)
> GpuTest Volplosion performance decreased ~3.5% and FurMark decreased ~2%
> (tested on HSW). I didn't notice any other changes in performance.

I did testing on SKL and those two tests didn't decrease there (everything was within normal variance), so I guess for those two tests the regression is GEN specific.
Comment 12 Eero Tamminen 2017-01-09 12:10:39 UTC
Tested latest Mesa.  On my SKL GT2 the slow shader still takes ~20s to compile and produces following Mesa statistics:
SIMD8 shader: 7985 instructions. 0 loops. 57210 cycles. 359:993 spills:fills. Promoted 19 constants. Compacted 127760 to 86032 bytes (33%)
Comment 13 Eero Tamminen 2017-05-17 11:32:08 UTC
(In reply to Eero Tamminen from comment #12)
> Tested latest Mesa.  On my SKL GT2 the slow shader still takes ~20s to
> compile and produces following Mesa statistics:
> SIMD8 shader: 7985 instructions. 0 loops. 57210 cycles. 359:993
> spills:fills. Promoted 19 constants. Compacted 127760 to 86032 bytes (33%)

Latest Mesa (i.e. 5 months later) uses 17-18ms to compile on that machine, and gets following statistics:
SIMD8 shader: 8127 instructions. 0 loops. 57436 cycles. 369:1019 spills:fills. Promoted 18 constants. Compacted 130032 to 87472 bytes (33%)

I.e. there's been a minor improvement in compile time, but regression from v11.2 is still valid and spilling & instruction counts have gotten slightly worse. "perf" output looks same as earlier.
Comment 14 Tapani Pälli 2017-09-26 12:31:02 UTC
Testing with current Mesa (06bfb2d28f) on HSW desktop system, page loads in ~10 secs and runs 60fps meeting the expected result. I'm proposing to resolve this as FIXED unless still a problem on chromebooks?
Comment 15 Eero Tamminen 2017-09-26 16:07:56 UTC
(In reply to Tapani Pälli from comment #14)
> Testing with current Mesa (06bfb2d28f) on HSW desktop system,

This bug was originally filed against BSW/SKL.


> page loads in
> ~10 secs and runs 60fps meeting the expected result. I'm proposing to
> resolve this as FIXED unless still a problem on chromebooks?

Tested Mesa tip (git-379b24a40d) on SKL GT4e.  Issue is still there, page freezes for ~10 seconds and "galaxy" shape renders at 1/60th speed of the other shapes.

(I verified that Chromium is using right Mesa version from "chrome://gpu".)
Comment 16 Tapani Pälli 2017-09-27 03:31:59 UTC
(In reply to Eero Tamminen from comment #15)
> (In reply to Tapani Pälli from comment #14)
> > Testing with current Mesa (06bfb2d28f) on HSW desktop system,
> 
> This bug was originally filed against BSW/SKL.
> 
> 
> > page loads in
> > ~10 secs and runs 60fps meeting the expected result. I'm proposing to
> > resolve this as FIXED unless still a problem on chromebooks?
> 
> Tested Mesa tip (git-379b24a40d) on SKL GT4e.  Issue is still there, page
> freezes for ~10 seconds and "galaxy" shape renders at 1/60th speed of the
> other shapes.
> 
> (I verified that Chromium is using right Mesa version from "chrome://gpu".)

Yep, I tried now on KBL and it also reproduces this. Did not notice this is gen specific issue.
Comment 17 Tapani Pälli 2017-09-29 05:02:11 UTC
(In reply to Tapani Pälli from comment #16)
> (In reply to Eero Tamminen from comment #15)
> > (In reply to Tapani Pälli from comment #14)
> > > Testing with current Mesa (06bfb2d28f) on HSW desktop system,
> > 
> > This bug was originally filed against BSW/SKL.
> > 
> > 
> > > page loads in
> > > ~10 secs and runs 60fps meeting the expected result. I'm proposing to
> > > resolve this as FIXED unless still a problem on chromebooks?
> > 
> > Tested Mesa tip (git-379b24a40d) on SKL GT4e.  Issue is still there, page
> > freezes for ~10 seconds and "galaxy" shape renders at 1/60th speed of the
> > other shapes.
> > 
> > (I verified that Chromium is using right Mesa version from "chrome://gpu".)
> 
> Yep, I tried now on KBL and it also reproduces this. Did not notice this is
> gen specific issue.

Which actually means that this is not just 'generic' spilling issue, this is something specific to how new gens operate. In the shader the return clause of Fun_E() is what causes trouble. It does probably cause many variables to be 'live' but even just removing couple of those dot() calls is enough for this to run 60fps on KBL machine.
Comment 18 Mark Janes 2018-01-10 20:04:18 UTC
Seems like this shader should be in shader-db, if the compilation regression can be fixed so it only takes 20ms.
Comment 19 Matt Turner 2018-01-10 21:32:10 UTC
I don't have any plans to look into this.
Comment 20 Eero Tamminen 2018-01-12 13:18:44 UTC
Issue is still there, Curro's register liveness improvements last month didn't help this one significantly.
Comment 21 Tapani Pälli 2019-05-28 08:19:55 UTC
Browsing through old bugs ... this demo is still super-slow on KBL (1fps), on HSW I get ~40-50fps.
Comment 22 GitLab Migration User 2019-09-25 18:57:19 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1529.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.