Message ID | CABXYE2VY59NeXPSbxUZ4sj0+G+u4LcxNg0GgoxH_-=gnQZWZiQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Thu, Nov 10, 2016 at 6:25 AM, Jim Wilson <jim.wilson@linaro.org> wrote: > This fixes a bug in code adding edges to the dependence graph. Values > for this_dir can be -1 (backward edge), 0 (no edge), 1 (forward edge), > and 2 (both backward and forward edges). There can be multiple > dependencies checked, creating multiple edges that have to be merged > together. this_dir contains the current edge. dir contains the > previous edges. The code fails to handle the case where this_dir is 2, > in which case we can return immediately. This is a minor optimization > to improve compile time. The code handles the case where dir is > non-zero and this_dir is zero by returning 2, which is incorrect. dir > should be unmodified in this case. We can return 2 only if both dir > and this_dir are non-zero and unequal, i.e. one is -1 and the other is > 1. This problem creates extra unnecessary edges, which can prevent > loops from being distributed. The patch fixes both problems. > > This passed an x86_64 gcc bootstrap with -ftree-loop-distribution > added to BOOT_CFLAGS and a testsuite regression check. Curiously, I > see that I get different results for the C/C++ ubsan tests every time > I run them, but this has nothing to do with my patch, as it happens > with or without my patch. I haven't tried debugging this yet. Might > be related to my recent upgrade to Ubuntu 16.04 LTS. Otherwise, there > are no regressions. > > On SPEC CPU2006, on aarch64, I see 5879 loops distributed without the > patch, and 5906 loops distributed with the patch. So 27 extra loops > are distributed which is about 0.5% more loop distributions. There is > no measurable performance gain from the bug fix on the CPU2006 run > time though I plan to spend some more time looking at this code to see > if I can find other improvements. The biggest "lack" of loop distribution is the ability to undo CSE so for for (;;) { a[i] = a[i] + 1; b[i] = a[i]; } CSE makes us see for (;;) { tem = a[i]; tem2 = tem + 1; a[i] = tem; b[i] = tem; } and loop distribution cannot re-materialize tem3 from a[i] thus most of the time it ends up pulling redundant computations into each partition (sometimes that can reduce memory bandwith if one less stream is used but sometimes not, like in the above case). Then of course the cost model is purely modeled for STREAM (reduce the number of memory streams). So loop distribution is expected to pessimize code for the CSE case in case you are not memory bound and improve things if you are memory bound. > OK? Ok. Thanks for the improvement! Richard. > Jim
On Thu, Nov 10, 2016 at 2:53 AM, Richard Biener <richard.guenther@gmail.com> wrote: > The biggest "lack" of loop distribution is the ability to undo CSE so for I hadn't noticed this problem yet. I will have to take a look. > Then of course the cost model is purely modeled for STREAM (reduce the number > of memory streams). So loop distribution is expected to pessimize code for > the CSE case in case you are not memory bound and improve things if you > are memory bound. I noticed this problem. I think loop distribution should be callable from inside the vectorizer or vice versa. if a loop can't be vectorized, but distributing the loop allows the sub loops to be vectorized, then we should go ahead and ditsribute, even if that increases the number of memory streams slightly, as the gain from vectorizing should be greater than the loss from the additional memory streams. We could have a cost model that tries to compute the gain/loss here and make a better decision of when to distribute to increase vectorization at the expense of the number of memory streams. This looks like a major project though, and may be more work than I have time for. Jim
On Fri, Nov 11, 2016 at 7:55 PM, Jim Wilson <jim.wilson@linaro.org> wrote: > On Thu, Nov 10, 2016 at 2:53 AM, Richard Biener > <richard.guenther@gmail.com> wrote: >> The biggest "lack" of loop distribution is the ability to undo CSE so for > > I hadn't noticed this problem yet. I will have to take a look. > >> Then of course the cost model is purely modeled for STREAM (reduce the number >> of memory streams). So loop distribution is expected to pessimize code for >> the CSE case in case you are not memory bound and improve things if you >> are memory bound. > > I noticed this problem. I think loop distribution should be callable > from inside the vectorizer or vice versa. if a loop can't be > vectorized, but distributing the loop allows the sub loops to be > vectorized, then we should go ahead and ditsribute, even if that > increases the number of memory streams slightly, as the gain from > vectorizing should be greater than the loss from the additional memory > streams. We could have a cost model that tries to compute the > gain/loss here and make a better decision of when to distribute to > increase vectorization at the expense of the number of memory streams. > This looks like a major project though, and may be more work than I > have time for. Yes. That's true for most enabling transforms (an easier one that comes to my mind is final value replacement, which, when required from the vectorizer could use a different cost model). Richard. > Jim
2016-11-09 Jim Wilson <jim.wilson@linaro.org> * tree-loop-distribution.c (pg_add_dependence_edges): Return 2 if this_dir is 2. Check for this_dir non-zero before dir != this_dir check. Index: gcc/tree-loop-distribution.c =================================================================== --- gcc/tree-loop-distribution.c (revision 242025) +++ gcc/tree-loop-distribution.c (working copy) @@ -1408,9 +1408,11 @@ pg_add_dependence_edges (struct graph *rdg, vec<lo else this_dir = 0; free_dependence_relation (ddr); - if (dir == 0) + if (this_dir == 2) + return 2; + else if (dir == 0) dir = this_dir; - else if (dir != this_dir) + else if (this_dir != 0 && dir != this_dir) return 2; /* Shuffle "back" dr1. */ dr1 = saved_dr1;