Message ID | 1452513219-25168-1-git-send-email-james.greenhalgh@arm.com |
---|---|
State | New |
Headers | show |
On Mon, Jan 11, 2016 at 04:57:56PM -0600, Evandro Menezes wrote: > On 01/11/2016 05:53 AM, James Greenhalgh wrote: > >I'd like to switch the logic around in aarch64.c such that > >-mlow-precision-recip-sqrt causes us to always emit the low-precision > >software expansion for reciprocal square root. I have two reasons to do > >this; first is consistency across -mcpu targets, second is enabling more > >-mcpu targets to use the flag for peak tuning. > > > >I don't much like that the precision we use for -mlow-precision-recip-sqrt > >differs between cores (and possibly compiler revisions). Yes, we're > >under -ffast-math but I take this flag to mean the user explicitly wants the > >low-precision expansion, and we should not diverge from that based on an > >internal decision as to what is optimal for performance in the > >high-precision case. I'd prefer to keep things as predictable as possible, > >and here that means always emitting the low-precision expansion when asked. > > > >Judging by the comments in the thread proposing the reciprocal square > >root optimisation, this will benefit all cores currently supported by GCC. > >To be clear, we would still not expand in the high-precision case for any > >cores which do not explicitly ask for it. Currently that is Cortex-A57 > >and xgene, though I will be proposing a patch to remove Cortex-A57 from > >that list shortly. > > > >Which gives my second motivation for this patch. -mlow-precision-recip-sqrt > >is intended as a tuning flag for situations where performance is more > >important than precision, but the current logic requires setting an > >internal flag which also changes the performance characteristics where > >high-precision is needed. This conflates two decisions the target might > >want to make, and reduces the applicability of an option targets might > >want to enable for performance. In particular, I'd still like to see > >-mlow-precision-recip-sqrt continue to emit the cheaper, low-precision > >sequence for floats under Cortex-A57. > > > >Based on that reasoning, this patch makes the appropriate change to the > >logic. I've checked with the current -mcpu values to ensure that behaviour > >without -mlow-precision-recip-sqrt does not change, and that behaviour > >with -mlow-precision-recip-sqrt is to emit the low precision sequences. > > > >I've also put this through bootstrap and test on aarch64-none-linux-gnu > >with no issues. > > > >OK? > > Yes, it LGTM. Thanks. > I appreciate the idea of uniformity whne an option is specified, > which led me to think if it wouldn't be a good ide to add an option > that would have the effect of focring the emission of the reciprocal > square root, effectively forcing the flag > AARCH64_EXTRA_TUNE_RECIP_SQRT on, regardless of the tuning flags for > the given core. I think that this flag would be particularly useful > when specifying flags for specific functions, irrespective of the > core. > > Thoughts? Currently you can do this using the (mostly unsupported) -moverride mechanism as -moverride=tune=recip_sqrt from the command line. I'm not sure how reliable using this from __attribute__((target("override=tune=recip_sqrt"))) would be, I wrote a small testcase that didn't work as intended, but whether that is a bug or a design decision I'm not yet sure. I think the logic for parsing the target attribute is set up to reapply the command-line override string over whichever tuning options you apply through the attribute, rather than to allow you to apply a per-function override. As to whether we'd want to expose this as a fully supported, user-visible setting, I'd rather not. Our claim is that for the higher-precision sequences the results are close enough that we can consider this like reassociation width or other core-specific tuning parameters that we don't expose. What I'm hoping to avoid is a proliferation of supported options which are not in anybody's regular testing matrix. This one would not be so bad as it is automatically enabled by some cores. For now I'd rather not add the option. Thanks, James
On Tue, Jan 12, 2016 at 05:53:21AM +0000, Kumar, Venkataramanan wrote: > Hi James, > > > -----Original Message----- > > From: James Greenhalgh [mailto:james.greenhalgh@arm.com] > > Sent: Monday, January 11, 2016 5:24 PM > > To: gcc-patches@gcc.gnu.org > > Cc: nd@arm.com; marcus.shawcroft@arm.com; > > richard.earnshaw@arm.com; Kumar, Venkataramanan; > > philipp.tomsich@theobroma-systems.com; pinskia@gmail.com; > > Kyrylo.Tkachov@arm.com; e.menezes@samsung.com > > Subject: [Patch AArch64] Use software sqrt expansion always for -mlow- > > precision-recip-sqrt > > > > > > Hi, > > > > I'd like to switch the logic around in aarch64.c such that -mlow-precision- > > recip-sqrt causes us to always emit the low-precision software expansion for > > reciprocal square root. I have two reasons to do this; first is consistency > > across -mcpu targets, second is enabling more -mcpu targets to use the flag > > for peak tuning. > > > > I don't much like that the precision we use for -mlow-precision-recip-sqrt > > differs between cores (and possibly compiler revisions). Yes, we're under - > > ffast-math but I take this flag to mean the user explicitly wants the low- > > precision expansion, and we should not diverge from that based on an > > internal decision as to what is optimal for performance in the high-precision > > case. I'd prefer to keep things as predictable as possible, and here that > > means always emitting the low-precision expansion when asked. > > > > Judging by the comments in the thread proposing the reciprocal square root > > optimisation, this will benefit all cores currently supported by GCC. > > To be clear, we would still not expand in the high-precision case for any cores > > which do not explicitly ask for it. Currently that is Cortex-A57 and xgene, > > though I will be proposing a patch to remove Cortex-A57 from that list > > shortly. > > > > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt > > is intended as a tuning flag for situations where performance is more > > important than precision, but the current logic requires setting an internal > > flag which also changes the performance characteristics where high-precision > > is needed. This conflates two decisions the target might want to make, and > > reduces the applicability of an option targets might want to enable for > > performance. In particular, I'd still like to see -mlow-precision-recip-sqrt > > continue to emit the cheaper, low-precision sequence for floats under > > Cortex-A57. > > > > Based on that reasoning, this patch makes the appropriate change to the > > logic. I've checked with the current -mcpu values to ensure that behaviour > > without -mlow-precision-recip-sqrt does not change, and that behaviour > > with -mlow-precision-recip-sqrt is to emit the low precision sequences. > > > > I've also put this through bootstrap and test on aarch64-none-linux-gnu with > > no issues. > > > > OK? > > > > Thanks, > > James > > > > Yes I like enabling this optimization for all cpus target via > -mlow-precision-recip-sqrt . > > If my understanding is correct for cortex-a57 we now need to use only > -mlow-precision-recip-sqrt to emit software sqrt expansion? > > In the below code > ---snip--- > void > aarch64_emit_swrsqrt (rtx dst, rtx src) > { > ............ > ............ > int iterations = double_mode ? 3 : 2; > > if (flag_mrecip_low_precision_sqrt) > iterations--; > ---snip--- > > Now cortex-a57 case we will always do 2 and 1 steps for double and float > and 3 and 2 will never be used. Should we make it 2 and 1 as default? Or > any target still needs to use 3 and 2. The code here should handle two cases: 1) Normal -Ofast case -> Some targets use the estimate expansion with 3 iterations for double, 2 for float. Other targets use the hardware fsqrt/fdiv instructions. 2) -mlow-precision-recip-sqrt -> All targets use the estimate expansion with 2 iterations for double, 1 for float. -mlow-precision-recip-sqrt is a specialisation to be used only when the programmer knows the lower precision is acceptable. It should not be on by default... > Ps: I remember reducing iterations benefited gromacs but caused some VE in > other FP benchmarks. ... For exactly this reason :-) Thanks, James
On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote: > > Hi, > > I'd like to switch the logic around in aarch64.c such that > -mlow-precision-recip-sqrt causes us to always emit the low-precision > software expansion for reciprocal square root. I have two reasons to do > this; first is consistency across -mcpu targets, second is enabling more > -mcpu targets to use the flag for peak tuning. > > I don't much like that the precision we use for -mlow-precision-recip-sqrt > differs between cores (and possibly compiler revisions). Yes, we're > under -ffast-math but I take this flag to mean the user explicitly wants the > low-precision expansion, and we should not diverge from that based on an > internal decision as to what is optimal for performance in the > high-precision case. I'd prefer to keep things as predictable as possible, > and here that means always emitting the low-precision expansion when asked. > > Judging by the comments in the thread proposing the reciprocal square > root optimisation, this will benefit all cores currently supported by GCC. > To be clear, we would still not expand in the high-precision case for any > cores which do not explicitly ask for it. Currently that is Cortex-A57 > and xgene, though I will be proposing a patch to remove Cortex-A57 from > that list shortly. > > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt > is intended as a tuning flag for situations where performance is more > important than precision, but the current logic requires setting an > internal flag which also changes the performance characteristics where > high-precision is needed. This conflates two decisions the target might > want to make, and reduces the applicability of an option targets might > want to enable for performance. In particular, I'd still like to see > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision > sequence for floats under Cortex-A57. > > Based on that reasoning, this patch makes the appropriate change to the > logic. I've checked with the current -mcpu values to ensure that behaviour > without -mlow-precision-recip-sqrt does not change, and that behaviour > with -mlow-precision-recip-sqrt is to emit the low precision sequences. > > I've also put this through bootstrap and test on aarch64-none-linux-gnu > with no issues. > > OK? *Ping* Thanks, James > 2015-12-10 James Greenhalgh <james.greenhalgh@arm.com> > > * config/aarch64/aarch64.c (use_rsqrt_p): Always use software > reciprocal sqrt for -mlow-precision-recip-sqrt. > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 9142ac0..1d5d898 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void) > { > return (!flag_trapping_math > && flag_unsafe_math_optimizations > - && (aarch64_tune_params.extra_tuning_flags > - & AARCH64_EXTRA_TUNE_RECIP_SQRT)); > + && ((aarch64_tune_params.extra_tuning_flags > + & AARCH64_EXTRA_TUNE_RECIP_SQRT) > + || flag_mrecip_low_precision_sqrt)); > } > > /* Function to decide when to use
On Mon, Jan 25, 2016 at 11:21:25AM +0000, James Greenhalgh wrote: > On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote: > > > > Hi, > > > > I'd like to switch the logic around in aarch64.c such that > > -mlow-precision-recip-sqrt causes us to always emit the low-precision > > software expansion for reciprocal square root. I have two reasons to do > > this; first is consistency across -mcpu targets, second is enabling more > > -mcpu targets to use the flag for peak tuning. > > > > I don't much like that the precision we use for -mlow-precision-recip-sqrt > > differs between cores (and possibly compiler revisions). Yes, we're > > under -ffast-math but I take this flag to mean the user explicitly wants the > > low-precision expansion, and we should not diverge from that based on an > > internal decision as to what is optimal for performance in the > > high-precision case. I'd prefer to keep things as predictable as possible, > > and here that means always emitting the low-precision expansion when asked. > > > > Judging by the comments in the thread proposing the reciprocal square > > root optimisation, this will benefit all cores currently supported by GCC. > > To be clear, we would still not expand in the high-precision case for any > > cores which do not explicitly ask for it. Currently that is Cortex-A57 > > and xgene, though I will be proposing a patch to remove Cortex-A57 from > > that list shortly. > > > > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt > > is intended as a tuning flag for situations where performance is more > > important than precision, but the current logic requires setting an > > internal flag which also changes the performance characteristics where > > high-precision is needed. This conflates two decisions the target might > > want to make, and reduces the applicability of an option targets might > > want to enable for performance. In particular, I'd still like to see > > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision > > sequence for floats under Cortex-A57. > > > > Based on that reasoning, this patch makes the appropriate change to the > > logic. I've checked with the current -mcpu values to ensure that behaviour > > without -mlow-precision-recip-sqrt does not change, and that behaviour > > with -mlow-precision-recip-sqrt is to emit the low precision sequences. > > > > I've also put this through bootstrap and test on aarch64-none-linux-gnu > > with no issues. > > > > OK? > > *Ping* *Pingx2* Thanks, James > > Thanks, > James > > > 2015-12-10 James Greenhalgh <james.greenhalgh@arm.com> > > > > * config/aarch64/aarch64.c (use_rsqrt_p): Always use software > > reciprocal sqrt for -mlow-precision-recip-sqrt. > > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > index 9142ac0..1d5d898 100644 > > --- a/gcc/config/aarch64/aarch64.c > > +++ b/gcc/config/aarch64/aarch64.c > > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void) > > { > > return (!flag_trapping_math > > && flag_unsafe_math_optimizations > > - && (aarch64_tune_params.extra_tuning_flags > > - & AARCH64_EXTRA_TUNE_RECIP_SQRT)); > > + && ((aarch64_tune_params.extra_tuning_flags > > + & AARCH64_EXTRA_TUNE_RECIP_SQRT) > > + || flag_mrecip_low_precision_sqrt)); > > } > > > > /* Function to decide when to use >
On Mon, Feb 01, 2016 at 01:59:34PM +0000, James Greenhalgh wrote: > On Mon, Jan 25, 2016 at 11:21:25AM +0000, James Greenhalgh wrote: > > On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote: > > > > > > Hi, > > > > > > I'd like to switch the logic around in aarch64.c such that > > > -mlow-precision-recip-sqrt causes us to always emit the low-precision > > > software expansion for reciprocal square root. I have two reasons to do > > > this; first is consistency across -mcpu targets, second is enabling more > > > -mcpu targets to use the flag for peak tuning. > > > > > > I don't much like that the precision we use for -mlow-precision-recip-sqrt > > > differs between cores (and possibly compiler revisions). Yes, we're > > > under -ffast-math but I take this flag to mean the user explicitly wants the > > > low-precision expansion, and we should not diverge from that based on an > > > internal decision as to what is optimal for performance in the > > > high-precision case. I'd prefer to keep things as predictable as possible, > > > and here that means always emitting the low-precision expansion when asked. > > > > > > Judging by the comments in the thread proposing the reciprocal square > > > root optimisation, this will benefit all cores currently supported by GCC. > > > To be clear, we would still not expand in the high-precision case for any > > > cores which do not explicitly ask for it. Currently that is Cortex-A57 > > > and xgene, though I will be proposing a patch to remove Cortex-A57 from > > > that list shortly. > > > > > > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt > > > is intended as a tuning flag for situations where performance is more > > > important than precision, but the current logic requires setting an > > > internal flag which also changes the performance characteristics where > > > high-precision is needed. This conflates two decisions the target might > > > want to make, and reduces the applicability of an option targets might > > > want to enable for performance. In particular, I'd still like to see > > > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision > > > sequence for floats under Cortex-A57. > > > > > > Based on that reasoning, this patch makes the appropriate change to the > > > logic. I've checked with the current -mcpu values to ensure that behaviour > > > without -mlow-precision-recip-sqrt does not change, and that behaviour > > > with -mlow-precision-recip-sqrt is to emit the low precision sequences. > > > > > > I've also put this through bootstrap and test on aarch64-none-linux-gnu > > > with no issues. > > > > > > OK? > > > > *Ping* > > *Pingx2* *Ping^3* Thanks, James > > > 2015-12-10 James Greenhalgh <james.greenhalgh@arm.com> > > > > > > * config/aarch64/aarch64.c (use_rsqrt_p): Always use software > > > reciprocal sqrt for -mlow-precision-recip-sqrt. > > > > > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > > index 9142ac0..1d5d898 100644 > > > --- a/gcc/config/aarch64/aarch64.c > > > +++ b/gcc/config/aarch64/aarch64.c > > > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void) > > > { > > > return (!flag_trapping_math > > > && flag_unsafe_math_optimizations > > > - && (aarch64_tune_params.extra_tuning_flags > > > - & AARCH64_EXTRA_TUNE_RECIP_SQRT)); > > > + && ((aarch64_tune_params.extra_tuning_flags > > > + & AARCH64_EXTRA_TUNE_RECIP_SQRT) > > > + || flag_mrecip_low_precision_sqrt)); > > > } > > > > > > /* Function to decide when to use > > >
On Mon, Feb 08, 2016 at 10:57:44AM +0000, James Greenhalgh wrote: > On Mon, Feb 01, 2016 at 01:59:34PM +0000, James Greenhalgh wrote: > > On Mon, Jan 25, 2016 at 11:21:25AM +0000, James Greenhalgh wrote: > > > On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote: > > > > > > > > Hi, > > > > > > > > I'd like to switch the logic around in aarch64.c such that > > > > -mlow-precision-recip-sqrt causes us to always emit the low-precision > > > > software expansion for reciprocal square root. I have two reasons to do > > > > this; first is consistency across -mcpu targets, second is enabling more > > > > -mcpu targets to use the flag for peak tuning. > > > > > > > > I don't much like that the precision we use for -mlow-precision-recip-sqrt > > > > differs between cores (and possibly compiler revisions). Yes, we're > > > > under -ffast-math but I take this flag to mean the user explicitly wants the > > > > low-precision expansion, and we should not diverge from that based on an > > > > internal decision as to what is optimal for performance in the > > > > high-precision case. I'd prefer to keep things as predictable as possible, > > > > and here that means always emitting the low-precision expansion when asked. > > > > > > > > Judging by the comments in the thread proposing the reciprocal square > > > > root optimisation, this will benefit all cores currently supported by GCC. > > > > To be clear, we would still not expand in the high-precision case for any > > > > cores which do not explicitly ask for it. Currently that is Cortex-A57 > > > > and xgene, though I will be proposing a patch to remove Cortex-A57 from > > > > that list shortly. > > > > > > > > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt > > > > is intended as a tuning flag for situations where performance is more > > > > important than precision, but the current logic requires setting an > > > > internal flag which also changes the performance characteristics where > > > > high-precision is needed. This conflates two decisions the target might > > > > want to make, and reduces the applicability of an option targets might > > > > want to enable for performance. In particular, I'd still like to see > > > > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision > > > > sequence for floats under Cortex-A57. > > > > > > > > Based on that reasoning, this patch makes the appropriate change to the > > > > logic. I've checked with the current -mcpu values to ensure that behaviour > > > > without -mlow-precision-recip-sqrt does not change, and that behaviour > > > > with -mlow-precision-recip-sqrt is to emit the low precision sequences. > > > > > > > > I've also put this through bootstrap and test on aarch64-none-linux-gnu > > > > with no issues. > > > > > > > > OK? > > > > > > *Ping* > > > > *Pingx2* > > *Ping^3* *ping^4* Thanks, James > > > > 2015-12-10 James Greenhalgh <james.greenhalgh@arm.com> > > > > > > > > * config/aarch64/aarch64.c (use_rsqrt_p): Always use software > > > > reciprocal sqrt for -mlow-precision-recip-sqrt. > > > > > > > > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > > > index 9142ac0..1d5d898 100644 > > > > --- a/gcc/config/aarch64/aarch64.c > > > > +++ b/gcc/config/aarch64/aarch64.c > > > > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void) > > > > { > > > > return (!flag_trapping_math > > > > && flag_unsafe_math_optimizations > > > > - && (aarch64_tune_params.extra_tuning_flags > > > > - & AARCH64_EXTRA_TUNE_RECIP_SQRT)); > > > > + && ((aarch64_tune_params.extra_tuning_flags > > > > + & AARCH64_EXTRA_TUNE_RECIP_SQRT) > > > > + || flag_mrecip_low_precision_sqrt)); > > > > } > > > > > > > > /* Function to decide when to use > > > > > >
On 11 January 2016 at 11:53, James Greenhalgh <james.greenhalgh@arm.com> wrote: > > --- > 2015-12-10 James Greenhalgh <james.greenhalgh@arm.com> > > * config/aarch64/aarch64.c (use_rsqrt_p): Always use software > reciprocal sqrt for -mlow-precision-recip-sqrt. > OK /Marcus
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 9142ac0..1d5d898 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -7485,8 +7485,9 @@ use_rsqrt_p (void) { return (!flag_trapping_math && flag_unsafe_math_optimizations - && (aarch64_tune_params.extra_tuning_flags - & AARCH64_EXTRA_TUNE_RECIP_SQRT)); + && ((aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_RECIP_SQRT) + || flag_mrecip_low_precision_sqrt)); } /* Function to decide when to use