diff mbox

[combine] PR rtl-optimization/68381: Only restrict pure simplification in mult-extend subst case, allow other substitutions

Message ID 564DA3D1.10905@arm.com
State New
Headers show

Commit Message

Kyrylo Tkachov Nov. 19, 2015, 10:26 a.m. UTC
Hi all,

In this PR we end up removing a widening multiplication. I suspect it's some path in combine
that doesn't handle the case where we return clobber of 0 properly. This started with my recent
combine patch r230326.
Changing the subst hunk from that patch to just return x when handling a no-op substitution
fixes the testcase for me and doesn't regress the widening-mult cases that r230326 improved.
In fact, with this patch I'm seeing improved codegen for aarch64 with some widening multiplications
combined with constant operands and ending up in bitfield move/insert instructions.

Bootstrapped and tested on aarch64, arm, x86_64.

Ok for trunk?

Thanks,
Kyrill

2015-11-19  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR rtl-optimization/68381
     * combine.c (subst): Do not return clobber of zero in widening mult
     case.  Just return x unchanged if it is a no-op substitution.

2015-11-19  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR rtl-optimization/68381
     * gcc.c-torture/execute/pr68381.c: New test.

Comments

Kyrylo Tkachov Nov. 19, 2015, 1:38 p.m. UTC | #1
On 19/11/15 10:57, Segher Boessenkool wrote:
> Hi Kyrill,

>

> On Thu, Nov 19, 2015 at 10:26:25AM +0000, Kyrill Tkachov wrote:

>> In this PR we end up removing a widening multiplication. I suspect it's

>> some path in combine

>> that doesn't handle the case where we return clobber of 0 properly.

> That is troublesome.  Could you look deeper?


Yes.
So the bad case is when we're in subst and returning a CLOBBER of zero
and 'from' is (reg/v:SI 80 [ x ]) and 'to' is (zero_extend:SI (reg:HI 0 x0 [ x ])).
The call to subst comes from try_combine at line 3403:

    if (added_sets_1)
     {
       rtx t = i1pat;
       if (i0_feeds_i1_n)
         t = subst (t, i0dest, i0src_copy ? i0src_copy : i0src, 0, 0, 0);

       XVECEXP (newpat, 0, --total_sets) = t;
     }

It uses t after calling subst on it without checking that it didn't return a clobber.
If I change that snippet to check for CLOBBER:
   if (added_sets_1)
     {
       rtx t = i1pat;
       if (i0_feeds_i1_n)
         t = subst (t, i0dest, i0src_copy ? i0src_copy : i0src, 0, 0, 0);

       if (GET_CODE (t) == CLOBBER)
         {
           undo_all ();
           return 0;
         }
       XVECEXP (newpat, 0, --total_sets) = t;
     }

The testcase gets fixed.
But shouldn't the XVECEXP (newpat, 0, --total_sets) = t; line create an uncrecognizable rtx
that would then get rejected by combine or something?

If we don't check for clobber there and perform the "XVECEXP = ..."
the resulting newpat looks like:
(parallel [
         (set (reg:CC 66 cc)
             (compare:CC (const_int 0 [0])
                 (const_int 0 [0])))
         (nil)
         (clobber:DI (const_int 0 [0]))
     ])

ah, so the clobber is put in a parallel with another insn and is thus accepted by recog?
So, should we check 't' after subst for clobbers as above? Or should this be fixed in
some other place?


>> This

>> started with my recent

>> combine patch r230326.

>> Changing the subst hunk from that patch to just return x when handling a

>> no-op substitution

>> fixes the testcase for me and doesn't regress the widening-mult cases that

>> r230326 improved.

>> In fact, with this patch I'm seeing improved codegen for aarch64 with some

>> widening multiplications

>> combined with constant operands and ending up in bitfield move/insert

>> instructions.

>>

>> Bootstrapped and tested on aarch64, arm, x86_64.

>>

>> Ok for trunk?

> I'll have a look what it does for code quality on other targets.


Thanks. In light of the above I think this patch happens to avoid
the issue highlighted above but we should fix the above code separately?

Kyrill

>

> Segher

>
Kyrylo Tkachov Nov. 19, 2015, 3 p.m. UTC | #2
On 19/11/15 14:41, Segher Boessenkool wrote:
> On Thu, Nov 19, 2015 at 01:38:53PM +0000, Kyrill Tkachov wrote:

>>> That is troublesome.  Could you look deeper?

>> Yes.

> Thanks.

>

>> So the bad case is when we're in subst and returning a CLOBBER of zero

>> and 'from' is (reg/v:SI 80 [ x ]) and 'to' is (zero_extend:SI (reg:HI 0 x0

>> [ x ])).

>> The call to subst comes from try_combine at line 3403:

>>

>>     if (added_sets_1)

>>      {

>>        rtx t = i1pat;

>>        if (i0_feeds_i1_n)

>>          t = subst (t, i0dest, i0src_copy ? i0src_copy : i0src, 0, 0, 0);

>>

>>        XVECEXP (newpat, 0, --total_sets) = t;

>>      }

>>

>> It uses t after calling subst on it without checking that it didn't return

>> a clobber.

>> If I change that snippet to check for CLOBBER:

>>    if (added_sets_1)

>>      {

>>        rtx t = i1pat;

>>        if (i0_feeds_i1_n)

>>          t = subst (t, i0dest, i0src_copy ? i0src_copy : i0src, 0, 0, 0);

>>

>>        if (GET_CODE (t) == CLOBBER)

>>          {

>>            undo_all ();

>>            return 0;

>>          }

>>        XVECEXP (newpat, 0, --total_sets) = t;

>>      }

>>

>> The testcase gets fixed.

>> But shouldn't the XVECEXP (newpat, 0, --total_sets) = t; line create an

>> uncrecognizable rtx

>> that would then get rejected by combine or something?

> Yes.  recog_for_combine_1 checks for a PARALLEL with such a CLOBBER

> right at the start; and of course having the clobber elsewhere will

> just not match.

>

>> If we don't check for clobber there and perform the "XVECEXP = ..."

>> the resulting newpat looks like:

>> (parallel [

>>          (set (reg:CC 66 cc)

>>              (compare:CC (const_int 0 [0])

>>                  (const_int 0 [0])))

>>          (nil)

>>          (clobber:DI (const_int 0 [0]))

>>      ])

>>

>> ah, so the clobber is put in a parallel with another insn and is thus

>> accepted by recog?

> No, recog_for_combine should refuse it because of that third arm.

> The second arm (the nil) looks very wrong, where is that coming

> from?  That isn't valid RTL.


Well, it came from a bit earlier before the subst call (around line 3390):
   /* If the actions of the earlier insns must be kept
      in addition to substituting them into the latest one,
      we must make a new PARALLEL for the latest insn
      to hold additional the SETs.  */
<snip>
       rtx old = newpat;
       total_sets = 1 + extra_sets;
       newpat = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (total_sets));
       XVECEXP (newpat, 0, 0) = old;


extra_sets is 2, so we have a parallel with 3 slots so after the subst call
we put the clobber into --total_sets, that is slot 2:
       XVECEXP (newpat, 0, --total_sets) = t;
A bit further below we fill slot 1 with another rtx, so all 3 parts of the PARALLEL
are filled.
I'll look further into why recog_for_combine doesn't kill the whole thing.

Kyrill



>

>> So, should we check 't' after subst for clobbers as above? Or should this

>> be fixed in

>> some other place?

> There is a bug somewhere, so that will need fixing.  Workarounds are

> last resort, and even then we really need to know what is going on.

>

>> Thanks. In light of the above I think this patch happens to avoid

>> the issue highlighted above but we should fix the above code separately?

> Yes, if your patch creates better code we want it (and fix the regression),

> but you exposed a separate bug as well :-)

>

>

> Segher

>
Kyrylo Tkachov Nov. 19, 2015, 3:20 p.m. UTC | #3
On 19/11/15 15:00, Kyrill Tkachov wrote:
>

> On 19/11/15 14:41, Segher Boessenkool wrote:

>> On Thu, Nov 19, 2015 at 01:38:53PM +0000, Kyrill Tkachov wrote:

>>>> That is troublesome.  Could you look deeper?

>>> Yes.

>> Thanks.

>>

>>> So the bad case is when we're in subst and returning a CLOBBER of zero

>>> and 'from' is (reg/v:SI 80 [ x ]) and 'to' is (zero_extend:SI (reg:HI 0 x0

>>> [ x ])).

>>> The call to subst comes from try_combine at line 3403:

>>>

>>>     if (added_sets_1)

>>>      {

>>>        rtx t = i1pat;

>>>        if (i0_feeds_i1_n)

>>>          t = subst (t, i0dest, i0src_copy ? i0src_copy : i0src, 0, 0, 0);

>>>

>>>        XVECEXP (newpat, 0, --total_sets) = t;

>>>      }

>>>

>>> It uses t after calling subst on it without checking that it didn't return

>>> a clobber.

>>> If I change that snippet to check for CLOBBER:

>>>    if (added_sets_1)

>>>      {

>>>        rtx t = i1pat;

>>>        if (i0_feeds_i1_n)

>>>          t = subst (t, i0dest, i0src_copy ? i0src_copy : i0src, 0, 0, 0);

>>>

>>>        if (GET_CODE (t) == CLOBBER)

>>>          {

>>>            undo_all ();

>>>            return 0;

>>>          }

>>>        XVECEXP (newpat, 0, --total_sets) = t;

>>>      }

>>>

>>> The testcase gets fixed.

>>> But shouldn't the XVECEXP (newpat, 0, --total_sets) = t; line create an

>>> uncrecognizable rtx

>>> that would then get rejected by combine or something?

>> Yes.  recog_for_combine_1 checks for a PARALLEL with such a CLOBBER

>> right at the start; and of course having the clobber elsewhere will

>> just not match.

>>

>>> If we don't check for clobber there and perform the "XVECEXP = ..."

>>> the resulting newpat looks like:

>>> (parallel [

>>>          (set (reg:CC 66 cc)

>>>              (compare:CC (const_int 0 [0])

>>>                  (const_int 0 [0])))

>>>          (nil)

>>>          (clobber:DI (const_int 0 [0]))

>>>      ])

>>>

>>> ah, so the clobber is put in a parallel with another insn and is thus

>>> accepted by recog?

>> No, recog_for_combine should refuse it because of that third arm.

>> The second arm (the nil) looks very wrong, where is that coming

>> from?  That isn't valid RTL.

>

> Well, it came from a bit earlier before the subst call (around line 3390):

>   /* If the actions of the earlier insns must be kept

>      in addition to substituting them into the latest one,

>      we must make a new PARALLEL for the latest insn

>      to hold additional the SETs.  */

> <snip>

>       rtx old = newpat;

>       total_sets = 1 + extra_sets;

>       newpat = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (total_sets));

>       XVECEXP (newpat, 0, 0) = old;

>

>

> extra_sets is 2, so we have a parallel with 3 slots so after the subst call

> we put the clobber into --total_sets, that is slot 2:

>       XVECEXP (newpat, 0, --total_sets) = t;

> A bit further below we fill slot 1 with another rtx, so all 3 parts of the PARALLEL

> are filled.

> I'll look further into why recog_for_combine doesn't kill the whole thing.

>


Hmmm, so the answer to that is a bit further down the validate_replacement: path.
It's the code after the big comment:
   /* See if this is a PARALLEL of two SETs where one SET's destination is
      a register that is unused and this isn't marked as an instruction that
      might trap in an EH region.  In that case, we just need the other SET.
      We prefer this over the PARALLEL.

      This can occur when simplifying a divmod insn.  We *must* test for this
      case here because the code below that splits two independent SETs doesn't
      handle this case correctly when it updates the register status.

      It's pointless doing this if we originally had two sets, one from
      i3, and one from i2.  Combining then splitting the parallel results
      in the original i2 again plus an invalid insn (which we delete).
      The net effect is only to move instructions around, which makes
      debug info less accurate.  */

The code extracts all the valid sets inside the PARALLEL and calls recog_for_combine on them
individually, ignoring the clobber. Should we tell it to reject any. So recog_for_combine doesn't
complain because it never sees the clobbers and all the other checks in try_combine are gated
on "insn_code < 0" so they never apply.

Where should we add the extra checks for CLOBBER? Or do you reckon we should add another call to
recog_for_combine somewhere there?

Thanks,
Kyrill

> Kyrill

>

>

>

>>

>>> So, should we check 't' after subst for clobbers as above? Or should this

>>> be fixed in

>>> some other place?

>> There is a bug somewhere, so that will need fixing.  Workarounds are

>> last resort, and even then we really need to know what is going on.

>>

>>> Thanks. In light of the above I think this patch happens to avoid

>>> the issue highlighted above but we should fix the above code separately?

>> Yes, if your patch creates better code we want it (and fix the regression),

>> but you exposed a separate bug as well :-)

>>

>>

>> Segher

>>

>
Kyrylo Tkachov Nov. 26, 2015, 9:50 a.m. UTC | #4
On 24/11/15 00:15, Segher Boessenkool wrote:
> On Thu, Nov 19, 2015 at 03:20:22PM +0000, Kyrill Tkachov wrote:

>> Hmmm, so the answer to that is a bit further down the validate_replacement:

>> path.

>> It's the code after the big comment:

>>    /* See if this is a PARALLEL of two SETs where one SET's destination is

>>       a register that is unused and this isn't marked as an instruction that

>>       might trap in an EH region.  In that case, we just need the other SET.

>>       We prefer this over the PARALLEL.

>>

>>       This can occur when simplifying a divmod insn.  We *must* test for this

>>       case here because the code below that splits two independent SETs

>>       doesn't

>>       handle this case correctly when it updates the register status.

>>

>>       It's pointless doing this if we originally had two sets, one from

>>       i3, and one from i2.  Combining then splitting the parallel results

>>       in the original i2 again plus an invalid insn (which we delete).

>>       The net effect is only to move instructions around, which makes

>>       debug info less accurate.  */

>>

>> The code extracts all the valid sets inside the PARALLEL and calls

>> recog_for_combine on them

>> individually, ignoring the clobber.

> Before I made this use is_parallel_of_n_reg_sets the code used to test

> if it is a parallel of two sets, and no clobbers allowed.  So it would

> never allow a clobber of zero.  But now it does.  I'll fix this in

> is_parallel_of_n_reg_sets.

>

> Thanks for finding the problem!


Thanks for fixing the wrong-code issue.
As I mentioned on IRC, this patch improves codegen on aarch64 as well.
I've re-checked SPEC2006 and it seems to improve codegen around multiply-extend-accumulate
instructions. For example the sequence:
     mov    w4, 64
     mov    x1, 24
     smaddl    x1, w9, w4, x1 // multiply-sign-extend-accumulate
     add    x1, x3, x1

becomes something like this:
     mov    w3, 64
     smaddl    x1, w9, w3, x0
     add    x1, x1, 24  // constant 24 propagated into the add

Another was transforming the muliply-extend into something cheaper:
     mov    x0, 40
     mov    w22, 32
     umaddl    x22, w21, w22, x0 // multiply-zero-extend-accumulate

changed becomes:
     ubfiz    x22, x21, 5, 32 // ASHIFT+extend
     add    x22, x22, 40

which should be always beneficial.
 From what I can see we don't lose any of the multiply-extend-accumulate
  opportunities that we gained from the original combine patch.

So can we take this patch in as well?

Thanks,
Kyrill

> Segher

>
diff mbox

Patch

commit 58ec17169da2bb864d2bcaeb34a33e8c5a93348f
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Nov 17 15:33:58 2015 +0000

    [combine] PR rtl-optimization/68381: Only restrict pure simplification in mult-extend subst case, allow other substitutions

diff --git a/gcc/combine.c b/gcc/combine.c
index 26b14bf..3791abf 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -5286,7 +5286,7 @@  subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy)
 	      || GET_CODE (SET_DEST (x)) == PC))
 	fmt = "ie";
 
-      /* Substituting into the operands of a widening MULT is not likely
+      /* Trying to simplify the operands of a widening MULT is not likely
 	 to create RTL matching a machine insn.  */
       if (code == MULT
 	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
@@ -5294,13 +5294,10 @@  subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy)
 	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
 	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND)
 	  && REG_P (XEXP (XEXP (x, 0), 0))
-	  && REG_P (XEXP (XEXP (x, 1), 0)))
-	{
-	  if (from == to)
-	    return x;
-	  else
-	    return gen_rtx_CLOBBER (GET_MODE (x), const0_rtx);
-	}
+	  && REG_P (XEXP (XEXP (x, 1), 0))
+	  && from == to)
+	return x;
+
 
       /* Get the mode of operand 0 in case X is now a SIGN_EXTEND of a
 	 constant.  */
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr68381.c b/gcc/testsuite/gcc.c-torture/execute/pr68381.c
new file mode 100644
index 0000000..33012a3
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr68381.c
@@ -0,0 +1,22 @@ 
+/* { dg-options "-fexpensive-optimizations -fno-tree-bit-ccp" } */
+
+__attribute__ ((noinline, noclone))
+int
+foo (unsigned short x, unsigned short y)
+{
+  int r;
+  if (__builtin_mul_overflow (x, y, &r))
+    __builtin_abort ();
+  return r;
+}
+
+int
+main (void)
+{
+  int x = 1;
+  int y = 2;
+  if (foo (x, y) != x * y)
+    __builtin_abort ();
+  return 0;
+}
+