Message ID | 56376FFF.3070008@arm.com |
---|---|
State | New |
Headers | show |
Ji Jrgg, On 02/11/15 22:31, Jeff Law wrote: > On 11/02/2015 07:15 AM, Kyrill Tkachov wrote: >> Hi all, >> >> This patch attempts to restrict combine from transforming ZERO_EXTEND >> and SIGN_EXTEND operations into and-bitmask >> and weird SUBREG expressions when they appear inside MULT expressions. >> This is because a MULT rtx containing these >> extend operations is usually a well understood widening multiply operation. >> However, if the costs for simple zero or sign-extend moves happen to >> line up in a particular way, >> expand_compound_operation will end up mangling a perfectly innocent >> extend+mult+add rtx like: >> (set (reg/f:DI 393) >> (plus:DI (mult:DI (sign_extend:DI (reg/v:SI 425 [ selected ])) >> (sign_extend:DI (reg:SI 606))) >> (reg:DI 600))) >> >> into: >> (set (reg/f:DI 393) >> (plus:DI (mult:DI (and:DI (subreg:DI (reg:SI 606) 0) >> (const_int 202 [0xca])) >> (sign_extend:DI (reg/v:SI 425 [ selected ]))) >> (reg:DI 600))) > Going to leave the review side of this for Segher. > > If you decide to go forward, there's a section in md.texi WRT canonicalization of these RTL codes that probably would need updating. Just search for "canonicalization" section and read down a ways. > You mean document a canonical form for these operations as (mult (extend op1) (extend op2)) ? > > Jeff > >
On 04/11/15 11:37, Kyrill Tkachov wrote: > > On 02/11/15 22:31, Jeff Law wrote: >> On 11/02/2015 07:15 AM, Kyrill Tkachov wrote: >>> Hi all, >>> >>> This patch attempts to restrict combine from transforming ZERO_EXTEND >>> and SIGN_EXTEND operations into and-bitmask >>> and weird SUBREG expressions when they appear inside MULT expressions. >>> This is because a MULT rtx containing these >>> extend operations is usually a well understood widening multiply operation. >>> However, if the costs for simple zero or sign-extend moves happen to >>> line up in a particular way, >>> expand_compound_operation will end up mangling a perfectly innocent >>> extend+mult+add rtx like: >>> (set (reg/f:DI 393) >>> (plus:DI (mult:DI (sign_extend:DI (reg/v:SI 425 [ selected ])) >>> (sign_extend:DI (reg:SI 606))) >>> (reg:DI 600))) >>> >>> into: >>> (set (reg/f:DI 393) >>> (plus:DI (mult:DI (and:DI (subreg:DI (reg:SI 606) 0) >>> (const_int 202 [0xca])) >>> (sign_extend:DI (reg/v:SI 425 [ selected ]))) >>> (reg:DI 600))) >> Going to leave the review side of this for Segher. >> >> If you decide to go forward, there's a section in md.texi WRT canonicalization of these RTL codes that probably would need updating. Just search for "canonicalization" section and read down a ways. >> > > You mean document a canonical form for these operations as (mult (extend op1) (extend op2)) ? > Looking at the issue more deeply I think the concrete problem is the code in expand_compound_operation: <code> /* If we reach here, we want to return a pair of shifts. The inner shift is a left shift of BITSIZE - POS - LEN bits. The outer shift is a right shift of BITSIZE - LEN bits. It is arithmetic or logical depending on the value of UNSIGNEDP. If this was a ZERO_EXTEND or ZERO_EXTRACT, this pair of shifts will be converted into an AND of a shift. We must check for the case where the left shift would have a negative count. This can happen in a case like (x >> 31) & 255 on machines that can't shift by a constant. On those machines, we would first combine the shift with the AND to produce a variable-position extraction. Then the constant of 31 would be substituted in to produce such a position. */ modewidth = GET_MODE_PRECISION (GET_MODE (x)); if (modewidth >= pos + len) { machine_mode mode = GET_MODE (x); tem = gen_lowpart (mode, XEXP (x, 0)); if (!tem || GET_CODE (tem) == CLOBBER) return x; tem = simplify_shift_const (NULL_RTX, ASHIFT, mode, tem, modewidth - pos - len); tem = simplify_shift_const (NULL_RTX, unsignedp ? LSHIFTRT : ASHIFTRT, mode, tem, modewidth - len); } else if (unsignedp && len < HOST_BITS_PER_WIDE_INT) tem = simplify_and_const_int (NULL_RTX, GET_MODE (x), simplify_shift_const (NULL_RTX, LSHIFTRT, GET_MODE (x), XEXP (x, 0), pos), ((unsigned HOST_WIDE_INT) 1 << len) - 1); else /* Any other cases we can't handle. */ return x; </code> this code ends up unconditionally transforming (zero_extend:DI (reg:SI 125)) into: (and:DI (subreg:DI (reg:SI 125) 0) (const_int 1 [0x1])) which then gets substituted as an operand of the mult and nothing matches after that. archaeology shows this code has been there since at least 1992. I guess my question is why do we do this unconditionally? Should we be checking whether the zero_extend form is cheaper than whatever simplify_shift_const returns? I don't think what expand_compound_operatio is trying to do here is canonicalisation... Thanks, Kyrill >> >> Jeff >> >> >
Hi Segher, On 04/11/15 23:50, Segher Boessenkool wrote: > Hi Kyrill, > > On Mon, Nov 02, 2015 at 02:15:27PM +0000, Kyrill Tkachov wrote: >> This patch attempts to restrict combine from transforming ZERO_EXTEND and >> SIGN_EXTEND operations into and-bitmask >> and weird SUBREG expressions when they appear inside MULT expressions. This >> is because a MULT rtx containing these >> extend operations is usually a well understood widening multiply operation. > Right. I have often wanted for combine to try plain substitution as well > as substitution and simplification: simplification (which uses nonzero_bits > etc.) often makes a mess of even moderately complex instructions. > > But since we do not have that yet, and your 24% number is nicely impressive, > let's try to do no simplification for widening mults, as in your patch. > >> With this patch on aarch64 I saw increased generation of smaddl and umaddl >> instructions where >> before the combination of smull and add instructions were rejected because >> the extend operands >> were being transformed into these weird equivalent expressions. >> >> Overall, I see 24% more [su]maddl instructions being generated for SPEC2006 >> and with no performance regressions. >> >> The testcase in the patch is the most minimal one I could get that >> demonstrates the issue I'm trying to solve. >> >> Does this approach look ok? > In broad lines it does. Could you try this patch instead? Not tested > etc. (other than building an aarch64 cross and your test case); I'll do > that tomorrow if you like the result. Thanks, that looks less intrusive. I did try it out on arm and aarch64. It does work on the aarch64 testcase. However, there's also a correctness regression, I'll try to explain inline.... > > > Segher > > > diff --git a/gcc/combine.c b/gcc/combine.c > index c3db2e0..3bf7ffb 100644 > --- a/gcc/combine.c > +++ b/gcc/combine.c > @@ -5284,6 +5284,15 @@ subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy) > || GET_CODE (SET_DEST (x)) == PC)) > fmt = "ie"; > > + /* Substituting into the operands of a widening MULT is not likely > + to create RTL matching a machine insn. */ > + if (code == MULT > + && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND > + || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND) > + && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND > + || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND)) > + return x; I think we should also add: && REG_P (XEXP (XEXP (x, 0), 0)) && REG_P (XEXP (XEXP (x, 1), 0)) to the condition. Otherwise I've seen regressions in the arm testsuite, the gcc.target/arm/smlatb-1.s test in particular that tries to match the pattern (define_insn "*maddhisi4tb" [(set (match_operand:SI 0 "s_register_operand" "=r") (plus:SI (mult:SI (ashiftrt:SI (match_operand:SI 1 "s_register_operand" "r") (const_int 16)) (sign_extend:SI (match_operand:HI 2 "s_register_operand" "r"))) (match_operand:SI 3 "s_register_operand" "r")))] There we have a sign_extend of a shift that we want to convert to the form that the pattern expects. So adding the checks for REG_P fixes that for me. For the correctness issue I saw on aarch64 the shortest case I could reduce is: short int a[16], b[16]; void f5 (void) { a[0] = b[0] / 14; } We synthesise the division and lose one of the intermediate immediates. The good codegen is: f5: adrp x1, b mov w0, 9363 // <--- This gets lost! movk w0, 0x9249, lsl 16 // <--- This gets lost! adrp x2, a ldrsh w1, [x1, #:lo12:b] smull x0, w1, w0 lsr x0, x0, 32 add w0, w1, w0 asr w0, w0, 3 sub w0, w0, w1, asr 31 strh w0, [x2, #:lo12:a] ret The bad one with this patch is: adrp x1, b adrp x2, a ldrsh w1, [x1, #:lo12:b] smull x0, w1, w0 lsr x0, x0, 32 add w0, w1, w0 asr w0, w0, 3 sub w0, w0, w1, asr 31 strh w0, [x2, #:lo12:a] ret The problematic hunk there is the subst hunk. In the expression 'x': (mult:DI (sign_extend:DI (reg:SI 80 [ b ])) (sign_extend:DI (reg:SI 82))) it tries to substitute 'from': (reg:SI 82) with 'to': (const_int -1840700269 [0xffffffff92492493]) since we now return just 'x' combine thinks that the substitution succeeded and eliminates the immediate move. Is there a way that subst can signal some kind of "failed to substitute" result? If not, I got it to work by using that check to set the in_dest variable to the subsequent recursive call to subst, in a similar way to my original patch, but I was hoping to avoid overloading the meaning of in_dest. Thanks, Kyrill > + > /* Get the mode of operand 0 in case X is now a SIGN_EXTEND of a > constant. */ > if (fmt[0] == 'e') > @@ -8455,6 +8464,15 @@ force_to_mode (rtx x, machine_mode mode, unsigned HOST_WIDE_INT mask, > /* ... fall through ... */ > > case MULT: > + /* Substituting into the operands of a widening MULT is not likely to > + create RTL matching a machine insn. */ > + if (code == MULT > + && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND > + || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND) > + && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND > + || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND)) > + return gen_lowpart_or_truncate (mode, x); > + > /* For PLUS, MINUS and MULT, we need any bits less significant than the > most significant bit in MASK since carries from those bits will > affect the bits we are interested in. */
commit 2005243d896cbeb3d22421a63f080699ab357610 Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com> Date: Fri Oct 30 13:56:10 2015 +0000 [combine] Don't transform sign and zero extends inside mults diff --git a/gcc/combine.c b/gcc/combine.c index fa1a93f..be0f5ae 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -5369,6 +5369,7 @@ subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy) n_occurrences++; } else + { /* If we are in a SET_DEST, suppress most cases unless we have gone inside a MEM, in which case we want to simplify the address. We assume here that things that @@ -5376,15 +5377,33 @@ subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy) parts in the first expression. This is true for SUBREG, STRICT_LOW_PART, and ZERO_EXTRACT, which are the only things aside from REG and MEM that should appear in a - SET_DEST. */ - new_rtx = subst (XEXP (x, i), from, to, + SET_DEST. Also, restrict transformations on SIGN_EXTENDs + and ZERO_EXTENDs if they appear inside multiplications if + the target supports widening multiplies. This is to avoid + mangling such expressions beyond recognition. */ + bool restrict_extend_p = false; + rtx_code inner_code = GET_CODE (XEXP (x, i)); + + if (code == MULT + && (inner_code == SIGN_EXTEND + || inner_code == ZERO_EXTEND) + && widening_optab_handler (inner_code == ZERO_EXTEND + ? umul_widen_optab + : smul_widen_optab, + GET_MODE (x), + GET_MODE (XEXP (XEXP (x, i), 0))) + != CODE_FOR_nothing) + restrict_extend_p = true; + + new_rtx = subst (XEXP (x, i), from, to, (((in_dest && (code == SUBREG || code == STRICT_LOW_PART || code == ZERO_EXTRACT)) || code == SET) - && i == 0), + && i == 0) || restrict_extend_p, code == IF_THEN_ELSE && i == 0, unique_copy); + } /* If we found that we will have to reject this combination, indicate that by returning the CLOBBER ourselves, rather than @@ -8509,8 +8528,30 @@ force_to_mode (rtx x, machine_mode mode, unsigned HOST_WIDE_INT mask, /* For most binary operations, just propagate into the operation and change the mode if we have an operation of that mode. */ - op0 = force_to_mode (XEXP (x, 0), mode, mask, next_select); - op1 = force_to_mode (XEXP (x, 1), mode, mask, next_select); + op0 = XEXP (x, 0); + op1 = XEXP (x, 1); + + /* If we have a widening multiply avoid messing with the + ZERO/SIGN_EXTEND operands so that we can still match the + appropriate smul/umul patterns. */ + if (GET_CODE (x) == MULT + && GET_CODE (op0) == GET_CODE (op1) + && (GET_CODE (op0) == SIGN_EXTEND + || GET_CODE (op0) == ZERO_EXTEND) + && GET_MODE (op0) == GET_MODE (op1) + && widening_optab_handler (GET_CODE (op0) == ZERO_EXTEND + ? umul_widen_optab + : smul_widen_optab, + GET_MODE (x), + GET_MODE (XEXP (op0, 0))) + != CODE_FOR_nothing) + ; + else + { + op0 = force_to_mode (op0, mode, mask, next_select); + op1 = force_to_mode (op1, mode, mask, next_select); + } + /* If we ended up truncating both operands, truncate the result of the operation instead. */ diff --git a/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c b/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c new file mode 100644 index 0000000..703c94d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=cortex-a53" } */ + +enum reg_class +{ + NO_REGS, + AD_REGS, + ALL_REGS, LIM_REG_CLASSES +}; + +extern enum reg_class + reg_class_subclasses[((int) LIM_REG_CLASSES)][((int) LIM_REG_CLASSES)]; + +void +init_reg_sets_1 () +{ + unsigned int i, j; + { + for (j = i + 1; j < ((int) LIM_REG_CLASSES); j++) + { + enum reg_class *p; + p = ®_class_subclasses[j][0]; + while (*p != LIM_REG_CLASSES) + p++; + } + } +} + +/* { dg-final { scan-assembler-not "umull\tx\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */