Message ID | 66275bc9-7d97-b990-4c86-2de1f4a6a2fa@arm.com |
---|---|
State | New |
Headers | show |
Hi! On Mon, Jun 19, 2017 at 02:46:59PM +0100, Richard Earnshaw (lists) wrote: > Many parallel set insns are of the form of a single set that also sets > the condition code flags. In this case the cost of such an insn is > normally the cost of the part that doesn't set the flags, since updating > the condition flags is simply a side effect. > > At present all such insns are treated as having unknown cost (ie 0) and > combine assumes that such insns are infinitely more expensive than any > other insn sequence with a non-zero cost. That's not what combine does: it optimistically assumes any combination with unknown costs is an improvement. > This patch addresses this problem by allowing insn_rtx_cost to ignore > the condition setting part of a PARALLEL iff there is exactly one > comparison set and one non-comparison set. If the only set operation is > a comparison we still use that as the basis of the insn cost. I'll test this on a zillion archs, see what the effect is. Have you considered costing general parallels as well? Segher
On 19/06/17 15:08, Segher Boessenkool wrote: > Hi! > > On Mon, Jun 19, 2017 at 02:46:59PM +0100, Richard Earnshaw (lists) wrote: >> Many parallel set insns are of the form of a single set that also sets >> the condition code flags. In this case the cost of such an insn is >> normally the cost of the part that doesn't set the flags, since updating >> the condition flags is simply a side effect. >> >> At present all such insns are treated as having unknown cost (ie 0) and >> combine assumes that such insns are infinitely more expensive than any >> other insn sequence with a non-zero cost. > > That's not what combine does: it optimistically assumes any combination > with unknown costs is an improvement. So try this testcase on ARM. unsigned long x, y, z; int b; void test() { b = __builtin_sub_overflow (y,z, &x); } Without the patch, combine rips apart a compare and subtract insn because it sees it as having cost zero and substitutes it with separate compare and subtract insns. ie before: ldr r3, .L5 ldr r2, .L5+4 ldr r3, [r3] ldr r2, [r2] cmp r3, r2 <===== movcs r0, #0 movcc r0, #1 ldr ip, .L5+8 ldr r1, .L5+12 sub r3, r3, r2 <===== str r3, [ip] str r0, [r1] bx lr after: ldr r3, .L5 ldr r2, .L5+4 ldr r3, [r3] ldr r2, [r2] subs r3, r3, r2 <==== movcc r1, #1 movcs r1, #0 ldr r0, .L5+8 ldr r2, .L5+12 str r3, [r0] str r1, [r2] bx lr The combine log before the patch shows: allowing combination of insns 10 and 51 original costs 0 + 8 = 0 replacement costs 4 + 12 = 16 So it is clearly deciding that the original costs are greater than the replacement costs. > >> This patch addresses this problem by allowing insn_rtx_cost to ignore >> the condition setting part of a PARALLEL iff there is exactly one >> comparison set and one non-comparison set. If the only set operation is >> a comparison we still use that as the basis of the insn cost. > > I'll test this on a zillion archs, see what the effect is. > > Have you considered costing general parallels as well? > > I thought about it but concluded that there's no generically correct answer. It might be the max of all the individual sets or it might be the sum, or it might be somewhere in between. For example on ARM the load/store multiple operations are expressed as parallels, but their cost will depend on how many loads/stores happen in parallel within the hardware. I think we'd need a new back-end hook to handle the other cases sensibly. R. > Segher >
On 19/06/17 15:08, Segher Boessenkool wrote: > Hi! > > On Mon, Jun 19, 2017 at 02:46:59PM +0100, Richard Earnshaw (lists) wrote: >> Many parallel set insns are of the form of a single set that also sets >> the condition code flags. In this case the cost of such an insn is >> normally the cost of the part that doesn't set the flags, since updating >> the condition flags is simply a side effect. >> >> At present all such insns are treated as having unknown cost (ie 0) and >> combine assumes that such insns are infinitely more expensive than any >> other insn sequence with a non-zero cost. > > That's not what combine does: it optimistically assumes any combination > with unknown costs is an improvement. Actually the logic is int reject = old_cost > 0 && new_cost > old_cost; So reject will never be true if old cost is zero. R. > >> This patch addresses this problem by allowing insn_rtx_cost to ignore >> the condition setting part of a PARALLEL iff there is exactly one >> comparison set and one non-comparison set. If the only set operation is >> a comparison we still use that as the basis of the insn cost. > > I'll test this on a zillion archs, see what the effect is. > > Have you considered costing general parallels as well? > > > Segher >
On Mon, Jun 19, 2017 at 03:28:20PM +0100, Richard Earnshaw (lists) wrote: > > That's not what combine does: it optimistically assumes any combination > > with unknown costs is an improvement. > > So try this testcase on ARM. > > unsigned long x, y, z; > int b; > void test() > { > b = __builtin_sub_overflow (y,z, &x); > } > > > Without the patch, combine rips apart a compare and subtract insn > because it sees it as having cost zero and substitutes it with separate > compare and subtract insns. > The combine log before the patch shows: > > allowing combination of insns 10 and 51 > original costs 0 + 8 = 0 > replacement costs 4 + 12 = 16 Yes, this is a good example of a case where your patch helps. Thanks. > So it is clearly deciding that the original costs are greater than the > replacement costs. No: it allows any combination with unknown cost (either old or new cost). See combine_validate_cost. > >> This patch addresses this problem by allowing insn_rtx_cost to ignore > >> the condition setting part of a PARALLEL iff there is exactly one > >> comparison set and one non-comparison set. If the only set operation is > >> a comparison we still use that as the basis of the insn cost. > > > > I'll test this on a zillion archs, see what the effect is. > > > > Have you considered costing general parallels as well? > > I thought about it but concluded that there's no generically correct > answer. It might be the max of all the individual sets or it might be > the sum, or it might be somewhere in between. For example on ARM the > load/store multiple operations are expressed as parallels, but their > cost will depend on how many loads/stores happen in parallel within the > hardware. > > I think we'd need a new back-end hook to handle the other cases sensibly. And in general make insn_rtx_cost do something more sane than just looking at a set_src_cost, yeah. The problem is changing any of this without regressing some targets. Of course we are in stage 1 now ;-) Segher
On Mon, Jun 19, 2017 at 03:45:23PM +0100, Richard Earnshaw (lists) wrote: > >> At present all such insns are treated as having unknown cost (ie 0) and > >> combine assumes that such insns are infinitely more expensive than any > >> other insn sequence with a non-zero cost. > > > > That's not what combine does: it optimistically assumes any combination > > with unknown costs is an improvement. > > Actually the logic is > > int reject = old_cost > 0 && new_cost > old_cost; > > So reject will never be true if old cost is zero. Yes, exactly; and neither if new_cost is zero. If any cost is unknown combine just hopes for the best. Segher
On 19/06/17 16:09, Segher Boessenkool wrote: > On Mon, Jun 19, 2017 at 03:45:23PM +0100, Richard Earnshaw (lists) wrote: >>>> At present all such insns are treated as having unknown cost (ie 0) and >>>> combine assumes that such insns are infinitely more expensive than any >>>> other insn sequence with a non-zero cost. >>> >>> That's not what combine does: it optimistically assumes any combination >>> with unknown costs is an improvement. >> >> Actually the logic is >> >> int reject = old_cost > 0 && new_cost > old_cost; >> >> So reject will never be true if old cost is zero. > > Yes, exactly; and neither if new_cost is zero. If any cost is unknown > combine just hopes for the best. > > > Segher > Yeah, and I'm not suggesting we change the logic there (sorry if the description was misleading). Instead I'm proposing that we handle more cases for parallels to not return zero. R.
On Mon, Jun 19, 2017 at 05:01:10PM +0100, Richard Earnshaw (lists) wrote: > Yeah, and I'm not suggesting we change the logic there (sorry if the > description was misleading). Instead I'm proposing that we handle more > cases for parallels to not return zero. Right. My test run is half way through, will have results later -- your change looks good to me, but it is always surprising whether better costs help or not, or even *hurt* good code generation (things are just too tightly tuned to the current behaviour, so some things may need retuning). Segher
On Mon, Jun 19, 2017 at 12:40:53PM -0500, Segher Boessenkool wrote: > On Mon, Jun 19, 2017 at 05:01:10PM +0100, Richard Earnshaw (lists) wrote: > > Yeah, and I'm not suggesting we change the logic there (sorry if the > > description was misleading). Instead I'm proposing that we handle more > > cases for parallels to not return zero. > > Right. My test run is half way through, will have results later -- > your change looks good to me, but it is always surprising whether > better costs help or not, or even *hurt* good code generation (things > are just too tightly tuned to the current behaviour, so some things > may need retuning). Everything built successfully (31 targets); --enable-checking=yes,rtl,tree so it took a while, sorry. The targets with any differences (table shows code size): old patched arm 11545709 11545797 powerpc 8442762 8442746 x86_64 10627428 10627363 Arm has very many differences, the others do not. For powerpc (which is 32-bit, 64-bit showed no differences) most of the difference is scheduling deciding to do things a bit differently, and most of it in places where we have not-so-good costs anyway. For arm the effects often cascade to bb-reorder making different decisions. Anyway, all differences are small, it is not likely to hurt anything. I support the patch, if that helps -- but I cannot approve it. Segher
On 19/06/17 14:46, Richard Earnshaw (lists) wrote: > Many parallel set insns are of the form of a single set that also sets > the condition code flags. In this case the cost of such an insn is > normally the cost of the part that doesn't set the flags, since updating > the condition flags is simply a side effect. > > At present all such insns are treated as having unknown cost (ie 0) and > combine assumes that such insns are infinitely more expensive than any > other insn sequence with a non-zero cost. > > This patch addresses this problem by allowing insn_rtx_cost to ignore > the condition setting part of a PARALLEL iff there is exactly one > comparison set and one non-comparison set. If the only set operation is > a comparison we still use that as the basis of the insn cost. > > * rtlanal.c (insn_rtx_cost): If a parallel contains exactly one > comparison set and one other set, use the cost of the > non-comparison set. > > Bootstrapped on aarch64-none-linuxgnu > > OK? > Ping? R. > R. > > > insn-costs.patch > > > diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c > index d9f57c3..5cae793 100644 > --- a/gcc/rtlanal.c > +++ b/gcc/rtlanal.c > @@ -5260,23 +5260,41 @@ insn_rtx_cost (rtx pat, bool speed) > int i, cost; > rtx set; > > - /* Extract the single set rtx from the instruction pattern. > - We can't use single_set since we only have the pattern. */ > + /* Extract the single set rtx from the instruction pattern. We > + can't use single_set since we only have the pattern. We also > + consider PARALLELs of a normal set and and a single comparison. > + In that case we use the cost of the non-comparison SET operation, > + which is most-likely to be the real cost of this operation. */ > if (GET_CODE (pat) == SET) > set = pat; > else if (GET_CODE (pat) == PARALLEL) > { > set = NULL_RTX; > + rtx comparison = NULL_RTX; > + > for (i = 0; i < XVECLEN (pat, 0); i++) > { > rtx x = XVECEXP (pat, 0, i); > if (GET_CODE (x) == SET) > { > - if (set) > - return 0; > - set = x; > + if (GET_CODE (SET_SRC (x)) == COMPARE) > + { > + if (comparison) > + return 0; > + comparison = x; > + } > + else > + { > + if (set) > + return 0; > + set = x; > + } > } > } > + > + if (!set && comparison) > + set = comparison; > + > if (!set) > return 0; > } >
On 06/30/2017 03:03 AM, Richard Earnshaw (lists) wrote: > On 19/06/17 14:46, Richard Earnshaw (lists) wrote: >> Many parallel set insns are of the form of a single set that also sets >> the condition code flags. In this case the cost of such an insn is >> normally the cost of the part that doesn't set the flags, since updating >> the condition flags is simply a side effect. >> >> At present all such insns are treated as having unknown cost (ie 0) and >> combine assumes that such insns are infinitely more expensive than any >> other insn sequence with a non-zero cost. >> >> This patch addresses this problem by allowing insn_rtx_cost to ignore >> the condition setting part of a PARALLEL iff there is exactly one >> comparison set and one non-comparison set. If the only set operation is >> a comparison we still use that as the basis of the insn cost. >> >> * rtlanal.c (insn_rtx_cost): If a parallel contains exactly one >> comparison set and one other set, use the cost of the >> non-comparison set. >> >> Bootstrapped on aarch64-none-linuxgnu >> >> OK? >> > > Ping? Needs a ChangeLog, with that, OK. Ideally we'd have something which verifies we're getting the cost right, but that's probably nontrivial to do in a generic manner. jeff
diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c index d9f57c3..5cae793 100644 --- a/gcc/rtlanal.c +++ b/gcc/rtlanal.c @@ -5260,23 +5260,41 @@ insn_rtx_cost (rtx pat, bool speed) int i, cost; rtx set; - /* Extract the single set rtx from the instruction pattern. - We can't use single_set since we only have the pattern. */ + /* Extract the single set rtx from the instruction pattern. We + can't use single_set since we only have the pattern. We also + consider PARALLELs of a normal set and and a single comparison. + In that case we use the cost of the non-comparison SET operation, + which is most-likely to be the real cost of this operation. */ if (GET_CODE (pat) == SET) set = pat; else if (GET_CODE (pat) == PARALLEL) { set = NULL_RTX; + rtx comparison = NULL_RTX; + for (i = 0; i < XVECLEN (pat, 0); i++) { rtx x = XVECEXP (pat, 0, i); if (GET_CODE (x) == SET) { - if (set) - return 0; - set = x; + if (GET_CODE (SET_SRC (x)) == COMPARE) + { + if (comparison) + return 0; + comparison = x; + } + else + { + if (set) + return 0; + set = x; + } } } + + if (!set && comparison) + set = comparison; + if (!set) return 0; }