Message ID | 5620F47B.9010107@arm.com |
---|---|
State | Accepted |
Commit | abc5231831d8356f563e89ab3f2e93bd98eaac57 |
Headers | show |
On 16 October 2015 at 13:58, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote: > Hi all, > > We already support load/store-pair operations on the D-registers when they > contain an FP value, but the peepholes/sched-fusion machinery that > do all the hard work currently ignore 64-bit vector modes. > > This patch adds support for fusing loads/stores of 64-bit vector operands > into ldp and stp instructions. > I've seen this trigger a few times in SPEC2006. Not too many times, but the > times it did trigger the code seemed objectively better > i.e. long sequences of ldr and str instructions essentially halved in size. > > Bootstrapped and tested on aarch64-none-linux-gnu. > > Ok for trunk? > > Thanks, > Kyrill > > 2015-10-16 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p): We have several different flavours of fusion in the backend, this one is specifically load/stores, perhaps making that clear in the name of this predicate will avoid confusion further down the line? > New function. > (fusion_load_store): Use it. > * config/aarch64/aarch64-ldpstp.md: Add new peephole2s for > ldp and stp in VD modes. > * config/aarch64/aarch64-simd.md (load_pair<mode>, VD): New pattern. > (store_pair<mode>, VD): Likewise. > > 2015-10-16 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * gcc.target/aarch64/stp_vec_64_1.c: New test. > * gcc.target/aarch64/ldp_vec_64_1.c: New test. Otherwise OK /Marcus
Hi Marcus, On 20/10/15 17:05, Marcus Shawcroft wrote: > On 16 October 2015 at 13:58, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote: >> Hi all, >> >> We already support load/store-pair operations on the D-registers when they >> contain an FP value, but the peepholes/sched-fusion machinery that >> do all the hard work currently ignore 64-bit vector modes. >> >> This patch adds support for fusing loads/stores of 64-bit vector operands >> into ldp and stp instructions. >> I've seen this trigger a few times in SPEC2006. Not too many times, but the >> times it did trigger the code seemed objectively better >> i.e. long sequences of ldr and str instructions essentially halved in size. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> >> Ok for trunk? >> >> Thanks, >> Kyrill >> >> 2015-10-16 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p): > We have several different flavours of fusion in the backend, this one > is specifically load/stores, perhaps making that clear in the name of > this predicate will avoid confusion further down the line? Thanks for the review, This particular type of fusion is called sched_fusion in various places in the compiler and its implementation in aarch64 is only for load/store merging (indeed, the only usage of sched_fusion currently is to merge loads/stores in arm and aarch64). So, I think that sched_fusion in the name already conveys the information that it's the ldp/stp one rather than macro fusion. In fact, there is a macro fusion of ADRP and an LDR instruction, so having sched_fusion in the name is actually a better differentiator than mentioning loads/stores as both types of fusion deal with loads in some way. Is it ok to keep the name as is? Thanks, Kyrill > >> New function. >> (fusion_load_store): Use it. >> * config/aarch64/aarch64-ldpstp.md: Add new peephole2s for >> ldp and stp in VD modes. >> * config/aarch64/aarch64-simd.md (load_pair<mode>, VD): New pattern. >> (store_pair<mode>, VD): Likewise. >> >> 2015-10-16 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> * gcc.target/aarch64/stp_vec_64_1.c: New test. >> * gcc.target/aarch64/ldp_vec_64_1.c: New test. > Otherwise OK /Marcus >
On 20 October 2015 at 17:26, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote: > Hi Marcus, > > On 20/10/15 17:05, Marcus Shawcroft wrote: >> >> On 16 October 2015 at 13:58, Kyrill Tkachov <kyrylo.tkachov@arm.com> >> wrote: >>> >>> Hi all, >>> >>> We already support load/store-pair operations on the D-registers when >>> they >>> contain an FP value, but the peepholes/sched-fusion machinery that >>> do all the hard work currently ignore 64-bit vector modes. >>> >>> This patch adds support for fusing loads/stores of 64-bit vector operands >>> into ldp and stp instructions. >>> I've seen this trigger a few times in SPEC2006. Not too many times, but >>> the >>> times it did trigger the code seemed objectively better >>> i.e. long sequences of ldr and str instructions essentially halved in >>> size. >>> >>> Bootstrapped and tested on aarch64-none-linux-gnu. >>> >>> Ok for trunk? >>> >>> Thanks, >>> Kyrill >>> >>> 2015-10-16 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >>> >>> * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p): >> >> We have several different flavours of fusion in the backend, this one >> is specifically load/stores, perhaps making that clear in the name of >> this predicate will avoid confusion further down the line? > > Thanks for the review, > > This particular type of fusion is called sched_fusion in various > places in the compiler and its implementation in aarch64 is only for > load/store merging (indeed, the only usage of sched_fusion currently > is to merge loads/stores in arm and aarch64). > > So, I think that sched_fusion in the name already conveys the information > that it's the ldp/stp one rather than macro fusion. In fact, there is a > macro fusion of ADRP and an LDR instruction, > so having sched_fusion in the name is actually a better differentiator than > mentioning loads/stores as both types of fusion deal with loads in some way. > > Is it ok to keep the name as is? Thanks for the justification, patch is OK to commit /Marcus
commit b5f4a5b87a7315fb8a4d88da3e4c4afc52d16052 Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com> Date: Tue Oct 6 12:08:24 2015 +0100 [AArch64] Add support for 64-bit vector-mode ldp/stp diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md index 8d6d882..458829c 100644 --- a/gcc/config/aarch64/aarch64-ldpstp.md +++ b/gcc/config/aarch64/aarch64-ldpstp.md @@ -98,6 +98,47 @@ (define_peephole2 } }) +(define_peephole2 + [(set (match_operand:VD 0 "register_operand" "") + (match_operand:VD 1 "aarch64_mem_pair_operand" "")) + (set (match_operand:VD 2 "register_operand" "") + (match_operand:VD 3 "memory_operand" ""))] + "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)" + [(parallel [(set (match_dup 0) (match_dup 1)) + (set (match_dup 2) (match_dup 3))])] +{ + rtx base, offset_1, offset_2; + + extract_base_offset_in_addr (operands[1], &base, &offset_1); + extract_base_offset_in_addr (operands[3], &base, &offset_2); + if (INTVAL (offset_1) > INTVAL (offset_2)) + { + std::swap (operands[0], operands[2]); + std::swap (operands[1], operands[3]); + } +}) + +(define_peephole2 + [(set (match_operand:VD 0 "aarch64_mem_pair_operand" "") + (match_operand:VD 1 "register_operand" "")) + (set (match_operand:VD 2 "memory_operand" "") + (match_operand:VD 3 "register_operand" ""))] + "TARGET_SIMD && aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)" + [(parallel [(set (match_dup 0) (match_dup 1)) + (set (match_dup 2) (match_dup 3))])] +{ + rtx base, offset_1, offset_2; + + extract_base_offset_in_addr (operands[0], &base, &offset_1); + extract_base_offset_in_addr (operands[2], &base, &offset_2); + if (INTVAL (offset_1) > INTVAL (offset_2)) + { + std::swap (operands[0], operands[2]); + std::swap (operands[1], operands[3]); + } +}) + + ;; Handle sign/zero extended consecutive load/store. (define_peephole2 diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 6a2ab61..bf051c3 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -153,6 +153,34 @@ (define_insn "*aarch64_simd_mov<mode>" (set_attr "length" "4,4,4,8,8,8,4")] ) +(define_insn "load_pair<mode>" + [(set (match_operand:VD 0 "register_operand" "=w") + (match_operand:VD 1 "aarch64_mem_pair_operand" "Ump")) + (set (match_operand:VD 2 "register_operand" "=w") + (match_operand:VD 3 "memory_operand" "m"))] + "TARGET_SIMD + && rtx_equal_p (XEXP (operands[3], 0), + plus_constant (Pmode, + XEXP (operands[1], 0), + GET_MODE_SIZE (<MODE>mode)))" + "ldp\\t%d0, %d2, %1" + [(set_attr "type" "neon_ldp")] +) + +(define_insn "store_pair<mode>" + [(set (match_operand:VD 0 "aarch64_mem_pair_operand" "=Ump") + (match_operand:VD 1 "register_operand" "w")) + (set (match_operand:VD 2 "memory_operand" "=m") + (match_operand:VD 3 "register_operand" "w"))] + "TARGET_SIMD + && rtx_equal_p (XEXP (operands[2], 0), + plus_constant (Pmode, + XEXP (operands[0], 0), + GET_MODE_SIZE (<MODE>mode)))" + "stp\\t%d1, %d3, %0" + [(set_attr "type" "neon_stp")] +) + (define_split [(set (match_operand:VQ 0 "register_operand" "") (match_operand:VQ 1 "register_operand" ""))] diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index d7d05b8..7682417 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -3491,6 +3491,18 @@ offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) && offset % GET_MODE_SIZE (mode) == 0); } +/* Return true if MODE is one of the modes for which we + support LDP/STP operations. */ + +static bool +aarch64_mode_valid_for_sched_fusion_p (machine_mode mode) +{ + return mode == SImode || mode == DImode + || mode == SFmode || mode == DFmode + || (aarch64_vector_mode_supported_p (mode) + && GET_MODE_SIZE (mode) == 8); +} + /* Return true if X is a valid address for machine mode MODE. If it is, fill in INFO appropriately. STRICT_P is true if REG_OK_STRICT is in effect. OUTER_CODE is PARALLEL for a load/store pair. */ @@ -12863,8 +12875,9 @@ fusion_load_store (rtx_insn *insn, rtx *base, rtx *offset) src = SET_SRC (x); dest = SET_DEST (x); - if (GET_MODE (dest) != SImode && GET_MODE (dest) != DImode - && GET_MODE (dest) != SFmode && GET_MODE (dest) != DFmode) + machine_mode dest_mode = GET_MODE (dest); + + if (!aarch64_mode_valid_for_sched_fusion_p (dest_mode)) return SCHED_FUSION_NONE; if (GET_CODE (src) == SIGN_EXTEND) diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_vec_64_1.c b/gcc/testsuite/gcc.target/aarch64/ldp_vec_64_1.c new file mode 100644 index 0000000..62213f3 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/ldp_vec_64_1.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast" } */ + +typedef int int32x2_t __attribute__ ((__vector_size__ ((8)))); + +void +foo (int32x2_t *foo, int32x2_t *bar) +{ + int i = 0; + int32x2_t val = { 3, 2 }; + + for (i = 0; i < 1024; i+=2) + foo[i] = bar[i] + bar[i + 1]; +} + +/* { dg-final { scan-assembler "ldp\td\[0-9\]+, d\[0-9\]" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_64_1.c b/gcc/testsuite/gcc.target/aarch64/stp_vec_64_1.c new file mode 100644 index 0000000..11e757a --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_64_1.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast" } */ + + +typedef int int32x2_t __attribute__ ((__vector_size__ ((8)))); + +void +bar (int32x2_t *foo) +{ + int i = 0; + int32x2_t val = { 3, 2 }; + + for (i = 0; i < 256; i+=2) + { + foo[i] = val; + foo[i+1] = val; + } +} + +/* { dg-final { scan-assembler "stp\td\[0-9\]+, d\[0-9\]" } } */