[v2,00/10] tcg vector improvements

Message ID	20190104223116.14037-1-richard.henderson@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Date: Sat, 5 Jan 2019 08:31:06 +1000 Message-Id: <20190104223116.14037-1-richard.henderson@linaro.org> Subject: [Qemu-devel] [PATCH v2 00/10] tcg vector improvements Precedence: list Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>
Series	tcg vector improvements \| expand [v2,00/10] tcg vector improvements [v2,01/10] tcg: Add logical simplifications during gvec expand [v2,02/10] tcg: Add gvec expanders for nand, nor, eqv [v2,03/10] tcg: Add write_aofs to GVecGen4 [v2,04/10] tcg: Add opcodes for vector saturated arithmetic [v2,05/10] tcg: Add opcodes for vector minmax arithmetic [v2,06/10] tcg/i386: Split subroutines out of tcg_expand_vec_op [v2,07/10] tcg/i386: Implement vector saturating arithmetic [v2,08/10] tcg/i386: Implement vector minmax arithmetic [v2,09/10] tcg/aarch64: Implement vector saturating arithmetic [v2,10/10] tcg/aarch64: Implement vector minmax arithmetic

Richard Henderson Jan. 4, 2019, 10:31 p.m. UTC

I've split this out from the target/ppc patch set in which
it was developed.


r~


Richard Henderson (10):
  tcg: Add logical simplifications during gvec expand
  tcg: Add gvec expanders for nand, nor, eqv
  tcg: Add write_aofs to GVecGen4
  tcg: Add opcodes for vector saturated arithmetic
  tcg: Add opcodes for vector minmax arithmetic
  tcg/i386: Split subroutines out of tcg_expand_vec_op
  tcg/i386: Implement vector saturating arithmetic
  tcg/i386: Implement vector minmax arithmetic
  tcg/aarch64: Implement vector saturating arithmetic
  tcg/aarch64: Implement vector minmax arithmetic

 accel/tcg/tcg-runtime.h      |  23 ++
 tcg/aarch64/tcg-target.h     |   2 +
 tcg/i386/tcg-target.h        |   2 +
 tcg/tcg-op-gvec.h            |  18 ++
 tcg/tcg-op.h                 |  11 +
 tcg/tcg-opc.h                |   8 +
 tcg/tcg.h                    |   2 +
 accel/tcg/tcg-runtime-gvec.c | 257 ++++++++++++++++
 tcg/aarch64/tcg-target.inc.c |  48 +++
 tcg/i386/tcg-target.inc.c    | 580 +++++++++++++++++++++--------------
 tcg/tcg-op-gvec.c            | 305 ++++++++++++++++--
 tcg/tcg-op-vec.c             |  75 ++++-
 tcg/tcg.c                    |  10 +
 13 files changed, 1078 insertions(+), 263 deletions(-)

-- 
2.17.2

Mark Cave-Ayland Jan. 7, 2019, 1:11 p.m. UTC | #1

On 04/01/2019 22:31, Richard Henderson wrote:

> I've split this out from the target/ppc patch set in which

> it was developed.

> 

> 

> r~

> 

> 

> Richard Henderson (10):

>   tcg: Add logical simplifications during gvec expand

>   tcg: Add gvec expanders for nand, nor, eqv

>   tcg: Add write_aofs to GVecGen4

>   tcg: Add opcodes for vector saturated arithmetic

>   tcg: Add opcodes for vector minmax arithmetic

>   tcg/i386: Split subroutines out of tcg_expand_vec_op

>   tcg/i386: Implement vector saturating arithmetic

>   tcg/i386: Implement vector minmax arithmetic

>   tcg/aarch64: Implement vector saturating arithmetic

>   tcg/aarch64: Implement vector minmax arithmetic

> 

>  accel/tcg/tcg-runtime.h      |  23 ++

>  tcg/aarch64/tcg-target.h     |   2 +

>  tcg/i386/tcg-target.h        |   2 +

>  tcg/tcg-op-gvec.h            |  18 ++

>  tcg/tcg-op.h                 |  11 +

>  tcg/tcg-opc.h                |   8 +

>  tcg/tcg.h                    |   2 +

>  accel/tcg/tcg-runtime-gvec.c | 257 ++++++++++++++++

>  tcg/aarch64/tcg-target.inc.c |  48 +++

>  tcg/i386/tcg-target.inc.c    | 580 +++++++++++++++++++++--------------

>  tcg/tcg-op-gvec.c            | 305 ++++++++++++++++--

>  tcg/tcg-op-vec.c             |  75 ++++-

>  tcg/tcg.c                    |  10 +

>  13 files changed, 1078 insertions(+), 263 deletions(-)


Not sure that I'm particularly qualified to give this an R-B, however should there be
a corresponding update to tcg/README for the new instructions?

One other thing is that Howard sent me off-list a backtrace from trying my combined
branch at https://github.com/mcayland/qemu/tree/ppc-altivec-v5.5-rth booting MacOS
10.5 in the guest and ended up hitting an assert in an --enable-debug build:

Thread 5 (Thread 0x7fffe3fff700 (LWP 10627)):
#0  0x00007ffff698d53f in raise () at /lib64/libc.so.6
#1  0x00007ffff6977895 in abort () at /lib64/libc.so.6
#2  0x00007ffff6977769 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
#3  0x00007ffff69859f6 in .annobin_assert.c_end () at /lib64/libc.so.6
#4  0x000055555584bb67 in do_op3 (vece=2, r=0x1848, a=0x18b8,
b=0x18f0, opc=INDEX_op_ssadd_vec) at
/home/hsp/src/qemu-altivec-55/tcg/tcg-op-vec.c:407
        rt = 0x7fffdc002368
        at = 0x7fffdc0023d8
        bt = 0x7fffdc002410
        ri = 140736884384616
        ai = 140736884384728
        bi = 140736884384784
        type = TCG_TYPE_V128
        can = 0
        __PRETTY_FUNCTION__ = "do_op3"
#5  0x000055555584bbfa in tcg_gen_ssadd_vec (vece=2, r=0x1848,
a=0x18b8, b=0x18f0) at
/home/hsp/src/qemu-altivec-55/tcg/tcg-op-vec.c:419
#6  0x0000555555991887 in gen_vaddsws_vec (vece=2, t=0x1848,
sat=0x1880, a=0x18b8, b=0x18f0) at
/home/hsp/src/qemu-altivec-55/target/ppc/translate/vmx-impl.inc.c:597
        x = 0x1928
#7  0x0000555555852e53 in expand_4_vec (vece=2, dofs=197872,
aofs=198288, bofs=197776, cofs=197792, oprsz=16, tysz=16,
type=TCG_TYPE_V128, write_aofs=true, fni=0x55555599182a
<gen_vaddsws_vec>) at
/home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:903
        t0 = 0x1848
        t1 = 0x1880
        t2 = 0x18b8
        t3 = 0x18f0
        i = 0
#8  0x0000555555853cc4 in tcg_gen_gvec_4 (dofs=197872, aofs=198288,
bofs=197776, cofs=197792, oprsz=16, maxsz=16, g=0x5555562d33c0 <g>) at
/home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:1211
        type = TCG_TYPE_V128
        some = 21845
        __PRETTY_FUNCTION__ = "tcg_gen_gvec_4"
        __func__ = "tcg_gen_gvec_4"
#9  0x0000555555991987 in gen_vaddsws (ctx=0x7fffe3ffe5f0) at
/home/hsp/src/qemu-altivec-55/target/ppc/translate/vmx-impl.inc.c:597
        g = {fni8 = 0x0, fni4 = 0x0, fniv = 0x55555599182a
<gen_vaddsws_vec>, fno = 0x5555559601a1 <gen_helper_vaddsws>, opc =
INDEX_op_add_vec, data = 0, vece = 2 '\002', prefer_i64 = false,
write_aofs = true}


Certainly according to patch 7 of the series only 8-bit and 16-bit accesses are
supported on i386 hosts, but shouldn't we be falling back to the previous
implementations rather than hitting an assert()?


ATB,

Mark.

Richard Henderson Jan. 23, 2019, 5:09 a.m. UTC | #2

On 1/7/19 5:11 AM, Mark Cave-Ayland wrote:
> #7  0x0000555555852e53 in expand_4_vec (vece=2, dofs=197872,

> aofs=198288, bofs=197776, cofs=197792, oprsz=16, tysz=16,

> type=TCG_TYPE_V128, write_aofs=true, fni=0x55555599182a

> <gen_vaddsws_vec>) at

> /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:903

>         t0 = 0x1848

>         t1 = 0x1880

>         t2 = 0x18b8

>         t3 = 0x18f0

>         i = 0

> #8  0x0000555555853cc4 in tcg_gen_gvec_4 (dofs=197872, aofs=198288,

> bofs=197776, cofs=197792, oprsz=16, maxsz=16, g=0x5555562d33c0 <g>) at

> /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:1211

>         type = TCG_TYPE_V128

>         some = 21845

>         __PRETTY_FUNCTION__ = "tcg_gen_gvec_4"

>         __func__ = "tcg_gen_gvec_4"

> #9  0x0000555555991987 in gen_vaddsws (ctx=0x7fffe3ffe5f0) at

> /home/hsp/src/qemu-altivec-55/target/ppc/translate/vmx-impl.inc.c:597

>         g = {fni8 = 0x0, fni4 = 0x0, fniv = 0x55555599182a

> <gen_vaddsws_vec>, fno = 0x5555559601a1 <gen_helper_vaddsws>, opc =

> INDEX_op_add_vec, data = 0, vece = 2 '\002', prefer_i64 = false,

> write_aofs = true}

> 

> 

> Certainly according to patch 7 of the series only 8-bit and 16-bit accesses are

> supported on i386 hosts, but shouldn't we be falling back to the previous

> implementations rather than hitting an assert()?


In here:

#define GEN_VXFORM_SAT(NAME, VECE, NORM, SAT, OPC2, OPC3)               \
static void glue(glue(gen_, NAME), _vec)(unsigned vece, TCGv_vec t,     \
                                         TCGv_vec sat, TCGv_vec a,      \
                                         TCGv_vec b)                    \
{                                                                       \
    TCGv_vec x = tcg_temp_new_vec_matching(t);                          \
    glue(glue(tcg_gen_, NORM), _vec)(VECE, x, a, b);                    \
    glue(glue(tcg_gen_, SAT), _vec)(VECE, t, a, b);                     \
    tcg_gen_cmp_vec(TCG_COND_NE, VECE, x, x, t);                        \
    tcg_gen_or_vec(VECE, sat, sat, x);                                  \
    tcg_temp_free_vec(x);                                               \
}                                                                       \
static void glue(gen_, NAME)(DisasContext *ctx)                         \
{                                                                       \
    static const GVecGen4 g = {                                         \
        .fniv = glue(glue(gen_, NAME), _vec),                           \
        .fno = glue(gen_helper_, NAME),                                 \
        .opc = glue(glue(INDEX_op_, NORM), _vec),                       \

s/NORM/SAT/, so that we query whether the saturated opcode is supported.  The
normal arithmetic, cmp, and or opcodes are mandatory; we don't need to do
anything with those.


r~

Mark Cave-Ayland Feb. 5, 2019, 9:29 p.m. UTC | #3

On 23/01/2019 05:09, Richard Henderson wrote:

> On 1/7/19 5:11 AM, Mark Cave-Ayland wrote:

>> #7  0x0000555555852e53 in expand_4_vec (vece=2, dofs=197872,

>> aofs=198288, bofs=197776, cofs=197792, oprsz=16, tysz=16,

>> type=TCG_TYPE_V128, write_aofs=true, fni=0x55555599182a

>> <gen_vaddsws_vec>) at

>> /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:903

>>         t0 = 0x1848

>>         t1 = 0x1880

>>         t2 = 0x18b8

>>         t3 = 0x18f0

>>         i = 0

>> #8  0x0000555555853cc4 in tcg_gen_gvec_4 (dofs=197872, aofs=198288,

>> bofs=197776, cofs=197792, oprsz=16, maxsz=16, g=0x5555562d33c0 <g>) at

>> /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:1211

>>         type = TCG_TYPE_V128

>>         some = 21845

>>         __PRETTY_FUNCTION__ = "tcg_gen_gvec_4"

>>         __func__ = "tcg_gen_gvec_4"

>> #9  0x0000555555991987 in gen_vaddsws (ctx=0x7fffe3ffe5f0) at

>> /home/hsp/src/qemu-altivec-55/target/ppc/translate/vmx-impl.inc.c:597

>>         g = {fni8 = 0x0, fni4 = 0x0, fniv = 0x55555599182a

>> <gen_vaddsws_vec>, fno = 0x5555559601a1 <gen_helper_vaddsws>, opc =

>> INDEX_op_add_vec, data = 0, vece = 2 '\002', prefer_i64 = false,

>> write_aofs = true}

>>

>>

>> Certainly according to patch 7 of the series only 8-bit and 16-bit accesses are

>> supported on i386 hosts, but shouldn't we be falling back to the previous

>> implementations rather than hitting an assert()?

> 

> In here:

> 

> #define GEN_VXFORM_SAT(NAME, VECE, NORM, SAT, OPC2, OPC3)               \

> static void glue(glue(gen_, NAME), _vec)(unsigned vece, TCGv_vec t,     \

>                                          TCGv_vec sat, TCGv_vec a,      \

>                                          TCGv_vec b)                    \

> {                                                                       \

>     TCGv_vec x = tcg_temp_new_vec_matching(t);                          \

>     glue(glue(tcg_gen_, NORM), _vec)(VECE, x, a, b);                    \

>     glue(glue(tcg_gen_, SAT), _vec)(VECE, t, a, b);                     \

>     tcg_gen_cmp_vec(TCG_COND_NE, VECE, x, x, t);                        \

>     tcg_gen_or_vec(VECE, sat, sat, x);                                  \

>     tcg_temp_free_vec(x);                                               \

> }                                                                       \

> static void glue(gen_, NAME)(DisasContext *ctx)                         \

> {                                                                       \

>     static const GVecGen4 g = {                                         \

>         .fniv = glue(glue(gen_, NAME), _vec),                           \

>         .fno = glue(gen_helper_, NAME),                                 \

>         .opc = glue(glue(INDEX_op_, NORM), _vec),                       \

> 

> s/NORM/SAT/, so that we query whether the saturated opcode is supported.  The

> normal arithmetic, cmp, and or opcodes are mandatory; we don't need to do

> anything with those.


Now that this and the other pre-requisite patches have been merged into master, I've
rebased the outstanding PPC parts of your "tcg, target/ppc vector improvements" on
master including the above fix and pushed the result to
https://github.com/mcayland/qemu/commits/ppc-altivec-v6.

The good news is that the graphics corruption I originally noticed caused by the
patch introducing the saturating add/sub vector ops has now gone, and with my
little-endian vsplt fix included then both OS X and MacOS 9 appear to run without any
obvious issues on an x86 host, and certainly feel smoother compared to before.

The only minor question I had with the patchset in its current form is whether to use
the new VsrD() macro for vscr_sat, or whether we don't really care enough?


ATB,

Mark.

Richard Henderson Feb. 6, 2019, 3:37 a.m. UTC | #4

On 2/5/19 9:29 PM, Mark Cave-Ayland wrote:
> The only minor question I had with the patchset in its current form is whether to use

> the new VsrD() macro for vscr_sat, or whether we don't really care enough?

Given the comment

  /* Which bit we set is completely arbitrary, but clear the rest.  */

I don't think VsrD is helpful.

In "target/ppc: Split out VSCR_SAT to a vector field":

      ppc_vsr_t vsr[64] QEMU_ALIGNED(16);
+     /* Non-zero if and only if VSCR_SAT should be set.  */
+     ppc_vsr_t vscr_sat;

Better to add the QEMU_ALIGNED(16) to vscr_sat as well.  Yes, it will already
happen to be aligned by placement, but this is also a bit documentation.

In "target/ppc: convert vadd*s and vsub*s to vector operations":

          if (sat) {                                                      \
-             set_vscr_sat(env);                                          \
+             vscr_sat->u32[0] = 1;                                       \
          }                                                               \

Changed in error?

r~

Mark Cave-Ayland Feb. 6, 2019, 7:15 a.m. UTC | #5

On 06/02/2019 03:37, Richard Henderson wrote:

> On 2/5/19 9:29 PM, Mark Cave-Ayland wrote:

>> The only minor question I had with the patchset in its current form is whether to use

>> the new VsrD() macro for vscr_sat, or whether we don't really care enough?

> 

> Given the comment

> 

>   /* Which bit we set is completely arbitrary, but clear the rest.  */

> 

> I don't think VsrD is helpful.


Okay, I can leave that for now.

> In "target/ppc: Split out VSCR_SAT to a vector field":

> 

>       ppc_vsr_t vsr[64] QEMU_ALIGNED(16);

> +     /* Non-zero if and only if VSCR_SAT should be set.  */

> +     ppc_vsr_t vscr_sat;

> 

> Better to add the QEMU_ALIGNED(16) to vscr_sat as well.  Yes, it will already

> happen to be aligned by placement, but this is also a bit documentation.


I've added this to my latest branch.

> In "target/ppc: convert vadd*s and vsub*s to vector operations":

> 

>           if (sat) {                                                      \

> -             set_vscr_sat(env);                                          \

> +             vscr_sat->u32[0] = 1;                                       \

>           }                                                               \

> 

> Changed in error?


It looks like this was in the original patch, presumably because GEN_VXFORM_SAT
doesn't include the env parameter which was present in GEN_VXFORM_ENV. Should the env
parameter be added to GEN_VXFORM_SAT?

Howard has also pointed out that he's still spotted some corruption in his tests, so
I will do a bit more investigation and report back.


ATB,

Mark.

BALATON Zoltan Feb. 6, 2019, 3:44 p.m. UTC | #6

On Wed, 6 Feb 2019, Mark Cave-Ayland wrote:
> Howard has also pointed out that he's still spotted some corruption in his tests, so

> I will do a bit more investigation and report back.

I wonder if instead of "visually testing" shouldn't it be verified by some 
more exact tests? Could any of these be useful:

https://gcc.gnu.org/ml/gcc-patches/2002-11/msg01391.html
https://gcc.gnu.org/git/?p=gcc.git;a=tree;f=gcc/testsuite/gcc.dg/vmx;h=4765c35dabb4e3891c98084af077c2549fb45dfc;hb=HEAD
https://github.com/llvm-mirror/test-suite/tree/master/SingleSource/UnitTests/Vector/Altivec

or there could be other similar ones around that already have test cases 
for these instructions that could provide some more confidence in the 
correctness of the implementation.

Regards,
BALATON Zoltan

Mark Cave-Ayland Feb. 7, 2019, 10:31 p.m. UTC | #7

On 05/02/2019 21:29, Mark Cave-Ayland wrote:

> Now that this and the other pre-requisite patches have been merged into master, I've

> rebased the outstanding PPC parts of your "tcg, target/ppc vector improvements" on

> master including the above fix and pushed the result to

> https://github.com/mcayland/qemu/commits/ppc-altivec-v6.

> 

> The good news is that the graphics corruption I originally noticed caused by the

> patch introducing the saturating add/sub vector ops has now gone, and with my

> little-endian vsplt fix included then both OS X and MacOS 9 appear to run without any

> obvious issues on an x86 host, and certainly feel smoother compared to before.


I started to follow up Howard's report that he could still see graphical corruption
with these patches, and I was surprised to notice that it had reappeared locally once
again :/

After working out that this was because the relevant instructions weren't being
generated for all Mac machines, I was able to figure it out after a few hours of
digging. Patch to follow shortly.


ATB,

Mark.

[v2,00/10] tcg vector improvements

Message

Comments