mbox series

[RFC,0/9] TCG Vector types and example conversion

Message ID 20170817180404.29334-1-alex.bennee@linaro.org
Headers show
Series TCG Vector types and example conversion | expand

Message

Alex Bennée Aug. 17, 2017, 6:03 p.m. UTC
Hi,

With upcoming work on SVE I've been looking at the way we implement
vector registers in QEMU's TCG. The current orthodoxy is to decompose
the vector into a series of TCG registers, often calling a helper
function the calculation of each element. The result of the helper is
then is then stored back in the vector representation afterwards.
There are occasional outliers like simd_tbl which access elements
directly from a passed CPUFooState env pointer but these are rare.

This series introduces the concept of TCGv_vec type. This is a pointer
to the start of the in memory representation of an arbitrarily long
vector register. This is passed to a helper function as a pointer
along with a normal TCG register containing information about the
actual vector length and any additional information the helper needs
to do the operation. The hope* is this saves on the churn of having
the TCG do things element by element and allows the compiler to use
native vector operations to streamline the helpers.

There are some downsides to this approach. The first is you have to be
careful about register aliasing. If you are doing a same reg to same
reg operation you need to make a copy of the vector so you don't
trample your input data as you go. The second is this involves
changing some of the assumptions the TCG makes about things. I've
managed to keep all the changes within the core TCG code for now but
so far it has only been tested for the tcg_call path which is the only
place where TCGv_vec's should turn up. It is possible to do the same
thing without touching the TCG code generation by using TCGv_ptrs and
manually emitting tcg_addi ops to pass the correct address. Richard
has been exploring this approach with his series. The downside of that
is you do miss the ability to have named global vector registers which
makes reading the TCG dumps a little easier.

I've only patched one helper in this series which implements the
indexed smull. This is because it appears in the profiles for my test
case which was using an arm64 ffmpeg to transcode:

  ./ffmpeg.arm64 -i big_buck_bunny_480p_surround-fix.avi \
    -threads 1 -qscale:v 3 -f null -

* hope. On an earlier revision (which included sqshrn conversions) I
  had measured a minor saving but this had disappeared once I measured
  the final code. However the profile is fairly dominated by
  softfloat.

master:
     8.05%  qemu-aarch64  qemu-aarch64             [.] roundAndPackFloat32
     7.28%  qemu-aarch64  qemu-aarch64             [.] float32_mul
     6.56%  qemu-aarch64  qemu-aarch64             [.] helper_lookup_tb_ptr
     5.31%  qemu-aarch64  qemu-aarch64             [.] float32_muladd
     4.09%  qemu-aarch64  qemu-aarch64             [.] helper_neon_mull_s16
     4.00%  qemu-aarch64  qemu-aarch64             [.] addFloat32Sigs
     3.86%  qemu-aarch64  qemu-aarch64             [.] subFloat32Sigs
     2.26%  qemu-aarch64  qemu-aarch64             [.] helper_simd_tbl
     2.00%  qemu-aarch64  qemu-aarch64             [.] float32_add
     1.81%  qemu-aarch64  qemu-aarch64             [.] helper_neon_unarrow_sat8
     1.64%  qemu-aarch64  qemu-aarch64             [.] float32_sub
     1.43%  qemu-aarch64  qemu-aarch64             [.] helper_neon_subl_u32
     0.98%  qemu-aarch64  qemu-aarch64             [.] helper_neon_widen_u8

tcg-native-vectors-rfc:
     7.93%  qemu-aarch64  qemu-aarch64             [.] roundAndPackFloat32             
     7.54%  qemu-aarch64  qemu-aarch64             [.] float32_mul                     
     6.29%  qemu-aarch64  qemu-aarch64             [.] helper_lookup_tb_ptr
     5.39%  qemu-aarch64  qemu-aarch64             [.] float32_muladd
     3.92%  qemu-aarch64  qemu-aarch64             [.] addFloat32Sigs
     3.86%  qemu-aarch64  qemu-aarch64             [.] subFloat32Sigs
     3.62%  qemu-aarch64  qemu-aarch64             [.] helper_advsimd_smull_idx_s32
     2.19%  qemu-aarch64  qemu-aarch64             [.] helper_simd_tbl
     2.09%  qemu-aarch64  qemu-aarch64             [.] helper_neon_mull_s16
     1.99%  qemu-aarch64  qemu-aarch64             [.] float32_add
     1.79%  qemu-aarch64  qemu-aarch64             [.] helper_neon_unarrow_sat8
     1.62%  qemu-aarch64  qemu-aarch64             [.] float32_sub
     1.43%  qemu-aarch64  qemu-aarch64             [.] helper_neon_subl_u32
     1.00%  qemu-aarch64  qemu-aarch64             [.] helper_neon_widen_u8
     0.98%  qemu-aarch64  qemu-aarch64             [.] helper_neon_addl_u32

At the moment the default compiler settings don't actually vectorise
the helper. I could get it to once I added some alignment guarantees
but the casting I did broke the instruction emulation so I haven't
included that patch in this series.

Given the results why continue investigating this? Well for one thing
vector sizes are growing, SVE vectors are up to 2048 bits long. Those
longer vectors should offer more scope for the host compiler to
generate efficient code in the helper. Also vector operations tend to
be quite complex operations, being able to handle this in C code
instead of TCGOps might be more preferable from a code maintainability
point of view. Finally this noddy little experiment has at least shown
it doesn't worsen performance. It would be nice if I could find a
benchmark that made heavy use if non-floating point SIMD instructions
to better measure the effect of marshalling elements vs vectorised
helpers. If anyone has any suggestions I'm all ears ;-)

Anyway questions, comments?

Alex Bennée (9):
  tcg/README: listify the TCG types.
  tcg: introduce the concepts of a TCGv_vec register type
  tcg: generate ptrs to vector registers
  helper-head: add support for vec type
  arm/cpu.h: align VFP registers
  target/arm/translate-a64: regnames -> x_regnames
  target/arm/translate-a64: register global vectors
  target/arm/helpers: introduce ADVSIMD flags
  target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]

 include/exec/helper-head.h        |  5 ++
 target/arm/advsimd_helper_flags.h | 50 ++++++++++++++++++++
 target/arm/cpu.h                  |  4 +-
 target/arm/helper-a64.c           | 18 ++++++++
 target/arm/helper-a64.h           |  2 +
 target/arm/translate-a64.c        | 97 +++++++++++++++++++++++++++++++++++++--
 tcg/README                        | 10 ++--
 tcg/tcg.c                         | 26 ++++++++++-
 tcg/tcg.h                         | 20 ++++++++
 9 files changed, 222 insertions(+), 10 deletions(-)
 create mode 100644 target/arm/advsimd_helper_flags.h

-- 
2.13.0

Comments

no-reply@patchew.org Aug. 17, 2017, 6:32 p.m. UTC | #1
Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20170817180404.29334-1-alex.bennee@linaro.org
Subject: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
bab61192a4 target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]
c5c733a1f9 target/arm/helpers: introduce ADVSIMD flags
e27691f07a target/arm/translate-a64: register global vectors
6a55e60454 target/arm/translate-a64: regnames -> x_regnames
e7a0e2466b arm/cpu.h: align VFP registers
efa94c04ce helper-head: add support for vec type
d8c96ebdd2 tcg: generate ptrs to vector registers
26ec07c3d7 tcg: introduce the concepts of a TCGv_vec register type
9ec3b4754d tcg/README: listify the TCG types.

=== OUTPUT BEGIN ===
Checking PATCH 1/9: tcg/README: listify the TCG types....
Checking PATCH 2/9: tcg: introduce the concepts of a TCGv_vec register type...
Checking PATCH 3/9: tcg: generate ptrs to vector registers...
Checking PATCH 4/9: helper-head: add support for vec type...
Checking PATCH 5/9: arm/cpu.h: align VFP registers...
Checking PATCH 6/9: target/arm/translate-a64: regnames -> x_regnames...
Checking PATCH 7/9: target/arm/translate-a64: register global vectors...
Checking PATCH 8/9: target/arm/helpers: introduce ADVSIMD flags...
WARNING: line over 80 characters
#50: FILE: target/arm/advsimd_helper_flags.h:30:
+ *   ADVSIMD_ALL_ELT  - the total count of elements (e.g. clear all-opr elements)

total: 0 errors, 1 warnings, 65 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 9/9: target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]...
WARNING: line over 80 characters
#65: FILE: target/arm/translate-a64.c:10469:
+typedef void AdvSIMDGenTwoPlusOneVectorFn(TCGv_vec, TCGv_vec, TCGv_i32, TCGv_i32);

ERROR: line over 90 characters
#71: FILE: target/arm/translate-a64.c:10475:
+static bool handle_vec_simd_mul_addsub(DisasContext *s, uint32_t insn, int opcode, int size, bool is_q, bool u, int rn, int rm, int rd)

ERROR: that open brace { should be on the previous line
#86: FILE: target/arm/translate-a64.c:10490:
+            if (!u)
+            {

WARNING: line over 80 characters
#95: FILE: target/arm/translate-a64.c:10499:
+                                      ADVSIMD_OPR_ELT_SHIFT, ADVSIMD_OPR_ELT_BITS, 4);

ERROR: line over 90 characters
#99: FILE: target/arm/translate-a64.c:10503:
+                                          ADVSIMD_DOFF_ELT_SHIFT, ADVSIMD_DOFF_ELT_BITS, 4);

WARNING: line over 80 characters
#141: FILE: target/arm/translate-a64.c:10590:
+        if (handle_vec_simd_mul_addsub(s, insn, opcode, size, is_q, u, rn, rm, rd)) {

total: 3 errors, 3 warnings, 109 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org
Kirill Batuzov Aug. 18, 2017, 11:33 a.m. UTC | #2
On Thu, 17 Aug 2017, Alex Bennée wrote:

> Hi,

> 

> With upcoming work on SVE I've been looking at the way we implement

> vector registers in QEMU's TCG. The current orthodoxy is to decompose

> the vector into a series of TCG registers, often calling a helper

> function the calculation of each element. The result of the helper is

> then is then stored back in the vector representation afterwards.

> There are occasional outliers like simd_tbl which access elements

> directly from a passed CPUFooState env pointer but these are rare.

> 

> This series introduces the concept of TCGv_vec type. This is a pointer

> to the start of the in memory representation of an arbitrarily long

> vector register. This is passed to a helper function as a pointer

> along with a normal TCG register containing information about the

> actual vector length and any additional information the helper needs

> to do the operation. The hope* is this saves on the churn of having

> the TCG do things element by element and allows the compiler to use

> native vector operations to streamline the helpers.

> 

> There are some downsides to this approach. The first is you have to be

> careful about register aliasing. If you are doing a same reg to same

> reg operation you need to make a copy of the vector so you don't

> trample your input data as you go. The second is this involves

> changing some of the assumptions the TCG makes about things. I've

> managed to keep all the changes within the core TCG code for now but

> so far it has only been tested for the tcg_call path which is the only

> place where TCGv_vec's should turn up. It is possible to do the same

> thing without touching the TCG code generation by using TCGv_ptrs and

> manually emitting tcg_addi ops to pass the correct address. Richard

> has been exploring this approach with his series. The downside of that

> is you do miss the ability to have named global vector registers which

> makes reading the TCG dumps a little easier.

> 

> I've only patched one helper in this series which implements the

> indexed smull. This is because it appears in the profiles for my test

> case which was using an arm64 ffmpeg to transcode:

> 

>   ./ffmpeg.arm64 -i big_buck_bunny_480p_surround-fix.avi \

>     -threads 1 -qscale:v 3 -f null -

> 

> * hope. On an earlier revision (which included sqshrn conversions) I

>   had measured a minor saving but this had disappeared once I measured

>   the final code. However the profile is fairly dominated by

>   softfloat.

> 

> master:

>      8.05%  qemu-aarch64  qemu-aarch64             [.] roundAndPackFloat32

>      7.28%  qemu-aarch64  qemu-aarch64             [.] float32_mul

>      6.56%  qemu-aarch64  qemu-aarch64             [.] helper_lookup_tb_ptr

>      5.31%  qemu-aarch64  qemu-aarch64             [.] float32_muladd

>      4.09%  qemu-aarch64  qemu-aarch64             [.] helper_neon_mull_s16

>      4.00%  qemu-aarch64  qemu-aarch64             [.] addFloat32Sigs

>      3.86%  qemu-aarch64  qemu-aarch64             [.] subFloat32Sigs

>      2.26%  qemu-aarch64  qemu-aarch64             [.] helper_simd_tbl

>      2.00%  qemu-aarch64  qemu-aarch64             [.] float32_add

>      1.81%  qemu-aarch64  qemu-aarch64             [.] helper_neon_unarrow_sat8

>      1.64%  qemu-aarch64  qemu-aarch64             [.] float32_sub

>      1.43%  qemu-aarch64  qemu-aarch64             [.] helper_neon_subl_u32

>      0.98%  qemu-aarch64  qemu-aarch64             [.] helper_neon_widen_u8

> 

> tcg-native-vectors-rfc:

>      7.93%  qemu-aarch64  qemu-aarch64             [.] roundAndPackFloat32             

>      7.54%  qemu-aarch64  qemu-aarch64             [.] float32_mul                     

>      6.29%  qemu-aarch64  qemu-aarch64             [.] helper_lookup_tb_ptr

>      5.39%  qemu-aarch64  qemu-aarch64             [.] float32_muladd

>      3.92%  qemu-aarch64  qemu-aarch64             [.] addFloat32Sigs

>      3.86%  qemu-aarch64  qemu-aarch64             [.] subFloat32Sigs

>      3.62%  qemu-aarch64  qemu-aarch64             [.] helper_advsimd_smull_idx_s32

>      2.19%  qemu-aarch64  qemu-aarch64             [.] helper_simd_tbl

>      2.09%  qemu-aarch64  qemu-aarch64             [.] helper_neon_mull_s16

>      1.99%  qemu-aarch64  qemu-aarch64             [.] float32_add

>      1.79%  qemu-aarch64  qemu-aarch64             [.] helper_neon_unarrow_sat8

>      1.62%  qemu-aarch64  qemu-aarch64             [.] float32_sub

>      1.43%  qemu-aarch64  qemu-aarch64             [.] helper_neon_subl_u32

>      1.00%  qemu-aarch64  qemu-aarch64             [.] helper_neon_widen_u8

>      0.98%  qemu-aarch64  qemu-aarch64             [.] helper_neon_addl_u32

> 

> At the moment the default compiler settings don't actually vectorise

> the helper. I could get it to once I added some alignment guarantees

> but the casting I did broke the instruction emulation so I haven't

> included that patch in this series.

> 

> Given the results why continue investigating this? Well for one thing

> vector sizes are growing, SVE vectors are up to 2048 bits long. Those

> longer vectors should offer more scope for the host compiler to

> generate efficient code in the helper. Also vector operations tend to

> be quite complex operations, being able to handle this in C code

> instead of TCGOps might be more preferable from a code maintainability

> point of view. Finally this noddy little experiment has at least shown

> it doesn't worsen performance. It would be nice if I could find a

> benchmark that made heavy use if non-floating point SIMD instructions

> to better measure the effect of marshalling elements vs vectorised

> helpers. If anyone has any suggestions I'm all ears ;-)


While doing my own vector register series I was using

1. Handwritten example (it's for ARM32 NEON, not aarch64)

    .cpu cortex-a8
    .fpu neon
    .text
    .global test
test:
    vld1.32     d0, [r0]!
    vld1.32     d1, [r0]
    vld1.32     d2, [r1]!
    vld1.32     d3, [r1]
	mov			r0, #0xb0000000
loop:
    vadd.i32    q0, q0, q1
    vadd.i32    q0, q0, q1
    vadd.i32    q0, q0, q1
    vadd.i32    q0, q0, q1
    subs        r0, r0, #1
    bne         loop
    vpadd.i32   d0, d0, d1
    vpadd.i32   d0, d0, d1
    vmov.i32    r0, d0[0]
    bx          lr

It can be adapted for aarch64 without much problems. This example shows
what potential speed up you can expect, as it is nearly perfect for the
optimization in question.

2. x264 video encoder. It has a lot of handwritten vector assembler for
different architectures, including aarch64. You probably can access it
as libx264 from within ffmpeg, if this library support was compiled.

> 

> Anyway questions, comments?

> 


From my own experimentations some times ago,

(1) translating vector instructions to vector instructions in TCG is faster than

(2) translating vector instructions to series of scalar instructions in TCG,
which is faster than*

(3) translating vector instructions to single helper calls, which is faster
than*

(4) translating vector instructions to helper calls for each vector element.

(*) (2) and (3) may change their respective places in case of some
complicated instructions.

ARM (at least ARM32, I have not checked aarch64 in this regard) uses the
last, the slowest scheme. As far as I understand, you are want to change
it to the third approach. This approach is used in SSE emulation, may be
you can use similar structure of helpers?

I still hope to finish my own series about implementation of the first
approach. I apologize for the long delay since last update and hope to
send next version somewhere next week. I do not think our series
contradict each other: you are trying to optimize existing general
purpose case while I'm trying to optimize case where both host and guest
support vector instructions. Since I'm experimenting on ARM32, we'll not
have much merge conflicts either.

-- 
Kirill
Richard Henderson Aug. 18, 2017, 1:44 p.m. UTC | #3
On 08/18/2017 04:33 AM, Kirill Batuzov wrote:
> From my own experimentations some times ago,

> 

> (1) translating vector instructions to vector instructions in TCG is faster than

> 

> (2) translating vector instructions to series of scalar instructions in TCG,

> which is faster than*

> 

> (3) translating vector instructions to single helper calls, which is faster

> than*

> 

> (4) translating vector instructions to helper calls for each vector element.

> 

> (*) (2) and (3) may change their respective places in case of some

> complicated instructions.


This was my gut feeling as well.  With the caveat that for the ARM SVE case of
2048-bit registers we cannot afford to expand inline due to generated code size.

> ARM (at least ARM32, I have not checked aarch64 in this regard) uses the

> last, the slowest scheme. As far as I understand, you are want to change

> it to the third approach. This approach is used in SSE emulation, may be

> you can use similar structure of helpers?

> 

> I still hope to finish my own series about implementation of the first

> approach. I apologize for the long delay since last update and hope to

> send next version somewhere next week. I do not think our series

> contradict each other: you are trying to optimize existing general

> purpose case while I'm trying to optimize case where both host and guest

> support vector instructions. Since I'm experimenting on ARM32, we'll not

> have much merge conflicts either.


I posted my own, different, take on vectorization yesterday as well.

  http://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg03272.html

The primary difference between my version and your version is that I do not
allow target/cpu/translate*.c to create vector types.  All of the host vector
expansion is done within tcg/*.c.

We also would like to settle on a common style for out-of-line helpers, if that
is possible.  One thing *not* to take from our current SSE emulation is that we
do not yet support AVX, AVX2, or AVX512 extensions.  So the current
construction of helpers within target/i386/ doesn't really take into account
all that should be required.

The thing that's common between AVX512 and SVE is that we have multiple vector
lengths, and that elements beyond the operation length are zeroed.  Both Alex
and I have packed operation length + full vector length into a descriptor given
to the helper.  (Alex allows for some other bits too; I'm not sure about that.)


r~
Kirill Batuzov Aug. 22, 2017, 9:04 a.m. UTC | #4
On Fri, 18 Aug 2017, Richard Henderson wrote:

> On 08/18/2017 04:33 AM, Kirill Batuzov wrote:

> > From my own experimentations some times ago,

> > 

> > (1) translating vector instructions to vector instructions in TCG is faster than

> > 

> > (2) translating vector instructions to series of scalar instructions in TCG,

> > which is faster than*

> > 

> > (3) translating vector instructions to single helper calls, which is faster

> > than*

> > 

> > (4) translating vector instructions to helper calls for each vector element.

> > 

> > (*) (2) and (3) may change their respective places in case of some

> > complicated instructions.

> 

> This was my gut feeling as well.  With the caveat that for the ARM SVE case of

> 2048-bit registers we cannot afford to expand inline due to generated code size.

> 

> > ARM (at least ARM32, I have not checked aarch64 in this regard) uses the

> > last, the slowest scheme. As far as I understand, you are want to change

> > it to the third approach. This approach is used in SSE emulation, may be

> > you can use similar structure of helpers?

> > 

> > I still hope to finish my own series about implementation of the first

> > approach. I apologize for the long delay since last update and hope to

> > send next version somewhere next week. I do not think our series

> > contradict each other: you are trying to optimize existing general

> > purpose case while I'm trying to optimize case where both host and guest

> > support vector instructions. Since I'm experimenting on ARM32, we'll not

> > have much merge conflicts either.

> 

> I posted my own, different, take on vectorization yesterday as well.

> 

>   http://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg03272.html

> 

> The primary difference between my version and your version is that I do not

> allow target/cpu/translate*.c to create vector types.  All of the host vector

> expansion is done within tcg/*.c.


I took a look at your approach. The only problem with it is that in
current implementation it does not allow to keep vector variables on
register between consecutive guest instructions. But this can be
changed. To do it we need to make copy propagation work with memory
locations as well, and dead code elimination to be able to remove excess
stores to memory. While in general case these can be troublesome if we
limit analysis to addresses that are [env + Const] it becomes relatively
easy. I've done similar thing in my series to track interference between
memory operations and vector global variables. In case of your series
this affects only performance so it does not need to be added in the
initial series and can be added later as a separate patch. I can care of
this once initial series are pulled to master.

Overall I like your approach the most out of three:
 - it handles different representations of guest vectors with host
   vectors seamlessly (unlike my approach where I still do not know how
   to make it right),
 - it provides better performance than Alex's (and the same as mine once
   we add a bit of alias analysis),
 - it moves in the direction of representing guest vectors not as
   globals, but as a pair (offset, size) in a special address space
   (this approach was successfully used in Valgrind and it handles
   intersecting registers much better than what we have now; we are
   moving in this direction anyway).

-- 
Kirill