Message ID | 20170817180404.29334-1-alex.bennee@linaro.org |
---|---|
Headers | show |
Series | TCG Vector types and example conversion | expand |
Hi, This series seems to have some coding style problems. See output below for more information: Type: series Message-id: 20170817180404.29334-1-alex.bennee@linaro.org Subject: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion === TEST SCRIPT BEGIN === #!/bin/bash BASE=base n=1 total=$(git log --oneline $BASE.. | wc -l) failed=0 git config --local diff.renamelimit 0 git config --local diff.renames True commits="$(git log --format=%H --reverse $BASE..)" for c in $commits; do echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..." if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then failed=1 echo fi n=$((n+1)) done exit $failed === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 Switched to a new branch 'test' bab61192a4 target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[] c5c733a1f9 target/arm/helpers: introduce ADVSIMD flags e27691f07a target/arm/translate-a64: register global vectors 6a55e60454 target/arm/translate-a64: regnames -> x_regnames e7a0e2466b arm/cpu.h: align VFP registers efa94c04ce helper-head: add support for vec type d8c96ebdd2 tcg: generate ptrs to vector registers 26ec07c3d7 tcg: introduce the concepts of a TCGv_vec register type 9ec3b4754d tcg/README: listify the TCG types. === OUTPUT BEGIN === Checking PATCH 1/9: tcg/README: listify the TCG types.... Checking PATCH 2/9: tcg: introduce the concepts of a TCGv_vec register type... Checking PATCH 3/9: tcg: generate ptrs to vector registers... Checking PATCH 4/9: helper-head: add support for vec type... Checking PATCH 5/9: arm/cpu.h: align VFP registers... Checking PATCH 6/9: target/arm/translate-a64: regnames -> x_regnames... Checking PATCH 7/9: target/arm/translate-a64: register global vectors... Checking PATCH 8/9: target/arm/helpers: introduce ADVSIMD flags... WARNING: line over 80 characters #50: FILE: target/arm/advsimd_helper_flags.h:30: + * ADVSIMD_ALL_ELT - the total count of elements (e.g. clear all-opr elements) total: 0 errors, 1 warnings, 65 lines checked Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. Checking PATCH 9/9: target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]... WARNING: line over 80 characters #65: FILE: target/arm/translate-a64.c:10469: +typedef void AdvSIMDGenTwoPlusOneVectorFn(TCGv_vec, TCGv_vec, TCGv_i32, TCGv_i32); ERROR: line over 90 characters #71: FILE: target/arm/translate-a64.c:10475: +static bool handle_vec_simd_mul_addsub(DisasContext *s, uint32_t insn, int opcode, int size, bool is_q, bool u, int rn, int rm, int rd) ERROR: that open brace { should be on the previous line #86: FILE: target/arm/translate-a64.c:10490: + if (!u) + { WARNING: line over 80 characters #95: FILE: target/arm/translate-a64.c:10499: + ADVSIMD_OPR_ELT_SHIFT, ADVSIMD_OPR_ELT_BITS, 4); ERROR: line over 90 characters #99: FILE: target/arm/translate-a64.c:10503: + ADVSIMD_DOFF_ELT_SHIFT, ADVSIMD_DOFF_ELT_BITS, 4); WARNING: line over 80 characters #141: FILE: target/arm/translate-a64.c:10590: + if (handle_vec_simd_mul_addsub(s, insn, opcode, size, is_q, u, rn, rm, rd)) { total: 3 errors, 3 warnings, 109 lines checked Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. === OUTPUT END === Test command exited with code: 1 --- Email generated automatically by Patchew [http://patchew.org/]. Please send your feedback to patchew-devel@freelists.org
On Thu, 17 Aug 2017, Alex Bennée wrote: > Hi, > > With upcoming work on SVE I've been looking at the way we implement > vector registers in QEMU's TCG. The current orthodoxy is to decompose > the vector into a series of TCG registers, often calling a helper > function the calculation of each element. The result of the helper is > then is then stored back in the vector representation afterwards. > There are occasional outliers like simd_tbl which access elements > directly from a passed CPUFooState env pointer but these are rare. > > This series introduces the concept of TCGv_vec type. This is a pointer > to the start of the in memory representation of an arbitrarily long > vector register. This is passed to a helper function as a pointer > along with a normal TCG register containing information about the > actual vector length and any additional information the helper needs > to do the operation. The hope* is this saves on the churn of having > the TCG do things element by element and allows the compiler to use > native vector operations to streamline the helpers. > > There are some downsides to this approach. The first is you have to be > careful about register aliasing. If you are doing a same reg to same > reg operation you need to make a copy of the vector so you don't > trample your input data as you go. The second is this involves > changing some of the assumptions the TCG makes about things. I've > managed to keep all the changes within the core TCG code for now but > so far it has only been tested for the tcg_call path which is the only > place where TCGv_vec's should turn up. It is possible to do the same > thing without touching the TCG code generation by using TCGv_ptrs and > manually emitting tcg_addi ops to pass the correct address. Richard > has been exploring this approach with his series. The downside of that > is you do miss the ability to have named global vector registers which > makes reading the TCG dumps a little easier. > > I've only patched one helper in this series which implements the > indexed smull. This is because it appears in the profiles for my test > case which was using an arm64 ffmpeg to transcode: > > ./ffmpeg.arm64 -i big_buck_bunny_480p_surround-fix.avi \ > -threads 1 -qscale:v 3 -f null - > > * hope. On an earlier revision (which included sqshrn conversions) I > had measured a minor saving but this had disappeared once I measured > the final code. However the profile is fairly dominated by > softfloat. > > master: > 8.05% qemu-aarch64 qemu-aarch64 [.] roundAndPackFloat32 > 7.28% qemu-aarch64 qemu-aarch64 [.] float32_mul > 6.56% qemu-aarch64 qemu-aarch64 [.] helper_lookup_tb_ptr > 5.31% qemu-aarch64 qemu-aarch64 [.] float32_muladd > 4.09% qemu-aarch64 qemu-aarch64 [.] helper_neon_mull_s16 > 4.00% qemu-aarch64 qemu-aarch64 [.] addFloat32Sigs > 3.86% qemu-aarch64 qemu-aarch64 [.] subFloat32Sigs > 2.26% qemu-aarch64 qemu-aarch64 [.] helper_simd_tbl > 2.00% qemu-aarch64 qemu-aarch64 [.] float32_add > 1.81% qemu-aarch64 qemu-aarch64 [.] helper_neon_unarrow_sat8 > 1.64% qemu-aarch64 qemu-aarch64 [.] float32_sub > 1.43% qemu-aarch64 qemu-aarch64 [.] helper_neon_subl_u32 > 0.98% qemu-aarch64 qemu-aarch64 [.] helper_neon_widen_u8 > > tcg-native-vectors-rfc: > 7.93% qemu-aarch64 qemu-aarch64 [.] roundAndPackFloat32 > 7.54% qemu-aarch64 qemu-aarch64 [.] float32_mul > 6.29% qemu-aarch64 qemu-aarch64 [.] helper_lookup_tb_ptr > 5.39% qemu-aarch64 qemu-aarch64 [.] float32_muladd > 3.92% qemu-aarch64 qemu-aarch64 [.] addFloat32Sigs > 3.86% qemu-aarch64 qemu-aarch64 [.] subFloat32Sigs > 3.62% qemu-aarch64 qemu-aarch64 [.] helper_advsimd_smull_idx_s32 > 2.19% qemu-aarch64 qemu-aarch64 [.] helper_simd_tbl > 2.09% qemu-aarch64 qemu-aarch64 [.] helper_neon_mull_s16 > 1.99% qemu-aarch64 qemu-aarch64 [.] float32_add > 1.79% qemu-aarch64 qemu-aarch64 [.] helper_neon_unarrow_sat8 > 1.62% qemu-aarch64 qemu-aarch64 [.] float32_sub > 1.43% qemu-aarch64 qemu-aarch64 [.] helper_neon_subl_u32 > 1.00% qemu-aarch64 qemu-aarch64 [.] helper_neon_widen_u8 > 0.98% qemu-aarch64 qemu-aarch64 [.] helper_neon_addl_u32 > > At the moment the default compiler settings don't actually vectorise > the helper. I could get it to once I added some alignment guarantees > but the casting I did broke the instruction emulation so I haven't > included that patch in this series. > > Given the results why continue investigating this? Well for one thing > vector sizes are growing, SVE vectors are up to 2048 bits long. Those > longer vectors should offer more scope for the host compiler to > generate efficient code in the helper. Also vector operations tend to > be quite complex operations, being able to handle this in C code > instead of TCGOps might be more preferable from a code maintainability > point of view. Finally this noddy little experiment has at least shown > it doesn't worsen performance. It would be nice if I could find a > benchmark that made heavy use if non-floating point SIMD instructions > to better measure the effect of marshalling elements vs vectorised > helpers. If anyone has any suggestions I'm all ears ;-) While doing my own vector register series I was using 1. Handwritten example (it's for ARM32 NEON, not aarch64) .cpu cortex-a8 .fpu neon .text .global test test: vld1.32 d0, [r0]! vld1.32 d1, [r0] vld1.32 d2, [r1]! vld1.32 d3, [r1] mov r0, #0xb0000000 loop: vadd.i32 q0, q0, q1 vadd.i32 q0, q0, q1 vadd.i32 q0, q0, q1 vadd.i32 q0, q0, q1 subs r0, r0, #1 bne loop vpadd.i32 d0, d0, d1 vpadd.i32 d0, d0, d1 vmov.i32 r0, d0[0] bx lr It can be adapted for aarch64 without much problems. This example shows what potential speed up you can expect, as it is nearly perfect for the optimization in question. 2. x264 video encoder. It has a lot of handwritten vector assembler for different architectures, including aarch64. You probably can access it as libx264 from within ffmpeg, if this library support was compiled. > > Anyway questions, comments? > From my own experimentations some times ago, (1) translating vector instructions to vector instructions in TCG is faster than (2) translating vector instructions to series of scalar instructions in TCG, which is faster than* (3) translating vector instructions to single helper calls, which is faster than* (4) translating vector instructions to helper calls for each vector element. (*) (2) and (3) may change their respective places in case of some complicated instructions. ARM (at least ARM32, I have not checked aarch64 in this regard) uses the last, the slowest scheme. As far as I understand, you are want to change it to the third approach. This approach is used in SSE emulation, may be you can use similar structure of helpers? I still hope to finish my own series about implementation of the first approach. I apologize for the long delay since last update and hope to send next version somewhere next week. I do not think our series contradict each other: you are trying to optimize existing general purpose case while I'm trying to optimize case where both host and guest support vector instructions. Since I'm experimenting on ARM32, we'll not have much merge conflicts either. -- Kirill
On 08/18/2017 04:33 AM, Kirill Batuzov wrote: > From my own experimentations some times ago, > > (1) translating vector instructions to vector instructions in TCG is faster than > > (2) translating vector instructions to series of scalar instructions in TCG, > which is faster than* > > (3) translating vector instructions to single helper calls, which is faster > than* > > (4) translating vector instructions to helper calls for each vector element. > > (*) (2) and (3) may change their respective places in case of some > complicated instructions. This was my gut feeling as well. With the caveat that for the ARM SVE case of 2048-bit registers we cannot afford to expand inline due to generated code size. > ARM (at least ARM32, I have not checked aarch64 in this regard) uses the > last, the slowest scheme. As far as I understand, you are want to change > it to the third approach. This approach is used in SSE emulation, may be > you can use similar structure of helpers? > > I still hope to finish my own series about implementation of the first > approach. I apologize for the long delay since last update and hope to > send next version somewhere next week. I do not think our series > contradict each other: you are trying to optimize existing general > purpose case while I'm trying to optimize case where both host and guest > support vector instructions. Since I'm experimenting on ARM32, we'll not > have much merge conflicts either. I posted my own, different, take on vectorization yesterday as well. http://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg03272.html The primary difference between my version and your version is that I do not allow target/cpu/translate*.c to create vector types. All of the host vector expansion is done within tcg/*.c. We also would like to settle on a common style for out-of-line helpers, if that is possible. One thing *not* to take from our current SSE emulation is that we do not yet support AVX, AVX2, or AVX512 extensions. So the current construction of helpers within target/i386/ doesn't really take into account all that should be required. The thing that's common between AVX512 and SVE is that we have multiple vector lengths, and that elements beyond the operation length are zeroed. Both Alex and I have packed operation length + full vector length into a descriptor given to the helper. (Alex allows for some other bits too; I'm not sure about that.) r~
On Fri, 18 Aug 2017, Richard Henderson wrote: > On 08/18/2017 04:33 AM, Kirill Batuzov wrote: > > From my own experimentations some times ago, > > > > (1) translating vector instructions to vector instructions in TCG is faster than > > > > (2) translating vector instructions to series of scalar instructions in TCG, > > which is faster than* > > > > (3) translating vector instructions to single helper calls, which is faster > > than* > > > > (4) translating vector instructions to helper calls for each vector element. > > > > (*) (2) and (3) may change their respective places in case of some > > complicated instructions. > > This was my gut feeling as well. With the caveat that for the ARM SVE case of > 2048-bit registers we cannot afford to expand inline due to generated code size. > > > ARM (at least ARM32, I have not checked aarch64 in this regard) uses the > > last, the slowest scheme. As far as I understand, you are want to change > > it to the third approach. This approach is used in SSE emulation, may be > > you can use similar structure of helpers? > > > > I still hope to finish my own series about implementation of the first > > approach. I apologize for the long delay since last update and hope to > > send next version somewhere next week. I do not think our series > > contradict each other: you are trying to optimize existing general > > purpose case while I'm trying to optimize case where both host and guest > > support vector instructions. Since I'm experimenting on ARM32, we'll not > > have much merge conflicts either. > > I posted my own, different, take on vectorization yesterday as well. > > http://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg03272.html > > The primary difference between my version and your version is that I do not > allow target/cpu/translate*.c to create vector types. All of the host vector > expansion is done within tcg/*.c. I took a look at your approach. The only problem with it is that in current implementation it does not allow to keep vector variables on register between consecutive guest instructions. But this can be changed. To do it we need to make copy propagation work with memory locations as well, and dead code elimination to be able to remove excess stores to memory. While in general case these can be troublesome if we limit analysis to addresses that are [env + Const] it becomes relatively easy. I've done similar thing in my series to track interference between memory operations and vector global variables. In case of your series this affects only performance so it does not need to be added in the initial series and can be added later as a separate patch. I can care of this once initial series are pulled to master. Overall I like your approach the most out of three: - it handles different representations of guest vectors with host vectors seamlessly (unlike my approach where I still do not know how to make it right), - it provides better performance than Alex's (and the same as mine once we add a bit of alias analysis), - it moves in the direction of representing guest vectors not as globals, but as a pair (offset, size) in a special address space (this approach was successfully used in Valgrind and it handles intersecting registers much better than what we have now; we are moving in this direction anyway). -- Kirill