Message ID | 20190104223116.14037-1-richard.henderson@linaro.org |
---|---|
Headers | show |
Series | tcg vector improvements | expand |
On 04/01/2019 22:31, Richard Henderson wrote: > I've split this out from the target/ppc patch set in which > it was developed. > > > r~ > > > Richard Henderson (10): > tcg: Add logical simplifications during gvec expand > tcg: Add gvec expanders for nand, nor, eqv > tcg: Add write_aofs to GVecGen4 > tcg: Add opcodes for vector saturated arithmetic > tcg: Add opcodes for vector minmax arithmetic > tcg/i386: Split subroutines out of tcg_expand_vec_op > tcg/i386: Implement vector saturating arithmetic > tcg/i386: Implement vector minmax arithmetic > tcg/aarch64: Implement vector saturating arithmetic > tcg/aarch64: Implement vector minmax arithmetic > > accel/tcg/tcg-runtime.h | 23 ++ > tcg/aarch64/tcg-target.h | 2 + > tcg/i386/tcg-target.h | 2 + > tcg/tcg-op-gvec.h | 18 ++ > tcg/tcg-op.h | 11 + > tcg/tcg-opc.h | 8 + > tcg/tcg.h | 2 + > accel/tcg/tcg-runtime-gvec.c | 257 ++++++++++++++++ > tcg/aarch64/tcg-target.inc.c | 48 +++ > tcg/i386/tcg-target.inc.c | 580 +++++++++++++++++++++-------------- > tcg/tcg-op-gvec.c | 305 ++++++++++++++++-- > tcg/tcg-op-vec.c | 75 ++++- > tcg/tcg.c | 10 + > 13 files changed, 1078 insertions(+), 263 deletions(-) Not sure that I'm particularly qualified to give this an R-B, however should there be a corresponding update to tcg/README for the new instructions? One other thing is that Howard sent me off-list a backtrace from trying my combined branch at https://github.com/mcayland/qemu/tree/ppc-altivec-v5.5-rth booting MacOS 10.5 in the guest and ended up hitting an assert in an --enable-debug build: Thread 5 (Thread 0x7fffe3fff700 (LWP 10627)): #0 0x00007ffff698d53f in raise () at /lib64/libc.so.6 #1 0x00007ffff6977895 in abort () at /lib64/libc.so.6 #2 0x00007ffff6977769 in _nl_load_domain.cold.0 () at /lib64/libc.so.6 #3 0x00007ffff69859f6 in .annobin_assert.c_end () at /lib64/libc.so.6 #4 0x000055555584bb67 in do_op3 (vece=2, r=0x1848, a=0x18b8, b=0x18f0, opc=INDEX_op_ssadd_vec) at /home/hsp/src/qemu-altivec-55/tcg/tcg-op-vec.c:407 rt = 0x7fffdc002368 at = 0x7fffdc0023d8 bt = 0x7fffdc002410 ri = 140736884384616 ai = 140736884384728 bi = 140736884384784 type = TCG_TYPE_V128 can = 0 __PRETTY_FUNCTION__ = "do_op3" #5 0x000055555584bbfa in tcg_gen_ssadd_vec (vece=2, r=0x1848, a=0x18b8, b=0x18f0) at /home/hsp/src/qemu-altivec-55/tcg/tcg-op-vec.c:419 #6 0x0000555555991887 in gen_vaddsws_vec (vece=2, t=0x1848, sat=0x1880, a=0x18b8, b=0x18f0) at /home/hsp/src/qemu-altivec-55/target/ppc/translate/vmx-impl.inc.c:597 x = 0x1928 #7 0x0000555555852e53 in expand_4_vec (vece=2, dofs=197872, aofs=198288, bofs=197776, cofs=197792, oprsz=16, tysz=16, type=TCG_TYPE_V128, write_aofs=true, fni=0x55555599182a <gen_vaddsws_vec>) at /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:903 t0 = 0x1848 t1 = 0x1880 t2 = 0x18b8 t3 = 0x18f0 i = 0 #8 0x0000555555853cc4 in tcg_gen_gvec_4 (dofs=197872, aofs=198288, bofs=197776, cofs=197792, oprsz=16, maxsz=16, g=0x5555562d33c0 <g>) at /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:1211 type = TCG_TYPE_V128 some = 21845 __PRETTY_FUNCTION__ = "tcg_gen_gvec_4" __func__ = "tcg_gen_gvec_4" #9 0x0000555555991987 in gen_vaddsws (ctx=0x7fffe3ffe5f0) at /home/hsp/src/qemu-altivec-55/target/ppc/translate/vmx-impl.inc.c:597 g = {fni8 = 0x0, fni4 = 0x0, fniv = 0x55555599182a <gen_vaddsws_vec>, fno = 0x5555559601a1 <gen_helper_vaddsws>, opc = INDEX_op_add_vec, data = 0, vece = 2 '\002', prefer_i64 = false, write_aofs = true} Certainly according to patch 7 of the series only 8-bit and 16-bit accesses are supported on i386 hosts, but shouldn't we be falling back to the previous implementations rather than hitting an assert()? ATB, Mark.
On 1/7/19 5:11 AM, Mark Cave-Ayland wrote: > #7 0x0000555555852e53 in expand_4_vec (vece=2, dofs=197872, > aofs=198288, bofs=197776, cofs=197792, oprsz=16, tysz=16, > type=TCG_TYPE_V128, write_aofs=true, fni=0x55555599182a > <gen_vaddsws_vec>) at > /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:903 > t0 = 0x1848 > t1 = 0x1880 > t2 = 0x18b8 > t3 = 0x18f0 > i = 0 > #8 0x0000555555853cc4 in tcg_gen_gvec_4 (dofs=197872, aofs=198288, > bofs=197776, cofs=197792, oprsz=16, maxsz=16, g=0x5555562d33c0 <g>) at > /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:1211 > type = TCG_TYPE_V128 > some = 21845 > __PRETTY_FUNCTION__ = "tcg_gen_gvec_4" > __func__ = "tcg_gen_gvec_4" > #9 0x0000555555991987 in gen_vaddsws (ctx=0x7fffe3ffe5f0) at > /home/hsp/src/qemu-altivec-55/target/ppc/translate/vmx-impl.inc.c:597 > g = {fni8 = 0x0, fni4 = 0x0, fniv = 0x55555599182a > <gen_vaddsws_vec>, fno = 0x5555559601a1 <gen_helper_vaddsws>, opc = > INDEX_op_add_vec, data = 0, vece = 2 '\002', prefer_i64 = false, > write_aofs = true} > > > Certainly according to patch 7 of the series only 8-bit and 16-bit accesses are > supported on i386 hosts, but shouldn't we be falling back to the previous > implementations rather than hitting an assert()? In here: #define GEN_VXFORM_SAT(NAME, VECE, NORM, SAT, OPC2, OPC3) \ static void glue(glue(gen_, NAME), _vec)(unsigned vece, TCGv_vec t, \ TCGv_vec sat, TCGv_vec a, \ TCGv_vec b) \ { \ TCGv_vec x = tcg_temp_new_vec_matching(t); \ glue(glue(tcg_gen_, NORM), _vec)(VECE, x, a, b); \ glue(glue(tcg_gen_, SAT), _vec)(VECE, t, a, b); \ tcg_gen_cmp_vec(TCG_COND_NE, VECE, x, x, t); \ tcg_gen_or_vec(VECE, sat, sat, x); \ tcg_temp_free_vec(x); \ } \ static void glue(gen_, NAME)(DisasContext *ctx) \ { \ static const GVecGen4 g = { \ .fniv = glue(glue(gen_, NAME), _vec), \ .fno = glue(gen_helper_, NAME), \ .opc = glue(glue(INDEX_op_, NORM), _vec), \ s/NORM/SAT/, so that we query whether the saturated opcode is supported. The normal arithmetic, cmp, and or opcodes are mandatory; we don't need to do anything with those. r~
On 23/01/2019 05:09, Richard Henderson wrote: > On 1/7/19 5:11 AM, Mark Cave-Ayland wrote: >> #7 0x0000555555852e53 in expand_4_vec (vece=2, dofs=197872, >> aofs=198288, bofs=197776, cofs=197792, oprsz=16, tysz=16, >> type=TCG_TYPE_V128, write_aofs=true, fni=0x55555599182a >> <gen_vaddsws_vec>) at >> /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:903 >> t0 = 0x1848 >> t1 = 0x1880 >> t2 = 0x18b8 >> t3 = 0x18f0 >> i = 0 >> #8 0x0000555555853cc4 in tcg_gen_gvec_4 (dofs=197872, aofs=198288, >> bofs=197776, cofs=197792, oprsz=16, maxsz=16, g=0x5555562d33c0 <g>) at >> /home/hsp/src/qemu-altivec-55/tcg/tcg-op-gvec.c:1211 >> type = TCG_TYPE_V128 >> some = 21845 >> __PRETTY_FUNCTION__ = "tcg_gen_gvec_4" >> __func__ = "tcg_gen_gvec_4" >> #9 0x0000555555991987 in gen_vaddsws (ctx=0x7fffe3ffe5f0) at >> /home/hsp/src/qemu-altivec-55/target/ppc/translate/vmx-impl.inc.c:597 >> g = {fni8 = 0x0, fni4 = 0x0, fniv = 0x55555599182a >> <gen_vaddsws_vec>, fno = 0x5555559601a1 <gen_helper_vaddsws>, opc = >> INDEX_op_add_vec, data = 0, vece = 2 '\002', prefer_i64 = false, >> write_aofs = true} >> >> >> Certainly according to patch 7 of the series only 8-bit and 16-bit accesses are >> supported on i386 hosts, but shouldn't we be falling back to the previous >> implementations rather than hitting an assert()? > > In here: > > #define GEN_VXFORM_SAT(NAME, VECE, NORM, SAT, OPC2, OPC3) \ > static void glue(glue(gen_, NAME), _vec)(unsigned vece, TCGv_vec t, \ > TCGv_vec sat, TCGv_vec a, \ > TCGv_vec b) \ > { \ > TCGv_vec x = tcg_temp_new_vec_matching(t); \ > glue(glue(tcg_gen_, NORM), _vec)(VECE, x, a, b); \ > glue(glue(tcg_gen_, SAT), _vec)(VECE, t, a, b); \ > tcg_gen_cmp_vec(TCG_COND_NE, VECE, x, x, t); \ > tcg_gen_or_vec(VECE, sat, sat, x); \ > tcg_temp_free_vec(x); \ > } \ > static void glue(gen_, NAME)(DisasContext *ctx) \ > { \ > static const GVecGen4 g = { \ > .fniv = glue(glue(gen_, NAME), _vec), \ > .fno = glue(gen_helper_, NAME), \ > .opc = glue(glue(INDEX_op_, NORM), _vec), \ > > s/NORM/SAT/, so that we query whether the saturated opcode is supported. The > normal arithmetic, cmp, and or opcodes are mandatory; we don't need to do > anything with those. Now that this and the other pre-requisite patches have been merged into master, I've rebased the outstanding PPC parts of your "tcg, target/ppc vector improvements" on master including the above fix and pushed the result to https://github.com/mcayland/qemu/commits/ppc-altivec-v6. The good news is that the graphics corruption I originally noticed caused by the patch introducing the saturating add/sub vector ops has now gone, and with my little-endian vsplt fix included then both OS X and MacOS 9 appear to run without any obvious issues on an x86 host, and certainly feel smoother compared to before. The only minor question I had with the patchset in its current form is whether to use the new VsrD() macro for vscr_sat, or whether we don't really care enough? ATB, Mark.
On 2/5/19 9:29 PM, Mark Cave-Ayland wrote: > The only minor question I had with the patchset in its current form is whether to use > the new VsrD() macro for vscr_sat, or whether we don't really care enough? Given the comment /* Which bit we set is completely arbitrary, but clear the rest. */ I don't think VsrD is helpful. In "target/ppc: Split out VSCR_SAT to a vector field": ppc_vsr_t vsr[64] QEMU_ALIGNED(16); + /* Non-zero if and only if VSCR_SAT should be set. */ + ppc_vsr_t vscr_sat; Better to add the QEMU_ALIGNED(16) to vscr_sat as well. Yes, it will already happen to be aligned by placement, but this is also a bit documentation. In "target/ppc: convert vadd*s and vsub*s to vector operations": if (sat) { \ - set_vscr_sat(env); \ + vscr_sat->u32[0] = 1; \ } \ Changed in error? r~
On 06/02/2019 03:37, Richard Henderson wrote: > On 2/5/19 9:29 PM, Mark Cave-Ayland wrote: >> The only minor question I had with the patchset in its current form is whether to use >> the new VsrD() macro for vscr_sat, or whether we don't really care enough? > > Given the comment > > /* Which bit we set is completely arbitrary, but clear the rest. */ > > I don't think VsrD is helpful. Okay, I can leave that for now. > In "target/ppc: Split out VSCR_SAT to a vector field": > > ppc_vsr_t vsr[64] QEMU_ALIGNED(16); > + /* Non-zero if and only if VSCR_SAT should be set. */ > + ppc_vsr_t vscr_sat; > > Better to add the QEMU_ALIGNED(16) to vscr_sat as well. Yes, it will already > happen to be aligned by placement, but this is also a bit documentation. I've added this to my latest branch. > In "target/ppc: convert vadd*s and vsub*s to vector operations": > > if (sat) { \ > - set_vscr_sat(env); \ > + vscr_sat->u32[0] = 1; \ > } \ > > Changed in error? It looks like this was in the original patch, presumably because GEN_VXFORM_SAT doesn't include the env parameter which was present in GEN_VXFORM_ENV. Should the env parameter be added to GEN_VXFORM_SAT? Howard has also pointed out that he's still spotted some corruption in his tests, so I will do a bit more investigation and report back. ATB, Mark.
On Wed, 6 Feb 2019, Mark Cave-Ayland wrote: > Howard has also pointed out that he's still spotted some corruption in his tests, so > I will do a bit more investigation and report back. I wonder if instead of "visually testing" shouldn't it be verified by some more exact tests? Could any of these be useful: https://gcc.gnu.org/ml/gcc-patches/2002-11/msg01391.html https://gcc.gnu.org/git/?p=gcc.git;a=tree;f=gcc/testsuite/gcc.dg/vmx;h=4765c35dabb4e3891c98084af077c2549fb45dfc;hb=HEAD https://github.com/llvm-mirror/test-suite/tree/master/SingleSource/UnitTests/Vector/Altivec or there could be other similar ones around that already have test cases for these instructions that could provide some more confidence in the correctness of the implementation. Regards, BALATON Zoltan
On 05/02/2019 21:29, Mark Cave-Ayland wrote: > Now that this and the other pre-requisite patches have been merged into master, I've > rebased the outstanding PPC parts of your "tcg, target/ppc vector improvements" on > master including the above fix and pushed the result to > https://github.com/mcayland/qemu/commits/ppc-altivec-v6. > > The good news is that the graphics corruption I originally noticed caused by the > patch introducing the saturating add/sub vector ops has now gone, and with my > little-endian vsplt fix included then both OS X and MacOS 9 appear to run without any > obvious issues on an x86 host, and certainly feel smoother compared to before. I started to follow up Howard's report that he could still see graphical corruption with these patches, and I was surprised to notice that it had reappeared locally once again :/ After working out that this was because the relevant instructions weren't being generated for all Mac machines, I was able to figure it out after a few hours of digging. Patch to follow shortly. ATB, Mark.