Message ID | 20181112214503.22941-1-richard.henderson@linaro.org |
---|---|
Headers | show |
Series | tcg: Move softmmu out-of-line | expand |
Hi, This series seems to have some coding style problems. See output below for more information: Type: series Message-id: 20181112214503.22941-1-richard.henderson@linaro.org Subject: [Qemu-devel] [PATCH for-4.0 00/17] tcg: Move softmmu out-of-line === TEST SCRIPT BEGIN === #!/bin/bash BASE=base n=1 total=$(git log --oneline $BASE.. | wc -l) failed=0 git config --local diff.renamelimit 0 git config --local diff.renames True git config --local diff.algorithm histogram commits="$(git log --format=%H --reverse $BASE..)" for c in $commits; do echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..." if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then failed=1 echo fi n=$((n+1)) done exit $failed === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 From https://github.com/patchew-project/qemu * [new tag] patchew/20181113083104.2692-1-aik@ozlabs.ru -> patchew/20181113083104.2692-1-aik@ozlabs.ru Switched to a new branch 'test' e69b196b86 tcg/arm: Use TCG_TARGET_NEED_LDST_OOL_LABELS 6fd30d145e tcg/arm: Force qemu_ld/st arguments into fixed registers bb1d7b7a2a tcg/arm: Reduce the number of temps for tcg_out_tlb_read 58ad5e1707 tcg/arm: Add constraints for R0-R5 0ae2a6fe31 tcg/arm: Parameterize the temps for tcg_out_tlb_read d9cc700ca9 tcg/aarch64: Use TCG_TARGET_NEED_LDST_OOL_LABELS 81d088cf09 tcg/aarch64: Use B not BL for tcg_out_goto_long 9b4254898d tcg/aarch64: Parameterize the temp for tcg_out_goto_long 88f66038f2 tcg/aarch64: Parameterize the temps for tcg_out_tlb_read 70870da04b tcg/aarch64: Add constraints for x0, x1, x2 c13bc30fe2 tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS f89e8cc968 tcg: Add TCG_TARGET_NEED_LDST_OOL_LABELS 61aa1913f6 tcg: Return success from patch_reloc 299cd2f2dc tcg/i386: Force qemu_ld/st arguments into fixed registers 29999f5f40 tcg/i386: Change TCG_REG_L[01] to not overlap function arguments 8c6efd075f tcg/i386: Return a base register from tcg_out_tlb_load 8c82ba4a0e tcg/i386: Add constraints for r8 and r9 === OUTPUT BEGIN === Checking PATCH 1/17: tcg/i386: Add constraints for r8 and r9... Checking PATCH 2/17: tcg/i386: Return a base register from tcg_out_tlb_load... Checking PATCH 3/17: tcg/i386: Change TCG_REG_L[01] to not overlap function arguments... Checking PATCH 4/17: tcg/i386: Force qemu_ld/st arguments into fixed registers... Checking PATCH 5/17: tcg: Return success from patch_reloc... Checking PATCH 6/17: tcg: Add TCG_TARGET_NEED_LDST_OOL_LABELS... WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #13: new file mode 100644 total: 0 errors, 1 warnings, 148 lines checked Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. Checking PATCH 7/17: tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS... Checking PATCH 8/17: tcg/aarch64: Add constraints for x0, x1, x2... Checking PATCH 9/17: tcg/aarch64: Parameterize the temps for tcg_out_tlb_read... Checking PATCH 10/17: tcg/aarch64: Parameterize the temp for tcg_out_goto_long... Checking PATCH 11/17: tcg/aarch64: Use B not BL for tcg_out_goto_long... Checking PATCH 12/17: tcg/aarch64: Use TCG_TARGET_NEED_LDST_OOL_LABELS... Checking PATCH 13/17: tcg/arm: Parameterize the temps for tcg_out_tlb_read... Checking PATCH 14/17: tcg/arm: Add constraints for R0-R5... Checking PATCH 15/17: tcg/arm: Reduce the number of temps for tcg_out_tlb_read... Checking PATCH 16/17: tcg/arm: Force qemu_ld/st arguments into fixed registers... Checking PATCH 17/17: tcg/arm: Use TCG_TARGET_NEED_LDST_OOL_LABELS... ERROR: externs should be avoided in .c files #168: FILE: tcg/arm/tcg-target.inc.c:1485: + TCGReg addrlo __attribute__((unused)); ERROR: externs should be avoided in .c files #169: FILE: tcg/arm/tcg-target.inc.c:1486: + TCGReg addrhi __attribute__((unused)); ERROR: externs should be avoided in .c files #170: FILE: tcg/arm/tcg-target.inc.c:1487: + TCGReg datalo __attribute__((unused)); ERROR: externs should be avoided in .c files #171: FILE: tcg/arm/tcg-target.inc.c:1488: + TCGReg datahi __attribute__((unused)); ERROR: externs should be avoided in .c files #233: FILE: tcg/arm/tcg-target.inc.c:1603: + TCGReg addrlo __attribute__((unused)); ERROR: externs should be avoided in .c files #234: FILE: tcg/arm/tcg-target.inc.c:1604: + TCGReg addrhi __attribute__((unused)); ERROR: externs should be avoided in .c files #235: FILE: tcg/arm/tcg-target.inc.c:1605: + TCGReg datalo __attribute__((unused)); ERROR: externs should be avoided in .c files #236: FILE: tcg/arm/tcg-target.inc.c:1606: + TCGReg datahi __attribute__((unused)); total: 8 errors, 0 warnings, 373 lines checked Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. === OUTPUT END === Test command exited with code: 1 --- Email generated automatically by Patchew [http://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
On Mon, Nov 12, 2018 at 22:44:46 +0100, Richard Henderson wrote: > Based on an idea forwarded by Emilio, which suggests a 5-6% > speed gain is possible. I have not spent too much time > measuring this, as the code size gains are significant. Nice! > I believe that I posted an x86_64-only patch some time ago, > but this now includes i386, aarch64 and arm32. In late > testing I do some failures on i386, for sparc guest. I'll > follow up on that later. The following might be related: I'm seeing segfaults with -smp 8 and beyond when doing bootup+shutdown of an aarch64 guest on an x86-64 host. smp -1 is stable AFAICT. The first commit that shows these crashes is "tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS", that is f7ec49a51c8 in your tcg-tlb-x86 github branch. Thanks, Emilio
On 11/14/18 2:00 AM, Emilio G. Cota wrote: > The following might be related: I'm seeing segfaults with -smp 8 > and beyond when doing bootup+shutdown of an aarch64 guest on > an x86-64 host. I'm not seeing that. Anything else special on the command-line? Are the segv in the code_gen_buffer or elsewhere? r~
On Thu, Nov 15, 2018 at 12:32:00 +0100, Richard Henderson wrote: > On 11/14/18 2:00 AM, Emilio G. Cota wrote: > > The following might be related: I'm seeing segfaults with -smp 8 > > and beyond when doing bootup+shutdown of an aarch64 guest on > > an x86-64 host. > > I'm not seeing that. Anything else special on the command-line? > Are the segv in the code_gen_buffer or elsewhere? I just spent some time on this. I've noticed two issues: - All TCG contexts end up using the same hash table, since we only allocate one table in tcg_context_init. This leads to memory corruption. This fixes it (confirmed that there aren't races with helgrind): --- a/tcg/tcg.c +++ b/tcg/tcg.c @@ -763,6 +763,14 @@ void tcg_register_thread(void) err = tcg_region_initial_alloc__locked(tcg_ctx); g_assert(!err); qemu_mutex_unlock(®ion.lock); + +#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS + /* if n == 0, keep the hash table we allocated in tcg_context_init */ + if (n) { + /* Both key and value are raw pointers. */ + s->ldst_ool_thunks = g_hash_table_new(NULL, NULL); + } +#endif } #endif /* !CONFIG_USER_ONLY */ - Segfault in code_gen_buffer. This one I don't have a fix for, but it's *much* easier to reproduce when -tb-size is very small, e.g. "-tb-size 5 -smp 2" (BTW it crashes with x86_64 guests too.) So at first I thought the code cache flushing was the problem, but I don't see how that could be, at least from a TCGContext viewpoint -- I agree that clearing the hash table in tcg_region_assign is a good place to do so. Thanks, Emilio
On 11/15/18 7:48 PM, Emilio G. Cota wrote: > - All TCG contexts end up using the same hash table, since > we only allocate one table in tcg_context_init. This leads > to memory corruption. Bah. I forgot we only call tcg_context_init once. Thanks for the fix. In the meantime I've fixed the i386 problem. A bit of silliness wrt 64-bit stores. r~
On 11/15/18 7:48 PM, Emilio G. Cota wrote: > - Segfault in code_gen_buffer. This one I don't have a fix for, > but it's *much* easier to reproduce when -tb-size is very small, > e.g. "-tb-size 5 -smp 2" (BTW it crashes with x86_64 guests too.) > So at first I thought the code cache flushing was the problem, > but I don't see how that could be, at least from a TCGContext > viewpoint -- I agree that clearing the hash table in > tcg_region_assign is a good place to do so. Ho hum. diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c index 639f0b2728..115ea186e5 100644 --- a/accel/tcg/translate-all.c +++ b/accel/tcg/translate-all.c @@ -1831,10 +1831,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu, existing_tb = tb_link_page(tb, phys_pc, phys_page2); /* if the TB already exists, discard what we just translated */ if (unlikely(existing_tb != tb)) { - uintptr_t orig_aligned = (uintptr_t)gen_code_buf; - - orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize); - atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned); return existing_tb; } tcg_tb_insert(tb); We can't easily undo the hash table insert, and for a relatively rare occurrence it's not worth the effort. r~
On Thu, Nov 15, 2018 at 23:04:50 +0100, Richard Henderson wrote: > On 11/15/18 7:48 PM, Emilio G. Cota wrote: > > - Segfault in code_gen_buffer. This one I don't have a fix for, > > but it's *much* easier to reproduce when -tb-size is very small, > > e.g. "-tb-size 5 -smp 2" (BTW it crashes with x86_64 guests too.) > > So at first I thought the code cache flushing was the problem, > > but I don't see how that could be, at least from a TCGContext > > viewpoint -- I agree that clearing the hash table in > > tcg_region_assign is a good place to do so. > > Ho hum. > > diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c > index 639f0b2728..115ea186e5 100644 > --- a/accel/tcg/translate-all.c > +++ b/accel/tcg/translate-all.c > @@ -1831,10 +1831,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu, > existing_tb = tb_link_page(tb, phys_pc, phys_page2); > /* if the TB already exists, discard what we just translated */ > if (unlikely(existing_tb != tb)) { > - uintptr_t orig_aligned = (uintptr_t)gen_code_buf; > - > - orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize); > - atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned); > return existing_tb; > } > tcg_tb_insert(tb); > > We can't easily undo the hash table insert, and for a relatively rare > occurrence it's not worth the effort. Nice catch! Everything works now =D In the bootup+shutdown aarch64 test with -smp 12, we end up discarding ~2500 TB's--that's ~439K of space for code that we do not waste; note that I'm assuming 180 host bytes per TB, which is the average reported by info jit. We can still discard most of these by increasing a counter every time we insert a new element into the OOL table, and checking this counter before/after tcg_gen_code. (Note that checking g_hash_table_size before/after is not enough, because we might have replaced an existing item from the table.) Then, we discard a TB iff an OOL thunk was generated. (Diff below.) This allows us to discard most TBs; in the example above, we end up *not* discarding only ~70 TBs, that is we end up keeping only 70/2500 = 2.8% of the TBs that we'd discard without OOL. Performance-wise it doesn't make a difference for -smp 1: Host: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (5 runs): - Before (3.1.0-rc1): 14351.436177 task-clock (msec) # 0.998 CPUs utilized ( +- 0.24% ) 49,963,260,126 cycles # 3.481 GHz ( +- 0.22% ) (83.32%) 26,047,650,654 stalled-cycles-frontend # 52.13% frontend cycles idle ( +- 0.29% ) (83.34%) 19,717,480,482 stalled-cycles-backend # 39.46% backend cycles idle ( +- 0.27% ) (66.67%) 59,278,011,067 instructions # 1.19 insns per cycle # 0.44 stalled cycles per insn ( +- 0.17% ) (83.34%) 10,632,601,608 branches # 740.874 M/sec ( +- 0.17% ) (83.34%) 236,153,469 branch-misses # 2.22% of all branches ( +- 0.16% ) (83.35%) 14.382847823 seconds time elapsed ( +- 0.25% ) - After this series (with the fixes we've discussed): 13256.198927 task-clock (msec) # 0.998 CPUs utilized ( +- 0.04% ) 46,146,457,353 cycles # 3.481 GHz ( +- 0.08% ) (83.34%) 22,632,342,565 stalled-cycles-frontend # 49.04% frontend cycles idle ( +- 0.12% ) (83.35%) 16,534,690,741 stalled-cycles-backend # 35.83% backend cycles idle ( +- 0.15% ) (66.67%) 58,047,832,548 instructions # 1.26 insns per cycle # 0.39 stalled cycles per insn ( +- 0.18% ) (83.34%) 11,031,634,880 branches # 832.187 M/sec ( +- 0.12% ) (83.33%) 210,593,929 branch-misses # 1.91% of all branches ( +- 0.30% ) (83.33%) 13.285023783 seconds time elapsed ( +- 0.05% ) - After the fixup below: 13240.889734 task-clock (msec) # 0.998 CPUs utilized ( +- 0.19% ) 46,074,292,775 cycles # 3.480 GHz ( +- 0.12% ) (83.35%) 22,670,132,770 stalled-cycles-frontend # 49.20% frontend cycles idle ( +- 0.17% ) (83.35%) 16,598,822,504 stalled-cycles-backend # 36.03% backend cycles idle ( +- 0.26% ) (66.66%) 57,796,083,344 instructions # 1.25 insns per cycle # 0.39 stalled cycles per insn ( +- 0.16% ) (83.34%) 11,002,340,174 branches # 830.937 M/sec ( +- 0.11% ) (83.35%) 211,023,549 branch-misses # 1.92% of all branches ( +- 0.22% ) (83.32%) 13.264499034 seconds time elapsed ( +- 0.19% ) I'll generate now some more perf numbers that we could include in the commit logs. Thanks, Emilio diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c index 115ea18..15f7d4e 100644 --- a/accel/tcg/translate-all.c +++ b/accel/tcg/translate-all.c @@ -1678,6 +1678,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu, target_ulong virt_page2; tcg_insn_unit *gen_code_buf; int gen_code_size, search_size; +#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS + size_t n_ool_thunks; +#endif #ifdef CONFIG_PROFILER TCGProfile *prof = &tcg_ctx->prof; int64_t ti; @@ -1744,6 +1747,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu, ti = profile_getclock(); #endif +#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS + n_ool_thunks = tcg_ctx->n_ool_thunks; +#endif + /* ??? Overflow could be handled better here. In particular, we don't need to re-do gen_intermediate_code, nor should we re-do the tcg optimization currently hidden inside tcg_gen_code. All @@ -1831,6 +1838,18 @@ TranslationBlock *tb_gen_code(CPUState *cpu, existing_tb = tb_link_page(tb, phys_pc, phys_page2); /* if the TB already exists, discard what we just translated */ if (unlikely(existing_tb != tb)) { + bool discard = true; + +#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS + /* only discard the TB if we didn't generate an OOL thunk */ + discard = tcg_ctx->n_ool_thunks == n_ool_thunks; +#endif + if (discard) { + uintptr_t orig_aligned = (uintptr_t)gen_code_buf; + + orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize); + atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned); + } return existing_tb; } tcg_tb_insert(tb); diff --git a/tcg/tcg-ldst-ool.inc.c b/tcg/tcg-ldst-ool.inc.c index 8fb6550..61da060 100644 --- a/tcg/tcg-ldst-ool.inc.c +++ b/tcg/tcg-ldst-ool.inc.c @@ -69,6 +69,7 @@ static bool tcg_out_ldst_ool_finalize(TCGContext *s) /* Remember the thunk for next time. */ g_hash_table_replace(s->ldst_ool_thunks, key, dest); + s->n_ool_thunks++; /* The new thunk must be in range. */ ok = patch_reloc(lb->label, lb->reloc, (intptr_t)dest, lb->addend); diff --git a/tcg/tcg.h b/tcg/tcg.h index 1255d2a..d4f07a6 100644 --- a/tcg/tcg.h +++ b/tcg/tcg.h @@ -709,6 +709,7 @@ struct TCGContext { #ifdef TCG_TARGET_NEED_LDST_OOL_LABELS QSIMPLEQ_HEAD(ldst_labels, TCGLabelQemuLdstOol) ldst_ool_labels; GHashTable *ldst_ool_thunks; + size_t n_ool_thunks; #endif #ifdef TCG_TARGET_NEED_POOL_LABELS struct TCGLabelPoolData *pool_labels;
On Thu, Nov 15, 2018 at 20:13:38 -0500, Emilio G. Cota wrote: > I'll generate now some more perf numbers that we could include in the > commit logs. SPEC numbers are a net perf decrease, unfortunately: Softmmu speedup for SPEC06int (test set) 1.1 +-+--+----+----+----+----+----+----+---+----+----+----+----+----+--+-+ | | | aft+++ | 1.05 +-+........................................................|.......+-+ | +++ | | | +++ | | | | +++ | | | | 1 +-++++++++++++++++****++++++++++++++++++++++++++++++++++++***+++++++-+ | | | * * **** **** *|* | | *** +++ | * * * |* +++ *| * *|* | 0.95 +-+.*|*..***...|..*..*.*.|*..+++...|............*|.*.+++..*|*..+++.+-+ | *|* *+* *** * * * |* | | +++ *| * *** *|* *** | | *+* * * *|* * * *++* | **** | *| * *+* *|* *+* | | * * * * *|* * * * * **** * |* **** *++* * * *+* * * | 0.9 +-+.*.*..*.*..*+*.*..*.*..*.*.|*.*.|*.*|.*......*..*.*.*..*.*..*.*.+-+ | * * * * * * * * * * *++* *++* *++* +++ * * * * * * * * | | * * * * * * * * * * * * * * * * | * * * * * * * * | 0.85 +-+.*.*..*.*..*.*.*..*.*..*.*..*.*..*.*..*..|...*..*.*.*..*.*..*.*.+-+ | * * * * * * * * * * * * * * * * | * * * * * * * * | | * * * * * * * * * * * * * * * * **** * * * * * * * * | | * * * * * * * * * * * * * * * * *| * * * * * * * * * | 0.8 +-+.*.*..*.*..*.*.*..*.*..*.*..*.*..*.*..*.*|.*.*..*.*.*..*.*..*.*.+-+ | * * * * * * * * * * * * * * * * *| * * * * * * * * * | | * * * * * * * * * * * * * * * * *++* * * * * * * * * | 0.75 +-+-***--***--***-****-****-****-****-****-****-****-***--***--***-+-+ 401.bzi403.g429445.g456.462.libq464.h471.omn4483.xalancbgeomean png: https://imgur.com/aO39gyP Turns out that the additional instructions are the problem, despite the much lower icache miss rate. For instance, here are some numbers for h264ref running on the not-so-recent Xeon E5-2643 (i.e. Sandy Bridge): - Before: 1,137,737,512,668 instructions # 2.02 insns per cycle 563,574,505,040 cycles 5,663,616,681 L1-icache-load-misses 164.091239774 seconds time elapsed - After: 1,216,600,582,476 instructions # 2.06 insns per cycle 591,888,969,223 cycles 3,082,426,508 L1-icache-load-misses 172.232292897 seconds time elapsed It's possible that newer machines with larger reorder buffers will be able to take better advantage of the higher instruction locality, hiding the latency of having to execute more instructions. I'll test on Skylake tomorrow. Thanks, E.
On 11/16/18 6:10 AM, Emilio G. Cota wrote: > It's possible that newer machines with larger reorder buffers > will be able to take better advantage of the higher instruction > locality, hiding the latency of having to execute more instructions. > I'll test on Skylake tomorrow. I've noticed that the code we generate for calls has twice as many instructions as really needed for setting up the arguments. I have a plan to fix that, which hopefully will solve this problem. r~
On 11/16/18 2:13 AM, Emilio G. Cota wrote: > This allows us to discard most TBs; in the example above, > we end up *not* discarding only ~70 TBs, that is we end up keeping > only 70/2500 = 2.8% of the TBs that we'd discard without OOL. Thanks. When I apply this I think I'll rename "n_ool_thunks" to "ool_generation". We don't care about the absolute count, but any change. r~
On Fri, Nov 16, 2018 at 09:07:50 +0100, Richard Henderson wrote: > On 11/16/18 6:10 AM, Emilio G. Cota wrote: > > It's possible that newer machines with larger reorder buffers > > will be able to take better advantage of the higher instruction > > locality, hiding the latency of having to execute more instructions. > > I'll test on Skylake tomorrow. > > I've noticed that the code we generate for calls has twice as many instructions > as really needed for setting up the arguments. I have a plan to fix that, > which hopefully will solve this problem. Ah that's great. I'll do more tests and a full review when those changes come out, then. Thanks, E.
On Fri, Nov 16, 2018 at 09:10:32 +0100, Richard Henderson wrote: > On 11/16/18 2:13 AM, Emilio G. Cota wrote: > > This allows us to discard most TBs; in the example above, > > we end up *not* discarding only ~70 TBs, that is we end up keeping > > only 70/2500 = 2.8% of the TBs that we'd discard without OOL. > > Thanks. > > When I apply this I think I'll rename "n_ool_thunks" to "ool_generation". We > don't care about the absolute count, but any change. Agreed, that's much better. Thanks, E.