[for-4.0,00/17] tcg: Move softmmu out-of-line

Message ID	20181112214503.22941-1-richard.henderson@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 2001:4830:134:3::11 as permitted sender) client-ip=2001:4830:134:3::11; From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Date: Mon, 12 Nov 2018 22:44:46 +0100 Message-Id: <20181112214503.22941-1-richard.henderson@linaro.org> Subject: [Qemu-devel] [PATCH for-4.0 00/17] tcg: Move softmmu out-of-line Precedence: list Cc: cota@braap.org Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>
Series	tcg: Move softmmu out-of-line \| expand [for-4.0,00/17] tcg: Move softmmu out-of-line [for-4.0,01/17] tcg/i386: Add constraints for r8 and r9 [for-4.0,02/17] tcg/i386: Return a base register from tcg_out_tlb_load [for-4.0,03/17] tcg/i386: Change TCG_REG_L[01] to not overlap function arguments [for-4.0,04/17] tcg/i386: Force qemu_ld/st arguments into fixed registers [for-4.0,05/17] tcg: Return success from patch_reloc [for-4.0,06/17] tcg: Add TCG_TARGET_NEED_LDST_OOL_LABELS [for-4.0,07/17] tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS [for-4.0,08/17] tcg/aarch64: Add constraints for x0, x1, x2 [for-4.0,09/17] tcg/aarch64: Parameterize the temps for tcg_out_tlb_read [for-4.0,10/17] tcg/aarch64: Parameterize the temp for tcg_out_goto_long [for-4.0,11/17] tcg/aarch64: Use B not BL for tcg_out_goto_long [for-4.0,12/17] tcg/aarch64: Use TCG_TARGET_NEED_LDST_OOL_LABELS [for-4.0,13/17] tcg/arm: Parameterize the temps for tcg_out_tlb_read [for-4.0,14/17] tcg/arm: Add constraints for R0-R5 [for-4.0,15/17] tcg/arm: Reduce the number of temps for tcg_out_tlb_read [for-4.0,16/17] tcg/arm: Force qemu_ld/st arguments into fixed registers [for-4.0,17/17] tcg/arm: Use TCG_TARGET_NEED_LDST_OOL_LABELS

Richard Henderson Nov. 12, 2018, 9:44 p.m. UTC

Based on an idea forwarded by Emilio, which suggests a 5-6%
speed gain is possible.  I have not spent too much time
measuring this, as the code size gains are significant.

I believe that I posted an x86_64-only patch some time ago,
but this now includes i386, aarch64 and arm32.  In late
testing I do some failures on i386, for sparc guest.  I'll
follow up on that later.

The main feature here is sharing code to place these out-of-line
thunks.  We want them to be within a direct call.  Once we've
emitted a thunk we remember (at least within a given tcg_region)
reusing it until we find that the relocation is out of range.
At which point we generate another copy.

The second main change is that the entire TCGMemOpIdx is built
into each thunk.  There simply are not enough free registers for
i386 (or arm32 for that matter) to pass in the mmu_idx to the thunk.

For x86, this displacement is 2GB, and we've already constrained
the whole code_gen_buffer to be in range.  For aarch64, this
displacement is 128MB; for arm32 it is 16MB.  In every case,
the range is significant, and for any smp guest may well cover
the entire tcg_region.

Other than these three targets, I have compile-tested the generic
change on ppc64le.  I have not even compile-tested mips, s390x,
or sparc host.


r~


Richard Henderson (17):
  tcg/i386: Add constraints for r8 and r9
  tcg/i386: Return a base register from tcg_out_tlb_load
  tcg/i386: Change TCG_REG_L[01] to not overlap function arguments
  tcg/i386: Force qemu_ld/st arguments into fixed registers
  tcg: Return success from patch_reloc
  tcg: Add TCG_TARGET_NEED_LDST_OOL_LABELS
  tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS
  tcg/aarch64: Add constraints for x0, x1, x2
  tcg/aarch64: Parameterize the temps for tcg_out_tlb_read
  tcg/aarch64: Parameterize the temp for tcg_out_goto_long
  tcg/aarch64: Use B not BL for tcg_out_goto_long
  tcg/aarch64: Use TCG_TARGET_NEED_LDST_OOL_LABELS
  tcg/arm: Parameterize the temps for tcg_out_tlb_read
  tcg/arm: Add constraints for R0-R5
  tcg/arm: Reduce the number of temps for tcg_out_tlb_read
  tcg/arm: Force qemu_ld/st arguments into fixed registers
  tcg/arm: Use TCG_TARGET_NEED_LDST_OOL_LABELS

 tcg/aarch64/tcg-target.h     |   2 +-
 tcg/arm/tcg-target.h         |   2 +-
 tcg/i386/tcg-target.h        |   2 +-
 tcg/tcg.h                    |   4 +
 tcg/aarch64/tcg-target.inc.c | 318 +++++++++---------
 tcg/arm/tcg-target.inc.c     | 535 +++++++++++++++---------------
 tcg/i386/tcg-target.inc.c    | 611 ++++++++++++++++++++---------------
 tcg/mips/tcg-target.inc.c    |  29 +-
 tcg/ppc/tcg-target.inc.c     |  47 +--
 tcg/s390/tcg-target.inc.c    |  37 ++-
 tcg/sparc/tcg-target.inc.c   |  13 +-
 tcg/tcg-ldst-ool.inc.c       |  94 ++++++
 tcg/tcg-pool.inc.c           |   5 +-
 tcg/tcg.c                    |  28 +-
 tcg/tci/tcg-target.inc.c     |   3 +-
 15 files changed, 974 insertions(+), 756 deletions(-)
 create mode 100644 tcg/tcg-ldst-ool.inc.c

-- 
2.17.2

no-reply@patchew.org Nov. 13, 2018, 9 a.m. UTC | #1

Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20181112214503.22941-1-richard.henderson@linaro.org
Subject: [Qemu-devel] [PATCH for-4.0 00/17] tcg: Move softmmu out-of-line

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]               patchew/20181113083104.2692-1-aik@ozlabs.ru -> patchew/20181113083104.2692-1-aik@ozlabs.ru
Switched to a new branch 'test'
e69b196b86 tcg/arm: Use TCG_TARGET_NEED_LDST_OOL_LABELS
6fd30d145e tcg/arm: Force qemu_ld/st arguments into fixed registers
bb1d7b7a2a tcg/arm: Reduce the number of temps for tcg_out_tlb_read
58ad5e1707 tcg/arm: Add constraints for R0-R5
0ae2a6fe31 tcg/arm: Parameterize the temps for tcg_out_tlb_read
d9cc700ca9 tcg/aarch64: Use TCG_TARGET_NEED_LDST_OOL_LABELS
81d088cf09 tcg/aarch64: Use B not BL for tcg_out_goto_long
9b4254898d tcg/aarch64: Parameterize the temp for tcg_out_goto_long
88f66038f2 tcg/aarch64: Parameterize the temps for tcg_out_tlb_read
70870da04b tcg/aarch64: Add constraints for x0, x1, x2
c13bc30fe2 tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS
f89e8cc968 tcg: Add TCG_TARGET_NEED_LDST_OOL_LABELS
61aa1913f6 tcg: Return success from patch_reloc
299cd2f2dc tcg/i386: Force qemu_ld/st arguments into fixed registers
29999f5f40 tcg/i386: Change TCG_REG_L[01] to not overlap function arguments
8c6efd075f tcg/i386: Return a base register from tcg_out_tlb_load
8c82ba4a0e tcg/i386: Add constraints for r8 and r9

=== OUTPUT BEGIN ===
Checking PATCH 1/17: tcg/i386: Add constraints for r8 and r9...
Checking PATCH 2/17: tcg/i386: Return a base register from tcg_out_tlb_load...
Checking PATCH 3/17: tcg/i386: Change TCG_REG_L[01] to not overlap function arguments...
Checking PATCH 4/17: tcg/i386: Force qemu_ld/st arguments into fixed registers...
Checking PATCH 5/17: tcg: Return success from patch_reloc...
Checking PATCH 6/17: tcg: Add TCG_TARGET_NEED_LDST_OOL_LABELS...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#13: 
new file mode 100644

total: 0 errors, 1 warnings, 148 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 7/17: tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS...
Checking PATCH 8/17: tcg/aarch64: Add constraints for x0, x1, x2...
Checking PATCH 9/17: tcg/aarch64: Parameterize the temps for tcg_out_tlb_read...
Checking PATCH 10/17: tcg/aarch64: Parameterize the temp for tcg_out_goto_long...
Checking PATCH 11/17: tcg/aarch64: Use B not BL for tcg_out_goto_long...
Checking PATCH 12/17: tcg/aarch64: Use TCG_TARGET_NEED_LDST_OOL_LABELS...
Checking PATCH 13/17: tcg/arm: Parameterize the temps for tcg_out_tlb_read...
Checking PATCH 14/17: tcg/arm: Add constraints for R0-R5...
Checking PATCH 15/17: tcg/arm: Reduce the number of temps for tcg_out_tlb_read...
Checking PATCH 16/17: tcg/arm: Force qemu_ld/st arguments into fixed registers...
Checking PATCH 17/17: tcg/arm: Use TCG_TARGET_NEED_LDST_OOL_LABELS...
ERROR: externs should be avoided in .c files
#168: FILE: tcg/arm/tcg-target.inc.c:1485:
+    TCGReg addrlo __attribute__((unused));

ERROR: externs should be avoided in .c files
#169: FILE: tcg/arm/tcg-target.inc.c:1486:
+    TCGReg addrhi __attribute__((unused));

ERROR: externs should be avoided in .c files
#170: FILE: tcg/arm/tcg-target.inc.c:1487:
+    TCGReg datalo __attribute__((unused));

ERROR: externs should be avoided in .c files
#171: FILE: tcg/arm/tcg-target.inc.c:1488:
+    TCGReg datahi __attribute__((unused));

ERROR: externs should be avoided in .c files
#233: FILE: tcg/arm/tcg-target.inc.c:1603:
+    TCGReg addrlo __attribute__((unused));

ERROR: externs should be avoided in .c files
#234: FILE: tcg/arm/tcg-target.inc.c:1604:
+    TCGReg addrhi __attribute__((unused));

ERROR: externs should be avoided in .c files
#235: FILE: tcg/arm/tcg-target.inc.c:1605:
+    TCGReg datalo __attribute__((unused));

ERROR: externs should be avoided in .c files
#236: FILE: tcg/arm/tcg-target.inc.c:1606:
+    TCGReg datahi __attribute__((unused));

total: 8 errors, 0 warnings, 373 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Emilio Cota Nov. 14, 2018, 1 a.m. UTC | #2

On Mon, Nov 12, 2018 at 22:44:46 +0100, Richard Henderson wrote:
> Based on an idea forwarded by Emilio, which suggests a 5-6%

> speed gain is possible.  I have not spent too much time

> measuring this, as the code size gains are significant.

Nice!

> I believe that I posted an x86_64-only patch some time ago,

> but this now includes i386, aarch64 and arm32.  In late

> testing I do some failures on i386, for sparc guest.  I'll

> follow up on that later.

The following might be related: I'm seeing segfaults with -smp 8
and beyond when doing bootup+shutdown of an aarch64 guest on
an x86-64 host. smp -1 is stable AFAICT. The first commit that
shows these crashes is "tcg/i386: Use TCG_TARGET_NEED_LDST_OOL_LABELS",
that is f7ec49a51c8 in your tcg-tlb-x86 github branch.

Thanks,

		Emilio

Richard Henderson Nov. 15, 2018, 11:32 a.m. UTC | #3

On 11/14/18 2:00 AM, Emilio G. Cota wrote:
> The following might be related: I'm seeing segfaults with -smp 8

> and beyond when doing bootup+shutdown of an aarch64 guest on

> an x86-64 host.

I'm not seeing that.  Anything else special on the command-line?
Are the segv in the code_gen_buffer or elsewhere?

r~

Emilio Cota Nov. 15, 2018, 6:48 p.m. UTC | #4

On Thu, Nov 15, 2018 at 12:32:00 +0100, Richard Henderson wrote:
> On 11/14/18 2:00 AM, Emilio G. Cota wrote:

> > The following might be related: I'm seeing segfaults with -smp 8

> > and beyond when doing bootup+shutdown of an aarch64 guest on

> > an x86-64 host.

> 

> I'm not seeing that.  Anything else special on the command-line?

> Are the segv in the code_gen_buffer or elsewhere?


I just spent some time on this. I've noticed two issues:

- All TCG contexts end up using the same hash table, since
  we only allocate one table in tcg_context_init. This leads
  to memory corruption.
  This fixes it (confirmed that there aren't races with helgrind):

--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -763,6 +763,14 @@ void tcg_register_thread(void)
     err = tcg_region_initial_alloc__locked(tcg_ctx);
     g_assert(!err);
     qemu_mutex_unlock(&region.lock);
+
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+    /* if n == 0, keep the hash table we allocated in tcg_context_init */
+    if (n) {
+        /* Both key and value are raw pointers.  */
+        s->ldst_ool_thunks = g_hash_table_new(NULL, NULL);
+    }
+#endif
 }
 #endif /* !CONFIG_USER_ONLY */

- Segfault in code_gen_buffer. This one I don't have a fix for,
  but it's *much* easier to reproduce when -tb-size is very small,
  e.g. "-tb-size 5 -smp 2" (BTW it crashes with x86_64 guests too.)
  So at first I thought the code cache flushing was the problem,
  but I don't see how that could be, at least from a TCGContext
  viewpoint -- I agree that clearing the hash table in
  tcg_region_assign is a good place to do so.

Thanks,

		Emilio

Richard Henderson Nov. 15, 2018, 6:54 p.m. UTC | #5

On 11/15/18 7:48 PM, Emilio G. Cota wrote:
> - All TCG contexts end up using the same hash table, since

>   we only allocate one table in tcg_context_init. This leads

>   to memory corruption.

Bah.  I forgot we only call tcg_context_init once.
Thanks for the fix.

In the meantime I've fixed the i386 problem.  A bit of silliness wrt 64-bit stores.

r~

Richard Henderson Nov. 15, 2018, 10:04 p.m. UTC | #6

On 11/15/18 7:48 PM, Emilio G. Cota wrote:
> - Segfault in code_gen_buffer. This one I don't have a fix for,

>   but it's *much* easier to reproduce when -tb-size is very small,

>   e.g. "-tb-size 5 -smp 2" (BTW it crashes with x86_64 guests too.)

>   So at first I thought the code cache flushing was the problem,

>   but I don't see how that could be, at least from a TCGContext

>   viewpoint -- I agree that clearing the hash table in

>   tcg_region_assign is a good place to do so.


Ho hum.

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 639f0b2728..115ea186e5 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1831,10 +1831,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     existing_tb = tb_link_page(tb, phys_pc, phys_page2);
     /* if the TB already exists, discard what we just translated */
     if (unlikely(existing_tb != tb)) {
-        uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
-
-        orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
-        atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned);
         return existing_tb;
     }
     tcg_tb_insert(tb);

We can't easily undo the hash table insert, and for a relatively rare
occurrence it's not worth the effort.


r~

Emilio Cota Nov. 16, 2018, 1:13 a.m. UTC | #7

On Thu, Nov 15, 2018 at 23:04:50 +0100, Richard Henderson wrote:
> On 11/15/18 7:48 PM, Emilio G. Cota wrote:

> > - Segfault in code_gen_buffer. This one I don't have a fix for,

> >   but it's *much* easier to reproduce when -tb-size is very small,

> >   e.g. "-tb-size 5 -smp 2" (BTW it crashes with x86_64 guests too.)

> >   So at first I thought the code cache flushing was the problem,

> >   but I don't see how that could be, at least from a TCGContext

> >   viewpoint -- I agree that clearing the hash table in

> >   tcg_region_assign is a good place to do so.

> 

> Ho hum.

> 

> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c

> index 639f0b2728..115ea186e5 100644

> --- a/accel/tcg/translate-all.c

> +++ b/accel/tcg/translate-all.c

> @@ -1831,10 +1831,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,

>      existing_tb = tb_link_page(tb, phys_pc, phys_page2);

>      /* if the TB already exists, discard what we just translated */

>      if (unlikely(existing_tb != tb)) {

> -        uintptr_t orig_aligned = (uintptr_t)gen_code_buf;

> -

> -        orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);

> -        atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned);

>          return existing_tb;

>      }

>      tcg_tb_insert(tb);

> 

> We can't easily undo the hash table insert, and for a relatively rare

> occurrence it's not worth the effort.


Nice catch! Everything works now =D

In the bootup+shutdown aarch64 test with -smp 12, we end up
discarding ~2500 TB's--that's ~439K of space for code that we
do not waste; note that I'm assuming 180 host bytes per TB,
which is the average reported by info jit.

We can still discard most of these by increasing a counter every
time we insert a new element into the OOL table, and checking
this counter before/after tcg_gen_code. (Note that checking
g_hash_table_size before/after is not enough, because we might
have replaced an existing item from the table.)
Then, we discard a TB iff an OOL thunk was generated. (Diff below.)

This allows us to discard most TBs; in the example above,
we end up *not* discarding only ~70 TBs, that is we end up keeping
only 70/2500 = 2.8% of the TBs that we'd discard without OOL.

Performance-wise it doesn't make a difference for -smp 1:

Host: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (5 runs):

- Before (3.1.0-rc1):

      14351.436177      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.24% )
    49,963,260,126      cycles                    #    3.481 GHz                      ( +-  0.22% )  (83.32%)
    26,047,650,654      stalled-cycles-frontend   #   52.13% frontend cycles idle     ( +-  0.29% )  (83.34%)
    19,717,480,482      stalled-cycles-backend    #   39.46% backend  cycles idle     ( +-  0.27% )  (66.67%)
    59,278,011,067      instructions              #    1.19  insns per cycle        
                                                  #    0.44  stalled cycles per insn  ( +-  0.17% )  (83.34%)
    10,632,601,608      branches                  #  740.874 M/sec                    ( +-  0.17% )  (83.34%)
       236,153,469      branch-misses             #    2.22% of all branches          ( +-  0.16% )  (83.35%)

      14.382847823 seconds time elapsed                                          ( +-  0.25% )

- After this series (with the fixes we've discussed):

      13256.198927      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.04% )
    46,146,457,353      cycles                    #    3.481 GHz                      ( +-  0.08% )  (83.34%)
    22,632,342,565      stalled-cycles-frontend   #   49.04% frontend cycles idle     ( +-  0.12% )  (83.35%)
    16,534,690,741      stalled-cycles-backend    #   35.83% backend  cycles idle     ( +-  0.15% )  (66.67%)
    58,047,832,548      instructions              #    1.26  insns per cycle        
                                                  #    0.39  stalled cycles per insn  ( +-  0.18% )  (83.34%)
    11,031,634,880      branches                  #  832.187 M/sec                    ( +-  0.12% )  (83.33%)
       210,593,929      branch-misses             #    1.91% of all branches          ( +-  0.30% )  (83.33%)

      13.285023783 seconds time elapsed                                          ( +-  0.05% )

- After the fixup below:

      13240.889734      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.19% )
    46,074,292,775      cycles                    #    3.480 GHz                      ( +-  0.12% )  (83.35%)
    22,670,132,770      stalled-cycles-frontend   #   49.20% frontend cycles idle     ( +-  0.17% )  (83.35%)
    16,598,822,504      stalled-cycles-backend    #   36.03% backend  cycles idle     ( +-  0.26% )  (66.66%)
    57,796,083,344      instructions              #    1.25  insns per cycle        
                                                  #    0.39  stalled cycles per insn  ( +-  0.16% )  (83.34%)
    11,002,340,174      branches                  #  830.937 M/sec                    ( +-  0.11% )  (83.35%)
       211,023,549      branch-misses             #    1.92% of all branches          ( +-  0.22% )  (83.32%)

      13.264499034 seconds time elapsed                                          ( +-  0.19% )

I'll generate now some more perf numbers that we could include in the
commit logs.

Thanks,

		Emilio

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 115ea18..15f7d4e 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1678,6 +1678,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     target_ulong virt_page2;
     tcg_insn_unit *gen_code_buf;
     int gen_code_size, search_size;
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+    size_t n_ool_thunks;
+#endif
 #ifdef CONFIG_PROFILER
     TCGProfile *prof = &tcg_ctx->prof;
     int64_t ti;
@@ -1744,6 +1747,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     ti = profile_getclock();
 #endif
 
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+    n_ool_thunks = tcg_ctx->n_ool_thunks;
+#endif
+
     /* ??? Overflow could be handled better here.  In particular, we
        don't need to re-do gen_intermediate_code, nor should we re-do
        the tcg optimization currently hidden inside tcg_gen_code.  All
@@ -1831,6 +1838,18 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     existing_tb = tb_link_page(tb, phys_pc, phys_page2);
     /* if the TB already exists, discard what we just translated */
     if (unlikely(existing_tb != tb)) {
+        bool discard = true;
+
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+        /* only discard the TB if we didn't generate an OOL thunk */
+        discard = tcg_ctx->n_ool_thunks == n_ool_thunks;
+#endif
+        if (discard) {
+            uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
+
+            orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
+            atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned);
+        }
         return existing_tb;
     }
     tcg_tb_insert(tb);
diff --git a/tcg/tcg-ldst-ool.inc.c b/tcg/tcg-ldst-ool.inc.c
index 8fb6550..61da060 100644
--- a/tcg/tcg-ldst-ool.inc.c
+++ b/tcg/tcg-ldst-ool.inc.c
@@ -69,6 +69,7 @@ static bool tcg_out_ldst_ool_finalize(TCGContext *s)
 
         /* Remember the thunk for next time.  */
         g_hash_table_replace(s->ldst_ool_thunks, key, dest);
+        s->n_ool_thunks++;
 
         /* The new thunk must be in range.  */
         ok = patch_reloc(lb->label, lb->reloc, (intptr_t)dest, lb->addend);
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 1255d2a..d4f07a6 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -709,6 +709,7 @@ struct TCGContext {
 #ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
     QSIMPLEQ_HEAD(ldst_labels, TCGLabelQemuLdstOol) ldst_ool_labels;
     GHashTable *ldst_ool_thunks;
+    size_t n_ool_thunks;
 #endif
 #ifdef TCG_TARGET_NEED_POOL_LABELS
     struct TCGLabelPoolData *pool_labels;

Emilio Cota Nov. 16, 2018, 5:10 a.m. UTC | #8

On Thu, Nov 15, 2018 at 20:13:38 -0500, Emilio G. Cota wrote:
> I'll generate now some more perf numbers that we could include in the

> commit logs.


SPEC numbers are a net perf decrease, unfortunately:

                     Softmmu speedup for SPEC06int (test set)
   1.1 +-+--+----+----+----+----+----+----+---+----+----+----+----+----+--+-+
       |                                                                    |
       |                                                      aft+++        |
  1.05 +-+........................................................|.......+-+
       |                                               +++        |         |
       |                       +++                      |         |         |
       |   +++                  |                       |         |         |
     1 +-++++++++++++++++****++++++++++++++++++++++++++++++++++++***+++++++-+
       |    |         |  *  * ****                     ****      *|*        |
       |   ***  +++   |  *  * * |*       +++           *| *      *|*        |
  0.95 +-+.*|*..***...|..*..*.*.|*..+++...|............*|.*.+++..*|*..+++.+-+
       |   *|*  *+*  *** *  * * |*   |    |  +++       *| * ***  *|*  ***   |
       |   *+*  * *  *|* *  * *++*   |  ****  |        *| * *+*  *|*  *+*   |
       |   * *  * *  *|* *  * *  * **** * |* ****      *++* * *  *+*  * *   |
   0.9 +-+.*.*..*.*..*+*.*..*.*..*.*.|*.*.|*.*|.*......*..*.*.*..*.*..*.*.+-+
       |   * *  * *  * * *  * *  * *++* *++* *++* +++  *  * * *  * *  * *   |
       |   * *  * *  * * *  * *  * *  * *  * *  *  |   *  * * *  * *  * *   |
  0.85 +-+.*.*..*.*..*.*.*..*.*..*.*..*.*..*.*..*..|...*..*.*.*..*.*..*.*.+-+
       |   * *  * *  * * *  * *  * *  * *  * *  *  |   *  * * *  * *  * *   |
       |   * *  * *  * * *  * *  * *  * *  * *  * **** *  * * *  * *  * *   |
       |   * *  * *  * * *  * *  * *  * *  * *  * *| * *  * * *  * *  * *   |
   0.8 +-+.*.*..*.*..*.*.*..*.*..*.*..*.*..*.*..*.*|.*.*..*.*.*..*.*..*.*.+-+
       |   * *  * *  * * *  * *  * *  * *  * *  * *| * *  * * *  * *  * *   |
       |   * *  * *  * * *  * *  * *  * *  * *  * *++* *  * * *  * *  * *   |
  0.75 +-+-***--***--***-****-****-****-****-****-****-****-***--***--***-+-+
        401.bzi403.g429445.g456.462.libq464.h471.omn4483.xalancbgeomean
  png: https://imgur.com/aO39gyP

Turns out that the additional instructions are the problem,
despite the much lower icache miss rate. For instance, here
are some numbers for h264ref running on the not-so-recent
Xeon E5-2643 (i.e. Sandy Bridge):

- Before:
 1,137,737,512,668      instructions              #    2.02  insns per cycle
   563,574,505,040      cycles
     5,663,616,681      L1-icache-load-misses
     164.091239774 seconds time elapsed

- After:
 1,216,600,582,476      instructions              #    2.06  insns per cycle        
   591,888,969,223      cycles                                                      
     3,082,426,508      L1-icache-load-misses                                       

     172.232292897 seconds time elapsed

It's possible that newer machines with larger reorder buffers
will be able to take better advantage of the higher instruction
locality, hiding the latency of having to execute more instructions.
I'll test on Skylake tomorrow.

Thanks,

		E.

Richard Henderson Nov. 16, 2018, 8:07 a.m. UTC | #9

On 11/16/18 6:10 AM, Emilio G. Cota wrote:
> It's possible that newer machines with larger reorder buffers

> will be able to take better advantage of the higher instruction

> locality, hiding the latency of having to execute more instructions.

> I'll test on Skylake tomorrow.

I've noticed that the code we generate for calls has twice as many instructions
as really needed for setting up the arguments.  I have a plan to fix that,
which hopefully will solve this problem.

r~

Richard Henderson Nov. 16, 2018, 8:10 a.m. UTC | #10

On 11/16/18 2:13 AM, Emilio G. Cota wrote:
> This allows us to discard most TBs; in the example above,

> we end up *not* discarding only ~70 TBs, that is we end up keeping

> only 70/2500 = 2.8% of the TBs that we'd discard without OOL.

Thanks.

When I apply this I think I'll rename "n_ool_thunks" to "ool_generation".  We
don't care about the absolute count, but any change.

r~

Emilio Cota Nov. 16, 2018, 3:07 p.m. UTC | #11

On Fri, Nov 16, 2018 at 09:07:50 +0100, Richard Henderson wrote:
> On 11/16/18 6:10 AM, Emilio G. Cota wrote:

> > It's possible that newer machines with larger reorder buffers

> > will be able to take better advantage of the higher instruction

> > locality, hiding the latency of having to execute more instructions.

> > I'll test on Skylake tomorrow.

> 

> I've noticed that the code we generate for calls has twice as many instructions

> as really needed for setting up the arguments.  I have a plan to fix that,

> which hopefully will solve this problem.


Ah that's great. I'll do more tests and a full review when those
changes come out, then.

Thanks,

		E.

Emilio Cota Nov. 16, 2018, 3:10 p.m. UTC | #12

On Fri, Nov 16, 2018 at 09:10:32 +0100, Richard Henderson wrote:
> On 11/16/18 2:13 AM, Emilio G. Cota wrote:

> > This allows us to discard most TBs; in the example above,

> > we end up *not* discarding only ~70 TBs, that is we end up keeping

> > only 70/2500 = 2.8% of the TBs that we'd discard without OOL.

> 

> Thanks.

> 

> When I apply this I think I'll rename "n_ool_thunks" to "ool_generation".  We

> don't care about the absolute count, but any change.


Agreed, that's much better.

Thanks,

		E.

[for-4.0,00/17] tcg: Move softmmu out-of-line

Message

Comments