mbox series

[for-10.0,v2,00/54] accel/tcg: Convert victim tlb to IntervalTree

Message ID 20241114160131.48616-1-richard.henderson@linaro.org
Headers show
Series accel/tcg: Convert victim tlb to IntervalTree | expand

Message

Richard Henderson Nov. 14, 2024, 4 p.m. UTC
v1: 20241009150855.804605-1-richard.henderson@linaro.org

The initial idea was: how much can we do with an intelligent data
structure for the same cost as a linear search through an array?


r~


Richard Henderson (54):
  util/interval-tree: Introduce interval_tree_free_nodes
  accel/tcg: Split out tlbfast_flush_locked
  accel/tcg: Split out tlbfast_{index,entry}
  accel/tcg: Split out tlbfast_flush_range_locked
  accel/tcg: Fix flags usage in mmu_lookup1, atomic_mmu_lookup
  accel/tcg: Assert non-zero length in tlb_flush_range_by_mmuidx*
  accel/tcg: Assert bits in range in tlb_flush_range_by_mmuidx*
  accel/tcg: Flush entire tlb when a masked range wraps
  accel/tcg: Add IntervalTreeRoot to CPUTLBDesc
  accel/tcg: Populate IntervalTree in tlb_set_page_full
  accel/tcg: Remove IntervalTree entry in tlb_flush_page_locked
  accel/tcg: Remove IntervalTree entries in tlb_flush_range_locked
  accel/tcg: Process IntervalTree entries in tlb_reset_dirty
  accel/tcg: Process IntervalTree entries in tlb_set_dirty
  accel/tcg: Use tlb_hit_page in victim_tlb_hit
  accel/tcg: Pass full addr to victim_tlb_hit
  accel/tcg: Replace victim_tlb_hit with tlbtree_hit
  accel/tcg: Remove the victim tlb
  accel/tcg: Remove tlb_n_used_entries_inc
  include/exec/tlb-common: Move CPUTLBEntryFull from hw/core/cpu.h
  accel/tcg: Delay plugin adjustment in probe_access_internal
  accel/tcg: Call cpu_ld*_code_mmu from cpu_ld*_code
  accel/tcg: Check original prot bits for read in atomic_mmu_lookup
  accel/tcg: Preserve tlb flags in tlb_set_compare
  accel/tcg: Return CPUTLBEntryFull not pointer in probe_access_full_mmu
  accel/tcg: Return CPUTLBEntryFull not pointer in probe_access_full
  accel/tcg: Return CPUTLBEntryFull not pointer in probe_access_internal
  accel/tcg: Introduce tlb_lookup
  accel/tcg: Partially unify MMULookupPageData and TLBLookupOutput
  accel/tcg: Merge mmu_lookup1 into mmu_lookup
  accel/tcg: Always use IntervalTree for code lookups
  accel/tcg: Link CPUTLBEntry to CPUTLBEntryTree
  accel/tcg: Remove CPUTLBDesc.fulltlb
  target/alpha: Convert to TCGCPUOps.tlb_fill_align
  target/avr: Convert to TCGCPUOps.tlb_fill_align
  target/i386: Convert to TCGCPUOps.tlb_fill_align
  target/loongarch: Convert to TCGCPUOps.tlb_fill_align
  target/m68k: Convert to TCGCPUOps.tlb_fill_align
  target/m68k: Do not call tlb_set_page in helper_ptest
  target/microblaze: Convert to TCGCPUOps.tlb_fill_align
  target/mips: Convert to TCGCPUOps.tlb_fill_align
  target/openrisc: Convert to TCGCPUOps.tlb_fill_align
  target/ppc: Convert to TCGCPUOps.tlb_fill_align
  target/riscv: Convert to TCGCPUOps.tlb_fill_align
  target/rx: Convert to TCGCPUOps.tlb_fill_align
  target/s390x: Convert to TCGCPUOps.tlb_fill_align
  target/sh4: Convert to TCGCPUOps.tlb_fill_align
  target/sparc: Convert to TCGCPUOps.tlb_fill_align
  target/tricore: Convert to TCGCPUOps.tlb_fill_align
  target/xtensa: Convert to TCGCPUOps.tlb_fill_align
  accel/tcg: Drop TCGCPUOps.tlb_fill
  accel/tcg: Unexport tlb_set_page*
  accel/tcg: Merge tlb_fill_align into callers
  accel/tcg: Return CPUTLBEntryTree from tlb_set_page_full

 include/exec/cpu-all.h               |   3 +
 include/exec/exec-all.h              |  65 +-
 include/exec/tlb-common.h            |  68 +-
 include/hw/core/cpu.h                |  75 +-
 include/hw/core/tcg-cpu-ops.h        |  10 -
 include/qemu/interval-tree.h         |  11 +
 target/alpha/cpu.h                   |   6 +-
 target/avr/cpu.h                     |   7 +-
 target/i386/tcg/helper-tcg.h         |   6 +-
 target/loongarch/internals.h         |   7 +-
 target/m68k/cpu.h                    |   7 +-
 target/microblaze/cpu.h              |   7 +-
 target/mips/tcg/tcg-internal.h       |   6 +-
 target/openrisc/cpu.h                |   8 +-
 target/ppc/internal.h                |   7 +-
 target/riscv/cpu.h                   |   8 +-
 target/s390x/s390x-internal.h        |   7 +-
 target/sh4/cpu.h                     |   8 +-
 target/sparc/cpu.h                   |   8 +-
 target/tricore/cpu.h                 |   7 +-
 target/xtensa/cpu.h                  |   8 +-
 accel/tcg/cputlb.c                   | 994 +++++++++++++--------------
 target/alpha/cpu.c                   |   2 +-
 target/alpha/helper.c                |  23 +-
 target/arm/ptw.c                     |  10 +-
 target/arm/tcg/helper-a64.c          |   4 +-
 target/arm/tcg/mte_helper.c          |  15 +-
 target/arm/tcg/sve_helper.c          |   6 +-
 target/avr/cpu.c                     |   2 +-
 target/avr/helper.c                  |  19 +-
 target/i386/tcg/sysemu/excp_helper.c |  36 +-
 target/i386/tcg/tcg-cpu.c            |   2 +-
 target/loongarch/cpu.c               |   2 +-
 target/loongarch/tcg/tlb_helper.c    |  17 +-
 target/m68k/cpu.c                    |   2 +-
 target/m68k/helper.c                 |  32 +-
 target/microblaze/cpu.c              |   2 +-
 target/microblaze/helper.c           |  33 +-
 target/mips/cpu.c                    |   2 +-
 target/mips/tcg/sysemu/tlb_helper.c  |  29 +-
 target/openrisc/cpu.c                |   2 +-
 target/openrisc/mmu.c                |  39 +-
 target/ppc/cpu_init.c                |   2 +-
 target/ppc/mmu_helper.c              |  21 +-
 target/riscv/cpu_helper.c            |  22 +-
 target/riscv/tcg/tcg-cpu.c           |   2 +-
 target/rx/cpu.c                      |  19 +-
 target/s390x/cpu.c                   |   4 +-
 target/s390x/tcg/excp_helper.c       |  23 +-
 target/sh4/cpu.c                     |   2 +-
 target/sh4/helper.c                  |  24 +-
 target/sparc/cpu.c                   |   2 +-
 target/sparc/mmu_helper.c            |  44 +-
 target/tricore/cpu.c                 |   2 +-
 target/tricore/helper.c              |  19 +-
 target/xtensa/cpu.c                  |   2 +-
 target/xtensa/helper.c               |  28 +-
 util/interval-tree.c                 |  20 +
 util/selfmap.c                       |  13 +-
 59 files changed, 938 insertions(+), 923 deletions(-)

Comments

Pierrick Bouvier Nov. 14, 2024, 7:56 p.m. UTC | #1
On 11/14/24 08:00, Richard Henderson wrote:
> v1: 20241009150855.804605-1-richard.henderson@linaro.org
> 
> The initial idea was: how much can we do with an intelligent data
> structure for the same cost as a linear search through an array?
> 
> 
> r~
> 
> 
> Richard Henderson (54):
>    util/interval-tree: Introduce interval_tree_free_nodes
>    accel/tcg: Split out tlbfast_flush_locked
>    accel/tcg: Split out tlbfast_{index,entry}
>    accel/tcg: Split out tlbfast_flush_range_locked
>    accel/tcg: Fix flags usage in mmu_lookup1, atomic_mmu_lookup
>    accel/tcg: Assert non-zero length in tlb_flush_range_by_mmuidx*
>    accel/tcg: Assert bits in range in tlb_flush_range_by_mmuidx*
>    accel/tcg: Flush entire tlb when a masked range wraps
>    accel/tcg: Add IntervalTreeRoot to CPUTLBDesc
>    accel/tcg: Populate IntervalTree in tlb_set_page_full
>    accel/tcg: Remove IntervalTree entry in tlb_flush_page_locked
>    accel/tcg: Remove IntervalTree entries in tlb_flush_range_locked
>    accel/tcg: Process IntervalTree entries in tlb_reset_dirty
>    accel/tcg: Process IntervalTree entries in tlb_set_dirty
>    accel/tcg: Use tlb_hit_page in victim_tlb_hit
>    accel/tcg: Pass full addr to victim_tlb_hit
>    accel/tcg: Replace victim_tlb_hit with tlbtree_hit
>    accel/tcg: Remove the victim tlb
>    accel/tcg: Remove tlb_n_used_entries_inc
>    include/exec/tlb-common: Move CPUTLBEntryFull from hw/core/cpu.h
>    accel/tcg: Delay plugin adjustment in probe_access_internal
>    accel/tcg: Call cpu_ld*_code_mmu from cpu_ld*_code
>    accel/tcg: Check original prot bits for read in atomic_mmu_lookup
>    accel/tcg: Preserve tlb flags in tlb_set_compare
>    accel/tcg: Return CPUTLBEntryFull not pointer in probe_access_full_mmu
>    accel/tcg: Return CPUTLBEntryFull not pointer in probe_access_full
>    accel/tcg: Return CPUTLBEntryFull not pointer in probe_access_internal
>    accel/tcg: Introduce tlb_lookup
>    accel/tcg: Partially unify MMULookupPageData and TLBLookupOutput
>    accel/tcg: Merge mmu_lookup1 into mmu_lookup
>    accel/tcg: Always use IntervalTree for code lookups
>    accel/tcg: Link CPUTLBEntry to CPUTLBEntryTree
>    accel/tcg: Remove CPUTLBDesc.fulltlb
>    target/alpha: Convert to TCGCPUOps.tlb_fill_align
>    target/avr: Convert to TCGCPUOps.tlb_fill_align
>    target/i386: Convert to TCGCPUOps.tlb_fill_align
>    target/loongarch: Convert to TCGCPUOps.tlb_fill_align
>    target/m68k: Convert to TCGCPUOps.tlb_fill_align
>    target/m68k: Do not call tlb_set_page in helper_ptest
>    target/microblaze: Convert to TCGCPUOps.tlb_fill_align
>    target/mips: Convert to TCGCPUOps.tlb_fill_align
>    target/openrisc: Convert to TCGCPUOps.tlb_fill_align
>    target/ppc: Convert to TCGCPUOps.tlb_fill_align
>    target/riscv: Convert to TCGCPUOps.tlb_fill_align
>    target/rx: Convert to TCGCPUOps.tlb_fill_align
>    target/s390x: Convert to TCGCPUOps.tlb_fill_align
>    target/sh4: Convert to TCGCPUOps.tlb_fill_align
>    target/sparc: Convert to TCGCPUOps.tlb_fill_align
>    target/tricore: Convert to TCGCPUOps.tlb_fill_align
>    target/xtensa: Convert to TCGCPUOps.tlb_fill_align
>    accel/tcg: Drop TCGCPUOps.tlb_fill
>    accel/tcg: Unexport tlb_set_page*
>    accel/tcg: Merge tlb_fill_align into callers
>    accel/tcg: Return CPUTLBEntryTree from tlb_set_page_full
> 
>   include/exec/cpu-all.h               |   3 +
>   include/exec/exec-all.h              |  65 +-
>   include/exec/tlb-common.h            |  68 +-
>   include/hw/core/cpu.h                |  75 +-
>   include/hw/core/tcg-cpu-ops.h        |  10 -
>   include/qemu/interval-tree.h         |  11 +
>   target/alpha/cpu.h                   |   6 +-
>   target/avr/cpu.h                     |   7 +-
>   target/i386/tcg/helper-tcg.h         |   6 +-
>   target/loongarch/internals.h         |   7 +-
>   target/m68k/cpu.h                    |   7 +-
>   target/microblaze/cpu.h              |   7 +-
>   target/mips/tcg/tcg-internal.h       |   6 +-
>   target/openrisc/cpu.h                |   8 +-
>   target/ppc/internal.h                |   7 +-
>   target/riscv/cpu.h                   |   8 +-
>   target/s390x/s390x-internal.h        |   7 +-
>   target/sh4/cpu.h                     |   8 +-
>   target/sparc/cpu.h                   |   8 +-
>   target/tricore/cpu.h                 |   7 +-
>   target/xtensa/cpu.h                  |   8 +-
>   accel/tcg/cputlb.c                   | 994 +++++++++++++--------------
>   target/alpha/cpu.c                   |   2 +-
>   target/alpha/helper.c                |  23 +-
>   target/arm/ptw.c                     |  10 +-
>   target/arm/tcg/helper-a64.c          |   4 +-
>   target/arm/tcg/mte_helper.c          |  15 +-
>   target/arm/tcg/sve_helper.c          |   6 +-
>   target/avr/cpu.c                     |   2 +-
>   target/avr/helper.c                  |  19 +-
>   target/i386/tcg/sysemu/excp_helper.c |  36 +-
>   target/i386/tcg/tcg-cpu.c            |   2 +-
>   target/loongarch/cpu.c               |   2 +-
>   target/loongarch/tcg/tlb_helper.c    |  17 +-
>   target/m68k/cpu.c                    |   2 +-
>   target/m68k/helper.c                 |  32 +-
>   target/microblaze/cpu.c              |   2 +-
>   target/microblaze/helper.c           |  33 +-
>   target/mips/cpu.c                    |   2 +-
>   target/mips/tcg/sysemu/tlb_helper.c  |  29 +-
>   target/openrisc/cpu.c                |   2 +-
>   target/openrisc/mmu.c                |  39 +-
>   target/ppc/cpu_init.c                |   2 +-
>   target/ppc/mmu_helper.c              |  21 +-
>   target/riscv/cpu_helper.c            |  22 +-
>   target/riscv/tcg/tcg-cpu.c           |   2 +-
>   target/rx/cpu.c                      |  19 +-
>   target/s390x/cpu.c                   |   4 +-
>   target/s390x/tcg/excp_helper.c       |  23 +-
>   target/sh4/cpu.c                     |   2 +-
>   target/sh4/helper.c                  |  24 +-
>   target/sparc/cpu.c                   |   2 +-
>   target/sparc/mmu_helper.c            |  44 +-
>   target/tricore/cpu.c                 |   2 +-
>   target/tricore/helper.c              |  19 +-
>   target/xtensa/cpu.c                  |   2 +-
>   target/xtensa/helper.c               |  28 +-
>   util/interval-tree.c                 |  20 +
>   util/selfmap.c                       |  13 +-
>   59 files changed, 938 insertions(+), 923 deletions(-)
> 

I tested this change by booting a debian x86_64 image, it works as expected.

I noticed that this change does not come for free (64s before, 82s after 
- 1.3x). Is that acceptable?

Pierrick
Richard Henderson Nov. 14, 2024, 8:58 p.m. UTC | #2
On 11/14/24 11:56, Pierrick Bouvier wrote:
> I tested this change by booting a debian x86_64 image, it works as expected.
> 
> I noticed that this change does not come for free (64s before, 82s after - 1.3x). Is that 
> acceptable?
Well, no.  But I didn't notice any change during boot tests.  I used hyperfine over 'make 
check-functional'.

I would only expect benefits to be seen during longer lived vm's, since a boot test 
doesn't run applications long enough to see tlb entries accumulate.  I have not attempted 
to create a reproducible test for that so far.


r~
Pierrick Bouvier Nov. 14, 2024, 9:05 p.m. UTC | #3
On 11/14/24 12:58, Richard Henderson wrote:
> On 11/14/24 11:56, Pierrick Bouvier wrote:
>> I tested this change by booting a debian x86_64 image, it works as expected.
>>
>> I noticed that this change does not come for free (64s before, 82s after - 1.3x). Is that
>> acceptable?
> Well, no.  But I didn't notice any change during boot tests.  I used hyperfine over 'make
> check-functional'.
> 
> I would only expect benefits to be seen during longer lived vm's, since a boot test
> doesn't run applications long enough to see tlb entries accumulate.  I have not attempted
> to create a reproducible test for that so far.
> 
> 

I didn't use check-functional neither.
I used a vanilla debian bookworm install, with a modified /etc/rc.local 
calling poweroff, and ran 3 times with/without change with turbo 
disabled on my cpu.

> r~
Alex Bennée Nov. 15, 2024, 11:43 a.m. UTC | #4
Pierrick Bouvier <pierrick.bouvier@linaro.org> writes:

> On 11/14/24 12:58, Richard Henderson wrote:
>> On 11/14/24 11:56, Pierrick Bouvier wrote:
>>> I tested this change by booting a debian x86_64 image, it works as expected.
>>>
>>> I noticed that this change does not come for free (64s before, 82s after - 1.3x). Is that
>>> acceptable?
>> Well, no.  But I didn't notice any change during boot tests.  I used hyperfine over 'make
>> check-functional'.
>> I would only expect benefits to be seen during longer lived vm's,
>> since a boot test
>> doesn't run applications long enough to see tlb entries accumulate.  I have not attempted
>> to create a reproducible test for that so far.
>> 
>
> I didn't use check-functional neither.
> I used a vanilla debian bookworm install, with a modified
> /etc/rc.local calling poweroff, and ran 3 times with/without change
> with turbo disabled on my cpu.

If you want to really stress the VM handling you should use stress-ng to
exercise page faulting and recovery. Wrap it up in a systemd unit for a
reproducible test:

  cat /etc/systemd/system/benchmark-stress-ng.service 
  # A benchmark target
  #
  # This shutsdown once the boot has completed

  [Unit]
  Description=Default
  Requires=basic.target
  After=basic.target
  AllowIsolate=yes

  [Service]
  Type=oneshot
  ExecStart=stress-ng --perf --iomix 4 --vm 2 --timeout 10s
  ExecStartPost=/sbin/poweroff

  [Install]
  WantedBy=multi-user.target

and then call with something like:

  -append "root=/dev/sda2 console=ttyAMA0 systemd.unit=benchmark-stress-ng.service"

>
>> r~
Pierrick Bouvier Nov. 15, 2024, 5:44 p.m. UTC | #5
On 11/15/24 03:43, Alex Bennée wrote:
> Pierrick Bouvier <pierrick.bouvier@linaro.org> writes:
> 
>> On 11/14/24 12:58, Richard Henderson wrote:
>>> On 11/14/24 11:56, Pierrick Bouvier wrote:
>>>> I tested this change by booting a debian x86_64 image, it works as expected.
>>>>
>>>> I noticed that this change does not come for free (64s before, 82s after - 1.3x). Is that
>>>> acceptable?
>>> Well, no.  But I didn't notice any change during boot tests.  I used hyperfine over 'make
>>> check-functional'.
>>> I would only expect benefits to be seen during longer lived vm's,
>>> since a boot test
>>> doesn't run applications long enough to see tlb entries accumulate.  I have not attempted
>>> to create a reproducible test for that so far.
>>>
>>
>> I didn't use check-functional neither.
>> I used a vanilla debian bookworm install, with a modified
>> /etc/rc.local calling poweroff, and ran 3 times with/without change
>> with turbo disabled on my cpu.
> 
> If you want to really stress the VM handling you should use stress-ng to
> exercise page faulting and recovery. Wrap it up in a systemd unit for a
> reproducible test:
> 
>    cat /etc/systemd/system/benchmark-stress-ng.service
>    # A benchmark target
>    #
>    # This shutsdown once the boot has completed
> 
>    [Unit]
>    Description=Default
>    Requires=basic.target
>    After=basic.target
>    AllowIsolate=yes
> 
>    [Service]
>    Type=oneshot
>    ExecStart=stress-ng --perf --iomix 4 --vm 2 --timeout 10s
>    ExecStartPost=/sbin/poweroff
> 
>    [Install]
>    WantedBy=multi-user.target
> 
> and then call with something like:
> 
>    -append "root=/dev/sda2 console=ttyAMA0 systemd.unit=benchmark-stress-ng.service"
> 

Thanks for the advice.

>>
>>> r~
>