[RFC,0/5] Experimenting with tb-lookup tweaks

Message ID	20210224165811.11567-1-alex.bennee@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; From: =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org> To: richard.henderson@linaro.org Subject: [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Date: Wed, 24 Feb 2021 16:58:06 +0000 Message-Id: <20210224165811.11567-1-alex.bennee@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2a00:1450:4864:20::42c; envelope-from=alex.bennee@linaro.org; helo=mail-wr1-x42c.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: cota@braap.org, =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org>, qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>
Series	Experimenting with tb-lookup tweaks \| expand [RFC,0/5] Experimenting with tb-lookup tweaks [RFC,1/5] accel/tcg: rename tb_lookup__cpu_state and hoist state extraction [RFC,2/5] accel/tcg: move CF_CLUSTER calculation to curr_cflags [RFC,3/5] accel/tcg: drop the use of CF_HASH_MASK and rename params [RFC,4/5] include/exec: lightly re-arrange TranslationBlock [RFC,5/5] include/exec/tb-lookup: try and reduce branch prediction issues

Message ID

20210224165811.11567-1-alex.bennee@linaro.org

Headers

Received-SPF: pass (google.com: domain of
	qemu-devel-bounces+patch=linaro.org@nongnu.org designates
	209.51.188.17 as permitted sender) client-ip=209.51.188.17; 
From: =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org>
To: richard.henderson@linaro.org
Subject: [RFC PATCH  0/5] Experimenting with tb-lookup tweaks
Date: Wed, 24 Feb 2021 16:58:06 +0000
Message-Id: <20210224165811.11567-1-alex.bennee@linaro.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2a00:1450:4864:20::42c;
	envelope-from=alex.bennee@linaro.org; helo=mail-wr1-x42c.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
	DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
	SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: cota@braap.org, =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org>,
	qemu-devel@nongnu.org
Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>

Series

Experimenting with tb-lookup tweaks | expand

Message

Alex Bennée Feb. 24, 2021, 4:58 p.m. UTC

Hi Richard,

Well I spun up some of the ideas we talked about to see if there was
anything to be squeezed out of the function. In the end the results
seem to be a washout with my pigz benchmark:

 qemu-system-aarch64 -cpu cortex-a57 \
   -machine type=virt,virtualization=on,gic-version=3 \
   -serial mon:stdio \
   -netdev user,id=unet,hostfwd=tcp::2222-:22 \
   -device virtio-net-pci,netdev=unet,id=virt-net,disable-legacy=on \
   -device virtio-scsi-pci,id=virt-scsi,disable-legacy=on \
   -blockdev driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-buster-arm64 \
   -device scsi-hd,drive=hd,id=virt-scsi-hd \
   -smp 4 -m 4096 \
   -kernel ~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image \
   -append "root=/dev/sda2 systemd.unit=benchmark-pigz.service" \
   -display none -snapshot

| Command | Mean [s]       | Min [s] | Max [s] | Relative |
|---------+----------------+---------+---------+----------|
| Before  | 46.597 ± 2.482 |  45.208 |  53.618 |     1.00 |
| After   | 46.867 ± 2.242 |  45.871 |  53.180 |     1.00 |

Maybe the code cleanup itself makes it worthwhile. WDYT?

Alex Bennée (5):
  accel/tcg: rename tb_lookup__cpu_state and hoist state extraction
  accel/tcg: move CF_CLUSTER calculation to curr_cflags
  accel/tcg: drop the use of CF_HASH_MASK and rename params
  include/exec: lightly re-arrange TranslationBlock
  include/exec/tb-lookup: try and reduce branch prediction issues

 include/exec/exec-all.h   | 20 +++++++++++---------
 include/exec/tb-lookup.h  | 34 +++++++++++++++++-----------------
 accel/tcg/cpu-exec.c      | 31 ++++++++++++++++++-------------
 accel/tcg/tcg-runtime.c   |  6 ++++--
 accel/tcg/translate-all.c | 14 ++++++++------
 softmmu/physmem.c         |  2 +-
 6 files changed, 59 insertions(+), 48 deletions(-)

-- 
2.20.1

Comments

Richard Henderson Feb. 25, 2021, 12:28 a.m. UTC | #1

On 2/24/21 8:58 AM, Alex Bennée wrote:
> Hi Richard,

> 

> Well I spun up some of the ideas we talked about to see if there was

> anything to be squeezed out of the function. In the end the results

> seem to be a washout with my pigz benchmark:

> 

>  qemu-system-aarch64 -cpu cortex-a57 \

>    -machine type=virt,virtualization=on,gic-version=3 \

>    -serial mon:stdio \

>    -netdev user,id=unet,hostfwd=tcp::2222-:22 \

>    -device virtio-net-pci,netdev=unet,id=virt-net,disable-legacy=on \

>    -device virtio-scsi-pci,id=virt-scsi,disable-legacy=on \

>    -blockdev driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-buster-arm64 \

>    -device scsi-hd,drive=hd,id=virt-scsi-hd \

>    -smp 4 -m 4096 \

>    -kernel ~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image \

>    -append "root=/dev/sda2 systemd.unit=benchmark-pigz.service" \

>    -display none -snapshot

> 

> | Command | Mean [s]       | Min [s] | Max [s] | Relative |

> |---------+----------------+---------+---------+----------|

> | Before  | 46.597 ± 2.482 |  45.208 |  53.618 |     1.00 |

> | After   | 46.867 ± 2.242 |  45.871 |  53.180 |     1.00 |

Well that's disappointing.

> Maybe the code cleanup itself makes it worthwhile. WDYT?

I think there's little doubt that the first 3 patches are a good code cleanup.

Patch 4 I think is still beneficial, simply so that we can add that "Above
fields" comment.

Patch 5 would only be worthwhile if we could measure any positive difference,
which it seems we cannot.

I have a follow-up patch to remove the parallel_cpus global variable which I
will post in a moment.  While it removes a handful of insns from this
fast-path, I doubt it helps.  But getting rid of a global is probably always
positive, no?

I was glancing through the lookup function for alpha, instead of aarch64 and saw:

 21e:   33 43 18                xor    0x18(%rbx),%eax
 221:   4c 31 e1                xor    %r12,%rcx
 224:   44 31 ea                xor    %r13d,%edx
 227:   09 c2                   or     %eax,%edx
 229:   48 0b 4b 08             or     0x8(%rbx),%rcx

and thought -- hang on, how come we're just ORing nor XORing here?  Of course
it's the cs_base field, which alpha has set to zero.  The compiler has
simplified bits |= 0 ^ tb->cs_base.

Which got me thinking: what if we had a per-cpu

typedef struct {
    target_ulong pc;
    ...
} TranslationBlockID;

static inline bool arch_tbid_cmp(TranslationBlockID x,
                                 TranslationBlockID y)
{
    return x.pc == y.pc && ...;
}

We could potentially reduce this to memcmp(&x, &y).

First, this would allow cs_base to be eliminated where it is not used.  Second,
this would allow cs_base to be renamed for the non-x86 targets for which it is
being abused.  Third, it would allow tb->flags to be either (a) elided or (b)
extended by the target as needed.

This final is directed at ARM, of course, where we've overflowed the uint32_t
that is tb->flags.  We could now extend that to 64-bits.

Obviously, some tweaks to tb_hash_func would be required as well, but that's
manageable.

What do you think about this last?

r~

Alex Bennée Feb. 25, 2021, 10:15 a.m. UTC | #2

Richard Henderson <richard.henderson@linaro.org> writes:

> On 2/24/21 8:58 AM, Alex Bennée wrote:

>> Hi Richard,

>> 

>> Well I spun up some of the ideas we talked about to see if there was

>> anything to be squeezed out of the function. In the end the results

>> seem to be a washout with my pigz benchmark:

>> 

>>  qemu-system-aarch64 -cpu cortex-a57 \

>>    -machine type=virt,virtualization=on,gic-version=3 \

>>    -serial mon:stdio \

>>    -netdev user,id=unet,hostfwd=tcp::2222-:22 \

>>    -device virtio-net-pci,netdev=unet,id=virt-net,disable-legacy=on \

>>    -device virtio-scsi-pci,id=virt-scsi,disable-legacy=on \

>>    -blockdev driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-buster-arm64 \

>>    -device scsi-hd,drive=hd,id=virt-scsi-hd \

>>    -smp 4 -m 4096 \

>>    -kernel ~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image \

>>    -append "root=/dev/sda2 systemd.unit=benchmark-pigz.service" \

>>    -display none -snapshot

>> 

>> | Command | Mean [s]       | Min [s] | Max [s] | Relative |

>> |---------+----------------+---------+---------+----------|

>> | Before  | 46.597 ± 2.482 |  45.208 |  53.618 |     1.00 |

>> | After   | 46.867 ± 2.242 |  45.871 |  53.180 |     1.00 |

>

> Well that's disappointing.

>

>> Maybe the code cleanup itself makes it worthwhile. WDYT?

>

> I think there's little doubt that the first 3 patches are a good code cleanup.

>

> Patch 4 I think is still beneficial, simply so that we can add that "Above

> fields" comment.

>

> Patch 5 would only be worthwhile if we could measure any positive difference,

> which it seems we cannot.

>

> I have a follow-up patch to remove the parallel_cpus global variable which I

> will post in a moment.  While it removes a handful of insns from this

> fast-path, I doubt it helps.  But getting rid of a global is probably always

> positive, no?

>

> I was glancing through the lookup function for alpha, instead of aarch64 and saw:

>

>  21e:   33 43 18                xor    0x18(%rbx),%eax

>  221:   4c 31 e1                xor    %r12,%rcx

>  224:   44 31 ea                xor    %r13d,%edx

>  227:   09 c2                   or     %eax,%edx

>  229:   48 0b 4b 08             or     0x8(%rbx),%rcx

>

> and thought -- hang on, how come we're just ORing nor XORing here?  Of course

> it's the cs_base field, which alpha has set to zero.  The compiler has

> simplified bits |= 0 ^ tb->cs_base.

>

> Which got me thinking: what if we had a per-cpu

>

> typedef struct {

>     target_ulong pc;

>     ...

> } TranslationBlockID;

>

> static inline bool arch_tbid_cmp(TranslationBlockID x,

>                                  TranslationBlockID y)

> {

>     return x.pc == y.pc && ...;

> }

>

> We could potentially reduce this to memcmp(&x, &y).

>

> First, this would allow cs_base to be eliminated where it is not used.  Second,

> this would allow cs_base to be renamed for the non-x86 targets for which it is

> being abused.  Third, it would allow tb->flags to be either (a) elided or (b)

> extended by the target as needed.

>

> This final is directed at ARM, of course, where we've overflowed the uint32_t

> that is tb->flags.  We could now extend that to 64-bits.

>

> Obviously, some tweaks to tb_hash_func would be required as well, but that's

> manageable.

>

> What do you think about this last?


Sounds like a good idea for clean-up, especially to get rid of
cs_base/extend tbflags when needed. One concern would be where do we go
when we get to heterogeneous emulation? Will they share the same
translation area like the current cpu->cluster_index stuff or will that
only be for similar but not quite the same architectures? Maybe I'm
thinking too far ahead... 

>

>

> r~



-- 
Alex Bennée

no-reply@patchew.org Feb. 25, 2021, 3:45 p.m. UTC | #3

Patchew URL: https://patchew.org/QEMU/20210224165811.11567-1-alex.bennee@linaro.org/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20210224165811.11567-1-alex.bennee@linaro.org
Subject: [RFC PATCH  0/5] Experimenting with tb-lookup tweaks

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]      patchew/20210218201528.127099-1-eblake@redhat.com -> patchew/20210218201528.127099-1-eblake@redhat.com
 - [tag update]      patchew/20210224055401.492407-1-jasowang@redhat.com -> patchew/20210224055401.492407-1-jasowang@redhat.com
 * [new tag]         patchew/20210224165811.11567-1-alex.bennee@linaro.org -> patchew/20210224165811.11567-1-alex.bennee@linaro.org
 * [new tag]         patchew/20210224165837.21983-1-vgoyal@redhat.com -> patchew/20210224165837.21983-1-vgoyal@redhat.com
 - [tag update]      patchew/20210225032335.64245-1-aik@ozlabs.ru -> patchew/20210225032335.64245-1-aik@ozlabs.ru
 * [new tag]         patchew/20210225054756.35962-1-linuxmaker@163.com -> patchew/20210225054756.35962-1-linuxmaker@163.com
 - [tag update]      patchew/20210225131316.631940-1-pbonzini@redhat.com -> patchew/20210225131316.631940-1-pbonzini@redhat.com
Switched to a new branch 'test'
0be54b4 include/exec/tb-lookup: try and reduce branch prediction issues
c6233de include/exec: lightly re-arrange TranslationBlock
18534bf accel/tcg: drop the use of CF_HASH_MASK and rename params
3a30caf accel/tcg: move CF_CLUSTER calculation to curr_cflags
a135bea accel/tcg: rename tb_lookup__cpu_state and hoist state extraction

=== OUTPUT BEGIN ===
1/5 Checking commit a135bea36366 (accel/tcg: rename tb_lookup__cpu_state and hoist state extraction)
ERROR: "foo * bar" should be "foo *bar"
#84: FILE: include/exec/tb-lookup.h:20:
+static inline TranslationBlock * tb_lookup(CPUState *cpu,

WARNING: line over 80 characters
#85: FILE: include/exec/tb-lookup.h:21:
+                                           target_ulong pc, target_ulong cs_base,

total: 1 errors, 1 warnings, 80 lines checked

Patch 1/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/5 Checking commit 3a30caf5f47d (accel/tcg: move CF_CLUSTER calculation to curr_cflags)
3/5 Checking commit 18534bff0f1f (accel/tcg: drop the use of CF_HASH_MASK and rename params)
4/5 Checking commit c6233de83263 (include/exec: lightly re-arrange TranslationBlock)
WARNING: Block comments use a leading /* on a separate line
#35: FILE: include/exec/exec-all.h:465:
+    uint16_t size;      /* size of target code for this block (1 <=

WARNING: Block comments use * on subsequent lines
#36: FILE: include/exec/exec-all.h:466:
+    uint16_t size;      /* size of target code for this block (1 <=
+                           size <= TARGET_PAGE_SIZE) */

WARNING: Block comments use a trailing */ on a separate line
#36: FILE: include/exec/exec-all.h:466:
+                           size <= TARGET_PAGE_SIZE) */

total: 0 errors, 3 warnings, 20 lines checked

Patch 4/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
5/5 Checking commit 0be54b4ee146 (include/exec/tb-lookup: try and reduce branch prediction issues)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20210224165811.11567-1-alex.bennee@linaro.org/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com