Message ID | 20210224165811.11567-1-alex.bennee@linaro.org |
---|---|
Headers | show |
Series | Experimenting with tb-lookup tweaks | expand |
On 2/24/21 8:58 AM, Alex Bennée wrote: > Hi Richard, > > Well I spun up some of the ideas we talked about to see if there was > anything to be squeezed out of the function. In the end the results > seem to be a washout with my pigz benchmark: > > qemu-system-aarch64 -cpu cortex-a57 \ > -machine type=virt,virtualization=on,gic-version=3 \ > -serial mon:stdio \ > -netdev user,id=unet,hostfwd=tcp::2222-:22 \ > -device virtio-net-pci,netdev=unet,id=virt-net,disable-legacy=on \ > -device virtio-scsi-pci,id=virt-scsi,disable-legacy=on \ > -blockdev driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-buster-arm64 \ > -device scsi-hd,drive=hd,id=virt-scsi-hd \ > -smp 4 -m 4096 \ > -kernel ~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image \ > -append "root=/dev/sda2 systemd.unit=benchmark-pigz.service" \ > -display none -snapshot > > | Command | Mean [s] | Min [s] | Max [s] | Relative | > |---------+----------------+---------+---------+----------| > | Before | 46.597 ± 2.482 | 45.208 | 53.618 | 1.00 | > | After | 46.867 ± 2.242 | 45.871 | 53.180 | 1.00 | Well that's disappointing. > Maybe the code cleanup itself makes it worthwhile. WDYT? I think there's little doubt that the first 3 patches are a good code cleanup. Patch 4 I think is still beneficial, simply so that we can add that "Above fields" comment. Patch 5 would only be worthwhile if we could measure any positive difference, which it seems we cannot. I have a follow-up patch to remove the parallel_cpus global variable which I will post in a moment. While it removes a handful of insns from this fast-path, I doubt it helps. But getting rid of a global is probably always positive, no? I was glancing through the lookup function for alpha, instead of aarch64 and saw: 21e: 33 43 18 xor 0x18(%rbx),%eax 221: 4c 31 e1 xor %r12,%rcx 224: 44 31 ea xor %r13d,%edx 227: 09 c2 or %eax,%edx 229: 48 0b 4b 08 or 0x8(%rbx),%rcx and thought -- hang on, how come we're just ORing nor XORing here? Of course it's the cs_base field, which alpha has set to zero. The compiler has simplified bits |= 0 ^ tb->cs_base. Which got me thinking: what if we had a per-cpu typedef struct { target_ulong pc; ... } TranslationBlockID; static inline bool arch_tbid_cmp(TranslationBlockID x, TranslationBlockID y) { return x.pc == y.pc && ...; } We could potentially reduce this to memcmp(&x, &y). First, this would allow cs_base to be eliminated where it is not used. Second, this would allow cs_base to be renamed for the non-x86 targets for which it is being abused. Third, it would allow tb->flags to be either (a) elided or (b) extended by the target as needed. This final is directed at ARM, of course, where we've overflowed the uint32_t that is tb->flags. We could now extend that to 64-bits. Obviously, some tweaks to tb_hash_func would be required as well, but that's manageable. What do you think about this last? r~
Richard Henderson <richard.henderson@linaro.org> writes: > On 2/24/21 8:58 AM, Alex Bennée wrote: >> Hi Richard, >> >> Well I spun up some of the ideas we talked about to see if there was >> anything to be squeezed out of the function. In the end the results >> seem to be a washout with my pigz benchmark: >> >> qemu-system-aarch64 -cpu cortex-a57 \ >> -machine type=virt,virtualization=on,gic-version=3 \ >> -serial mon:stdio \ >> -netdev user,id=unet,hostfwd=tcp::2222-:22 \ >> -device virtio-net-pci,netdev=unet,id=virt-net,disable-legacy=on \ >> -device virtio-scsi-pci,id=virt-scsi,disable-legacy=on \ >> -blockdev driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-buster-arm64 \ >> -device scsi-hd,drive=hd,id=virt-scsi-hd \ >> -smp 4 -m 4096 \ >> -kernel ~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image \ >> -append "root=/dev/sda2 systemd.unit=benchmark-pigz.service" \ >> -display none -snapshot >> >> | Command | Mean [s] | Min [s] | Max [s] | Relative | >> |---------+----------------+---------+---------+----------| >> | Before | 46.597 ± 2.482 | 45.208 | 53.618 | 1.00 | >> | After | 46.867 ± 2.242 | 45.871 | 53.180 | 1.00 | > > Well that's disappointing. > >> Maybe the code cleanup itself makes it worthwhile. WDYT? > > I think there's little doubt that the first 3 patches are a good code cleanup. > > Patch 4 I think is still beneficial, simply so that we can add that "Above > fields" comment. > > Patch 5 would only be worthwhile if we could measure any positive difference, > which it seems we cannot. > > I have a follow-up patch to remove the parallel_cpus global variable which I > will post in a moment. While it removes a handful of insns from this > fast-path, I doubt it helps. But getting rid of a global is probably always > positive, no? > > I was glancing through the lookup function for alpha, instead of aarch64 and saw: > > 21e: 33 43 18 xor 0x18(%rbx),%eax > 221: 4c 31 e1 xor %r12,%rcx > 224: 44 31 ea xor %r13d,%edx > 227: 09 c2 or %eax,%edx > 229: 48 0b 4b 08 or 0x8(%rbx),%rcx > > and thought -- hang on, how come we're just ORing nor XORing here? Of course > it's the cs_base field, which alpha has set to zero. The compiler has > simplified bits |= 0 ^ tb->cs_base. > > Which got me thinking: what if we had a per-cpu > > typedef struct { > target_ulong pc; > ... > } TranslationBlockID; > > static inline bool arch_tbid_cmp(TranslationBlockID x, > TranslationBlockID y) > { > return x.pc == y.pc && ...; > } > > We could potentially reduce this to memcmp(&x, &y). > > First, this would allow cs_base to be eliminated where it is not used. Second, > this would allow cs_base to be renamed for the non-x86 targets for which it is > being abused. Third, it would allow tb->flags to be either (a) elided or (b) > extended by the target as needed. > > This final is directed at ARM, of course, where we've overflowed the uint32_t > that is tb->flags. We could now extend that to 64-bits. > > Obviously, some tweaks to tb_hash_func would be required as well, but that's > manageable. > > What do you think about this last? Sounds like a good idea for clean-up, especially to get rid of cs_base/extend tbflags when needed. One concern would be where do we go when we get to heterogeneous emulation? Will they share the same translation area like the current cpu->cluster_index stuff or will that only be for similar but not quite the same architectures? Maybe I'm thinking too far ahead... > > > r~ -- Alex Bennée
Patchew URL: https://patchew.org/QEMU/20210224165811.11567-1-alex.bennee@linaro.org/ Hi, This series seems to have some coding style problems. See output below for more information: Type: series Message-id: 20210224165811.11567-1-alex.bennee@linaro.org Subject: [RFC PATCH 0/5] Experimenting with tb-lookup tweaks === TEST SCRIPT BEGIN === #!/bin/bash git rev-parse base > /dev/null || exit 0 git config --local diff.renamelimit 0 git config --local diff.renames True git config --local diff.algorithm histogram ./scripts/checkpatch.pl --mailback base.. === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 From https://github.com/patchew-project/qemu - [tag update] patchew/20210218201528.127099-1-eblake@redhat.com -> patchew/20210218201528.127099-1-eblake@redhat.com - [tag update] patchew/20210224055401.492407-1-jasowang@redhat.com -> patchew/20210224055401.492407-1-jasowang@redhat.com * [new tag] patchew/20210224165811.11567-1-alex.bennee@linaro.org -> patchew/20210224165811.11567-1-alex.bennee@linaro.org * [new tag] patchew/20210224165837.21983-1-vgoyal@redhat.com -> patchew/20210224165837.21983-1-vgoyal@redhat.com - [tag update] patchew/20210225032335.64245-1-aik@ozlabs.ru -> patchew/20210225032335.64245-1-aik@ozlabs.ru * [new tag] patchew/20210225054756.35962-1-linuxmaker@163.com -> patchew/20210225054756.35962-1-linuxmaker@163.com - [tag update] patchew/20210225131316.631940-1-pbonzini@redhat.com -> patchew/20210225131316.631940-1-pbonzini@redhat.com Switched to a new branch 'test' 0be54b4 include/exec/tb-lookup: try and reduce branch prediction issues c6233de include/exec: lightly re-arrange TranslationBlock 18534bf accel/tcg: drop the use of CF_HASH_MASK and rename params 3a30caf accel/tcg: move CF_CLUSTER calculation to curr_cflags a135bea accel/tcg: rename tb_lookup__cpu_state and hoist state extraction === OUTPUT BEGIN === 1/5 Checking commit a135bea36366 (accel/tcg: rename tb_lookup__cpu_state and hoist state extraction) ERROR: "foo * bar" should be "foo *bar" #84: FILE: include/exec/tb-lookup.h:20: +static inline TranslationBlock * tb_lookup(CPUState *cpu, WARNING: line over 80 characters #85: FILE: include/exec/tb-lookup.h:21: + target_ulong pc, target_ulong cs_base, total: 1 errors, 1 warnings, 80 lines checked Patch 1/5 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 2/5 Checking commit 3a30caf5f47d (accel/tcg: move CF_CLUSTER calculation to curr_cflags) 3/5 Checking commit 18534bff0f1f (accel/tcg: drop the use of CF_HASH_MASK and rename params) 4/5 Checking commit c6233de83263 (include/exec: lightly re-arrange TranslationBlock) WARNING: Block comments use a leading /* on a separate line #35: FILE: include/exec/exec-all.h:465: + uint16_t size; /* size of target code for this block (1 <= WARNING: Block comments use * on subsequent lines #36: FILE: include/exec/exec-all.h:466: + uint16_t size; /* size of target code for this block (1 <= + size <= TARGET_PAGE_SIZE) */ WARNING: Block comments use a trailing */ on a separate line #36: FILE: include/exec/exec-all.h:466: + size <= TARGET_PAGE_SIZE) */ total: 0 errors, 3 warnings, 20 lines checked Patch 4/5 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 5/5 Checking commit 0be54b4ee146 (include/exec/tb-lookup: try and reduce branch prediction issues) === OUTPUT END === Test command exited with code: 1 The full log is available at http://patchew.org/logs/20210224165811.11567-1-alex.bennee@linaro.org/testing.checkpatch/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com