[00/16] make system memory API available for common code

Message ID	20250310045842.2650784-1-pierrick.bouvier@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; From: Pierrick Bouvier <pierrick.bouvier@linaro.org> To: qemu-devel@nongnu.org Cc: qemu-ppc@nongnu.org, Alistair Francis <alistair.francis@wdc.com>, Richard Henderson <richard.henderson@linaro.org>, Harsh Prateek Bora <harshpb@linux.ibm.com>, alex.bennee@linaro.org, Palmer Dabbelt <palmer@dabbelt.com>, Daniel Henrique Barboza <danielhb413@gmail.com>, kvm@vger.kernel.org, Peter Xu <peterx@redhat.com>, Nicholas Piggin <npiggin@gmail.com>, Liu Zhiwei <zhiwei_liu@linux.alibaba.com>, David Hildenbrand <david@redhat.com>, Weiwei Li <liwei1518@gmail.com>, Paul Durrant <paul@xen.org>, "Edgar E. Iglesias" <edgar.iglesias@gmail.com>, =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>, Anthony PERARD <anthony@xenproject.org>, Yoshinori Sato <ysato@users.sourceforge.jp>, manos.pitsidianakis@linaro.org, qemu-riscv@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>, xen-devel@lists.xenproject.org, Stefano Stabellini <sstabellini@kernel.org>, Pierrick Bouvier <pierrick.bouvier@linaro.org> Subject: [PATCH 00/16] make system memory API available for common code Date: Sun, 9 Mar 2025 21:58:26 -0700 Message-Id: <20250310045842.2650784-1-pierrick.bouvier@linaro.org> Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::635; envelope-from=pierrick.bouvier@linaro.org; helo=mail-pl1-x635.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=unavailable autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org
Series	make system memory API available for common code \| expand [00/16] make system memory API available for common code [01/16] exec/memory_ldst: extract memory_ldst declarations from cpu-all.h [02/16] exec/memory_ldst_phys: extract memory_ldst_phys declarations from cpu-all.h [03/16] include: move target_words_bigendian() from tswap to bswap [04/16] exec/memory.h: make devend_memop target agnostic [05/16] qemu/bswap: implement {ld,st}.*_p as functions [06/16] exec/cpu-all.h: we can now remove ld/st macros [07/16] codebase: prepare to remove cpu.h from exec/exec-all.h [08/16] exec/exec-all: remove dependency on cpu.h [09/16] exec/memory-internal: remove dependency on cpu.h [10/16] exec/ram_addr: remove dependency on cpu.h [11/16] system/kvm: make kvm_flush_coalesced_mmio_buffer() accessible for common code [12/16] exec/ram_addr: call xen_hvm_modified_memory only if xen is enabled [13/16] hw/xen: add stubs for various functions [14/16] system/physmem: compilation unit is now common to all targets [15/16] system/memory: make compilation unit common [16/16] system/ioport: make compilation unit common

Pierrick Bouvier March 10, 2025, 4:58 a.m. UTC

The main goal of this series is to be able to call any memory ld/st function
from code that is *not* target dependent. As a positive side effect, we can
turn related system compilation units into common code.

The first 6 patches remove dependency of memory API to cpu headers and remove
dependency to target specific code. This could be a series on its own, but it's
great to be able to turn system memory compilation units into common code to
make sure it can't regress, and prove it achieves the desired result.

The next patches remove more dependencies on cpu headers (exec-all,
memory-internal, ram_addr).
Then, we add access to a needed function from kvm, some xen stubs, and we
finally can turn our compilation units into common code.

Every commit was tested to build correctly for all targets (on windows, linux,
macos), and the series was fully tested by running all tests we have (linux,
x86_64 host).

Pierrick Bouvier (16):
  exec/memory_ldst: extract memory_ldst declarations from cpu-all.h
  exec/memory_ldst_phys: extract memory_ldst_phys declarations from
    cpu-all.h
  include: move target_words_bigendian() from tswap to bswap
  exec/memory.h: make devend_memop target agnostic
  qemu/bswap: implement {ld,st}.*_p as functions
  exec/cpu-all.h: we can now remove ld/st macros
  codebase: prepare to remove cpu.h from exec/exec-all.h
  exec/exec-all: remove dependency on cpu.h
  exec/memory-internal: remove dependency on cpu.h
  exec/ram_addr: remove dependency on cpu.h
  system/kvm: make kvm_flush_coalesced_mmio_buffer() accessible for
    common code
  exec/ram_addr: call xen_hvm_modified_memory only if xen is enabled
  hw/xen: add stubs for various functions
  system/physmem: compilation unit is now common to all targets
  system/memory: make compilation unit common
  system/ioport: make compilation unit common

 include/exec/cpu-all.h              | 52 ------------------
 include/exec/exec-all.h             |  1 -
 include/exec/memory-internal.h      |  2 -
 include/exec/memory.h               | 48 ++++++++++++++---
 include/exec/ram_addr.h             | 11 ++--
 include/exec/tswap.h                | 11 ----
 include/qemu/bswap.h                | 82 +++++++++++++++++++++++++++++
 include/system/kvm.h                |  6 +--
 include/tcg/tcg-op.h                |  1 +
 target/ppc/helper_regs.h            |  2 +
 include/exec/memory_ldst.h.inc      | 13 ++---
 include/exec/memory_ldst_phys.h.inc |  5 +-
 hw/ppc/spapr_nested.c               |  1 +
 hw/sh4/sh7750.c                     |  1 +
 hw/xen/xen_stubs.c                  | 56 ++++++++++++++++++++
 page-vary-target.c                  |  3 +-
 system/ioport.c                     |  1 -
 system/memory.c                     | 22 +++++---
 target/riscv/bitmanip_helper.c      |  1 +
 hw/xen/meson.build                  |  3 ++
 system/meson.build                  |  6 +--
 21 files changed, 225 insertions(+), 103 deletions(-)
 create mode 100644 hw/xen/xen_stubs.c

BALATON Zoltan March 10, 2025, 1:23 p.m. UTC | #1

On Sun, 9 Mar 2025, Pierrick Bouvier wrote:
> The main goal of this series is to be able to call any memory ld/st function
> from code that is *not* target dependent.

Why is that needed?

> As a positive side effect, we can
> turn related system compilation units into common code.

Are there any negative side effects? In particular have you done any 
performance benchmarking to see if this causes a measurable slow down? 
Such as with the STREAM benchmark:
https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure

Maybe it would be good to have some performance tests similiar to 
functional tests that could be run like the CI tests to detect such 
performance changes. People report that QEMU is getting slower and slower 
with each release. Maybe it could be a GSoC project to make such tests but 
maybe we're too late for that.

Regards,
BALATON Zoltan

Pierrick Bouvier March 10, 2025, 4:28 p.m. UTC | #2

Hi Zoltan,

On 3/10/25 06:23, BALATON Zoltan wrote:
> On Sun, 9 Mar 2025, Pierrick Bouvier wrote:
>> The main goal of this series is to be able to call any memory ld/st function
>> from code that is *not* target dependent.
> 
> Why is that needed?
> 

this series belongs to the "single binary" topic, where we are trying to 
build a single QEMU binary with all architectures embedded.

To achieve that, we need to have every single compilation unit compiled 
only once, to be able to link a binary without any symbol conflict.

A consequence of that is target specific code (in terms of code relying 
of target specific macros) needs to be converted to common code, 
checking at runtime properties of the target we run. We are tackling 
various places in QEMU codebase at the same time, which can be confusing 
for the community members.

This series take care of system memory related functions and associated 
compilation units in system/.

>> As a positive side effect, we can
>> turn related system compilation units into common code.
> 
> Are there any negative side effects? In particular have you done any
> performance benchmarking to see if this causes a measurable slow down?
> Such as with the STREAM benchmark:
> https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure
> 
> Maybe it would be good to have some performance tests similiar to
> functional tests that could be run like the CI tests to detect such
> performance changes. People report that QEMU is getting slower and slower
> with each release. Maybe it could be a GSoC project to make such tests but
> maybe we're too late for that.
> 

I agree with you, and it's something we have mentioned during our 
"internal" conversations. Testing performance with existing functional 
tests would already be a first good step. However, given the poor 
reliability we have on our CI runners, I think it's a bit doomed.

Ideally, every QEMU release cycle should have a performance measurement 
window to detect potential sources of regressions.

To answer to your specific question, I am trying first to get a review 
on the approach taken. We can always optimize in next series version, in 
case we identify it's a big deal to introduce a branch for every memory 
related function call.

In all cases, transforming code relying on compile time 
optimization/dead code elimination through defines to runtime checks 
will *always* have an impact, even though it should be minimal in most 
of cases. But the maintenance and compilation time benefits, as well as 
the perspectives it opens (single binary, heterogeneous emulation, use 
QEMU as a library) are worth it IMHO.

> Regards,
> BALATON Zoltan

Regards,
Pierrick

Pierrick Bouvier March 10, 2025, 4:56 p.m. UTC | #3

On 3/10/25 09:28, Pierrick Bouvier wrote:
> Hi Zoltan,
> 
> On 3/10/25 06:23, BALATON Zoltan wrote:
>> On Sun, 9 Mar 2025, Pierrick Bouvier wrote:
>>> The main goal of this series is to be able to call any memory ld/st function
>>> from code that is *not* target dependent.
>>
>> Why is that needed?
>>
> 
> this series belongs to the "single binary" topic, where we are trying to
> build a single QEMU binary with all architectures embedded.
> 
> To achieve that, we need to have every single compilation unit compiled
> only once, to be able to link a binary without any symbol conflict.
> 
> A consequence of that is target specific code (in terms of code relying
> of target specific macros) needs to be converted to common code,
> checking at runtime properties of the target we run. We are tackling
> various places in QEMU codebase at the same time, which can be confusing
> for the community members.
> 
> This series take care of system memory related functions and associated
> compilation units in system/.
> 
>>> As a positive side effect, we can
>>> turn related system compilation units into common code.
>>
>> Are there any negative side effects? In particular have you done any
>> performance benchmarking to see if this causes a measurable slow down?
>> Such as with the STREAM benchmark:
>> https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure
>>
>> Maybe it would be good to have some performance tests similiar to
>> functional tests that could be run like the CI tests to detect such
>> performance changes. People report that QEMU is getting slower and slower
>> with each release. Maybe it could be a GSoC project to make such tests but
>> maybe we're too late for that.
>>
> 
> I agree with you, and it's something we have mentioned during our
> "internal" conversations. Testing performance with existing functional
> tests would already be a first good step. However, given the poor
> reliability we have on our CI runners, I think it's a bit doomed.
> 
> Ideally, every QEMU release cycle should have a performance measurement
> window to detect potential sources of regressions.
> 
> To answer to your specific question, I am trying first to get a review
> on the approach taken. We can always optimize in next series version, in
> case we identify it's a big deal to introduce a branch for every memory
> related function call.
> 
> In all cases, transforming code relying on compile time
> optimization/dead code elimination through defines to runtime checks
> will *always* have an impact, even though it should be minimal in most
> of cases. But the maintenance and compilation time benefits, as well as
> the perspectives it opens (single binary, heterogeneous emulation, use
> QEMU as a library) are worth it IMHO.
> 
>> Regards,
>> BALATON Zoltan
> 
> Regards,
> Pierrick
> 

As a side note, we recently did some work around performance analysis 
(for aarch64), as you can see here [1]. In the end, QEMU performance 
depends (roughly in this order) on:
1. quality of code generated by TCG
2. helper code to implement instructions
3. mmu emulation

Other state of the art translators that exist are faster (fex, box64) 
mainly by enhancing 1, and relying on various tricks to avoid 
translating some libraries calls. But those translators are host/target 
specific, and the ratio of instructions generated (vs target ones read) 
is much lower than QEMU. In the experimentation listed in the blog, I 
observed that for qemu-system-aarch64, we have an average expansion 
factor of around 18 (1 guest insn translates to 18 host ones).

For users seeing performance decreases, beyond the QEMU code changes, 
adding new target instructions may add new helpers, which may be called 
by the stack people use, and they can sometimes observe a slower behaviour.

There are probably some other low hanging fruits for other target 
architectures.

[1] https://www.linaro.org/blog/qemu-a-tale-of-performance-analysis/

BALATON Zoltan March 10, 2025, 7:40 p.m. UTC | #4

On Mon, 10 Mar 2025, Pierrick Bouvier wrote:
> On 3/10/25 09:28, Pierrick Bouvier wrote:
>> Hi Zoltan,
>> 
>> On 3/10/25 06:23, BALATON Zoltan wrote:
>>> On Sun, 9 Mar 2025, Pierrick Bouvier wrote:
>>>> The main goal of this series is to be able to call any memory ld/st 
>>>> function
>>>> from code that is *not* target dependent.
>>> 
>>> Why is that needed?
>>> 
>> 
>> this series belongs to the "single binary" topic, where we are trying to
>> build a single QEMU binary with all architectures embedded.

Yes I get it now, I just forgot as this wasn't mentioned so the goal 
wasn't obvious.

>> To achieve that, we need to have every single compilation unit compiled
>> only once, to be able to link a binary without any symbol conflict.
>> 
>> A consequence of that is target specific code (in terms of code relying
>> of target specific macros) needs to be converted to common code,
>> checking at runtime properties of the target we run. We are tackling
>> various places in QEMU codebase at the same time, which can be confusing
>> for the community members.

Mentioning this single binary in related series may help reminding readers 
about the context.

>> This series take care of system memory related functions and associated
>> compilation units in system/.
>> 
>>>> As a positive side effect, we can
>>>> turn related system compilation units into common code.
>>> 
>>> Are there any negative side effects? In particular have you done any
>>> performance benchmarking to see if this causes a measurable slow down?
>>> Such as with the STREAM benchmark:
>>> https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure
>>> 
>>> Maybe it would be good to have some performance tests similiar to
>>> functional tests that could be run like the CI tests to detect such
>>> performance changes. People report that QEMU is getting slower and slower
>>> with each release. Maybe it could be a GSoC project to make such tests but
>>> maybe we're too late for that.
>>> 
>> 
>> I agree with you, and it's something we have mentioned during our
>> "internal" conversations. Testing performance with existing functional
>> tests would already be a first good step. However, given the poor
>> reliability we have on our CI runners, I think it's a bit doomed.
>> 
>> Ideally, every QEMU release cycle should have a performance measurement
>> window to detect potential sources of regressions.

Maybe instead of aiming for full CI like performance testing something 
simpler like a few tests that excercise some apects each like STREAM that 
tests memory access, copying a file from network and/or disk that tests 
I/O and mp3 encode with lame for example that's supposed to test floating 
point and SIMD might be simpler to do. It could be made a bootable image 
that just runs the test and reports a number (I did that before for 
qemu-system-ppc when we wanted to test an issue that on some hosts it ran 
slower). Such test could be run by somebody making changes so they could 
call these before and after their patch to quickly check if there's 
anything to improve. This may be less through then full performance 
testing but still give some insight and better than not testing anything 
for performance.

I'm bringig this topic up to try to keep awareness on this so QEMU can 
remain true to its name. (Although I'm not sure if originally the Q in the 
name stood for the time it took to write or its performance but it's 
hopefully still a goal to keep it fast.)

>> To answer to your specific question, I am trying first to get a review
>> on the approach taken. We can always optimize in next series version, in
>> case we identify it's a big deal to introduce a branch for every memory
>> related function call.

I'm not sure we can always optimise after the fact so sometimes it can be 
necessary to take performance in consideration while designing changes.

>> In all cases, transforming code relying on compile time
>> optimization/dead code elimination through defines to runtime checks
>> will *always* have an impact,

Yes, that's why it would be good to know how much impact is that.

>> even though it should be minimal in most of cases.

Hopefully but how do we know if we don't even test for it?

>> But the maintenance and compilation time benefits, as well as
>> the perspectives it opens (single binary, heterogeneous emulation, use
>> QEMU as a library) are worth it IMHO.

I'm not so sure about that. Heterogeneous emulation sounds interesting but 
is it needed most of the time? Using QEMU as a library also may not be 
common and limited by licencing. The single binary would simplify packages 
but then this binary may get huge so it's slower to load, may take more 
resources to run and more time to compile and if somebody only needs one 
architecture why do I want to include all of the others and wait for it to 
compile using up a lot of space on my disk? So in other words, while these 
are interesting and good goals could it be achieved with keeping the 
current way of building single ARCH binary as opposed to single binary 
with multiple archs and not throwing out the optimisations a single arch 
binary can use? Which one is better may depend on the use case so if 
possible it would be better to allow both keeping what we have and adding 
multi arch binary on top not replacing the current way completely.

>>> Regards,
>>> BALATON Zoltan
>> 
>> Regards,
>> Pierrick
>> 
>
> As a side note, we recently did some work around performance analysis (for 
> aarch64), as you can see here [1]. In the end, QEMU performance depends

Thank you, very interesting read.

> (roughly in this order) on:
> 1. quality of code generated by TCG
> 2. helper code to implement instructions
> 3. mmu emulation
>
> Other state of the art translators that exist are faster (fex, box64) mainly 
> by enhancing 1, and relying on various tricks to avoid translating some 
> libraries calls. But those translators are host/target specific, and the 
> ratio of instructions generated (vs target ones read) is much lower than 
> QEMU. In the experimentation listed in the blog, I observed that for 
> qemu-system-aarch64, we have an average expansion factor of around 18 (1 
> guest insn translates to 18 host ones).
>
> For users seeing performance decreases, beyond the QEMU code changes, adding 
> new target instructions may add new helpers, which may be called by the stack 
> people use, and they can sometimes observe a slower behaviour.

I'm mostly interested in emulating PPC for older and obscure OSes running 
on older hardware so there new instructions isn't a problem. Most of the 
time MMU emulation, helpers and TCG code generation is mostly dominating 
there and on PPC particularly the lack of hard float usage. Apart from 
that maybe some device emulations but that's a different topic. This is 
already slow so any overhead introduced at lowest levels just adds to 
that and target specific optimisation may only get back what's lost 
elsewhere.

Regards,
BALATON Zoltan

> There are probably some other low hanging fruits for other target 
> architectures.
>
> [1] https://www.linaro.org/blog/qemu-a-tale-of-performance-analysis/
>
>

BALATON Zoltan March 10, 2025, 9:02 p.m. UTC | #5

On Mon, 10 Mar 2025, Pierrick Bouvier wrote:
> On 3/10/25 12:40, BALATON Zoltan wrote:
>> On Mon, 10 Mar 2025, Pierrick Bouvier wrote:
>>> On 3/10/25 09:28, Pierrick Bouvier wrote:
>>>> Hi Zoltan,
>>>> 
>>>> On 3/10/25 06:23, BALATON Zoltan wrote:
>>>>> On Sun, 9 Mar 2025, Pierrick Bouvier wrote:
>>>>>> The main goal of this series is to be able to call any memory ld/st
>>>>>> function
>>>>>> from code that is *not* target dependent.
>>>>> 
>>>>> Why is that needed?
>>>>> 
>>>> 
>>>> this series belongs to the "single binary" topic, where we are trying to
>>>> build a single QEMU binary with all architectures embedded.
>> 
>> Yes I get it now, I just forgot as this wasn't mentioned so the goal
>> wasn't obvious.
>> 
>
> The more I work on this topic, the more I realize we miss a clear and concise 
> document (wiki page, or anything than can be edited easily - not email) 
> explaining this to other developers, and that we could share as a link, and 
> enhance based on the questions asked.

Maybe you can start collecting FAQ on a wiki page so you don't have to 
answer them multiple times. I think most people aware of this though just 
may not associate a series with it if not mentioned in the description.

>>>> To achieve that, we need to have every single compilation unit compiled
>>>> only once, to be able to link a binary without any symbol conflict.
>>>> 
>>>> A consequence of that is target specific code (in terms of code relying
>>>> of target specific macros) needs to be converted to common code,
>>>> checking at runtime properties of the target we run. We are tackling
>>>> various places in QEMU codebase at the same time, which can be confusing
>>>> for the community members.
>> 
>> Mentioning this single binary in related series may help reminding readers
>> about the context.
>> 
>
> I'll make sure to mention this "name" in the title for next series, thanks!
>
>>>> This series take care of system memory related functions and associated
>>>> compilation units in system/.
>>>> 
>>>>>> As a positive side effect, we can
>>>>>> turn related system compilation units into common code.
>>>>> 
>>>>> Are there any negative side effects? In particular have you done any
>>>>> performance benchmarking to see if this causes a measurable slow down?
>>>>> Such as with the STREAM benchmark:
>>>>> https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure
>>>>> 
>>>>> Maybe it would be good to have some performance tests similiar to
>>>>> functional tests that could be run like the CI tests to detect such
>>>>> performance changes. People report that QEMU is getting slower and 
>>>>> slower
>>>>> with each release. Maybe it could be a GSoC project to make such tests 
>>>>> but
>>>>> maybe we're too late for that.
>>>>> 
>>>> 
>>>> I agree with you, and it's something we have mentioned during our
>>>> "internal" conversations. Testing performance with existing functional
>>>> tests would already be a first good step. However, given the poor
>>>> reliability we have on our CI runners, I think it's a bit doomed.
>>>> 
>>>> Ideally, every QEMU release cycle should have a performance measurement
>>>> window to detect potential sources of regressions.
>> 
>> Maybe instead of aiming for full CI like performance testing something
>> simpler like a few tests that excercise some apects each like STREAM that
>> tests memory access, copying a file from network and/or disk that tests
>> I/O and mp3 encode with lame for example that's supposed to test floating
>> point and SIMD might be simpler to do. It could be made a bootable image
>> that just runs the test and reports a number (I did that before for
>> qemu-system-ppc when we wanted to test an issue that on some hosts it ran
>> slower). Such test could be run by somebody making changes so they could
>> call these before and after their patch to quickly check if there's
>> anything to improve. This may be less through then full performance
>> testing but still give some insight and better than not testing anything
>> for performance.
>> 
>> I'm bringig this topic up to try to keep awareness on this so QEMU can
>> remain true to its name. (Although I'm not sure if originally the Q in the
>> name stood for the time it took to write or its performance but it's
>> hopefully still a goal to keep it fast.)
>> 
>
> You do well to remind that, but as always, the problem is that "run by 
> somebody" is not an enforceable process.
>
>>>> To answer to your specific question, I am trying first to get a review
>>>> on the approach taken. We can always optimize in next series version, in
>>>> case we identify it's a big deal to introduce a branch for every memory
>>>> related function call.
>> 
>> I'm not sure we can always optimise after the fact so sometimes it can be
>> necessary to take performance in consideration while designing changes.
>> 
>
> In the context of single binary concerned series, we mostly introduce a few 
> branches in various spots, to do a runtime check.
> As Richard mentioned in this series, we can keep target code exactly as it 
> is.
>
>>>> In all cases, transforming code relying on compile time
>>>> optimization/dead code elimination through defines to runtime checks
>>>> will *always* have an impact,
>> 
>> Yes, that's why it would be good to know how much impact is that.
>> 
>>>> even though it should be minimal in most of cases.
>> 
>> Hopefully but how do we know if we don't even test for it?
>> 
>
> In the case of this series, I usually so a local test booting (automatically) 
> an x64 debian stable vm, that poweroff itself as part of its init.
>
> With and without this series, the variation is below the average one I have 
> between two runs (<1 sec, for a total of 40 seconds), so the impact is 
> litterally invisible.

That's good to hear. Some overhead which is unavoidable is OK I just hope 
we can avoid which is not unavoidable and try to do something about what 
would have noticable performance penalty. If you're already aware of that 
and do that then that's all I wanted to say, nothing new.

>>>> But the maintenance and compilation time benefits, as well as
>>>> the perspectives it opens (single binary, heterogeneous emulation, use
>>>> QEMU as a library) are worth it IMHO.
>> 
>> I'm not so sure about that. Heterogeneous emulation sounds interesting but
>> is it needed most of the time? Using QEMU as a library also may not be
>> common and limited by licencing. The single binary would simplify packages
>> but then this binary may get huge so it's slower to load, may take more
>> resources to run and more time to compile and if somebody only needs one
>> architecture why do I want to include all of the others and wait for it to
>> compile using up a lot of space on my disk? So in other words, while these
>> are interesting and good goals could it be achieved with keeping the
>> current way of building single ARCH binary as opposed to single binary
>> with multiple archs and not throwing out the optimisations a single arch
>> binary can use? Which one is better may depend on the use case so if
>> possible it would be better to allow both keeping what we have and adding
>> multi arch binary on top not replacing the current way completely.
>> 
>
> Thanks, it's definitely interesting to hear the concerns on this, so we can 
> address them, and find the best and minimal solution to achive the desired 
> goal.
>
> I'll answer point by point.
>
> QEMU as a library: that's what Unicorn is 
> (https://www.unicorn-engine.org/docs/beyond_qemu.html), which is used by a 
> lot of researchers. Talking frequently with some of them, they would be happy 
> to have such a library directly with upstream QEMU, so it can benefit from 
> all the enhancements done to TCG. It's mostly a use case for security 
> researchers/engineers, but definitely a valid one. Just look at the list of 
> QEMU downstream forks focused on that. Combining this with plugins would be 
> amazing, and only grow our list of users.
>
> For the heterogeneous scenario, yes it's not the most common case. But we 
> *must*, in terms of QEMU binary, be able to have a single binary first. By 
> that, I mean the need is to be able to link a binary with several arch 
> present, without any symbol conflict.

OK Unicorn engine explains it and it needs multiple targets in single 
library (which maybe is the real goal, not a single binary here and that 
only needs targets not all devices). By the way I think multiple-arch is 
what they really mean on that beyond_qemu.html page above under 
Thread-safety.

> The other approach possible is to rename many functions through QEMU codebase 
> by adding a target_prefix everywhere, which would be ugly and endless. That's 
> why we are currently using the "remove duplicated compilation units" 
> pragmatic approach. As well, we can do a lot of headers cleanup on the way 
> (removing useless dependencies), which is good for everyone.
>
> For compilation times, it will only speed it up, because in case you have 
> only specific targets, non-needed files won't be compiled/linked. For multi 
> target setup, it's only a speed up (with all targets, it would be a drop from 
> 9000+ CUs to around 4000+). Less disk space as well, most notable in debug.
> As well, having files compiled only once allow to use reliably code 
> indexation tools (clangd for instance), instead of picking a random CU 
> setting based on one target.
> Finally, having a single binary would mean it's easy to use LTO (or at least 
> distros would use it easily), and get the same or better performance as what 
> we have today.
>
> The "current" way, with several binaries, can be kept forever if people

As I said I think that would be needed as there are valid use cases for 
both.

> wants. But it's not feasible to keep headers and cu compatible for both 
> modes. It would be a lot of code duplication, and that is really not 
> desirable IMHO. So we need to do those system wide changes and convince the 
> community it's a good progress for everyone.

It would be nice to keep optimisations where possible and it seems it 
might be possible sometimes so just take that in consideration as well not 
just one goal.

> Kudos to Philippe who has been doing this long and tedious work for several 
> years now, and I hope that with some fresh eyes/blood, it can be completed 
> soon.

Absolutely and I did not mean to say not to do it just added another view 
point for consideration.

Regards,
BALATON Zoltan

[00/16] make system memory API available for common code

Message

Comments