Message ID | 20230330224348.1006691-1-davidai@google.com |
---|---|
Headers | show |
Series | Improve VM DVFS and task placement behavior | expand |
On Fri, Mar 31, 2023 at 01:49:48AM +0100, Matthew Wilcox wrote: > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > Hi, > > > > This patch series is a continuation of the talk Saravana gave at LPC 2022 > > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist > > of the talk is that workloads running in a guest VM get terrible task > > placement and DVFS behavior when compared to running the same workload in > > DVFS? Some new filesystem, perhaps? > Dynamic Voltage and Frequency Scaling (DVFS) -- it's a well known term in cpufreq/cpuidle/schedutil land. > > the host. Effectively, no EAS for threads inside VMs. This would make power > > EAS? > Energy Aware Scheduling (EAS) is mostly a kernel/sched thing that has an impact on cpufreq and my recollection is that it was discussed at conferences long before kernel/sched had any EAS awareness. I don't have the full series in my inbox and didn't dig further but patch 1 at least is providing additional information to schedutil which impacts CPU frequency selection on systems to varying degrees. The full impact would depend on what cpufreq driver is in use and the specific hardware so even if the series benefits one set of hardware, it's not necessarily a guaranteed win. > Two unfamiliar and undefined acronyms in your opening paragraph. > You're not making me want to read the rest of your opus. It depends on the audience and mm/ is not the audience. VM in the title refers to Virtual Machine, not Virtual Memory although I confess I originally read it as mm/ and wondered initially how mm/ affects DVFS to the extent it triggered a "wtf happened in mm/ recently that I completely missed?". This series is mostly of concern to scheduler, cpufreq or KVM depending on your perspective. For example, on KVM, I'd immediately wonder if the hypercall overhead exceeds any benefit from better task placement although the leader suggests the answer is "no". However, it didn't comment (or I didn't read carefully enough) on whether MMIO overhead or alternative communication methods have constant cost across different hardware or, much more likely, depend on the hardware that could potentially opt-in. Various cpufreq hardware has very different costs when measuring or alterating CPU frequency stuff, even within different generations of chips from the same vendor. While the data also shows performance improvements, it doesn't indicate how close to bare metal the improvement is. Even if it's 50% faster within a VM, how much slower than bare metal is it? In terms of data presentation, it might be better to assign bare metal a score of 1 at the best possible score and show the VM performance as a relative ratio (1.00 for bare metal, 0.5 for VM with a vanilla kernel, 0.75 using improved task placement). It would also be preferred to have x86-64 data as the hazards the series details with impacts arm64 and x86-64 has the additional challenge that cpufreq is often managed by the hardware so it should be demonstrated the the series "does no harm" on x86-64 for recent generation Intel and AMD chips if possible. The lack of that data doesn't kill the series as a large improvement is still very interesting even if it's not perfect and possible specific to arm64. If this *was* my area or I happened to be paying close attention to it at the time, I would likely favour using hypercalls only at the start because it can be used universally and suggest adding alternative communication methods later using the same metric "is an alternative method of Guest<->Host communication worse, neutral or better at getting close to bare metal performance?" I'd also push for the ratio tables as it's easier to see at a glance how close to bare metal performance the series achieves. Finally, I would look for x86-64 data just in case it causes harm due to hypercall overhead on chips that management frequency in firmware. So while I haven't read the series and only patches 2+6 reached by inbox, I understand the point in principle. The scheduler on wakeup paths for bare metal also tries to favour recently used CPUs and spurious CPU migration even though it is only tangentially related to EAS. For example, a recently used CPUs may still be polling (drivers/cpuidle/poll_state.c:poll_idle) or at least not entered a deep C-state so the wakeup penalty is lower. So whatever critism the series deserves, it's not due to using obscure terms that no one in kernel/sched/, drivers/cpuidle of drivers/cpufreq would recognise.
Folks, On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: <snip> > PCMark > Higher is better > +-------------------+----------+------------+--------+-------+--------+ > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta | > +-------------------+----------+------------+--------+-------+--------+ > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% | > +-------------------+----------+------------+--------+-------+--------+ > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% | > +-------------------+----------+------------+--------+-------+--------+ > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% | > +-------------------+----------+------------+--------+-------+--------+ > | Writing | 6864 | 8825 | +29% | 8529 | +24% | > +-------------------+----------+------------+--------+-------+--------+ > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% | > +-------------------+----------+------------+--------+-------+--------+ > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% | > +-------------------+----------+------------+--------+-------+--------+ > > PCMark Performance/mAh > Higher is better > +-----------+----------+-----------+--------+------+--------+ > | | Baseline | Hypercall | %delta | MMIO | %delta | > +-----------+----------+-----------+--------+------+--------+ > | Score/mAh | 79 | 88 | +11% | 83 | +7% | > +-----------+----------+-----------+--------+------+--------+ > > Roblox > Higher is better > +-----+----------+------------+--------+-------+--------+ > | | Baseline | Hypercall | %delta | MMIO | %delta | > +-----+----------+------------+--------+-------+--------+ > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% | > +-----+----------+------------+--------+-------+--------+ > > Roblox Frames/mAh > Higher is better > +------------+----------+------------+--------+--------+--------+ > | | Baseline | Hypercall | %delta | MMIO | %delta | > +------------+----------+------------+--------+--------+--------+ > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% | > +------------+----------+------------+--------+--------+--------+ </snip> > Next steps: > =========== > We are continuing to look into communication mechanisms other than > hypercalls that are just as/more efficient and avoid switching into the VMM > userspace. Any inputs in this regard are greatly appreciated. We're highly unlikely to entertain such an interface in KVM. The entire feature is dependent on pinning vCPUs to physical cores, for which userspace is in the driver's seat. That is a well established and documented policy which can be seen in the way we handle heterogeneous systems and vPMU. Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces that I would not expect to benefit the typical user of KVM. Based on the data above, it would appear that the userspace implementation is in the same neighborhood as a KVM-based implementation, which only further weakens the case for moving this into the kernel. I certainly can appreciate the motivation for the series, but this feature should be in userspace as some form of a virtual device.
On Tue, 04 Apr 2023 20:43:40 +0100, Oliver Upton <oliver.upton@linux.dev> wrote: > > Folks, > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > <snip> > > > PCMark > > Higher is better > > +-------------------+----------+------------+--------+-------+--------+ > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta | > > +-------------------+----------+------------+--------+-------+--------+ > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% | > > +-------------------+----------+------------+--------+-------+--------+ > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% | > > +-------------------+----------+------------+--------+-------+--------+ > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% | > > +-------------------+----------+------------+--------+-------+--------+ > > | Writing | 6864 | 8825 | +29% | 8529 | +24% | > > +-------------------+----------+------------+--------+-------+--------+ > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% | > > +-------------------+----------+------------+--------+-------+--------+ > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% | > > +-------------------+----------+------------+--------+-------+--------+ > > > > PCMark Performance/mAh > > Higher is better > > +-----------+----------+-----------+--------+------+--------+ > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > +-----------+----------+-----------+--------+------+--------+ > > | Score/mAh | 79 | 88 | +11% | 83 | +7% | > > +-----------+----------+-----------+--------+------+--------+ > > > > Roblox > > Higher is better > > +-----+----------+------------+--------+-------+--------+ > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > +-----+----------+------------+--------+-------+--------+ > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% | > > +-----+----------+------------+--------+-------+--------+ > > > > Roblox Frames/mAh > > Higher is better > > +------------+----------+------------+--------+--------+--------+ > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > +------------+----------+------------+--------+--------+--------+ > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% | > > +------------+----------+------------+--------+--------+--------+ > > </snip> > > > Next steps: > > =========== > > We are continuing to look into communication mechanisms other than > > hypercalls that are just as/more efficient and avoid switching into the VMM > > userspace. Any inputs in this regard are greatly appreciated. > > We're highly unlikely to entertain such an interface in KVM. > > The entire feature is dependent on pinning vCPUs to physical cores, for which > userspace is in the driver's seat. That is a well established and documented > policy which can be seen in the way we handle heterogeneous systems and > vPMU. > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces > that I would not expect to benefit the typical user of KVM. > > Based on the data above, it would appear that the userspace implementation is > in the same neighborhood as a KVM-based implementation, which only further > weakens the case for moving this into the kernel. > > I certainly can appreciate the motivation for the series, but this feature > should be in userspace as some form of a virtual device. +1 on all of the above. The one thing I'd like to understand that the comment seems to imply that there is a significant difference in overhead between a hypercall and an MMIO. In my experience, both are pretty similar in cost for a handling location (both in userspace or both in the kernel). MMIO handling is a tiny bit more expensive due to a guaranteed TLB miss followed by a walk of the in-kernel device ranges, but that's all. It should hardly register. And if you really want some super-low latency, low overhead signalling, maybe an exception is the wrong tool for the job. Shared memory communication could be more appropriate. Thanks, M.
On Tuesday 04 Apr 2023 at 21:49:10 (+0100), Marc Zyngier wrote: > On Tue, 04 Apr 2023 20:43:40 +0100, > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Folks, > > > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > > > <snip> > > > > > PCMark > > > Higher is better > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Writing | 6864 | 8825 | +29% | 8529 | +24% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > > > > PCMark Performance/mAh > > > Higher is better > > > +-----------+----------+-----------+--------+------+--------+ > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > +-----------+----------+-----------+--------+------+--------+ > > > | Score/mAh | 79 | 88 | +11% | 83 | +7% | > > > +-----------+----------+-----------+--------+------+--------+ > > > > > > Roblox > > > Higher is better > > > +-----+----------+------------+--------+-------+--------+ > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > +-----+----------+------------+--------+-------+--------+ > > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% | > > > +-----+----------+------------+--------+-------+--------+ > > > > > > Roblox Frames/mAh > > > Higher is better > > > +------------+----------+------------+--------+--------+--------+ > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > +------------+----------+------------+--------+--------+--------+ > > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% | > > > +------------+----------+------------+--------+--------+--------+ > > > > </snip> > > > > > Next steps: > > > =========== > > > We are continuing to look into communication mechanisms other than > > > hypercalls that are just as/more efficient and avoid switching into the VMM > > > userspace. Any inputs in this regard are greatly appreciated. > > > > We're highly unlikely to entertain such an interface in KVM. > > > > The entire feature is dependent on pinning vCPUs to physical cores, for which > > userspace is in the driver's seat. That is a well established and documented > > policy which can be seen in the way we handle heterogeneous systems and > > vPMU. > > > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces > > that I would not expect to benefit the typical user of KVM. > > > > Based on the data above, it would appear that the userspace implementation is > > in the same neighborhood as a KVM-based implementation, which only further > > weakens the case for moving this into the kernel. > > > > I certainly can appreciate the motivation for the series, but this feature > > should be in userspace as some form of a virtual device. > > +1 on all of the above. And I concur with all the above as well. Putting this in the kernel is not an obvious fit at all as that requires a number of assumptions about the VMM. As Oliver pointed out, the guest topology, and how it maps to the host topology (vcpu pinning etc) is very much a VMM policy decision and will be particularly important to handle guest frequency requests correctly. In addition to that, the VMM's software architecture may have an impact. Crosvm for example does device emulation in separate processes for security reasons, so it is likely that adjusting the scheduling parameters ('util_guest', uclamp, or else) only for the vCPU thread that issues frequency requests will be sub-optimal for performance, we may want to adjust those parameters for all the tasks that are on the critical path. And at an even higher level, assuming in the kernel a certain mapping of vCPU threads to host threads feels kinda wrong, this too is a host userspace policy decision I believe. Not that anybody in their right mind would want to do this, but I _think_ it would technically be feasible to serialize the execution of multiple vCPUs on the same host thread, at which point the util_guest thingy becomes entirely bogus. (I obviously don't want to conflate this use-case, it's just an example that shows the proposed abstraction in the series is not a perfect fit for the KVM userspace delegation model.) So +1 from me to move this as a virtual device of some kind. And if the extra cost of exiting all the way back to userspace is prohibitive (is it btw?), then we can try to work on that. Maybe something a la vhost can be done to optimize, I'll have a think. > The one thing I'd like to understand that the comment seems to imply > that there is a significant difference in overhead between a hypercall > and an MMIO. In my experience, both are pretty similar in cost for a > handling location (both in userspace or both in the kernel). MMIO > handling is a tiny bit more expensive due to a guaranteed TLB miss > followed by a walk of the in-kernel device ranges, but that's all. It > should hardly register. > > And if you really want some super-low latency, low overhead > signalling, maybe an exception is the wrong tool for the job. Shared > memory communication could be more appropriate. I presume some kind of signalling mechanism will be necessary to synchronously update host scheduling parameters in response to guest frequency requests, but if the volume of data requires it then a shared buffer + doorbell type of approach should do. Thinking about it, using SCMI over virtio would implement exactly that. Linux-as-a-guest already supports it IIRC, so possibly the problem being addressed in this series could be 'simply' solved using an SCMI backend in the VMM... Thanks, Quentin
On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > Hi, > > This patch series is a continuation of the talk Saravana gave at LPC 2022 > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist > of the talk is that workloads running in a guest VM get terrible task > placement and DVFS behavior when compared to running the same workload in > the host. Effectively, no EAS for threads inside VMs. This would make power > and performance terrible just by running the workload in a VM even if we > assume there is zero virtualization overhead. > > We have been iterating over different options for communicating between > guest and host, ways of applying the information coming from the > guest/host, etc to figure out the best performance and power improvements > we could get. > > The patch series in its current state is NOT meant for landing in the > upstream kernel. We are sending this patch series to share the current > progress and data we have so far. The patch series is meant to be easy to > cherry-pick and test on various devices to see what performance and power > benefits this might give for others. > > With this series, a workload running in a VM gets the same task placement > and DVFS treatment as it would when running in the host. > > As expected, we see significant performance improvement and better > performance/power ratio. If anyone else wants to try this out for your VM > workloads and report findings, that'd be very much appreciated. > > The idea is to improve VM CPUfreq/sched behavior by: > - Having guest kernel to do accurate load tracking by taking host CPU > arch/type and frequency into account. > - Sharing vCPU run queue utilization information with the host so that the > host can do proper frequency scaling and task placement on the host side. So, not having actually been send many of the patches I've no idea what you've done... Please, eradicate this ridiculous idea of sending random people a random subset of a patch series. Either send all of it or none, this is a bloody nuisance. Having said that; my biggest worry is that you're making scheduler internals into an ABI. I would hate for this paravirt interface to tie us down.
On Wed, 5 Apr 2023 at 09:48, Quentin Perret <qperret@google.com> wrote: > > On Tuesday 04 Apr 2023 at 21:49:10 (+0100), Marc Zyngier wrote: > > On Tue, 04 Apr 2023 20:43:40 +0100, > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > Folks, > > > > > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > > > > > <snip> > > > > > > > PCMark > > > > Higher is better > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Writing | 6864 | 8825 | +29% | 8529 | +24% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > > > > > PCMark Performance/mAh > > > > Higher is better > > > > +-----------+----------+-----------+--------+------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-----------+----------+-----------+--------+------+--------+ > > > > | Score/mAh | 79 | 88 | +11% | 83 | +7% | > > > > +-----------+----------+-----------+--------+------+--------+ > > > > > > > > Roblox > > > > Higher is better > > > > +-----+----------+------------+--------+-------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-----+----------+------------+--------+-------+--------+ > > > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% | > > > > +-----+----------+------------+--------+-------+--------+ > > > > > > > > Roblox Frames/mAh > > > > Higher is better > > > > +------------+----------+------------+--------+--------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +------------+----------+------------+--------+--------+--------+ > > > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% | > > > > +------------+----------+------------+--------+--------+--------+ > > > > > > </snip> > > > > > > > Next steps: > > > > =========== > > > > We are continuing to look into communication mechanisms other than > > > > hypercalls that are just as/more efficient and avoid switching into the VMM > > > > userspace. Any inputs in this regard are greatly appreciated. > > > > > > We're highly unlikely to entertain such an interface in KVM. > > > > > > The entire feature is dependent on pinning vCPUs to physical cores, for which > > > userspace is in the driver's seat. That is a well established and documented > > > policy which can be seen in the way we handle heterogeneous systems and > > > vPMU. > > > > > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces > > > that I would not expect to benefit the typical user of KVM. > > > > > > Based on the data above, it would appear that the userspace implementation is > > > in the same neighborhood as a KVM-based implementation, which only further > > > weakens the case for moving this into the kernel. > > > > > > I certainly can appreciate the motivation for the series, but this feature > > > should be in userspace as some form of a virtual device. > > > > +1 on all of the above. > > And I concur with all the above as well. Putting this in the kernel is > not an obvious fit at all as that requires a number of assumptions about > the VMM. > > As Oliver pointed out, the guest topology, and how it maps to the host > topology (vcpu pinning etc) is very much a VMM policy decision and will > be particularly important to handle guest frequency requests correctly. > > In addition to that, the VMM's software architecture may have an impact. > Crosvm for example does device emulation in separate processes for > security reasons, so it is likely that adjusting the scheduling > parameters ('util_guest', uclamp, or else) only for the vCPU thread that > issues frequency requests will be sub-optimal for performance, we may > want to adjust those parameters for all the tasks that are on the > critical path. > > And at an even higher level, assuming in the kernel a certain mapping of > vCPU threads to host threads feels kinda wrong, this too is a host > userspace policy decision I believe. Not that anybody in their right > mind would want to do this, but I _think_ it would technically be > feasible to serialize the execution of multiple vCPUs on the same host > thread, at which point the util_guest thingy becomes entirely bogus. (I > obviously don't want to conflate this use-case, it's just an example > that shows the proposed abstraction in the series is not a perfect fit > for the KVM userspace delegation model.) > > So +1 from me to move this as a virtual device of some kind. And if the > extra cost of exiting all the way back to userspace is prohibitive (is > it btw?), then we can try to work on that. Maybe something a la vhost > can be done to optimize, I'll have a think. > > > The one thing I'd like to understand that the comment seems to imply > > that there is a significant difference in overhead between a hypercall > > and an MMIO. In my experience, both are pretty similar in cost for a > > handling location (both in userspace or both in the kernel). MMIO > > handling is a tiny bit more expensive due to a guaranteed TLB miss > > followed by a walk of the in-kernel device ranges, but that's all. It > > should hardly register. > > > > And if you really want some super-low latency, low overhead > > signalling, maybe an exception is the wrong tool for the job. Shared > > memory communication could be more appropriate. > > I presume some kind of signalling mechanism will be necessary to > synchronously update host scheduling parameters in response to guest > frequency requests, but if the volume of data requires it then a shared > buffer + doorbell type of approach should do. > > Thinking about it, using SCMI over virtio would implement exactly that. > Linux-as-a-guest already supports it IIRC, so possibly the problem > being addressed in this series could be 'simply' solved using an SCMI > backend in the VMM... This is what was suggested at LPC: using virtio-scmi and scmi performance domain in the guest for cpufreq driver using a vhost user scmi backend in user space from this vhost userspace backend updates the uclamp min of the vCPU thread or use another method is this one is not good enough > > Thanks, > Quentin
On Tue, Apr 4, 2023 at 1:49 PM Marc Zyngier <maz@kernel.org> wrote: > > On Tue, 04 Apr 2023 20:43:40 +0100, > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Folks, > > > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > > > <snip> > > > > > PCMark > > > Higher is better > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Writing | 6864 | 8825 | +29% | 8529 | +24% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% | > > > +-------------------+----------+------------+--------+-------+--------+ > > > > > > PCMark Performance/mAh > > > Higher is better > > > +-----------+----------+-----------+--------+------+--------+ > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > +-----------+----------+-----------+--------+------+--------+ > > > | Score/mAh | 79 | 88 | +11% | 83 | +7% | > > > +-----------+----------+-----------+--------+------+--------+ > > > > > > Roblox > > > Higher is better > > > +-----+----------+------------+--------+-------+--------+ > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > +-----+----------+------------+--------+-------+--------+ > > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% | > > > +-----+----------+------------+--------+-------+--------+ > > > > > > Roblox Frames/mAh > > > Higher is better > > > +------------+----------+------------+--------+--------+--------+ > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > +------------+----------+------------+--------+--------+--------+ > > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% | > > > +------------+----------+------------+--------+--------+--------+ > > > > </snip> > > > > > Next steps: > > > =========== > > > We are continuing to look into communication mechanisms other than > > > hypercalls that are just as/more efficient and avoid switching into the VMM > > > userspace. Any inputs in this regard are greatly appreciated. Hi Oliver and Marc, Replying to both of you in this one email. > > > > We're highly unlikely to entertain such an interface in KVM. > > > > The entire feature is dependent on pinning vCPUs to physical cores, for which > > userspace is in the driver's seat. That is a well established and documented > > policy which can be seen in the way we handle heterogeneous systems and > > vPMU. > > > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces > > that I would not expect to benefit the typical user of KVM. > > > > Based on the data above, it would appear that the userspace implementation is > > in the same neighborhood as a KVM-based implementation, which only further > > weakens the case for moving this into the kernel. Oliver, Sorry if the tables/data aren't presented in an intuitive way, but MMIO vs hypercall is definitely not in the same neighborhood. The hypercall method often gives close to 2x the improvement that the MMIO method gives. For example: - Roblox FPS: MMIO improves it by 32% vs hypercall improves it by 57%. - Frames/mAh: MMIO improves it by 13% vs hypercall improves it by 26%. - PC Mark Data manipulation: MMIO makes it worse by 8% vs hypercall improves it by 5% Hypercall does better for other cases too, just not as good. For example, - PC Mark Photo editing: Going from MMIO to hypercall gives a 10% improvement. These are all pretty non-trivial, at least in the mobile world. Heck, whole teams would spend months for 2% improvement in battery :) > > > > I certainly can appreciate the motivation for the series, but this feature > > should be in userspace as some form of a virtual device. > > +1 on all of the above. Marc and Oliver, We are not tied to hypercalls. We want to do the right thing here, but MMIO going all the way to userspace definitely doesn't cut it as is. This is where we need some guidance. See more below. > The one thing I'd like to understand that the comment seems to imply > that there is a significant difference in overhead between a hypercall > and an MMIO. In my experience, both are pretty similar in cost for a > handling location (both in userspace or both in the kernel). I think the main difference really is that in our hypercall vs MMIO comparison the hypercall is handled in the kernel vs MMIO goes all the way to userspace. I agree with you that the difference probably won't be significant if both of them go to the same "depth" in the privilege levels. > MMIO > handling is a tiny bit more expensive due to a guaranteed TLB miss > followed by a walk of the in-kernel device ranges, but that's all. It > should hardly register. > > And if you really want some super-low latency, low overhead > signalling, maybe an exception is the wrong tool for the job. Shared > memory communication could be more appropriate. Yeah, that's one of our next steps. Ideally, we want to use shared memory for the host to guest information flow. It's a 32-bit value representing the current frequency that the host can update whenever the host CPU frequency changes and the guest can read whenever it needs it. For guest to host information flow, we'll need a kick from guest to host because we need to take action on the host side when threads migrate between vCPUs and cause a significant change in vCPU util. Again it can be just a shared memory and some kick. This is what we are currently trying to figure out how to do. If there are APIs to do this, can you point us to those please? We'd also want the shared memory to be accessible by the VMM (so, shared between guest kernel, host kernel and VMM). Are the above next steps sane? Or is that a no-go? The main thing we want to cut out is the need for having to switch to userspace for every single interaction because, as is, it leaves a lot on the table. Also, thanks for all the feedback. Glad to receive it. -Saravana
On Wed, Apr 5, 2023 at 12:48 AM 'Quentin Perret' via kernel-team <kernel-team@android.com> wrote: > > On Tuesday 04 Apr 2023 at 21:49:10 (+0100), Marc Zyngier wrote: > > On Tue, 04 Apr 2023 20:43:40 +0100, > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > Folks, > > > > > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > > > > > <snip> > > > > > > > PCMark > > > > Higher is better > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Writing | 6864 | 8825 | +29% | 8529 | +24% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > > > > > PCMark Performance/mAh > > > > Higher is better > > > > +-----------+----------+-----------+--------+------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-----------+----------+-----------+--------+------+--------+ > > > > | Score/mAh | 79 | 88 | +11% | 83 | +7% | > > > > +-----------+----------+-----------+--------+------+--------+ > > > > > > > > Roblox > > > > Higher is better > > > > +-----+----------+------------+--------+-------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-----+----------+------------+--------+-------+--------+ > > > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% | > > > > +-----+----------+------------+--------+-------+--------+ > > > > > > > > Roblox Frames/mAh > > > > Higher is better > > > > +------------+----------+------------+--------+--------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +------------+----------+------------+--------+--------+--------+ > > > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% | > > > > +------------+----------+------------+--------+--------+--------+ > > > > > > </snip> > > > > > > > Next steps: > > > > =========== > > > > We are continuing to look into communication mechanisms other than > > > > hypercalls that are just as/more efficient and avoid switching into the VMM > > > > userspace. Any inputs in this regard are greatly appreciated. > > > > > > We're highly unlikely to entertain such an interface in KVM. > > > > > > The entire feature is dependent on pinning vCPUs to physical cores, for which > > > userspace is in the driver's seat. That is a well established and documented > > > policy which can be seen in the way we handle heterogeneous systems and > > > vPMU. > > > > > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces > > > that I would not expect to benefit the typical user of KVM. > > > > > > Based on the data above, it would appear that the userspace implementation is > > > in the same neighborhood as a KVM-based implementation, which only further > > > weakens the case for moving this into the kernel. > > > > > > I certainly can appreciate the motivation for the series, but this feature > > > should be in userspace as some form of a virtual device. > > > > +1 on all of the above. > > And I concur with all the above as well. Putting this in the kernel is > not an obvious fit at all as that requires a number of assumptions about > the VMM. > > As Oliver pointed out, the guest topology, and how it maps to the host > topology (vcpu pinning etc) is very much a VMM policy decision and will > be particularly important to handle guest frequency requests correctly. > > In addition to that, the VMM's software architecture may have an impact. > Crosvm for example does device emulation in separate processes for > security reasons, so it is likely that adjusting the scheduling > parameters ('util_guest', uclamp, or else) only for the vCPU thread that > issues frequency requests will be sub-optimal for performance, we may > want to adjust those parameters for all the tasks that are on the > critical path. > > And at an even higher level, assuming in the kernel a certain mapping of > vCPU threads to host threads feels kinda wrong, this too is a host > userspace policy decision I believe. Not that anybody in their right > mind would want to do this, but I _think_ it would technically be > feasible to serialize the execution of multiple vCPUs on the same host > thread, at which point the util_guest thingy becomes entirely bogus. (I > obviously don't want to conflate this use-case, it's just an example > that shows the proposed abstraction in the series is not a perfect fit > for the KVM userspace delegation model.) See my reply to Oliver and Marc. To me it looks like we are converging towards having shared memory between guest, host kernel and VMM and that should address all our concerns. The guest will see a MMIO device, writing to it will trigger the host kernel to do the basic "set util_guest/uclamp for the vCPU thread that corresponds to the vCPU" and then the VMM can do more on top as/if needed (because it has access to the shared memory too). Does that make sense? Even in the extreme example, the stuff the kernel would do would still be helpful, but not sufficient. You can aggregate the util_guest/uclamp and do whatever from the VMM. Technically in the extreme example, you don't need any of this. The normal util tracking of the vCPU thread on the host side would be sufficient. Actually any time we have only 1 vCPU host thread per VM, we shouldn't be using anything in this patch series and not instantiate the guest device at all. > So +1 from me to move this as a virtual device of some kind. And if the > extra cost of exiting all the way back to userspace is prohibitive (is > it btw?), I think the "13% increase in battery consumption for games" makes it pretty clear that going to userspace is prohibitive. And that's just one example. > then we can try to work on that. Maybe something a la vhost > can be done to optimize, I'll have a think. > > > The one thing I'd like to understand that the comment seems to imply > > that there is a significant difference in overhead between a hypercall > > and an MMIO. In my experience, both are pretty similar in cost for a > > handling location (both in userspace or both in the kernel). MMIO > > handling is a tiny bit more expensive due to a guaranteed TLB miss > > followed by a walk of the in-kernel device ranges, but that's all. It > > should hardly register. > > > > And if you really want some super-low latency, low overhead > > signalling, maybe an exception is the wrong tool for the job. Shared > > memory communication could be more appropriate. > > I presume some kind of signalling mechanism will be necessary to > synchronously update host scheduling parameters in response to guest > frequency requests, but if the volume of data requires it then a shared > buffer + doorbell type of approach should do. Part of the communication doesn't need synchronous handling by the host. So, what I said above. > Thinking about it, using SCMI over virtio would implement exactly that. > Linux-as-a-guest already supports it IIRC, so possibly the problem > being addressed in this series could be 'simply' solved using an SCMI > backend in the VMM... This will be worse than all the options we've tried so far because it has the userspace overhead AND uclamp overhead. -Saravana
On Wed, Apr 5, 2023 at 1:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > Hi, > > > > This patch series is a continuation of the talk Saravana gave at LPC 2022 > > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist > > of the talk is that workloads running in a guest VM get terrible task > > placement and DVFS behavior when compared to running the same workload in > > the host. Effectively, no EAS for threads inside VMs. This would make power > > and performance terrible just by running the workload in a VM even if we > > assume there is zero virtualization overhead. > > > > We have been iterating over different options for communicating between > > guest and host, ways of applying the information coming from the > > guest/host, etc to figure out the best performance and power improvements > > we could get. > > > > The patch series in its current state is NOT meant for landing in the > > upstream kernel. We are sending this patch series to share the current > > progress and data we have so far. The patch series is meant to be easy to > > cherry-pick and test on various devices to see what performance and power > > benefits this might give for others. > > > > With this series, a workload running in a VM gets the same task placement > > and DVFS treatment as it would when running in the host. > > > > As expected, we see significant performance improvement and better > > performance/power ratio. If anyone else wants to try this out for your VM > > workloads and report findings, that'd be very much appreciated. > > > > The idea is to improve VM CPUfreq/sched behavior by: > > - Having guest kernel to do accurate load tracking by taking host CPU > > arch/type and frequency into account. > > - Sharing vCPU run queue utilization information with the host so that the > > host can do proper frequency scaling and task placement on the host side. > > So, not having actually been send many of the patches I've no idea what > you've done... Please, eradicate this ridiculous idea of sending random > people a random subset of a patch series. Either send all of it or none, > this is a bloody nuisance. Sorry, that was our intention, but had a scripting error. It's been fixed. I have a script to use with git send-email's --to-cmd and --cc-cmd option. It uses get_maintainers.pl to figure out who to email, but it gets trickier for a patch series that spans maintainer trees. v2 and later will have everyone get all the patches. > Having said that; my biggest worry is that you're making scheduler > internals into an ABI. I would hate for this paravirt interface to tie > us down. The only 2 pieces of information shared between host/guest are: 1. Host CPU frequency -- this isn't really scheduler internals and will map nicely to a virtual cpufreq driver. 2. A vCPU util value between 0 - 1024 where 1024 corresponds to the highest performance point across all CPUs (taking freq, arch, etc into consideration). Yes, this currently matches how the run queue util is tracked, but we can document the interface as "percentage of max performance capability", but representing it as 0 - 1024 instead of 0-100. That way, even if the scheduler changes how it tracks util in the future, we can still keep this interface between guest/host and map it appropriately on the host end. In either case, we could even have a Windows guest where they might track vCPU utilization differently and still have this work with the Linux host with this interface. Does that sound reasonable to you? Another option is to convert (2) into a "CPU frequency" request (but without latching it to values in the CPUfreq table) but it'll add some unnecessary math (with division) on the guest and host end. But I'd rather keep it as 0-1024 unless you really want this 2nd option. -Saravana
On Wed, Apr 05, 2023 at 02:08:43PM -0700, Saravana Kannan wrote: > Sorry, that was our intention, but had a scripting error. It's been fixed. > > I have a script to use with git send-email's --to-cmd and --cc-cmd > option. It uses get_maintainers.pl to figure out who to email, but it > gets trickier for a patch series that spans maintainer trees. What I do is I simply run get_maintainers.pl against the full series diff and CC everybody the same. Then again, I don't use git-send-email, so I've no idea how to use that.
On Wed, Apr 05, 2023 at 02:08:43PM -0700, Saravana Kannan wrote: > The only 2 pieces of information shared between host/guest are: > > 1. Host CPU frequency -- this isn't really scheduler internals and > will map nicely to a virtual cpufreq driver. > > 2. A vCPU util value between 0 - 1024 where 1024 corresponds to the > highest performance point across all CPUs (taking freq, arch, etc into > consideration). Yes, this currently matches how the run queue util is > tracked, but we can document the interface as "percentage of max > performance capability", but representing it as 0 - 1024 instead of > 0-100. That way, even if the scheduler changes how it tracks util in > the future, we can still keep this interface between guest/host and > map it appropriately on the host end. > > In either case, we could even have a Windows guest where they might > track vCPU utilization differently and still have this work with the > Linux host with this interface. > > Does that sound reasonable to you? Yeah, I suppose that's managable. Something that wasn't initially clear to me; all this hard assumes a 1:1 vCPU:CPU relation, right? Which isn't typical in virt land.
On Wed, 05 Apr 2023 22:00:59 +0100, Saravana Kannan <saravanak@google.com> wrote: > > On Tue, Apr 4, 2023 at 1:49 PM Marc Zyngier <maz@kernel.org> wrote: > > > > On Tue, 04 Apr 2023 20:43:40 +0100, > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > Folks, > > > > > > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > > > > > <snip> > > > > > > > PCMark > > > > Higher is better > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Test Case (score) | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Weighted Total | 6136 | 7274 | +19% | 6867 | +12% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Web Browsing | 5558 | 6273 | +13% | 6035 | +9% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Video Editing | 4921 | 5221 | +6% | 5167 | +5% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Writing | 6864 | 8825 | +29% | 8529 | +24% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Photo Editing | 7983 | 11593 | +45% | 10812 | +35% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > | Data Manipulation | 5814 | 6081 | +5% | 5327 | -8% | > > > > +-------------------+----------+------------+--------+-------+--------+ > > > > > > > > PCMark Performance/mAh > > > > Higher is better > > > > +-----------+----------+-----------+--------+------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-----------+----------+-----------+--------+------+--------+ > > > > | Score/mAh | 79 | 88 | +11% | 83 | +7% | > > > > +-----------+----------+-----------+--------+------+--------+ > > > > > > > > Roblox > > > > Higher is better > > > > +-----+----------+------------+--------+-------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +-----+----------+------------+--------+-------+--------+ > > > > | FPS | 18.25 | 28.66 | +57% | 24.06 | +32% | > > > > +-----+----------+------------+--------+-------+--------+ > > > > > > > > Roblox Frames/mAh > > > > Higher is better > > > > +------------+----------+------------+--------+--------+--------+ > > > > | | Baseline | Hypercall | %delta | MMIO | %delta | > > > > +------------+----------+------------+--------+--------+--------+ > > > > | Frames/mAh | 91.25 | 114.64 | +26% | 103.11 | +13% | > > > > +------------+----------+------------+--------+--------+--------+ > > > > > > </snip> > > > > > > > Next steps: > > > > =========== > > > > We are continuing to look into communication mechanisms other than > > > > hypercalls that are just as/more efficient and avoid switching into the VMM > > > > userspace. Any inputs in this regard are greatly appreciated. > > Hi Oliver and Marc, > > Replying to both of you in this one email. > > > > > > > We're highly unlikely to entertain such an interface in KVM. > > > > > > The entire feature is dependent on pinning vCPUs to physical cores, for which > > > userspace is in the driver's seat. That is a well established and documented > > > policy which can be seen in the way we handle heterogeneous systems and > > > vPMU. > > > > > > Additionally, this bloats the KVM PV ABI with highly VMM-dependent interfaces > > > that I would not expect to benefit the typical user of KVM. > > > > > > Based on the data above, it would appear that the userspace implementation is > > > in the same neighborhood as a KVM-based implementation, which only further > > > weakens the case for moving this into the kernel. > > Oliver, > > Sorry if the tables/data aren't presented in an intuitive way, but > MMIO vs hypercall is definitely not in the same neighborhood. The > hypercall method often gives close to 2x the improvement that the MMIO > method gives. For example: > > - Roblox FPS: MMIO improves it by 32% vs hypercall improves it by 57%. > - Frames/mAh: MMIO improves it by 13% vs hypercall improves it by 26%. > - PC Mark Data manipulation: MMIO makes it worse by 8% vs hypercall > improves it by 5% > > Hypercall does better for other cases too, just not as good. For example, > - PC Mark Photo editing: Going from MMIO to hypercall gives a 10% improvement. > > These are all pretty non-trivial, at least in the mobile world. Heck, > whole teams would spend months for 2% improvement in battery :) > > > > > > > I certainly can appreciate the motivation for the series, but this feature > > > should be in userspace as some form of a virtual device. > > > > +1 on all of the above. > > Marc and Oliver, > > We are not tied to hypercalls. We want to do the right thing here, but > MMIO going all the way to userspace definitely doesn't cut it as is. > This is where we need some guidance. See more below. I don't buy this assertion at all. An MMIO in userspace is already much better than nothing. One of my many objection to the whole series is that it is built as a massively invasive thing that has too many fingers in too many pies, with unsustainable assumptions such as 1:1 mapping between CPU and vCPUs. I'd rather you build something simple first (pure userspace using MMIOs), work out where the bottlenecks are, and work with us to add what is needed to get to something sensible, and only that. I'm not willing to sacrifice maintainability for maximum performance (the whole thing reminds me of the in-kernel http server...). > > > The one thing I'd like to understand that the comment seems to imply > > that there is a significant difference in overhead between a hypercall > > and an MMIO. In my experience, both are pretty similar in cost for a > > handling location (both in userspace or both in the kernel). > > I think the main difference really is that in our hypercall vs MMIO > comparison the hypercall is handled in the kernel vs MMIO goes all the > way to userspace. I agree with you that the difference probably won't > be significant if both of them go to the same "depth" in the privilege > levels. > > > MMIO > > handling is a tiny bit more expensive due to a guaranteed TLB miss > > followed by a walk of the in-kernel device ranges, but that's all. It > > should hardly register. > > > > And if you really want some super-low latency, low overhead > > signalling, maybe an exception is the wrong tool for the job. Shared > > memory communication could be more appropriate. > > Yeah, that's one of our next steps. Ideally, we want to use shared > memory for the host to guest information flow. It's a 32-bit value > representing the current frequency that the host can update whenever > the host CPU frequency changes and the guest can read whenever it > needs it. Why should the guest care? Why can't the guest ask for an arbitrary capacity, and get what it gets? You give no information as to *why* you are doing what you are doing... > > For guest to host information flow, we'll need a kick from guest to > host because we need to take action on the host side when threads > migrate between vCPUs and cause a significant change in vCPU util. > Again it can be just a shared memory and some kick. This is what we > are currently trying to figure out how to do. That kick would have to go to userspace. There is no way I'm willing to introduce scheduling primitives inside KVM (the ones we have are ridiculously bad anyway), and I very much want to avoid extra PV gunk. > If there are APIs to do this, can you point us to those please? We'd > also want the shared memory to be accessible by the VMM (so, shared > between guest kernel, host kernel and VMM). By default, *ALL* the memory is shared. Isn't that wonderful? > > Are the above next steps sane? Or is that a no-go? The main thing we > want to cut out is the need for having to switch to userspace for > every single interaction because, as is, it leaves a lot on the table. Well, for a start, you could disclose how often you hit this DVFS "device", and when are the critical state changes that must happen immediately vs those that can simply be posted without having to take immediate effect. This sort of information would be much more interesting than a bunch of benchmarks I know nothing about. Thanks, M.
On Wednesday 05 Apr 2023 at 14:07:18 (-0700), Saravana Kannan wrote: > On Wed, Apr 5, 2023 at 12:48 AM 'Quentin Perret' via kernel-team > > And I concur with all the above as well. Putting this in the kernel is > > not an obvious fit at all as that requires a number of assumptions about > > the VMM. > > > > As Oliver pointed out, the guest topology, and how it maps to the host > > topology (vcpu pinning etc) is very much a VMM policy decision and will > > be particularly important to handle guest frequency requests correctly. > > > > In addition to that, the VMM's software architecture may have an impact. > > Crosvm for example does device emulation in separate processes for > > security reasons, so it is likely that adjusting the scheduling > > parameters ('util_guest', uclamp, or else) only for the vCPU thread that > > issues frequency requests will be sub-optimal for performance, we may > > want to adjust those parameters for all the tasks that are on the > > critical path. > > > > And at an even higher level, assuming in the kernel a certain mapping of > > vCPU threads to host threads feels kinda wrong, this too is a host > > userspace policy decision I believe. Not that anybody in their right > > mind would want to do this, but I _think_ it would technically be > > feasible to serialize the execution of multiple vCPUs on the same host > > thread, at which point the util_guest thingy becomes entirely bogus. (I > > obviously don't want to conflate this use-case, it's just an example > > that shows the proposed abstraction in the series is not a perfect fit > > for the KVM userspace delegation model.) > > See my reply to Oliver and Marc. To me it looks like we are converging > towards having shared memory between guest, host kernel and VMM and > that should address all our concerns. Hmm, that is not at all my understanding of what has been the most important part of the feedback so far: this whole thing belongs to userspace. > The guest will see a MMIO device, writing to it will trigger the host > kernel to do the basic "set util_guest/uclamp for the vCPU thread that > corresponds to the vCPU" and then the VMM can do more on top as/if > needed (because it has access to the shared memory too). Does that > make sense? Not really no. I've given examples of why this doesn't make sense for the kernel to do this, which still seems to be the case with what you're suggesting here. > Even in the extreme example, the stuff the kernel would do would still > be helpful, but not sufficient. You can aggregate the > util_guest/uclamp and do whatever from the VMM. > Technically in the extreme example, you don't need any of this. The > normal util tracking of the vCPU thread on the host side would be > sufficient. > > Actually any time we have only 1 vCPU host thread per VM, we shouldn't > be using anything in this patch series and not instantiate the guest > device at all. > > So +1 from me to move this as a virtual device of some kind. And if the > > extra cost of exiting all the way back to userspace is prohibitive (is > > it btw?), > > I think the "13% increase in battery consumption for games" makes it > pretty clear that going to userspace is prohibitive. And that's just > one example. I beg to differ. We need to understand where these 13% come from in more details. Is it really the actual cost of the userspace exit? Or is it just that from userspace the only knob you can play with is uclamp and that didn't reach the expected level of performance? If that is the userspace exit, then we can work to optimize that -- it's a fairly common problem in the virt world, nothing special here. And if the issue is the lack of expressiveness in uclamp, then that too is something we should work on, but clearly giving vCPU threads more 'power' than normal host threads is a bit of a red flag IMO. vCPU threads must be constrained in the same way that userspace threads are, because they _are_ userspace threads. > > then we can try to work on that. Maybe something a la vhost > > can be done to optimize, I'll have a think. > > > > > The one thing I'd like to understand that the comment seems to imply > > > that there is a significant difference in overhead between a hypercall > > > and an MMIO. In my experience, both are pretty similar in cost for a > > > handling location (both in userspace or both in the kernel). MMIO > > > handling is a tiny bit more expensive due to a guaranteed TLB miss > > > followed by a walk of the in-kernel device ranges, but that's all. It > > > should hardly register. > > > > > > And if you really want some super-low latency, low overhead > > > signalling, maybe an exception is the wrong tool for the job. Shared > > > memory communication could be more appropriate. > > > > I presume some kind of signalling mechanism will be necessary to > > synchronously update host scheduling parameters in response to guest > > frequency requests, but if the volume of data requires it then a shared > > buffer + doorbell type of approach should do. > > Part of the communication doesn't need synchronous handling by the > host. So, what I said above. I've also replied to another message about the scale invariance issue, and I'm not convinced the frequency based interface proposed here really makes sense. An AMU-like interface is very likely to be superior. > > Thinking about it, using SCMI over virtio would implement exactly that. > > Linux-as-a-guest already supports it IIRC, so possibly the problem > > being addressed in this series could be 'simply' solved using an SCMI > > backend in the VMM... > > This will be worse than all the options we've tried so far because it > has the userspace overhead AND uclamp overhead. But it doesn't violate the whole KVM userspace delegation model, so we should start from there and then optimize further if need be. Thanks, Quentin
On Thu, Apr 6, 2023 at 5:52 AM Quentin Perret <qperret@google.com> wrote: > > On Wednesday 05 Apr 2023 at 14:07:18 (-0700), Saravana Kannan wrote: > > On Wed, Apr 5, 2023 at 12:48 AM 'Quentin Perret' via kernel-team > > > And I concur with all the above as well. Putting this in the kernel is > > > not an obvious fit at all as that requires a number of assumptions about > > > the VMM. > > > > > > As Oliver pointed out, the guest topology, and how it maps to the host > > > topology (vcpu pinning etc) is very much a VMM policy decision and will > > > be particularly important to handle guest frequency requests correctly. > > > > > > In addition to that, the VMM's software architecture may have an impact. > > > Crosvm for example does device emulation in separate processes for > > > security reasons, so it is likely that adjusting the scheduling > > > parameters ('util_guest', uclamp, or else) only for the vCPU thread that > > > issues frequency requests will be sub-optimal for performance, we may > > > want to adjust those parameters for all the tasks that are on the > > > critical path. > > > > > > And at an even higher level, assuming in the kernel a certain mapping of > > > vCPU threads to host threads feels kinda wrong, this too is a host > > > userspace policy decision I believe. Not that anybody in their right > > > mind would want to do this, but I _think_ it would technically be > > > feasible to serialize the execution of multiple vCPUs on the same host > > > thread, at which point the util_guest thingy becomes entirely bogus. (I > > > obviously don't want to conflate this use-case, it's just an example > > > that shows the proposed abstraction in the series is not a perfect fit > > > for the KVM userspace delegation model.) > > > > See my reply to Oliver and Marc. To me it looks like we are converging > > towards having shared memory between guest, host kernel and VMM and > > that should address all our concerns. > > Hmm, that is not at all my understanding of what has been the most > important part of the feedback so far: this whole thing belongs to > userspace. > > > The guest will see a MMIO device, writing to it will trigger the host > > kernel to do the basic "set util_guest/uclamp for the vCPU thread that > > corresponds to the vCPU" and then the VMM can do more on top as/if > > needed (because it has access to the shared memory too). Does that > > make sense? > > Not really no. I've given examples of why this doesn't make sense for > the kernel to do this, which still seems to be the case with what you're > suggesting here. > > > Even in the extreme example, the stuff the kernel would do would still > > be helpful, but not sufficient. You can aggregate the > > util_guest/uclamp and do whatever from the VMM. > > Technically in the extreme example, you don't need any of this. The > > normal util tracking of the vCPU thread on the host side would be > > sufficient. > > > > Actually any time we have only 1 vCPU host thread per VM, we shouldn't > > be using anything in this patch series and not instantiate the guest > > device at all. > > > > So +1 from me to move this as a virtual device of some kind. And if the > > > extra cost of exiting all the way back to userspace is prohibitive (is > > > it btw?), > > > > I think the "13% increase in battery consumption for games" makes it > > pretty clear that going to userspace is prohibitive. And that's just > > one example. > Hi Quentin, Appreciate the feedback, > I beg to differ. We need to understand where these 13% come from in more > details. Is it really the actual cost of the userspace exit? Or is it > just that from userspace the only knob you can play with is uclamp and > that didn't reach the expected level of performance? To clarify, the MMIO numbers shown in the cover letter were collected with updating vCPU task's util_guest as opposed to uclamp_min. In that configuration, userspace(VMM) handles the mmio_exit from the guest and makes an ioctl on the host kernel to update util_guest for the vCPU task. > > If that is the userspace exit, then we can work to optimize that -- it's > a fairly common problem in the virt world, nothing special here. > Ok, we're open to suggestions on how to better optimize here. > And if the issue is the lack of expressiveness in uclamp, then that too > is something we should work on, but clearly giving vCPU threads more > 'power' than normal host threads is a bit of a red flag IMO. vCPU > threads must be constrained in the same way that userspace threads are, > because they _are_ userspace threads. > > > > then we can try to work on that. Maybe something a la vhost > > > can be done to optimize, I'll have a think. > > > > > > > The one thing I'd like to understand that the comment seems to imply > > > > that there is a significant difference in overhead between a hypercall > > > > and an MMIO. In my experience, both are pretty similar in cost for a > > > > handling location (both in userspace or both in the kernel). MMIO > > > > handling is a tiny bit more expensive due to a guaranteed TLB miss > > > > followed by a walk of the in-kernel device ranges, but that's all. It > > > > should hardly register. > > > > > > > > And if you really want some super-low latency, low overhead > > > > signalling, maybe an exception is the wrong tool for the job. Shared > > > > memory communication could be more appropriate. > > > > > > I presume some kind of signalling mechanism will be necessary to > > > synchronously update host scheduling parameters in response to guest > > > frequency requests, but if the volume of data requires it then a shared > > > buffer + doorbell type of approach should do. > > > > Part of the communication doesn't need synchronous handling by the > > host. So, what I said above. > > I've also replied to another message about the scale invariance issue, > and I'm not convinced the frequency based interface proposed here really > makes sense. An AMU-like interface is very likely to be superior. > Some sort of AMU-based interface was discussed offline with Saravana, but I'm not sure how to best implement that. If you have any pointers to get started, that would be helpful. > > > Thinking about it, using SCMI over virtio would implement exactly that. > > > Linux-as-a-guest already supports it IIRC, so possibly the problem > > > being addressed in this series could be 'simply' solved using an SCMI > > > backend in the VMM... > > > > This will be worse than all the options we've tried so far because it > > has the userspace overhead AND uclamp overhead. > > But it doesn't violate the whole KVM userspace delegation model, so we > should start from there and then optimize further if need be. Do you have any references we can experiment with getting started for SCMI? (ex. SCMI backend support in CrosVM). For RFC V3, I'll post a CPUfreq driver implementation that only uses MMIO and without any kernel host modifications(I.E. Only using uclamp as a knob to tune the host) along with performance numbers and then work on optimizing from there. Thanks, David > > Thanks, > Quentin
>> This patch series is a continuation of the talk Saravana gave at LPC 2022 >> titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist >> of the talk is that workloads running in a guest VM get terrible task >> placement and DVFS behavior when compared to running the same workload in >> the host. Effectively, no EAS for threads inside VMs. This would make power >> and performance terrible just by running the workload in a VM even if we >> assume there is zero virtualization overhead. >> >> We have been iterating over different options for communicating between >> guest and host, ways of applying the information coming from the >> guest/host, etc to figure out the best performance and power improvements >> we could get. >> >> The patch series in its current state is NOT meant for landing in the >> upstream kernel. We are sending this patch series to share the current >> progress and data we have so far. The patch series is meant to be easy to >> cherry-pick and test on various devices to see what performance and power >> benefits this might give for others. >> >> With this series, a workload running in a VM gets the same task placement >> and DVFS treatment as it would when running in the host. >> >> As expected, we see significant performance improvement and better >> performance/power ratio. If anyone else wants to try this out for your VM >> workloads and report findings, that'd be very much appreciated. >> >> The idea is to improve VM CPUfreq/sched behavior by: >> - Having guest kernel to do accurate load tracking by taking host CPU >> arch/type and frequency into account. >> - Sharing vCPU run queue utilization information with the host so that the >> host can do proper frequency scaling and task placement on the host side. >> > > [...] > >> >> Next steps: >> =========== >> We are continuing to look into communication mechanisms other than >> hypercalls that are just as/more efficient and avoid switching into the VMM >> userspace. Any inputs in this regard are greatly appreciated. >> > > I am trying to understand why virtio based cpufrq does not work here? > The VMM on host can process requests from guest VM like freq table, > current frequency and setting the min_freq. I believe Virtio backend > has mechanisms for acceleration (vhost) so that user space is not > involved for every frequency request from the guest. > > It has advantages of (1) Hypervisor agnostic (virtio basically) > (2) scheduler does not need additional input, the aggregated min_freq > requests from all guest should be sufficient. Also want to add, 3) virtio based solution would definitely be better from performance POV as would avoid expense vmexits which we have with hypercalls. Thanks, Pankaj
On Thu, Apr 27, 2023 at 11:52:29AM +0200, Gupta, Pankaj wrote: > > > > This patch series is a continuation of the talk Saravana gave at LPC 2022 > > > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist > > > of the talk is that workloads running in a guest VM get terrible task > > > placement and DVFS behavior when compared to running the same workload in > > > the host. Effectively, no EAS for threads inside VMs. This would make power > > > and performance terrible just by running the workload in a VM even if we > > > assume there is zero virtualization overhead. > > > > > > We have been iterating over different options for communicating between > > > guest and host, ways of applying the information coming from the > > > guest/host, etc to figure out the best performance and power improvements > > > we could get. > > > > > > The patch series in its current state is NOT meant for landing in the > > > upstream kernel. We are sending this patch series to share the current > > > progress and data we have so far. The patch series is meant to be easy to > > > cherry-pick and test on various devices to see what performance and power > > > benefits this might give for others. > > > > > > With this series, a workload running in a VM gets the same task placement > > > and DVFS treatment as it would when running in the host. > > > > > > As expected, we see significant performance improvement and better > > > performance/power ratio. If anyone else wants to try this out for your VM > > > workloads and report findings, that'd be very much appreciated. > > > > > > The idea is to improve VM CPUfreq/sched behavior by: > > > - Having guest kernel to do accurate load tracking by taking host CPU > > > arch/type and frequency into account. > > > - Sharing vCPU run queue utilization information with the host so that the > > > host can do proper frequency scaling and task placement on the host side. > > > > > > > [...] > > > > > > > > Next steps: > > > =========== > > > We are continuing to look into communication mechanisms other than > > > hypercalls that are just as/more efficient and avoid switching into the VMM > > > userspace. Any inputs in this regard are greatly appreciated. > > > > > > > I am trying to understand why virtio based cpufrq does not work here? > > The VMM on host can process requests from guest VM like freq table, > > current frequency and setting the min_freq. I believe Virtio backend > > has mechanisms for acceleration (vhost) so that user space is not > > involved for every frequency request from the guest. > > > > It has advantages of (1) Hypervisor agnostic (virtio basically) > > (2) scheduler does not need additional input, the aggregated min_freq > > requests from all guest should be sufficient. > > Also want to add, 3) virtio based solution would definitely be better from > performance POV as would avoid expense vmexits which we have with > hypercalls. > > I just went through the whole discussion, it seems David mentioned he would re-write this series with virtio frontend and VMM in user space taking care of the requests. will wait for that series to land. Thanks, Pavan