diff mbox series

[v4,1/2] dt-bindings: cpufreq: add virtual cpufreq device

Message ID 20231111014933.1934562-2-davidai@google.com
State New
Headers show
Series Improve VM CPUfreq and task placement behavior | expand

Commit Message

David Dai Nov. 11, 2023, 1:49 a.m. UTC
Adding bindings to represent a virtual cpufreq device.

Virtual machines may expose MMIO regions for a virtual cpufreq device
for guests to read frequency information or to request frequency
selection. The virtual cpufreq device has an individual controller for
each frequency domain.

Co-developed-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: David Dai <davidai@google.com>
---
 .../cpufreq/qemu,cpufreq-virtual.yaml         | 99 +++++++++++++++++++
 1 file changed, 99 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml

Comments

Viresh Kumar Nov. 15, 2023, 6:27 a.m. UTC | #1
On 10-11-23, 17:49, David Dai wrote:
> diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> +properties:
> +  compatible:
> +    const: qemu,virtual-cpufreq

Not sure why we need to mention QEMU here.. Why limit this to just QEMU ?
Marc Zyngier Nov. 15, 2023, 8:49 a.m. UTC | #2
On Sat, 11 Nov 2023 01:49:29 +0000,
David Dai <davidai@google.com> wrote:
> 
> Adding bindings to represent a virtual cpufreq device.
> 
> Virtual machines may expose MMIO regions for a virtual cpufreq device
> for guests to read frequency information or to request frequency
> selection. The virtual cpufreq device has an individual controller for
> each frequency domain.

I would really refrain form having absolute frequencies here. A
virtual machine can be migrated, and there are *zero* guarantees that
the target system has the same clock range as the source.

This really should be a relative number, much like the capacity. That,
at least, can be migrated across systems.

Thanks,

	M.
Rob Herring (Arm) Nov. 16, 2023, 4:22 p.m. UTC | #3
On Wed, Nov 15, 2023 at 11:57:41AM +0530, Viresh Kumar wrote:
> On 10-11-23, 17:49, David Dai wrote:
> > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> > +properties:
> > +  compatible:
> > +    const: qemu,virtual-cpufreq
> 
> Not sure why we need to mention QEMU here.. Why limit this to just QEMU ?

Because that is the implementation. And also to discourage anyone from 
using this on their hardware or thinking it is generic and the way to 
extend it to another implementation is adding properties.

Rob
Saravana Kannan Dec. 7, 2023, 10:44 p.m. UTC | #4
On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 11 Nov 2023 01:49:29 +0000,
> David Dai <davidai@google.com> wrote:
> >
> > Adding bindings to represent a virtual cpufreq device.
> >
> > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > for guests to read frequency information or to request frequency
> > selection. The virtual cpufreq device has an individual controller for
> > each frequency domain.
>
> I would really refrain form having absolute frequencies here. A
> virtual machine can be migrated, and there are *zero* guarantees that
> the target system has the same clock range as the source.
>
> This really should be a relative number, much like the capacity. That,
> at least, can be migrated across systems.

There's nothing in this patch that mandates absolute frequency.
In true KVM philosophy, we leave it to the VMM to decide.

-Saravana
Marc Zyngier Dec. 8, 2023, 8:52 a.m. UTC | #5
On Thu, 07 Dec 2023 22:44:36 +0000,
Saravana Kannan <saravanak@google.com> wrote:
> 
> On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Sat, 11 Nov 2023 01:49:29 +0000,
> > David Dai <davidai@google.com> wrote:
> > >
> > > Adding bindings to represent a virtual cpufreq device.
> > >
> > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > for guests to read frequency information or to request frequency
> > > selection. The virtual cpufreq device has an individual controller for
> > > each frequency domain.
> >
> > I would really refrain form having absolute frequencies here. A
> > virtual machine can be migrated, and there are *zero* guarantees that
> > the target system has the same clock range as the source.
> >
> > This really should be a relative number, much like the capacity. That,
> > at least, can be migrated across systems.
> 
> There's nothing in this patch that mandates absolute frequency.
> In true KVM philosophy, we leave it to the VMM to decide.

This has nothing to do with KVM. It would apply to any execution
environment, including QEMU in TCG mode.

To quote the original patch:

+    description:
+      Address and size of region containing frequency controls for each of the
+      frequency domains. Regions for each frequency domain is placed
+      contiugously and contain registers for controlling DVFS(Dynamic Frequency
+      and Voltage) characteristics. The size of the region is proportional to
+      total number of frequency domains.

What part of that indicates that *relative* frequencies are
acceptable? The example explicitly uses the opp-v2 binding, which
clearly is about absolute frequency.

To reiterate: absolute frequencies are not the right tool for the job,
and they should explicitly be described as relative in the spec. Not
left as a "whatev'" option for the execution environment to interpret.

	M.
Sudeep Holla Dec. 8, 2023, 12:45 p.m. UTC | #6
On Fri, Nov 10, 2023 at 05:49:29PM -0800, David Dai wrote:
> Adding bindings to represent a virtual cpufreq device.
> 
> Virtual machines may expose MMIO regions for a virtual cpufreq device
> for guests to read frequency information or to request frequency
> selection. The virtual cpufreq device has an individual controller for
> each frequency domain.
> 
> Co-developed-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: David Dai <davidai@google.com>
> ---
>  .../cpufreq/qemu,cpufreq-virtual.yaml         | 99 +++++++++++++++++++
>  1 file changed, 99 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> 
> diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> new file mode 100644
> index 000000000000..16606cf1fd1a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> @@ -0,0 +1,99 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Virtual CPUFreq
> +
> +maintainers:
> +  - David Dai <davidai@google.com>
> +  - Saravana Kannan <saravanak@google.com>
> +
> +description:
> +  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
> +  selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
> +  is associated with a frequency domain which can be shared with other vCPUs.
> +  Each frequency domain has its own set of registers for frequency controls.
> +

Are these register controls described somewhere ? The reason I ask is we
should be able to have single implementation of this virtual cpufreq
irrespective of the firmware used(DT vs ACPI) IMO.

> +properties:
> +  compatible:
> +    const: qemu,virtual-cpufreq
> +
> +  reg:
> +    maxItems: 1
> +    description:
> +      Address and size of region containing frequency controls for each of the
> +      frequency domains. Regions for each frequency domain is placed
> +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> +      and Voltage) characteristics. The size of the region is proportional to
> +      total number of frequency domains.
> +
> +required:
> +  - compatible
> +  - reg
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    // This example shows a two CPU configuration with a frequency domain
> +    // for each CPU.
> +    cpus {
> +      #address-cells = <1>;
> +      #size-cells = <0>;
> +
> +      cpu@0 {
> +        compatible = "arm,armv8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table0>;
> +      };
> +
> +      cpu@1 {
> +        compatible = "arm,armv8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table1>;
> +      };
> +    };
> +
> +    opp_table0: opp-table-0 {
> +      compatible = "operating-points-v2";
> +      opp-shared;
> +
> +      opp1098000000 {
> +        opp-hz = /bits/ 64 <1098000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1197000000 {
> +        opp-hz = /bits/ 64 <1197000000>;
> +        opp-level = <2>;
> +      };
> +    };
> +
> +    opp_table1: opp-table-1 {
> +      compatible = "operating-points-v2";
> +      opp-shared;
> +
> +      opp1106000000 {
> +        opp-hz = /bits/ 64 <1106000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1277000000 {
> +        opp-hz = /bits/ 64 <1277000000>;
> +        opp-level = <2>;
> +      };
> +    };
>

I think using OPP with absolute frequencies seems not appropriate here.
Why can't these fetched from the registers and have some abstract values
instead ?

> +    soc {
> +      #address-cells = <1>;
> +      #size-cells = <1>;
> +
> +      cpufreq@1040000 {
> +        compatible = "qemu,virtual-cpufreq";
> +        reg = <0x1040000 0x10>;

So just 16bytes for 2 CPU system ? How does the register layout look like ?
I assume just 4 x 32bit registers: 2 for reading and 2 for setting the
frequencies ?

--
Regards,
Sudeep
Saravana Kannan Jan. 12, 2024, 10:02 p.m. UTC | #7
Sorry for the delay in response. Was very busy for a while and then
holidays started.

On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 07 Dec 2023 22:44:36 +0000,
> Saravana Kannan <saravanak@google.com> wrote:
> >
> > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Sat, 11 Nov 2023 01:49:29 +0000,
> > > David Dai <davidai@google.com> wrote:
> > > >
> > > > Adding bindings to represent a virtual cpufreq device.
> > > >
> > > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > > for guests to read frequency information or to request frequency
> > > > selection. The virtual cpufreq device has an individual controller for
> > > > each frequency domain.
> > >
> > > I would really refrain form having absolute frequencies here. A
> > > virtual machine can be migrated, and there are *zero* guarantees that
> > > the target system has the same clock range as the source.
> > >
> > > This really should be a relative number, much like the capacity. That,
> > > at least, can be migrated across systems.
> >
> > There's nothing in this patch that mandates absolute frequency.
> > In true KVM philosophy, we leave it to the VMM to decide.
>
> This has nothing to do with KVM. It would apply to any execution
> environment, including QEMU in TCG mode.
>
> To quote the original patch:
>
> +    description:
> +      Address and size of region containing frequency controls for each of the
> +      frequency domains. Regions for each frequency domain is placed
> +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> +      and Voltage) characteristics. The size of the region is proportional to
> +      total number of frequency domains.
>
> What part of that indicates that *relative* frequencies are
> acceptable? The example explicitly uses the opp-v2 binding, which
> clearly is about absolute frequency.

We can update the doc to make that clearer and update the example too.

> To reiterate: absolute frequencies are not the right tool for the job,
> and they should explicitly be described as relative in the spec. Not
> left as a "whatev'" option for the execution environment to interpret.

I think it depends on the use case. If there's no plan to migrate the
VM across different devices, there's no need to do the unnecessary
normalization back and forth.

And if we can translate between pCPU frequency and a normalized
frequency, we can do the same for whatever made up frequencies too. In
fact, we plan to do exactly that in our internal use cases for this.
There's nothing here that prevents the VMM from doing that.

Also, if there are hardware virtualized performance counters (AMU,
CPPC, etc) that are used for frequency normalization, then we have to
use the real frequencies in those devices otherwise the "current
frequency" can be 2 GHz while the max normalized frequency is 1024
KHz. That'll mess up load tracking.

Thanks,
Saravana
Saravana Kannan Jan. 12, 2024, 10:15 p.m. UTC | #8
On Fri, Dec 8, 2023 at 4:47 AM Sudeep Holla <sudeep.holla@arm.com> wrote:
>
> On Fri, Nov 10, 2023 at 05:49:29PM -0800, David Dai wrote:
> > Adding bindings to represent a virtual cpufreq device.
> >
> > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > for guests to read frequency information or to request frequency
> > selection. The virtual cpufreq device has an individual controller for
> > each frequency domain.
> >
> > Co-developed-by: Saravana Kannan <saravanak@google.com>
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > Signed-off-by: David Dai <davidai@google.com>
> > ---
> >  .../cpufreq/qemu,cpufreq-virtual.yaml         | 99 +++++++++++++++++++
> >  1 file changed, 99 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> >
> > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> > new file mode 100644
> > index 000000000000..16606cf1fd1a
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> > @@ -0,0 +1,99 @@
> > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: Virtual CPUFreq
> > +
> > +maintainers:
> > +  - David Dai <davidai@google.com>
> > +  - Saravana Kannan <saravanak@google.com>
> > +
> > +description:
> > +  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
> > +  selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
> > +  is associated with a frequency domain which can be shared with other vCPUs.
> > +  Each frequency domain has its own set of registers for frequency controls.
> > +
>
> Are these register controls described somewhere ? The reason I ask is we
> should be able to have single implementation of this virtual cpufreq
> irrespective of the firmware used(DT vs ACPI) IMO.

Agree that we want the same driver to work for DT and ACPI. This doc
was written to be similar to other DT binding docs that don't describe
the registers in the DT binding. The registers are pretty straight
forward (can tell from the code too). One register to "set" the
frequency and another to "get" the current frequency. We don't have
any ACPI expertise/hardware to test this on or care about it right
now. But David looked at some ACPI drivers and we think it should be
trivial to add ACPI support to this. Just a different set of probe
functions to register and populate the CPUfreq table.

>
> > +properties:
> > +  compatible:
> > +    const: qemu,virtual-cpufreq
> > +
> > +  reg:
> > +    maxItems: 1
> > +    description:
> > +      Address and size of region containing frequency controls for each of the
> > +      frequency domains. Regions for each frequency domain is placed
> > +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> > +      and Voltage) characteristics. The size of the region is proportional to
> > +      total number of frequency domains.
> > +
> > +required:
> > +  - compatible
> > +  - reg
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > +  - |
> > +    // This example shows a two CPU configuration with a frequency domain
> > +    // for each CPU.
> > +    cpus {
> > +      #address-cells = <1>;
> > +      #size-cells = <0>;
> > +
> > +      cpu@0 {
> > +        compatible = "arm,armv8";
> > +        device_type = "cpu";
> > +        reg = <0x0>;
> > +        operating-points-v2 = <&opp_table0>;
> > +      };
> > +
> > +      cpu@1 {
> > +        compatible = "arm,armv8";
> > +        device_type = "cpu";
> > +        reg = <0x0>;
> > +        operating-points-v2 = <&opp_table1>;
> > +      };
> > +    };
> > +
> > +    opp_table0: opp-table-0 {
> > +      compatible = "operating-points-v2";
> > +      opp-shared;
> > +
> > +      opp1098000000 {
> > +        opp-hz = /bits/ 64 <1098000000>;
> > +        opp-level = <1>;
> > +      };
> > +
> > +      opp1197000000 {
> > +        opp-hz = /bits/ 64 <1197000000>;
> > +        opp-level = <2>;
> > +      };
> > +    };
> > +
> > +    opp_table1: opp-table-1 {
> > +      compatible = "operating-points-v2";
> > +      opp-shared;
> > +
> > +      opp1106000000 {
> > +        opp-hz = /bits/ 64 <1106000000>;
> > +        opp-level = <1>;
> > +      };
> > +
> > +      opp1277000000 {
> > +        opp-hz = /bits/ 64 <1277000000>;
> > +        opp-level = <2>;
> > +      };
> > +    };
> >
>
> I think using OPP with absolute frequencies seems not appropriate here.
> Why can't these fetched from the registers and have some abstract values
> instead ?

Whether the frequencies are real or you want to cap it to 1024 and
normalize it, you still need to populate the CPUfreq table. And we
didn't want to reinvent the wheel and want to use existing means of
representing the table in as cross-architecture as possible -- so, DT
and ACPI should cover them all. For example, if we want to say CPU0
and 1 for a single CPUfreq policy, that's all already doable in DT. We
don't want to reinvent new schemes/register interfaces for that.

>
> > +    soc {
> > +      #address-cells = <1>;
> > +      #size-cells = <1>;
> > +
> > +      cpufreq@1040000 {
> > +        compatible = "qemu,virtual-cpufreq";
> > +        reg = <0x1040000 0x10>;
>
> So just 16bytes for 2 CPU system ? How does the register layout look like ?
> I assume just 4 x 32bit registers: 2 for reading and 2 for setting the
> frequencies ?

Yup. 2 registers per CPU frequency domain or policy.

Thanks,
Saravana
Marc Zyngier Jan. 13, 2024, 9:37 a.m. UTC | #9
On Fri, 12 Jan 2024 22:02:39 +0000,
Saravana Kannan <saravanak@google.com> wrote:
> 
> Sorry for the delay in response. Was very busy for a while and then
> holidays started.
> 
> On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Thu, 07 Dec 2023 22:44:36 +0000,
> > Saravana Kannan <saravanak@google.com> wrote:
> > >
> > > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On Sat, 11 Nov 2023 01:49:29 +0000,
> > > > David Dai <davidai@google.com> wrote:
> > > > >
> > > > > Adding bindings to represent a virtual cpufreq device.
> > > > >
> > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > > > for guests to read frequency information or to request frequency
> > > > > selection. The virtual cpufreq device has an individual controller for
> > > > > each frequency domain.
> > > >
> > > > I would really refrain form having absolute frequencies here. A
> > > > virtual machine can be migrated, and there are *zero* guarantees that
> > > > the target system has the same clock range as the source.
> > > >
> > > > This really should be a relative number, much like the capacity. That,
> > > > at least, can be migrated across systems.
> > >
> > > There's nothing in this patch that mandates absolute frequency.
> > > In true KVM philosophy, we leave it to the VMM to decide.
> >
> > This has nothing to do with KVM. It would apply to any execution
> > environment, including QEMU in TCG mode.
> >
> > To quote the original patch:
> >
> > +    description:
> > +      Address and size of region containing frequency controls for each of the
> > +      frequency domains. Regions for each frequency domain is placed
> > +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> > +      and Voltage) characteristics. The size of the region is proportional to
> > +      total number of frequency domains.
> >
> > What part of that indicates that *relative* frequencies are
> > acceptable? The example explicitly uses the opp-v2 binding, which
> > clearly is about absolute frequency.
> 
> We can update the doc to make that clearer and update the example too.
> 
> > To reiterate: absolute frequencies are not the right tool for the job,
> > and they should explicitly be described as relative in the spec. Not
> > left as a "whatev'" option for the execution environment to interpret.
> 
> I think it depends on the use case. If there's no plan to migrate the
> VM across different devices, there's no need to do the unnecessary
> normalization back and forth.

VM migration is a given, specially when QEMU is involved. Designing
something that doesn't support it is a bug, plain and simple.

> And if we can translate between pCPU frequency and a normalized
> frequency, we can do the same for whatever made up frequencies too. In
> fact, we plan to do exactly that in our internal use cases for this.
> There's nothing here that prevents the VMM from doing that.
>
> Also, if there are hardware virtualized performance counters (AMU,
> CPPC, etc) that are used for frequency normalization, then we have to
> use the real frequencies in those devices otherwise the "current
> frequency" can be 2 GHz while the max normalized frequency is 1024
> KHz. That'll mess up load tracking.

And that's exactly why this shouldn't be a *frequency*, but a
performance scale or some other unit-less coefficient. Just like the
big-little capacity.

	M.
Hongyan Xia Jan. 15, 2024, 4:28 p.m. UTC | #10
On 11/11/2023 01:49, David Dai wrote:
> Adding bindings to represent a virtual cpufreq device.
> 
> Virtual machines may expose MMIO regions for a virtual cpufreq device
> for guests to read frequency information or to request frequency
> selection. The virtual cpufreq device has an individual controller for
> each frequency domain.
> 
> Co-developed-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: David Dai <davidai@google.com>
> ---
>   .../cpufreq/qemu,cpufreq-virtual.yaml         | 99 +++++++++++++++++++
>   1 file changed, 99 insertions(+)
>   create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> 
> diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> new file mode 100644
> index 000000000000..16606cf1fd1a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> @@ -0,0 +1,99 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Virtual CPUFreq
> +
> +maintainers:
> +  - David Dai <davidai@google.com>
> +  - Saravana Kannan <saravanak@google.com>
> +
> +description:
> +  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
> +  selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
> +  is associated with a frequency domain which can be shared with other vCPUs.
> +  Each frequency domain has its own set of registers for frequency controls.
> +
> +properties:
> +  compatible:
> +    const: qemu,virtual-cpufreq
> +
> +  reg:
> +    maxItems: 1
> +    description:
> +      Address and size of region containing frequency controls for each of the
> +      frequency domains. Regions for each frequency domain is placed
> +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> +      and Voltage) characteristics. The size of the region is proportional to
> +      total number of frequency domains.
> +
> +required:
> +  - compatible
> +  - reg
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    // This example shows a two CPU configuration with a frequency domain
> +    // for each CPU.
> +    cpus {
> +      #address-cells = <1>;
> +      #size-cells = <0>;
> +
> +      cpu@0 {
> +        compatible = "arm,armv8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table0>;
> +      };
> +
> +      cpu@1 {
> +        compatible = "arm,armv8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table1>;
> +      };
> +    };
> +
> +    opp_table0: opp-table-0 {
> +      compatible = "operating-points-v2";
> +      opp-shared;
> +
> +      opp1098000000 {
> +        opp-hz = /bits/ 64 <1098000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1197000000 {
> +        opp-hz = /bits/ 64 <1197000000>;
> +        opp-level = <2>;
> +      };
> +    };
> +
> +    opp_table1: opp-table-1 {
> +      compatible = "operating-points-v2";
> +      opp-shared;
> +
> +      opp1106000000 {
> +        opp-hz = /bits/ 64 <1106000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1277000000 {
> +        opp-hz = /bits/ 64 <1277000000>;
> +        opp-level = <2>;
> +      };
> +    };

NIT: If my understanding is correct, it might be worth re-iterating that 
these OPPs should mirror the host frequency domain this vCPU is pinned to.

Also, since VM migration has been mentioned elsewhere in this thread, am 
I right in saying that you can't change these OPPs after registration? 
So, even if one wants to migrate, one has to migrate to an SoC with the 
same frequency domains anyway, otherwise the OPPs in the VM are entirely 
bogus?

> +
> +    soc {
> +      #address-cells = <1>;
> +      #size-cells = <1>;
> +
> +      cpufreq@1040000 {
> +        compatible = "qemu,virtual-cpufreq";
> +        reg = <0x1040000 0x10>;
> +      };
> +    };
Saravana Kannan Jan. 16, 2024, 11:47 p.m. UTC | #11
On Sat, Jan 13, 2024 at 1:37 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 12 Jan 2024 22:02:39 +0000,
> Saravana Kannan <saravanak@google.com> wrote:
> >
> > Sorry for the delay in response. Was very busy for a while and then
> > holidays started.
> >
> > On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Thu, 07 Dec 2023 22:44:36 +0000,
> > > Saravana Kannan <saravanak@google.com> wrote:
> > > >
> > > > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
> > > > >
> > > > > On Sat, 11 Nov 2023 01:49:29 +0000,
> > > > > David Dai <davidai@google.com> wrote:
> > > > > >
> > > > > > Adding bindings to represent a virtual cpufreq device.
> > > > > >
> > > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > > > > for guests to read frequency information or to request frequency
> > > > > > selection. The virtual cpufreq device has an individual controller for
> > > > > > each frequency domain.
> > > > >
> > > > > I would really refrain form having absolute frequencies here. A
> > > > > virtual machine can be migrated, and there are *zero* guarantees that
> > > > > the target system has the same clock range as the source.
> > > > >
> > > > > This really should be a relative number, much like the capacity. That,
> > > > > at least, can be migrated across systems.
> > > >
> > > > There's nothing in this patch that mandates absolute frequency.
> > > > In true KVM philosophy, we leave it to the VMM to decide.
> > >
> > > This has nothing to do with KVM. It would apply to any execution
> > > environment, including QEMU in TCG mode.
> > >
> > > To quote the original patch:
> > >
> > > +    description:
> > > +      Address and size of region containing frequency controls for each of the
> > > +      frequency domains. Regions for each frequency domain is placed
> > > +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> > > +      and Voltage) characteristics. The size of the region is proportional to
> > > +      total number of frequency domains.
> > >
> > > What part of that indicates that *relative* frequencies are
> > > acceptable? The example explicitly uses the opp-v2 binding, which
> > > clearly is about absolute frequency.
> >
> > We can update the doc to make that clearer and update the example too.
> >
> > > To reiterate: absolute frequencies are not the right tool for the job,
> > > and they should explicitly be described as relative in the spec. Not
> > > left as a "whatev'" option for the execution environment to interpret.
> >
> > I think it depends on the use case. If there's no plan to migrate the
> > VM across different devices, there's no need to do the unnecessary
> > normalization back and forth.
>
> VM migration is a given, specially when QEMU is involved. Designing
> something that doesn't support it is a bug, plain and simple.

I'm not saying this patch series doesn't allow migrating. I'm just
pointing out that normalization might not always be worth it for a VMM
to do.

We'll update the example and documentation to used normalize values.
CPUfreq itself used KHz for the tables and we can't really change that
when we are emulating CPU freq scaling. Same goes for OPP table DT
property names.

But the values we use can be 0 to 1024 Hz and be normalized.

> > And if we can translate between pCPU frequency and a normalized
> > frequency, we can do the same for whatever made up frequencies too. In
> > fact, we plan to do exactly that in our internal use cases for this.
> > There's nothing here that prevents the VMM from doing that.
> >
> > Also, if there are hardware virtualized performance counters (AMU,
> > CPPC, etc) that are used for frequency normalization, then we have to
> > use the real frequencies in those devices otherwise the "current
> > frequency" can be 2 GHz while the max normalized frequency is 1024
> > KHz. That'll mess up load tracking.
>
> And that's exactly why this shouldn't be a *frequency*, but a
> performance scale or some other unit-less coefficient. Just like the
> big-little capacity.

Sorry I wasn't cleared in my explanation. Let me explain it better.
When performance counters are available, the scheduler uses them to
compute the current average CPU frequency over a scheduler tick. It
looks at the performance counters to figure out how many CPU cycles
have gone by and how much time has gone by and does (delta cycles /
delta seconds) to compute current CPU freq in Hz. So, when a HW
virtualized performance counter is available, and the scheduler inside
the VM uses it to compute the average CPU frequency, the resulting
frequency is going to be the real physical CPU frequency. And the CPU
frequency value the scheduler used to compute the PELT load will be
completely different from the normalized values and the load tracking
will be completely wrong.

Using their performance counters inside the VM reduces a lot of MMIO
access exits to the host or memory accesses. So it makes sense a VM
might want to use that. I was just trying to say that in that
situation, a VMM might have a good reason to just use the real
frequencies inside the VM too.

In any case, we can update the doc to use normalized values/examples
and call out this caveat.

Thanks,
Saravana
diff mbox series

Patch

diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
new file mode 100644
index 000000000000..16606cf1fd1a
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
@@ -0,0 +1,99 @@ 
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Virtual CPUFreq
+
+maintainers:
+  - David Dai <davidai@google.com>
+  - Saravana Kannan <saravanak@google.com>
+
+description:
+  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
+  selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
+  is associated with a frequency domain which can be shared with other vCPUs.
+  Each frequency domain has its own set of registers for frequency controls.
+
+properties:
+  compatible:
+    const: qemu,virtual-cpufreq
+
+  reg:
+    maxItems: 1
+    description:
+      Address and size of region containing frequency controls for each of the
+      frequency domains. Regions for each frequency domain is placed
+      contiugously and contain registers for controlling DVFS(Dynamic Frequency
+      and Voltage) characteristics. The size of the region is proportional to
+      total number of frequency domains.
+
+required:
+  - compatible
+  - reg
+
+additionalProperties: false
+
+examples:
+  - |
+    // This example shows a two CPU configuration with a frequency domain
+    // for each CPU.
+    cpus {
+      #address-cells = <1>;
+      #size-cells = <0>;
+
+      cpu@0 {
+        compatible = "arm,armv8";
+        device_type = "cpu";
+        reg = <0x0>;
+        operating-points-v2 = <&opp_table0>;
+      };
+
+      cpu@1 {
+        compatible = "arm,armv8";
+        device_type = "cpu";
+        reg = <0x0>;
+        operating-points-v2 = <&opp_table1>;
+      };
+    };
+
+    opp_table0: opp-table-0 {
+      compatible = "operating-points-v2";
+      opp-shared;
+
+      opp1098000000 {
+        opp-hz = /bits/ 64 <1098000000>;
+        opp-level = <1>;
+      };
+
+      opp1197000000 {
+        opp-hz = /bits/ 64 <1197000000>;
+        opp-level = <2>;
+      };
+    };
+
+    opp_table1: opp-table-1 {
+      compatible = "operating-points-v2";
+      opp-shared;
+
+      opp1106000000 {
+        opp-hz = /bits/ 64 <1106000000>;
+        opp-level = <1>;
+      };
+
+      opp1277000000 {
+        opp-hz = /bits/ 64 <1277000000>;
+        opp-level = <2>;
+      };
+    };
+
+    soc {
+      #address-cells = <1>;
+      #size-cells = <1>;
+
+      cpufreq@1040000 {
+        compatible = "qemu,virtual-cpufreq";
+        reg = <0x1040000 0x10>;
+      };
+    };