Message ID | 1448288921-30307-3-git-send-email-juri.lelli@arm.com |
---|---|
State | New |
Headers | show |
On 14/12/15 16:59, Mark Brown wrote: > On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote: > > On 11/12/15 17:49, Mark Brown wrote: > > > > The purpose of the capacity values is to influence the scheduler > > > behaviour and hence performance. Without a concrete definition they're > > > just magic numbers which have meaining only in terms of their effect on > > > the performance of the system. That is a sufficiently complex outcome > > > to ensure that there will be an element of taste in what the desired > > > outcomes are. Sounds like tuneables to me. > > > Capacity values are meant to describe asymmetry (if any) of the system > > CPUs to the scheduler. The scheduler can then use this additional bit of > > information to try to do better scheduling decisions. Yes, having these > > values available will end up giving you better performance, but I guess > > this apply to any information we provide to the kernel (and scheduler); > > the less dumb a subsystem is, the better we can make it work. > > This information is a magic number, there's never going to be a right > answer. If it needs changing it's not like the kernel is modeling a > concrete thing like the relative performance of the A53 and A57 poorly > or whatever, it's just that the relative values of number A and number B > are not what the system integrator desires. > > > > If you are saying people should use other, more sensible, ways of > > > specifying the final values that actually get used in production then > > > why take the defaults from direct numbers DT in the first place? If you > > > are saying that people should tune and then put the values in here then > > > that's problematic for the reasons I outlined. > > > IMHO, people should come up with default values that describe > > heterogeneity in their system. Then use other ways to tune the system at > > run time (depending on the workload maybe). > > My argument is that they should be describing the hetrogeneity of their > system by describing concrete properties of their system rather than by > providing magic numbers. > > > As said, I understand your concerns; but, what I don't still get is > > where CPU capacity values are so different from, say, idle states > > min-residency-us. AFAIK there is a per-SoC benchmarking phase required > > to come up with that values as well; you have to pick some benchmark > > that stresses worst case entry/exit while measuring energy, then make > > calculations that tells you when it is wise to enter a particular idle > > state. Ideally we should derive min residency from specs, but I'm not > > sure is how it works in practice. > > Those at least have a concrete physical value that it is possible to > measure in a describable way that is unlikely to change based on the > internals of the kernel. It would be kind of nice to have the broken > down numbers for entry time, exit time and power burn in suspend but > it's not clear it's worth the bother. It's also one of these things > where we don't have any real proxies that get us anywhere in the > ballpark of where we want to be. > I'm proposing to add a new value because I couldn't find any proxies in the current bindings that bring us any close to what we need. If I failed in looking for them, and they actually exists, I'll personally be more then happy to just rely on them instead of adding more stuff :-). Interestingly, to me it sounds like we could actually use your first paragraph above almost as it is to describe how to come up with capacity values. In the documentation I put the following: "One simple way to estimate CPU capacities is to iteratively run a well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at maximum frequency and then normalize values w.r.t. the best performing CPU." I don't see why this should change if we decide that the scheduler has to change in the future. Also, looking again at section 2 of idle-states bindings docs, we have a nice and accurate description of what min-residency is, but not much info about how we can actually measure that. Maybe, expanding the docs section regarding CPU capacity could help? > > > It also seems a bit strange to expect people to do some tuning in one > > > place initially and then additional tuning somewhere else later, from > > > a user point of view I'd expect to always do my tuning in the same > > > place. > > > I think that runtime tuning needs are much more complex and have finer > > grained needs than what you can achieve by playing with CPU capacities. > > And I agree with you, users should only play with these other methods > > I'm referring to; they should not mess around with platform description > > bits. They should provide information about runtime needs, then the > > scheduler (in this case) will do its best to give them acceptable > > performance using improved knowledge about the platform. > > So then why isn't it adequate to just have things like the core types in > there and work from there? Are we really expecting the tuning to be so > much better than it's possible to come up with something that's so much > better on the scale that we're expecting this to be accurate that it's > worth just jumping straight to magic numbers? > I take your point here that having fine grained values might not really give us appreciable differences (that is also why I proposed the capacity-scale in the first instance), but I'm not sure I'm getting what you are proposing here. Today, and for arm only, we have a static table representing CPUs "efficiency": /* * Table of relative efficiency of each processors * The efficiency value must fit in 20bit and the final * cpu_scale value must be in the range * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2 * in order to return at most 1 when DIV_ROUND_CLOSEST * is used to compute the capacity of a CPU. * Processors that are not defined in the table, * use the default SCHED_CAPACITY_SCALE value for cpu_scale. */ static const struct cpu_efficiency table_efficiency[] = { {"arm,cortex-a15", 3891}, {"arm,cortex-a7", 2048}, {NULL, }, }; When clock-frequency property is defined in DT, we try to find a match for the compatibility string in the table above and then use the associate number to compute the capacity. Are you proposing to have something like this for arm64 as well? BTW, the only info I could find about those numbers is from this thread http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html Vincent, do we have more precise information about these numbers somewhere else? If I understand how that table was created, how do we think we will extend it in the future to allow newer core types (say we replicate this solution for arm64)? It seems that we have to change it, rescaling values, each time we have a new core on the market. How can we come up with relative numbers, in the future, comparing newer cores to old ones (that might be already out of the market by that time)? > > > Doing that and then switching to some other interface for real tuning > > > seems especially odd and I'm not sure that's something that users are > > > going to expect or understand. > > > As I'm saying above, users should not care about this first step of > > platform description; not more than how much they care about other bits > > in DTs that describe their platform. > > That may be your intention but I don't see how it is realistic to expect > that this is what people will actually understand. It's a number, it > has an effect and it's hard to see that people won't tune it, it's not > like people don't have to edit DTs during system integration. People > won't reliably read documentation or look in mailing list threads and > other that that it has all the properties of a tuning interface. > Eh, sad but true. I guess we can, as we usually do, put more effort in documenting how things are supposed to be used. Then, if people think that they can make their system perform better without looking at documentation or asking around, I'm not sure there is much we could do to prevent them to do things wrong. There are already lot of things people shouldn't touch if they don't know what they are doing. :-/ > There's a tension here between what you're saying about people not being > supposed to care much about the numbers for tuning and the very fact > that there's a need for the DT to carry explicit numbers. My point is that people with tuning needs shouldn't even look at DTs, but put all their efforts in describing (using appropriate APIs) their needs and how they apply to the workload they care about. Our job is to put together information coming from users and knowledge of system configuration to provide people the desired outcomes. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Dec 15, 2015 at 02:24:58PM +0000, Juri Lelli wrote: > Hi Mark, Hi Juri, > On 15/12/15 14:01, Mark Rutland wrote: > > I really don't want to see a table of magic numbers in the kernel. > > Doesn't seem to be a clean and scalable solution to me either. It is not > easy to reconfigure when new core types come around, as I don't think > relative data is always present or easy to derive, and it exposes some > sort of centralized global information where everyone is compared > against everyone. I'm also concerned that it will be difficult to curate this to avoid deceptive marketing numbers. These may not reflect reality. > Where the DT solution is inherently per platform: no need to expose > absolute values and no problems with knowing data regarding old core > types. The DT approach certainly avoids tying the kernel to a given idea of particular microarchitectures. > > If we cannot rely on external information, and want this information to > > be derived by the kernel, then we need to perform some dynamic > > benchmark. That would work for future CPUs the kernel knows nothing > > about yet, and would cater for the pseudo-heterogeneous cases too. > > I've actually experimented a bit with this approch already, but I wasn't > convinced of its viability. It is true that we remove the burden of > coming up with default values from user/integrator, but I'm pretty sure > we will end up discussing endlessly about which particular benchmark to > pick Regardless of which direction we go there will be endless discussion as to the benchmark. As Mark pointed out, that happened in the case of the table, and it's happening now for the DT approach. I think we agree that if this is something we can change later (i.e. we don't rely on an external oracle like DT) the particular benchmark matters less, as that can be changed given evidence of superiority. > and the fact that it impacts on boot time and such. I was under the impression that the kernel already did RAID algorithm benchmarking as part of the boot process. Maybe we can find a set of similarly brief benchmarks. Thanks, Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote: > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote: >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote: >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote: >> > > On 15/12/15 15:32, Mark Rutland wrote: >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: >> > > > > My expectation is that we just need good enough, not perfect, and that >> > > > > seems to match what Juri is saying about the expectation that most of >> > > > > the fine tuning is done via other knobs. >> > > > >> > > > My expectation is that if a ballpark figure is good enough, it should be >> > > > possible to implement something trivial like bogomips / loop_per_jiffy >> > > > calculation. >> > > >> > > I didn't really followed that, so I might be wrong here, but isn't >> > > already happened a discussion about how we want/like to stop exposing >> > > bogomips info or rely on it for anything but in kernel delay loops? >> > >> > I meant that we could have a benchmark of that level of complexity, >> > rather than those specific values. >> >> Or we could simply let user space use whatever benchmarks or hard-coded >> values it wants and set the capacity via sysfs (during boot). By >> default, the kernel would assume all CPUs equal. > > I assume that a userspace override would be available regardless of > whatever mechanism the kernel uses to determine relative > performance/effinciency. Don't you think that if we let a complete latitude to the userspace to set whatever it wants, it will be used to abuse the kernel (and the scheduler in particular ) and that this will finish in a real mess to understand what is wrong when a task is not placed where it should be. We can probably provide a debug mode to help soc manufacturer to define their capacity value but IMHO we should not let complete latitude in normal operation In normal operation we need to give some methods to tweak the value to reflect a memory bounded or integer calculation work or other kind of work that currently runs on the cpu but not more Vincent > > I am not opposed to that mechanism being "assume equal". > > Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Dec 15, 2015 at 05:45:16PM +0000, Mark Brown wrote: > On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote: > > On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote: > > > > Obviously people are going to get upset if we introduce performance > > > regressions - but that's true always, we can also introduce problems > > > with numbers people have put in DT. It seems like it'd be harder to > > > manage regressions due to externally provided magic numbers since > > > there's inherently less information there. > > > It's certainly still possible to have regressions in that case. Those > > regressions would be due to code changes in the kernel, given the DT > > didn't change. > > > I'm not sure I follow w.r.t. "inherently less information", unless you > > mean trying to debug without access to that DTB? > > If what the kernel knows about the system is that it's got a bunch of > cores with numbers assigned to them then all it's really got is those > numbers. If something changes that causes problems for some systems > (eg, because the numbers have been picked poorly but in a way that > happened to work well with the old code) that's not a lot to go on, the > more we know about the system the more likely it is that we'll be able > to adjust the assumptions in whatever new thing we do that causes > problems for any particular systems where we run into trouble. Regardless of where the numbers live (DT or kernel), all we have are numbers. I can see that changing the in-kernel numbers would be possible when modifyign the DT is not, but I don't see how that gives you more information. Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt new file mode 100644 index 0000000..2a00af0 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt @@ -0,0 +1,227 @@ +========================================== +ARM CPUs capacity bindings +========================================== + +========================================== +1 - Introduction +========================================== + +ARM systems may be configured to have cpus with different power/performance +characteristics within the same chip. In this case, additional information +has to be made available to the kernel (the scheduler in particular) for +it to be aware of such differences and take decisions accordingly. + +========================================== +2 - CPU capacity definition +========================================== + +CPU capacity is a number that provides the scheduler information about CPUs +heterogeneity. Such heterogeneity can come from micro-architectural differences +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this +context is about differing performance characteristics; this binding tries to +capture a first-order approximation of the relative performance of CPUs. + +One simple way to estimate CPU capacities is to iteratively run a well-known +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at +maximum frequency and then normalize values w.r.t. the best performing CPU. +One can also do a statistically significant study of a wide collection of +benchmarks, but pros of such an approach are not really evident at the time of +writing. + +========================================== +3 - capacity-scale +========================================== + +CPUs capacities are defined with respect to capacity-scale property in the cpus +node [1]. The property is optional; if not defined a 1024 capacity-scale is +assumed. This property defines both the highest CPU capacity present in the +system and granularity of CPU capacity values. + +========================================== +4 - capacity +========================================== + +capacity is an optional cpu node [1] property: u32 value representing CPU +capacity, relative to capacity-scale. It is required and enforced that capacity +<= capacity-scale. + +=========================================== +5 - Examples +=========================================== + +Example 1 (ARM 64-bit, 6-cpu system, two clusters): +capacity-scale is not defined, so it is assumed to be 1024 + +cpus { + #address-cells = <2>; + #size-cells = <0>; + + cpu-map { + cluster0 { + core0 { + cpu = <&A57_0>; + }; + core1 { + cpu = <&A57_1>; + }; + }; + + cluster1 { + core0 { + cpu = <&A53_0>; + }; + core1 { + cpu = <&A53_1>; + }; + core2 { + cpu = <&A53_2>; + }; + core3 { + cpu = <&A53_3>; + }; + }; + }; + + idle-states { + entry-method = "arm,psci"; + + CPU_SLEEP_0: cpu-sleep-0 { + compatible = "arm,idle-state"; + arm,psci-suspend-param = <0x0010000>; + local-timer-stop; + entry-latency-us = <100>; + exit-latency-us = <250>; + min-residency-us = <150>; + }; + + CLUSTER_SLEEP_0: cluster-sleep-0 { + compatible = "arm,idle-state"; + arm,psci-suspend-param = <0x1010000>; + local-timer-stop; + entry-latency-us = <800>; + exit-latency-us = <700>; + min-residency-us = <2500>; + }; + }; + + A57_0: cpu@0 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x0>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + clocks = <&scpi_dvfs 0>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <1024>; + }; + + A57_1: cpu@1 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x1>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + clocks = <&scpi_dvfs 0>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <1024>; + }; + + A53_0: cpu@100 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x100>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <447>; + }; + + A53_1: cpu@101 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x101>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <447>; + }; + + A53_2: cpu@102 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x102>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <447>; + }; + + A53_3: cpu@103 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x103>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <447>; + }; + + A57_L2: l2-cache0 { + compatible = "cache"; + }; + + A53_L2: l2-cache1 { + compatible = "cache"; + }; +}; + +Example 2 (ARM 32-bit, 4-cpu system, two clusters, + cpus 0,1@1GHz, cpus 2,3@500MHz): +capacity-scale is equal to 2, so first cluster is twice faster than second +cluster (which matches with clock frequencies) + +cpus { + #address-cells = <1>; + #size-cells = <0>; + capacity-scale = <2>; + + cpu0: cpu@0 { + device_type = "cpu"; + compatible = "arm,cortex-a15"; + reg = <0>; + capacity = <2>; + }; + + cpu1: cpu@1 { + device_type = "cpu"; + compatible = "arm,cortex-a15"; + reg = <1>; + capacity = <2>; + }; + + cpu2: cpu@2 { + device_type = "cpu"; + compatible = "arm,cortex-a15"; + reg = <0x100>; + capacity = <1>; + }; + + cpu3: cpu@3 { + device_type = "cpu"; + compatible = "arm,cortex-a15"; + reg = <0x101>; + capacity = <1>; + }; +}; + +=========================================== +6 - References +=========================================== + +[1] ARM Linux Kernel documentation - CPUs bindings + Documentation/devicetree/bindings/arm/cpus.txt diff --git a/Documentation/devicetree/bindings/arm/cpus.txt b/Documentation/devicetree/bindings/arm/cpus.txt index 91e6e5c..7593584 100644 --- a/Documentation/devicetree/bindings/arm/cpus.txt +++ b/Documentation/devicetree/bindings/arm/cpus.txt @@ -62,6 +62,14 @@ nodes to be present and contain the properties described below. Value type: <u32> Definition: must be set to 0 + A cpus node may also define the following optional property: + + - capacity-scale + Usage: optional + Value type: <u32> + Definition: value used as a reference for CPU capacity [3] + (see below). + - cpu node Description: Describes a CPU in an ARM based system @@ -231,6 +239,13 @@ nodes to be present and contain the properties described below. # List of phandles to idle state nodes supported by this cpu [3]. + - capacity + Usage: Optional + Value type: <u32> + Definition: + # u32 value representing CPU capacity [3], relative to + capacity-scale (see above). + - rockchip,pmu Usage: optional for systems that have an "enable-method" property value of "rockchip,rk3066-smp" @@ -437,3 +452,5 @@ cpus { [2] arm/msm/qcom,kpss-acc.txt [3] ARM Linux kernel documentation - idle states bindings Documentation/devicetree/bindings/arm/idle-states.txt +[3] ARM Linux kernel documentation - cpu capacity bindings + Documentation/devicetree/bindings/arm/cpu-capacity.txt
ARM systems may be configured to have cpus with different power/performance characteristics within the same chip. In this case, additional information has to be made available to the kernel (the scheduler in particular) for it to be aware of such differences and take decisions accordingly. Therefore, this patch aims at standardizing cpu capacities device tree bindings for ARM platforms. Bindings define cpu capacity parameter, to allow operating systems to retrieve such information from the device tree and initialize related kernel structures, paving the way for common code in the kernel to deal with heterogeneity. Cc: Rob Herring <robh+dt@kernel.org> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ian Campbell <ijc+devicetree@hellion.org.uk> Cc: Kumar Gala <galak@codeaurora.org> Cc: Maxime Ripard <maxime.ripard@free-electrons.com> Cc: Olof Johansson <olof@lixom.net> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Cc: Paul Walmsley <paul@pwsan.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: devicetree@vger.kernel.org Signed-off-by: Juri Lelli <juri.lelli@arm.com> --- .../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++ Documentation/devicetree/bindings/arm/cpus.txt | 17 ++ 2 files changed, 244 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt -- 2.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/