Message ID | 20231212072617.14756-1-lihuisong@huawei.com |
---|---|
State | New |
Headers | show |
Series | cpufreq: CPPC: Resolve the large frequency discrepancy from cpuinfo_cur_freq | expand |
On Tue, Dec 12, 2023 at 8:26 AM Huisong Li <lihuisong@huawei.com> wrote: > > Many developers found that the cpu current frequency is greater than > the maximum frequency of the platform, please see [1], [2] and [3]. > > In the scenarios with high memory access pressure, the patch [1] has > proved the significant latency of cpc_read() which is used to obtain > delivered and reference performance counter cause an absurd frequency. > The sampling interval for this counters is very critical and is expected > to be equal. However, the different latency of cpc_read() has a direct > impact on their sampling interval. > > This patch adds a interface, cpc_read_arch_counters_on_cpu, to read > delivered and reference performance counter together. According to my > test[4], the discrepancy of cpu current frequency in the scenarios with > high memory access pressure is lower than 0.2% by stress-ng application. > > [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ > [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ > [3] https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ > > [4] My local test: > The testing platform enable SMT and include 128 logical CPU in total, > and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each > physical core on platform during the high memory access pressure from > stress-ng, and the output is as follows: > 0: 2699133 2: 2699942 4: 2698189 6: 2704347 > 8: 2704009 10: 2696277 12: 2702016 14: 2701388 > 16: 2700358 18: 2696741 20: 2700091 22: 2700122 > 24: 2701713 26: 2702025 28: 2699816 30: 2700121 > 32: 2700000 34: 2699788 36: 2698884 38: 2699109 > 40: 2704494 42: 2698350 44: 2699997 46: 2701023 > 48: 2703448 50: 2699501 52: 2700000 54: 2699999 > 56: 2702645 58: 2696923 60: 2697718 62: 2700547 > 64: 2700313 66: 2700000 68: 2699904 70: 2699259 > 72: 2699511 74: 2700644 76: 2702201 78: 2700000 > 80: 2700776 82: 2700364 84: 2702674 86: 2700255 > 88: 2699886 90: 2700359 92: 2699662 94: 2696188 > 96: 2705454 98: 2699260 100: 2701097 102: 2699630 > 104: 2700463 106: 2698408 108: 2697766 110: 2701181 > 112: 2699166 114: 2701804 116: 2701907 118: 2701973 > 120: 2699584 122: 2700474 124: 2700768 126: 2701963 > > Signed-off-by: Huisong Li <lihuisong@huawei.com> First off, please Cc ACPI-related patches to linux-acpi. > --- > arch/arm64/kernel/topology.c | 43 ++++++++++++++++++++++++++++++++++-- > drivers/acpi/cppc_acpi.c | 22 +++++++++++++++--- > include/acpi/cppc_acpi.h | 5 +++++ > 3 files changed, 65 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c > index 7d37e458e2f5..c3122154d738 100644 > --- a/arch/arm64/kernel/topology.c > +++ b/arch/arm64/kernel/topology.c > @@ -299,6 +299,11 @@ core_initcall(init_amu_fie); > #ifdef CONFIG_ACPI_CPPC_LIB > #include <acpi/cppc_acpi.h> > > +struct amu_counters { > + u64 corecnt; > + u64 constcnt; > +}; > + > static void cpu_read_corecnt(void *val) > { > /* > @@ -322,8 +327,27 @@ static void cpu_read_constcnt(void *val) > 0UL : read_constcnt(); > } > > +static void cpu_read_amu_counters(void *data) > +{ > + struct amu_counters *cnt = (struct amu_counters *)data; > + > + /* > + * The running time of the this_cpu_has_cap() might have a couple of > + * microseconds and is significantly increased to tens of microseconds. > + * But AMU core and constant counter need to be read togeter without any > + * time interval to reduce the calculation discrepancy using this counters. > + */ > + if (this_cpu_has_cap(ARM64_WORKAROUND_2457168)) { > + cnt->corecnt = read_corecnt(); This statement is present in both branches, so can it be moved before the if ()? > + cnt->constcnt = 0; > + } else { > + cnt->corecnt = read_corecnt(); > + cnt->constcnt = read_constcnt(); > + } > +} > + > static inline > -int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) > +int counters_read_on_cpu(int cpu, smp_call_func_t func, void *data) > { > /* > * Abort call on counterless CPU or when interrupts are > @@ -335,7 +359,7 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) > if (WARN_ON_ONCE(irqs_disabled())) > return -EPERM; > > - smp_call_function_single(cpu, func, val, 1); > + smp_call_function_single(cpu, func, data, 1); > > return 0; > } > @@ -364,6 +388,21 @@ bool cpc_ffh_supported(void) > return true; > } > > +int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) > +{ > + struct amu_counters cnts = {0}; > + int ret; > + > + ret = counters_read_on_cpu(cpu, cpu_read_amu_counters, &cnts); > + if (ret) > + return ret; > + > + *delivered = cnts.corecnt; > + *reference = cnts.constcnt; > + > + return 0; > +} > + > int cpc_read_ffh(int cpu, struct cpc_reg *reg, u64 *val) > { > int ret = -EOPNOTSUPP; > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c > index 7ff269a78c20..f303fabd7cfe 100644 > --- a/drivers/acpi/cppc_acpi.c > +++ b/drivers/acpi/cppc_acpi.c > @@ -1299,6 +1299,11 @@ bool cppc_perf_ctrs_in_pcc(void) > } > EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc); > > +int __weak cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) > +{ > + return 0; > +} > + > /** > * cppc_get_perf_ctrs - Read a CPU's performance feedback counters. > * @cpunum: CPU from which to read counters. > @@ -1313,7 +1318,8 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) > *ref_perf_reg, *ctr_wrap_reg; > int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum); > struct cppc_pcc_data *pcc_ss_data = NULL; > - u64 delivered, reference, ref_perf, ctr_wrap_time; > + u64 delivered = 0, reference = 0; > + u64 ref_perf, ctr_wrap_time; > int ret = 0, regs_in_pcc = 0; > > if (!cpc_desc) { > @@ -1350,8 +1356,18 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) > } > } > > - cpc_read(cpunum, delivered_reg, &delivered); > - cpc_read(cpunum, reference_reg, &reference); > + if (cpc_ffh_supported()) { > + ret = cpc_read_arch_counters_on_cpu(cpunum, &delivered, &reference); > + if (ret) { > + pr_debug("read arch counters failed, ret=%d.\n", ret); > + ret = 0; > + } > + } The above is surely not applicable to every platform using CPPC. Also it looks like in the ARM64_WORKAROUND_2457168 enabled case it is just pointless overhead, because "reference" is always going to be 0 here then. Please clean that up. > + if (!delivered || !reference) { > + cpc_read(cpunum, delivered_reg, &delivered); > + cpc_read(cpunum, reference_reg, &reference); > + } > + > cpc_read(cpunum, ref_perf_reg, &ref_perf); > > /* > diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h > index 6126c977ece0..07d4fd82d499 100644 > --- a/include/acpi/cppc_acpi.h > +++ b/include/acpi/cppc_acpi.h > @@ -152,6 +152,7 @@ extern bool cpc_ffh_supported(void); > extern bool cpc_supported_by_cpu(void); > extern int cpc_read_ffh(int cpunum, struct cpc_reg *reg, u64 *val); > extern int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val); > +extern int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference); > extern int cppc_get_epp_perf(int cpunum, u64 *epp_perf); > extern int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable); > extern int cppc_get_auto_sel_caps(int cpunum, struct cppc_perf_caps *perf_caps); > @@ -209,6 +210,10 @@ static inline int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val) > { > return -ENOTSUPP; > } > +static inline int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) > +{ > + return -EOPNOTSUPP; > +} > static inline int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable) > { > return -ENOTSUPP; > --
Hi Rafael, Thanks for your review.😁 在 2023/12/15 3:31, Rafael J. Wysocki 写道: > On Tue, Dec 12, 2023 at 8:26 AM Huisong Li <lihuisong@huawei.com> wrote: >> Many developers found that the cpu current frequency is greater than >> the maximum frequency of the platform, please see [1], [2] and [3]. >> >> In the scenarios with high memory access pressure, the patch [1] has >> proved the significant latency of cpc_read() which is used to obtain >> delivered and reference performance counter cause an absurd frequency. >> The sampling interval for this counters is very critical and is expected >> to be equal. However, the different latency of cpc_read() has a direct >> impact on their sampling interval. >> >> This patch adds a interface, cpc_read_arch_counters_on_cpu, to read >> delivered and reference performance counter together. According to my >> test[4], the discrepancy of cpu current frequency in the scenarios with >> high memory access pressure is lower than 0.2% by stress-ng application. >> >> [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ >> [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ >> [3] https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ >> >> [4] My local test: >> The testing platform enable SMT and include 128 logical CPU in total, >> and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each >> physical core on platform during the high memory access pressure from >> stress-ng, and the output is as follows: >> 0: 2699133 2: 2699942 4: 2698189 6: 2704347 >> 8: 2704009 10: 2696277 12: 2702016 14: 2701388 >> 16: 2700358 18: 2696741 20: 2700091 22: 2700122 >> 24: 2701713 26: 2702025 28: 2699816 30: 2700121 >> 32: 2700000 34: 2699788 36: 2698884 38: 2699109 >> 40: 2704494 42: 2698350 44: 2699997 46: 2701023 >> 48: 2703448 50: 2699501 52: 2700000 54: 2699999 >> 56: 2702645 58: 2696923 60: 2697718 62: 2700547 >> 64: 2700313 66: 2700000 68: 2699904 70: 2699259 >> 72: 2699511 74: 2700644 76: 2702201 78: 2700000 >> 80: 2700776 82: 2700364 84: 2702674 86: 2700255 >> 88: 2699886 90: 2700359 92: 2699662 94: 2696188 >> 96: 2705454 98: 2699260 100: 2701097 102: 2699630 >> 104: 2700463 106: 2698408 108: 2697766 110: 2701181 >> 112: 2699166 114: 2701804 116: 2701907 118: 2701973 >> 120: 2699584 122: 2700474 124: 2700768 126: 2701963 >> >> Signed-off-by: Huisong Li <lihuisong@huawei.com> > First off, please Cc ACPI-related patches to linux-acpi. got it. +linux-acpi@vger.kernel.org > >> --- >> arch/arm64/kernel/topology.c | 43 ++++++++++++++++++++++++++++++++++-- >> drivers/acpi/cppc_acpi.c | 22 +++++++++++++++--- >> include/acpi/cppc_acpi.h | 5 +++++ >> 3 files changed, 65 insertions(+), 5 deletions(-) >> >> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c >> index 7d37e458e2f5..c3122154d738 100644 >> --- a/arch/arm64/kernel/topology.c >> +++ b/arch/arm64/kernel/topology.c >> @@ -299,6 +299,11 @@ core_initcall(init_amu_fie); >> #ifdef CONFIG_ACPI_CPPC_LIB >> #include <acpi/cppc_acpi.h> >> >> +struct amu_counters { >> + u64 corecnt; >> + u64 constcnt; >> +}; >> + >> static void cpu_read_corecnt(void *val) >> { >> /* >> @@ -322,8 +327,27 @@ static void cpu_read_constcnt(void *val) >> 0UL : read_constcnt(); >> } >> >> +static void cpu_read_amu_counters(void *data) >> +{ >> + struct amu_counters *cnt = (struct amu_counters *)data; >> + >> + /* >> + * The running time of the this_cpu_has_cap() might have a couple of >> + * microseconds and is significantly increased to tens of microseconds. >> + * But AMU core and constant counter need to be read togeter without any >> + * time interval to reduce the calculation discrepancy using this counters. >> + */ >> + if (this_cpu_has_cap(ARM64_WORKAROUND_2457168)) { >> + cnt->corecnt = read_corecnt(); > This statement is present in both branches, so can it be moved before the if ()? Yes. Do you mean adding a blank line before if()? > >> + cnt->constcnt = 0; >> + } else { >> + cnt->corecnt = read_corecnt(); >> + cnt->constcnt = read_constcnt(); >> + } >> +} >> + >> static inline >> -int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) >> +int counters_read_on_cpu(int cpu, smp_call_func_t func, void *data) >> { >> /* >> * Abort call on counterless CPU or when interrupts are >> @@ -335,7 +359,7 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) >> if (WARN_ON_ONCE(irqs_disabled())) >> return -EPERM; >> >> - smp_call_function_single(cpu, func, val, 1); >> + smp_call_function_single(cpu, func, data, 1); >> >> return 0; >> } >> @@ -364,6 +388,21 @@ bool cpc_ffh_supported(void) >> return true; >> } >> >> +int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) >> +{ >> + struct amu_counters cnts = {0}; >> + int ret; >> + >> + ret = counters_read_on_cpu(cpu, cpu_read_amu_counters, &cnts); >> + if (ret) >> + return ret; >> + >> + *delivered = cnts.corecnt; >> + *reference = cnts.constcnt; >> + >> + return 0; >> +} >> + >> int cpc_read_ffh(int cpu, struct cpc_reg *reg, u64 *val) >> { >> int ret = -EOPNOTSUPP; >> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c >> index 7ff269a78c20..f303fabd7cfe 100644 >> --- a/drivers/acpi/cppc_acpi.c >> +++ b/drivers/acpi/cppc_acpi.c >> @@ -1299,6 +1299,11 @@ bool cppc_perf_ctrs_in_pcc(void) >> } >> EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc); >> >> +int __weak cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) >> +{ >> + return 0; >> +} >> + >> /** >> * cppc_get_perf_ctrs - Read a CPU's performance feedback counters. >> * @cpunum: CPU from which to read counters. >> @@ -1313,7 +1318,8 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) >> *ref_perf_reg, *ctr_wrap_reg; >> int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum); >> struct cppc_pcc_data *pcc_ss_data = NULL; >> - u64 delivered, reference, ref_perf, ctr_wrap_time; >> + u64 delivered = 0, reference = 0; >> + u64 ref_perf, ctr_wrap_time; >> int ret = 0, regs_in_pcc = 0; >> >> if (!cpc_desc) { >> @@ -1350,8 +1356,18 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) >> } >> } >> >> - cpc_read(cpunum, delivered_reg, &delivered); >> - cpc_read(cpunum, reference_reg, &reference); >> + if (cpc_ffh_supported()) { >> + ret = cpc_read_arch_counters_on_cpu(cpunum, &delivered, &reference); >> + if (ret) { >> + pr_debug("read arch counters failed, ret=%d.\n", ret); >> + ret = 0; >> + } >> + } > The above is surely not applicable to every platform using CPPC. Also cpc_ffh_supported is aimed to control only the platform supported FFH to enter. cpc_read_arch_counters_on_cpu is also needed to implemented by each platform according to their require. Here just implement this interface for arm64. > it looks like in the ARM64_WORKAROUND_2457168 enabled case it is just > pointless overhead, because "reference" is always going to be 0 here > then. Right, it is always going to be 0 here for the ARM64_WORKAROUND_2457168 enabled case . But ARM64_WORKAROUND_2457168 is a macro releated to ARM. It seems that it is not appropriate for this macro to appear this common place for all platform, right? > > Please clean that up. > >> + if (!delivered || !reference) { >> + cpc_read(cpunum, delivered_reg, &delivered); >> + cpc_read(cpunum, reference_reg, &reference); >> + } >> + >> cpc_read(cpunum, ref_perf_reg, &ref_perf); >> >> /* >> diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h >> index 6126c977ece0..07d4fd82d499 100644 >> --- a/include/acpi/cppc_acpi.h >> +++ b/include/acpi/cppc_acpi.h >> @@ -152,6 +152,7 @@ extern bool cpc_ffh_supported(void); >> extern bool cpc_supported_by_cpu(void); >> extern int cpc_read_ffh(int cpunum, struct cpc_reg *reg, u64 *val); >> extern int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val); >> +extern int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference); >> extern int cppc_get_epp_perf(int cpunum, u64 *epp_perf); >> extern int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable); >> extern int cppc_get_auto_sel_caps(int cpunum, struct cppc_perf_caps *perf_caps); >> @@ -209,6 +210,10 @@ static inline int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val) >> { >> return -ENOTSUPP; >> } >> +static inline int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) >> +{ >> + return -EOPNOTSUPP; >> +} >> static inline int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable) >> { >> return -ENOTSUPP; >> -- > .
Hi, On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: > Many developers found that the cpu current frequency is greater than > the maximum frequency of the platform, please see [1], [2] and [3]. > > In the scenarios with high memory access pressure, the patch [1] has > proved the significant latency of cpc_read() which is used to obtain > delivered and reference performance counter cause an absurd frequency. > The sampling interval for this counters is very critical and is expected > to be equal. However, the different latency of cpc_read() has a direct > impact on their sampling interval. > Would this [1] alternative solution work for you? [1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ Thanks, Ionela. > This patch adds a interface, cpc_read_arch_counters_on_cpu, to read > delivered and reference performance counter together. According to my > test[4], the discrepancy of cpu current frequency in the scenarios with > high memory access pressure is lower than 0.2% by stress-ng application. > > [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ > [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ > [3] https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ > > [4] My local test: > The testing platform enable SMT and include 128 logical CPU in total, > and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each > physical core on platform during the high memory access pressure from > stress-ng, and the output is as follows: > 0: 2699133 2: 2699942 4: 2698189 6: 2704347 > 8: 2704009 10: 2696277 12: 2702016 14: 2701388 > 16: 2700358 18: 2696741 20: 2700091 22: 2700122 > 24: 2701713 26: 2702025 28: 2699816 30: 2700121 > 32: 2700000 34: 2699788 36: 2698884 38: 2699109 > 40: 2704494 42: 2698350 44: 2699997 46: 2701023 > 48: 2703448 50: 2699501 52: 2700000 54: 2699999 > 56: 2702645 58: 2696923 60: 2697718 62: 2700547 > 64: 2700313 66: 2700000 68: 2699904 70: 2699259 > 72: 2699511 74: 2700644 76: 2702201 78: 2700000 > 80: 2700776 82: 2700364 84: 2702674 86: 2700255 > 88: 2699886 90: 2700359 92: 2699662 94: 2696188 > 96: 2705454 98: 2699260 100: 2701097 102: 2699630 > 104: 2700463 106: 2698408 108: 2697766 110: 2701181 > 112: 2699166 114: 2701804 116: 2701907 118: 2701973 > 120: 2699584 122: 2700474 124: 2700768 126: 2701963 > > Signed-off-by: Huisong Li <lihuisong@huawei.com> > --- > arch/arm64/kernel/topology.c | 43 ++++++++++++++++++++++++++++++++++-- > drivers/acpi/cppc_acpi.c | 22 +++++++++++++++--- > include/acpi/cppc_acpi.h | 5 +++++ > 3 files changed, 65 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c > index 7d37e458e2f5..c3122154d738 100644 > --- a/arch/arm64/kernel/topology.c > +++ b/arch/arm64/kernel/topology.c > @@ -299,6 +299,11 @@ core_initcall(init_amu_fie); > #ifdef CONFIG_ACPI_CPPC_LIB > #include <acpi/cppc_acpi.h> > > +struct amu_counters { > + u64 corecnt; > + u64 constcnt; > +}; > + > static void cpu_read_corecnt(void *val) > { > /* > @@ -322,8 +327,27 @@ static void cpu_read_constcnt(void *val) > 0UL : read_constcnt(); > } > > +static void cpu_read_amu_counters(void *data) > +{ > + struct amu_counters *cnt = (struct amu_counters *)data; > + > + /* > + * The running time of the this_cpu_has_cap() might have a couple of > + * microseconds and is significantly increased to tens of microseconds. > + * But AMU core and constant counter need to be read togeter without any > + * time interval to reduce the calculation discrepancy using this counters. > + */ > + if (this_cpu_has_cap(ARM64_WORKAROUND_2457168)) { > + cnt->corecnt = read_corecnt(); > + cnt->constcnt = 0; > + } else { > + cnt->corecnt = read_corecnt(); > + cnt->constcnt = read_constcnt(); > + } > +} > + > static inline > -int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) > +int counters_read_on_cpu(int cpu, smp_call_func_t func, void *data) > { > /* > * Abort call on counterless CPU or when interrupts are > @@ -335,7 +359,7 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) > if (WARN_ON_ONCE(irqs_disabled())) > return -EPERM; > > - smp_call_function_single(cpu, func, val, 1); > + smp_call_function_single(cpu, func, data, 1); > > return 0; > } > @@ -364,6 +388,21 @@ bool cpc_ffh_supported(void) > return true; > } > > +int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) > +{ > + struct amu_counters cnts = {0}; > + int ret; > + > + ret = counters_read_on_cpu(cpu, cpu_read_amu_counters, &cnts); > + if (ret) > + return ret; > + > + *delivered = cnts.corecnt; > + *reference = cnts.constcnt; > + > + return 0; > +} > + > int cpc_read_ffh(int cpu, struct cpc_reg *reg, u64 *val) > { > int ret = -EOPNOTSUPP; > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c > index 7ff269a78c20..f303fabd7cfe 100644 > --- a/drivers/acpi/cppc_acpi.c > +++ b/drivers/acpi/cppc_acpi.c > @@ -1299,6 +1299,11 @@ bool cppc_perf_ctrs_in_pcc(void) > } > EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc); > > +int __weak cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) > +{ > + return 0; > +} > + > /** > * cppc_get_perf_ctrs - Read a CPU's performance feedback counters. > * @cpunum: CPU from which to read counters. > @@ -1313,7 +1318,8 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) > *ref_perf_reg, *ctr_wrap_reg; > int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum); > struct cppc_pcc_data *pcc_ss_data = NULL; > - u64 delivered, reference, ref_perf, ctr_wrap_time; > + u64 delivered = 0, reference = 0; > + u64 ref_perf, ctr_wrap_time; > int ret = 0, regs_in_pcc = 0; > > if (!cpc_desc) { > @@ -1350,8 +1356,18 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) > } > } > > - cpc_read(cpunum, delivered_reg, &delivered); > - cpc_read(cpunum, reference_reg, &reference); > + if (cpc_ffh_supported()) { > + ret = cpc_read_arch_counters_on_cpu(cpunum, &delivered, &reference); > + if (ret) { > + pr_debug("read arch counters failed, ret=%d.\n", ret); > + ret = 0; > + } > + } > + if (!delivered || !reference) { > + cpc_read(cpunum, delivered_reg, &delivered); > + cpc_read(cpunum, reference_reg, &reference); > + } > + > cpc_read(cpunum, ref_perf_reg, &ref_perf); > > /* > diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h > index 6126c977ece0..07d4fd82d499 100644 > --- a/include/acpi/cppc_acpi.h > +++ b/include/acpi/cppc_acpi.h > @@ -152,6 +152,7 @@ extern bool cpc_ffh_supported(void); > extern bool cpc_supported_by_cpu(void); > extern int cpc_read_ffh(int cpunum, struct cpc_reg *reg, u64 *val); > extern int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val); > +extern int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference); > extern int cppc_get_epp_perf(int cpunum, u64 *epp_perf); > extern int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable); > extern int cppc_get_auto_sel_caps(int cpunum, struct cppc_perf_caps *perf_caps); > @@ -209,6 +210,10 @@ static inline int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val) > { > return -ENOTSUPP; > } > +static inline int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) > +{ > + return -EOPNOTSUPP; > +} > static inline int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable) > { > return -ENOTSUPP; > -- > 2.33.0 >
On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: > >在 2024/1/4 1:53, Ionela Voinescu 写道: >>Hi, >> >>On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>Many developers found that the cpu current frequency is greater than >>>the maximum frequency of the platform, please see [1], [2] and [3]. >>> >>>In the scenarios with high memory access pressure, the patch [1] has >>>proved the significant latency of cpc_read() which is used to obtain >>>delivered and reference performance counter cause an absurd frequency. >>>The sampling interval for this counters is very critical and is expected >>>to be equal. However, the different latency of cpc_read() has a direct >>>impact on their sampling interval. >>> >>Would this [1] alternative solution work for you? >It would work for me AFAICS. >Because the "arch_freq_scale" is also from AMU core and constant >counter, and read together. >But, from their discuss line, it seems that there are some tricky >points to clarify or consider. I think the changes in [1] would work better when CPUs may be idle. With this patch we would have to wake any core that is in idle state to read the AMU counters. Worst case, if core 0 is trying to read the CPU frequency of all cores, it may need to wake up all the other cores to read the AMU counters. For systems with 128 cores or more, this could be very expensive and happen very frequently. AFAICS, the approach in [1] would avoid this cost. Thanks, Vanshi >> >>[1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ >> >>Thanks, >>Ionela. >> >>>This patch adds a interface, cpc_read_arch_counters_on_cpu, to read >>>delivered and reference performance counter together. According to my >>>test[4], the discrepancy of cpu current frequency in the scenarios with >>>high memory access pressure is lower than 0.2% by stress-ng application. >>> >>>[1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ >>>[2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ >>>[3] https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ >>> >>>[4] My local test: >>>The testing platform enable SMT and include 128 logical CPU in total, >>>and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each >>>physical core on platform during the high memory access pressure from >>>stress-ng, and the output is as follows: >>> 0: 2699133 2: 2699942 4: 2698189 6: 2704347 >>> 8: 2704009 10: 2696277 12: 2702016 14: 2701388 >>> 16: 2700358 18: 2696741 20: 2700091 22: 2700122 >>> 24: 2701713 26: 2702025 28: 2699816 30: 2700121 >>> 32: 2700000 34: 2699788 36: 2698884 38: 2699109 >>> 40: 2704494 42: 2698350 44: 2699997 46: 2701023 >>> 48: 2703448 50: 2699501 52: 2700000 54: 2699999 >>> 56: 2702645 58: 2696923 60: 2697718 62: 2700547 >>> 64: 2700313 66: 2700000 68: 2699904 70: 2699259 >>> 72: 2699511 74: 2700644 76: 2702201 78: 2700000 >>> 80: 2700776 82: 2700364 84: 2702674 86: 2700255 >>> 88: 2699886 90: 2700359 92: 2699662 94: 2696188 >>> 96: 2705454 98: 2699260 100: 2701097 102: 2699630 >>>104: 2700463 106: 2698408 108: 2697766 110: 2701181 >>>112: 2699166 114: 2701804 116: 2701907 118: 2701973 >>>120: 2699584 122: 2700474 124: 2700768 126: 2701963 >>> >>>Signed-off-by: Huisong Li <lihuisong@huawei.com> >>>--- >>> arch/arm64/kernel/topology.c | 43 ++++++++++++++++++++++++++++++++++-- >>> drivers/acpi/cppc_acpi.c | 22 +++++++++++++++--- >>> include/acpi/cppc_acpi.h | 5 +++++ >>> 3 files changed, 65 insertions(+), 5 deletions(-) >>> >>>diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c >>>index 7d37e458e2f5..c3122154d738 100644 >>>--- a/arch/arm64/kernel/topology.c >>>+++ b/arch/arm64/kernel/topology.c >>>@@ -299,6 +299,11 @@ core_initcall(init_amu_fie); >>> #ifdef CONFIG_ACPI_CPPC_LIB >>> #include <acpi/cppc_acpi.h> >>>+struct amu_counters { >>>+ u64 corecnt; >>>+ u64 constcnt; >>>+}; >>>+ >>> static void cpu_read_corecnt(void *val) >>> { >>> /* >>>@@ -322,8 +327,27 @@ static void cpu_read_constcnt(void *val) >>> 0UL : read_constcnt(); >>> } >>>+static void cpu_read_amu_counters(void *data) >>>+{ >>>+ struct amu_counters *cnt = (struct amu_counters *)data; >>>+ >>>+ /* >>>+ * The running time of the this_cpu_has_cap() might have a couple of >>>+ * microseconds and is significantly increased to tens of microseconds. >>>+ * But AMU core and constant counter need to be read togeter without any >>>+ * time interval to reduce the calculation discrepancy using this counters. >>>+ */ >>>+ if (this_cpu_has_cap(ARM64_WORKAROUND_2457168)) { >>>+ cnt->corecnt = read_corecnt(); >>>+ cnt->constcnt = 0; >>>+ } else { >>>+ cnt->corecnt = read_corecnt(); >>>+ cnt->constcnt = read_constcnt(); >>>+ } >>>+} >>>+ >>> static inline >>>-int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) >>>+int counters_read_on_cpu(int cpu, smp_call_func_t func, void *data) >>> { >>> /* >>> * Abort call on counterless CPU or when interrupts are >>>@@ -335,7 +359,7 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) >>> if (WARN_ON_ONCE(irqs_disabled())) >>> return -EPERM; >>>- smp_call_function_single(cpu, func, val, 1); >>>+ smp_call_function_single(cpu, func, data, 1); >>> return 0; >>> } >>>@@ -364,6 +388,21 @@ bool cpc_ffh_supported(void) >>> return true; >>> } >>>+int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) >>>+{ >>>+ struct amu_counters cnts = {0}; >>>+ int ret; >>>+ >>>+ ret = counters_read_on_cpu(cpu, cpu_read_amu_counters, &cnts); >>>+ if (ret) >>>+ return ret; >>>+ >>>+ *delivered = cnts.corecnt; >>>+ *reference = cnts.constcnt; >>>+ >>>+ return 0; >>>+} >>>+ >>> int cpc_read_ffh(int cpu, struct cpc_reg *reg, u64 *val) >>> { >>> int ret = -EOPNOTSUPP; >>>diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c >>>index 7ff269a78c20..f303fabd7cfe 100644 >>>--- a/drivers/acpi/cppc_acpi.c >>>+++ b/drivers/acpi/cppc_acpi.c >>>@@ -1299,6 +1299,11 @@ bool cppc_perf_ctrs_in_pcc(void) >>> } >>> EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc); >>>+int __weak cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) >>>+{ >>>+ return 0; >>>+} >>>+ >>> /** >>> * cppc_get_perf_ctrs - Read a CPU's performance feedback counters. >>> * @cpunum: CPU from which to read counters. >>>@@ -1313,7 +1318,8 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) >>> *ref_perf_reg, *ctr_wrap_reg; >>> int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum); >>> struct cppc_pcc_data *pcc_ss_data = NULL; >>>- u64 delivered, reference, ref_perf, ctr_wrap_time; >>>+ u64 delivered = 0, reference = 0; >>>+ u64 ref_perf, ctr_wrap_time; >>> int ret = 0, regs_in_pcc = 0; >>> if (!cpc_desc) { >>>@@ -1350,8 +1356,18 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) >>> } >>> } >>>- cpc_read(cpunum, delivered_reg, &delivered); >>>- cpc_read(cpunum, reference_reg, &reference); >>>+ if (cpc_ffh_supported()) { >>>+ ret = cpc_read_arch_counters_on_cpu(cpunum, &delivered, &reference); >>>+ if (ret) { >>>+ pr_debug("read arch counters failed, ret=%d.\n", ret); >>>+ ret = 0; >>>+ } >>>+ } >>>+ if (!delivered || !reference) { >>>+ cpc_read(cpunum, delivered_reg, &delivered); >>>+ cpc_read(cpunum, reference_reg, &reference); >>>+ } >>>+ >>> cpc_read(cpunum, ref_perf_reg, &ref_perf); >>> /* >>>diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h >>>index 6126c977ece0..07d4fd82d499 100644 >>>--- a/include/acpi/cppc_acpi.h >>>+++ b/include/acpi/cppc_acpi.h >>>@@ -152,6 +152,7 @@ extern bool cpc_ffh_supported(void); >>> extern bool cpc_supported_by_cpu(void); >>> extern int cpc_read_ffh(int cpunum, struct cpc_reg *reg, u64 *val); >>> extern int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val); >>>+extern int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference); >>> extern int cppc_get_epp_perf(int cpunum, u64 *epp_perf); >>> extern int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable); >>> extern int cppc_get_auto_sel_caps(int cpunum, struct cppc_perf_caps *perf_caps); >>>@@ -209,6 +210,10 @@ static inline int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val) >>> { >>> return -ENOTSUPP; >>> } >>>+static inline int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) >>>+{ >>>+ return -EOPNOTSUPP; >>>+} >>> static inline int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable) >>> { >>> return -ENOTSUPP; >>>-- >>>2.33.0 >>> >>. > >_______________________________________________ >linux-arm-kernel mailing list >linux-arm-kernel@lists.infradead.org >http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Hi Vanshi, 在 2024/1/5 8:48, Vanshidhar Konda 写道: > On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >> >> 在 2024/1/4 1:53, Ionela Voinescu 写道: >>> Hi, >>> >>> On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>> Many developers found that the cpu current frequency is greater than >>>> the maximum frequency of the platform, please see [1], [2] and [3]. >>>> >>>> In the scenarios with high memory access pressure, the patch [1] has >>>> proved the significant latency of cpc_read() which is used to obtain >>>> delivered and reference performance counter cause an absurd frequency. >>>> The sampling interval for this counters is very critical and is >>>> expected >>>> to be equal. However, the different latency of cpc_read() has a direct >>>> impact on their sampling interval. >>>> >>> Would this [1] alternative solution work for you? >> It would work for me AFAICS. >> Because the "arch_freq_scale" is also from AMU core and constant >> counter, and read together. >> But, from their discuss line, it seems that there are some tricky >> points to clarify or consider. > > I think the changes in [1] would work better when CPUs may be idle. > With this > patch we would have to wake any core that is in idle state to read the > AMU > counters. Worst case, if core 0 is trying to read the CPU frequency of > all > cores, it may need to wake up all the other cores to read the AMU > counters. From the approach in [1], if all CPUs (one or more cores) under one policy are idle, they still cannot be obtained the CPU frequency, right? In this case, the [1] API will return 0 and have to back to call cpufreq_driver->get() for cpuinfo_cur_freq. Then we still need to face the issue this patch mentioned. > For systems with 128 cores or more, this could be very expensive and > happen > very frequently. > > AFAICS, the approach in [1] would avoid this cost. But the CPU frequency is just an average value for the last tick period instead of the current one the CPU actually runs at. In addition, there are some conditions to use 'arch_freq_scale' in this approach. So I'm not sure if this approach can entirely cover the frequency discrepancy issue. /Huisong >>> >>> [1] >>> https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ >>> >>> Thanks, >>> Ionela. >>> >>>> This patch adds a interface, cpc_read_arch_counters_on_cpu, to read >>>> delivered and reference performance counter together. According to my >>>> test[4], the discrepancy of cpu current frequency in the scenarios >>>> with >>>> high memory access pressure is lower than 0.2% by stress-ng >>>> application. >>>> >>>> [1] >>>> https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ >>>> [2] >>>> https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ >>>> [3] >>>> https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ >>>> >>>> [4] My local test: >>>> The testing platform enable SMT and include 128 logical CPU in total, >>>> and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each >>>> physical core on platform during the high memory access pressure from >>>> stress-ng, and the output is as follows: >>>> 0: 2699133 2: 2699942 4: 2698189 6: 2704347 >>>> 8: 2704009 10: 2696277 12: 2702016 14: 2701388 >>>> 16: 2700358 18: 2696741 20: 2700091 22: 2700122 >>>> 24: 2701713 26: 2702025 28: 2699816 30: 2700121 >>>> 32: 2700000 34: 2699788 36: 2698884 38: 2699109 >>>> 40: 2704494 42: 2698350 44: 2699997 46: 2701023 >>>> 48: 2703448 50: 2699501 52: 2700000 54: 2699999 >>>> 56: 2702645 58: 2696923 60: 2697718 62: 2700547 >>>> 64: 2700313 66: 2700000 68: 2699904 70: 2699259 >>>> 72: 2699511 74: 2700644 76: 2702201 78: 2700000 >>>> 80: 2700776 82: 2700364 84: 2702674 86: 2700255 >>>> 88: 2699886 90: 2700359 92: 2699662 94: 2696188 >>>> 96: 2705454 98: 2699260 100: 2701097 102: 2699630 >>>> 104: 2700463 106: 2698408 108: 2697766 110: 2701181 >>>> 112: 2699166 114: 2701804 116: 2701907 118: 2701973 >>>> 120: 2699584 122: 2700474 124: 2700768 126: 2701963 >>>> >>>> Signed-off-by: Huisong Li <lihuisong@huawei.com> >>>> --- >>>> arch/arm64/kernel/topology.c | 43 >>>> ++++++++++++++++++++++++++++++++++-- >>>> drivers/acpi/cppc_acpi.c | 22 +++++++++++++++--- >>>> include/acpi/cppc_acpi.h | 5 +++++ >>>> 3 files changed, 65 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/arch/arm64/kernel/topology.c >>>> b/arch/arm64/kernel/topology.c >>>> index 7d37e458e2f5..c3122154d738 100644 >>>> --- a/arch/arm64/kernel/topology.c >>>> +++ b/arch/arm64/kernel/topology.c >>>> @@ -299,6 +299,11 @@ core_initcall(init_amu_fie); >>>> #ifdef CONFIG_ACPI_CPPC_LIB >>>> #include <acpi/cppc_acpi.h> >>>> +struct amu_counters { >>>> + u64 corecnt; >>>> + u64 constcnt; >>>> +}; >>>> + >>>> static void cpu_read_corecnt(void *val) >>>> { >>>> /* >>>> @@ -322,8 +327,27 @@ static void cpu_read_constcnt(void *val) >>>> 0UL : read_constcnt(); >>>> } >>>> +static void cpu_read_amu_counters(void *data) >>>> +{ >>>> + struct amu_counters *cnt = (struct amu_counters *)data; >>>> + >>>> + /* >>>> + * The running time of the this_cpu_has_cap() might have a >>>> couple of >>>> + * microseconds and is significantly increased to tens of >>>> microseconds. >>>> + * But AMU core and constant counter need to be read togeter >>>> without any >>>> + * time interval to reduce the calculation discrepancy using >>>> this counters. >>>> + */ >>>> + if (this_cpu_has_cap(ARM64_WORKAROUND_2457168)) { >>>> + cnt->corecnt = read_corecnt(); >>>> + cnt->constcnt = 0; >>>> + } else { >>>> + cnt->corecnt = read_corecnt(); >>>> + cnt->constcnt = read_constcnt(); >>>> + } >>>> +} >>>> + >>>> static inline >>>> -int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) >>>> +int counters_read_on_cpu(int cpu, smp_call_func_t func, void *data) >>>> { >>>> /* >>>> * Abort call on counterless CPU or when interrupts are >>>> @@ -335,7 +359,7 @@ int counters_read_on_cpu(int cpu, >>>> smp_call_func_t func, u64 *val) >>>> if (WARN_ON_ONCE(irqs_disabled())) >>>> return -EPERM; >>>> - smp_call_function_single(cpu, func, val, 1); >>>> + smp_call_function_single(cpu, func, data, 1); >>>> return 0; >>>> } >>>> @@ -364,6 +388,21 @@ bool cpc_ffh_supported(void) >>>> return true; >>>> } >>>> +int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 >>>> *reference) >>>> +{ >>>> + struct amu_counters cnts = {0}; >>>> + int ret; >>>> + >>>> + ret = counters_read_on_cpu(cpu, cpu_read_amu_counters, &cnts); >>>> + if (ret) >>>> + return ret; >>>> + >>>> + *delivered = cnts.corecnt; >>>> + *reference = cnts.constcnt; >>>> + >>>> + return 0; >>>> +} >>>> + >>>> int cpc_read_ffh(int cpu, struct cpc_reg *reg, u64 *val) >>>> { >>>> int ret = -EOPNOTSUPP; >>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c >>>> index 7ff269a78c20..f303fabd7cfe 100644 >>>> --- a/drivers/acpi/cppc_acpi.c >>>> +++ b/drivers/acpi/cppc_acpi.c >>>> @@ -1299,6 +1299,11 @@ bool cppc_perf_ctrs_in_pcc(void) >>>> } >>>> EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc); >>>> +int __weak cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, >>>> u64 *reference) >>>> +{ >>>> + return 0; >>>> +} >>>> + >>>> /** >>>> * cppc_get_perf_ctrs - Read a CPU's performance feedback counters. >>>> * @cpunum: CPU from which to read counters. >>>> @@ -1313,7 +1318,8 @@ int cppc_get_perf_ctrs(int cpunum, struct >>>> cppc_perf_fb_ctrs *perf_fb_ctrs) >>>> *ref_perf_reg, *ctr_wrap_reg; >>>> int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum); >>>> struct cppc_pcc_data *pcc_ss_data = NULL; >>>> - u64 delivered, reference, ref_perf, ctr_wrap_time; >>>> + u64 delivered = 0, reference = 0; >>>> + u64 ref_perf, ctr_wrap_time; >>>> int ret = 0, regs_in_pcc = 0; >>>> if (!cpc_desc) { >>>> @@ -1350,8 +1356,18 @@ int cppc_get_perf_ctrs(int cpunum, struct >>>> cppc_perf_fb_ctrs *perf_fb_ctrs) >>>> } >>>> } >>>> - cpc_read(cpunum, delivered_reg, &delivered); >>>> - cpc_read(cpunum, reference_reg, &reference); >>>> + if (cpc_ffh_supported()) { >>>> + ret = cpc_read_arch_counters_on_cpu(cpunum, &delivered, >>>> &reference); >>>> + if (ret) { >>>> + pr_debug("read arch counters failed, ret=%d.\n", ret); >>>> + ret = 0; >>>> + } >>>> + } >>>> + if (!delivered || !reference) { >>>> + cpc_read(cpunum, delivered_reg, &delivered); >>>> + cpc_read(cpunum, reference_reg, &reference); >>>> + } >>>> + >>>> cpc_read(cpunum, ref_perf_reg, &ref_perf); >>>> /* >>>> diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h >>>> index 6126c977ece0..07d4fd82d499 100644 >>>> --- a/include/acpi/cppc_acpi.h >>>> +++ b/include/acpi/cppc_acpi.h >>>> @@ -152,6 +152,7 @@ extern bool cpc_ffh_supported(void); >>>> extern bool cpc_supported_by_cpu(void); >>>> extern int cpc_read_ffh(int cpunum, struct cpc_reg *reg, u64 *val); >>>> extern int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val); >>>> +extern int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, >>>> u64 *reference); >>>> extern int cppc_get_epp_perf(int cpunum, u64 *epp_perf); >>>> extern int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls >>>> *perf_ctrls, bool enable); >>>> extern int cppc_get_auto_sel_caps(int cpunum, struct >>>> cppc_perf_caps *perf_caps); >>>> @@ -209,6 +210,10 @@ static inline int cpc_write_ffh(int cpunum, >>>> struct cpc_reg *reg, u64 val) >>>> { >>>> return -ENOTSUPP; >>>> } >>>> +static inline int cpc_read_arch_counters_on_cpu(int cpu, u64 >>>> *delivered, u64 *reference) >>>> +{ >>>> + return -EOPNOTSUPP; >>>> +} >>>> static inline int cppc_set_epp_perf(int cpu, struct >>>> cppc_perf_ctrls *perf_ctrls, bool enable) >>>> { >>>> return -ENOTSUPP; >>>> -- >>>> 2.33.0 >>>> >>> . >> >> _______________________________________________ >> linux-arm-kernel mailing list >> linux-arm-kernel@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > > .
Hi, On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: > Hi Vanshi, > > 在 2024/1/5 8:48, Vanshidhar Konda 写道: > > On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: > > > > > > 在 2024/1/4 1:53, Ionela Voinescu 写道: > > > > Hi, > > > > > > > > On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: > > > > > Many developers found that the cpu current frequency is greater than > > > > > the maximum frequency of the platform, please see [1], [2] and [3]. > > > > > > > > > > In the scenarios with high memory access pressure, the patch [1] has > > > > > proved the significant latency of cpc_read() which is used to obtain > > > > > delivered and reference performance counter cause an absurd frequency. > > > > > The sampling interval for this counters is very critical and > > > > > is expected > > > > > to be equal. However, the different latency of cpc_read() has a direct > > > > > impact on their sampling interval. > > > > > > > > > Would this [1] alternative solution work for you? > > > It would work for me AFAICS. > > > Because the "arch_freq_scale" is also from AMU core and constant > > > counter, and read together. > > > But, from their discuss line, it seems that there are some tricky > > > points to clarify or consider. > > > > I think the changes in [1] would work better when CPUs may be idle. With > > this > > patch we would have to wake any core that is in idle state to read the > > AMU > > counters. Worst case, if core 0 is trying to read the CPU frequency of > > all > > cores, it may need to wake up all the other cores to read the AMU > > counters. > From the approach in [1], if all CPUs (one or more cores) under one policy > are idle, they still cannot be obtained the CPU frequency, right? > In this case, the [1] API will return 0 and have to back to call > cpufreq_driver->get() for cpuinfo_cur_freq. > Then we still need to face the issue this patch mentioned. With the implementation at [1], arch_freq_get_on_cpu() will not return 0 for idle CPUs and the get() callback will not be called to wake up the CPUs. Worst case, arch_freq_get_on_cpu() will return a frequency based on the AMU counter values obtained on the last tick on that CPU. But if that CPU is not a housekeeping CPU, a housekeeping CPU in the same policy will be selected, as it would have had a more recent tick, and therefore a more recent frequency value for the domain. I understand that the frequency returned here will not be up to date, but there's no proper frequency feedback for an idle CPU. If one only wakes up a CPU to sample counters, before the CPU goes back to sleep, the obtained frequency feedback is meaningless. > > For systems with 128 cores or more, this could be very expensive and > > happen > > very frequently. > > > > AFAICS, the approach in [1] would avoid this cost. > But the CPU frequency is just an average value for the last tick period > instead of the current one the CPU actually runs at. > In addition, there are some conditions to use 'arch_freq_scale' in this > approach. What are the conditions you are referring to? > So I'm not sure if this approach can entirely cover the frequency > discrepancy issue. Unfortunately there is no perfect frequency feedback. By the time you observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency of the CPU might have already changed. Therefore, an average value might be a better indication of the recent performance level of a CPU. Would you be able to test [1] on your platform and usecase? Many thanks, Ionela. > > /Huisong > > > > > > > > > [1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ > > > > > > > > Thanks, > > > > Ionela. > > > > > > > > > This patch adds a interface, cpc_read_arch_counters_on_cpu, to read > > > > > delivered and reference performance counter together. According to my > > > > > test[4], the discrepancy of cpu current frequency in the > > > > > scenarios with > > > > > high memory access pressure is lower than 0.2% by stress-ng > > > > > application. > > > > > > > > > > [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ > > > > > [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ > > > > > [3] > > > > > https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ > > > > > > > > > > [4] My local test: > > > > > The testing platform enable SMT and include 128 logical CPU in total, > > > > > and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each > > > > > physical core on platform during the high memory access pressure from > > > > > stress-ng, and the output is as follows: > > > > > 0: 2699133 2: 2699942 4: 2698189 6: 2704347 > > > > > 8: 2704009 10: 2696277 12: 2702016 14: 2701388 > > > > > 16: 2700358 18: 2696741 20: 2700091 22: 2700122 > > > > > 24: 2701713 26: 2702025 28: 2699816 30: 2700121 > > > > > 32: 2700000 34: 2699788 36: 2698884 38: 2699109 > > > > > 40: 2704494 42: 2698350 44: 2699997 46: 2701023 > > > > > 48: 2703448 50: 2699501 52: 2700000 54: 2699999 > > > > > 56: 2702645 58: 2696923 60: 2697718 62: 2700547 > > > > > 64: 2700313 66: 2700000 68: 2699904 70: 2699259 > > > > > 72: 2699511 74: 2700644 76: 2702201 78: 2700000 > > > > > 80: 2700776 82: 2700364 84: 2702674 86: 2700255 > > > > > 88: 2699886 90: 2700359 92: 2699662 94: 2696188 > > > > > 96: 2705454 98: 2699260 100: 2701097 102: 2699630 > > > > > 104: 2700463 106: 2698408 108: 2697766 110: 2701181 > > > > > 112: 2699166 114: 2701804 116: 2701907 118: 2701973 > > > > > 120: 2699584 122: 2700474 124: 2700768 126: 2701963 > > > > > > > > > > Signed-off-by: Huisong Li <lihuisong@huawei.com> > > > > > --- > > > > > arch/arm64/kernel/topology.c | 43 > > > > > ++++++++++++++++++++++++++++++++++-- > > > > > drivers/acpi/cppc_acpi.c | 22 +++++++++++++++--- > > > > > include/acpi/cppc_acpi.h | 5 +++++ > > > > > 3 files changed, 65 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/arch/arm64/kernel/topology.c > > > > > b/arch/arm64/kernel/topology.c > > > > > index 7d37e458e2f5..c3122154d738 100644 > > > > > --- a/arch/arm64/kernel/topology.c > > > > > +++ b/arch/arm64/kernel/topology.c > > > > > @@ -299,6 +299,11 @@ core_initcall(init_amu_fie); > > > > > #ifdef CONFIG_ACPI_CPPC_LIB > > > > > #include <acpi/cppc_acpi.h> > > > > > +struct amu_counters { > > > > > + u64 corecnt; > > > > > + u64 constcnt; > > > > > +}; > > > > > + > > > > > static void cpu_read_corecnt(void *val) > > > > > { > > > > > /* > > > > > @@ -322,8 +327,27 @@ static void cpu_read_constcnt(void *val) > > > > > 0UL : read_constcnt(); > > > > > } > > > > > +static void cpu_read_amu_counters(void *data) > > > > > +{ > > > > > + struct amu_counters *cnt = (struct amu_counters *)data; > > > > > + > > > > > + /* > > > > > + * The running time of the this_cpu_has_cap() might > > > > > have a couple of > > > > > + * microseconds and is significantly increased to tens > > > > > of microseconds. > > > > > + * But AMU core and constant counter need to be read > > > > > togeter without any > > > > > + * time interval to reduce the calculation discrepancy > > > > > using this counters. > > > > > + */ > > > > > + if (this_cpu_has_cap(ARM64_WORKAROUND_2457168)) { > > > > > + cnt->corecnt = read_corecnt(); > > > > > + cnt->constcnt = 0; > > > > > + } else { > > > > > + cnt->corecnt = read_corecnt(); > > > > > + cnt->constcnt = read_constcnt(); > > > > > + } > > > > > +} > > > > > + > > > > > static inline > > > > > -int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) > > > > > +int counters_read_on_cpu(int cpu, smp_call_func_t func, void *data) > > > > > { > > > > > /* > > > > > * Abort call on counterless CPU or when interrupts are > > > > > @@ -335,7 +359,7 @@ int counters_read_on_cpu(int cpu, > > > > > smp_call_func_t func, u64 *val) > > > > > if (WARN_ON_ONCE(irqs_disabled())) > > > > > return -EPERM; > > > > > - smp_call_function_single(cpu, func, val, 1); > > > > > + smp_call_function_single(cpu, func, data, 1); > > > > > return 0; > > > > > } > > > > > @@ -364,6 +388,21 @@ bool cpc_ffh_supported(void) > > > > > return true; > > > > > } > > > > > +int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, > > > > > u64 *reference) > > > > > +{ > > > > > + struct amu_counters cnts = {0}; > > > > > + int ret; > > > > > + > > > > > + ret = counters_read_on_cpu(cpu, cpu_read_amu_counters, &cnts); > > > > > + if (ret) > > > > > + return ret; > > > > > + > > > > > + *delivered = cnts.corecnt; > > > > > + *reference = cnts.constcnt; > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > int cpc_read_ffh(int cpu, struct cpc_reg *reg, u64 *val) > > > > > { > > > > > int ret = -EOPNOTSUPP; > > > > > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c > > > > > index 7ff269a78c20..f303fabd7cfe 100644 > > > > > --- a/drivers/acpi/cppc_acpi.c > > > > > +++ b/drivers/acpi/cppc_acpi.c > > > > > @@ -1299,6 +1299,11 @@ bool cppc_perf_ctrs_in_pcc(void) > > > > > } > > > > > EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc); > > > > > +int __weak cpc_read_arch_counters_on_cpu(int cpu, u64 > > > > > *delivered, u64 *reference) > > > > > +{ > > > > > + return 0; > > > > > +} > > > > > + > > > > > /** > > > > > * cppc_get_perf_ctrs - Read a CPU's performance feedback counters. > > > > > * @cpunum: CPU from which to read counters. > > > > > @@ -1313,7 +1318,8 @@ int cppc_get_perf_ctrs(int cpunum, > > > > > struct cppc_perf_fb_ctrs *perf_fb_ctrs) > > > > > *ref_perf_reg, *ctr_wrap_reg; > > > > > int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum); > > > > > struct cppc_pcc_data *pcc_ss_data = NULL; > > > > > - u64 delivered, reference, ref_perf, ctr_wrap_time; > > > > > + u64 delivered = 0, reference = 0; > > > > > + u64 ref_perf, ctr_wrap_time; > > > > > int ret = 0, regs_in_pcc = 0; > > > > > if (!cpc_desc) { > > > > > @@ -1350,8 +1356,18 @@ int cppc_get_perf_ctrs(int cpunum, > > > > > struct cppc_perf_fb_ctrs *perf_fb_ctrs) > > > > > } > > > > > } > > > > > - cpc_read(cpunum, delivered_reg, &delivered); > > > > > - cpc_read(cpunum, reference_reg, &reference); > > > > > + if (cpc_ffh_supported()) { > > > > > + ret = cpc_read_arch_counters_on_cpu(cpunum, > > > > > &delivered, &reference); > > > > > + if (ret) { > > > > > + pr_debug("read arch counters failed, ret=%d.\n", ret); > > > > > + ret = 0; > > > > > + } > > > > > + } > > > > > + if (!delivered || !reference) { > > > > > + cpc_read(cpunum, delivered_reg, &delivered); > > > > > + cpc_read(cpunum, reference_reg, &reference); > > > > > + } > > > > > + > > > > > cpc_read(cpunum, ref_perf_reg, &ref_perf); > > > > > /* > > > > > diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h > > > > > index 6126c977ece0..07d4fd82d499 100644 > > > > > --- a/include/acpi/cppc_acpi.h > > > > > +++ b/include/acpi/cppc_acpi.h > > > > > @@ -152,6 +152,7 @@ extern bool cpc_ffh_supported(void); > > > > > extern bool cpc_supported_by_cpu(void); > > > > > extern int cpc_read_ffh(int cpunum, struct cpc_reg *reg, u64 *val); > > > > > extern int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val); > > > > > +extern int cpc_read_arch_counters_on_cpu(int cpu, u64 > > > > > *delivered, u64 *reference); > > > > > extern int cppc_get_epp_perf(int cpunum, u64 *epp_perf); > > > > > extern int cppc_set_epp_perf(int cpu, struct > > > > > cppc_perf_ctrls *perf_ctrls, bool enable); > > > > > extern int cppc_get_auto_sel_caps(int cpunum, struct > > > > > cppc_perf_caps *perf_caps); > > > > > @@ -209,6 +210,10 @@ static inline int cpc_write_ffh(int > > > > > cpunum, struct cpc_reg *reg, u64 val) > > > > > { > > > > > return -ENOTSUPP; > > > > > } > > > > > +static inline int cpc_read_arch_counters_on_cpu(int cpu, > > > > > u64 *delivered, u64 *reference) > > > > > +{ > > > > > + return -EOPNOTSUPP; > > > > > +} > > > > > static inline int cppc_set_epp_perf(int cpu, struct > > > > > cppc_perf_ctrls *perf_ctrls, bool enable) > > > > > { > > > > > return -ENOTSUPP; > > > > > -- > > > > > 2.33.0 > > > > > > > > > . > > > > > > _______________________________________________ > > > linux-arm-kernel mailing list > > > linux-arm-kernel@lists.infradead.org > > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > > > > .
Hi Ionela, 在 2024/1/8 22:03, Ionela Voinescu 写道: > Hi, > > On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: >> Hi Vanshi, >> >> 在 2024/1/5 8:48, Vanshidhar Konda 写道: >>> On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >>>> 在 2024/1/4 1:53, Ionela Voinescu 写道: >>>>> Hi, >>>>> >>>>> On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>>>> Many developers found that the cpu current frequency is greater than >>>>>> the maximum frequency of the platform, please see [1], [2] and [3]. >>>>>> >>>>>> In the scenarios with high memory access pressure, the patch [1] has >>>>>> proved the significant latency of cpc_read() which is used to obtain >>>>>> delivered and reference performance counter cause an absurd frequency. >>>>>> The sampling interval for this counters is very critical and >>>>>> is expected >>>>>> to be equal. However, the different latency of cpc_read() has a direct >>>>>> impact on their sampling interval. >>>>>> >>>>> Would this [1] alternative solution work for you? >>>> It would work for me AFAICS. >>>> Because the "arch_freq_scale" is also from AMU core and constant >>>> counter, and read together. >>>> But, from their discuss line, it seems that there are some tricky >>>> points to clarify or consider. >>> I think the changes in [1] would work better when CPUs may be idle. With >>> this >>> patch we would have to wake any core that is in idle state to read the >>> AMU >>> counters. Worst case, if core 0 is trying to read the CPU frequency of >>> all >>> cores, it may need to wake up all the other cores to read the AMU >>> counters. >> From the approach in [1], if all CPUs (one or more cores) under one policy >> are idle, they still cannot be obtained the CPU frequency, right? >> In this case, the [1] API will return 0 and have to back to call >> cpufreq_driver->get() for cpuinfo_cur_freq. >> Then we still need to face the issue this patch mentioned. > With the implementation at [1], arch_freq_get_on_cpu() will not return 0 > for idle CPUs and the get() callback will not be called to wake up the > CPUs. Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. However, for no-housekeeping CPUs, it will return 0 and have to call get() callback, right? > > Worst case, arch_freq_get_on_cpu() will return a frequency based on the > AMU counter values obtained on the last tick on that CPU. But if that CPU > is not a housekeeping CPU, a housekeeping CPU in the same policy will be > selected, as it would have had a more recent tick, and therefore a more > recent frequency value for the domain. But this frequency is from the last tick, this last tick is probably a long time ago and it doesn't update 'arch_freq_scale' for some reasons like CPU dile. In addition, I'm not sure if there is possible that amu_scale_freq_tick() is executed delayed under high stress case. It also have an impact on the accuracy of the cpu frequency we query. > > I understand that the frequency returned here will not be up to date, > but there's no proper frequency feedback for an idle CPU. If one only > wakes up a CPU to sample counters, before the CPU goes back to sleep, > the obtained frequency feedback is meaningless. > >>> For systems with 128 cores or more, this could be very expensive and >>> happen >>> very frequently. >>> >>> AFAICS, the approach in [1] would avoid this cost. >> But the CPU frequency is just an average value for the last tick period >> instead of the current one the CPU actually runs at. >> In addition, there are some conditions to use 'arch_freq_scale' in this >> approach. > What are the conditions you are referring to? It depends on the housekeeping CPUs. > >> So I'm not sure if this approach can entirely cover the frequency >> discrepancy issue. > Unfortunately there is no perfect frequency feedback. By the time you > observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency > of the CPU might have already changed. Therefore, an average value might > be a better indication of the recent performance level of a CPU. An average value for CPU frequency is ok. It may be better if it has not any delaying. The original implementation for cpuinfo_cur_freq can more reflect their meaning in the user-guide [1]. The user-guide said: "cpuinfo_cur_freq : Current frequency of the CPU as obtained from the hardware, in KHz. This is the frequency the CPU actually runs at." [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt > > Would you be able to test [1] on your platform and usecase? I has tested it on my platform (CPU number: 64, SMT: off and CPU base frequency: 2.7GHz). Accoding to the testing result, 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. They still have to face the large frequency discrepancy issue my patch mentioned. 2> Additionally, the frequency value of all CPUs are almost the same by using the 'arch_freq_scale' factor way. I'm not sure if it is ok. The patch [1] has been modified silightly as below: --> @@ -1756,7 +1756,10 @@ static unsigned int cpufreq_verify_current_freq(struct cpufreq_policy *policy, b { unsigned int new_freq; - new_freq = cpufreq_driver->get(policy->cpu); + new_freq = arch_freq_get_on_cpu(policy->cpu); + if (!new_freq) + new_freq = cpufreq_driver->get(policy->cpu); + if (!new_freq) return 0; And the result is as follows: *case 1:**No setting the nohz_full and cpufreq use performance governor* *--> Step1: *read 'cpuinfo_cur_freq' in no pressure 0: 2699264 2: 2699264 4: 2699264 6: 2699264 8: 2696628 10: 2696628 12: 2696628 14: 2699264 16: 2699264 18: 2696628 20: 2699264 22: 2696628 24: 2699264 26: 2696628 28: 2699264 30: 2696628 32: 2696628 34: 2696628 36: 2696628 38: 2696628 40: 2699264 42: 2699264 44: 2696628 46: 2696628 48: 2696628 50: 2699264 52: 2699264 54: 2696628 56: 2696628 58: 2696628 60: 2696628 62: 2696628 64: 2696628 66: 2699264 68: 2696628 70: 2696628 72: 2699264 74: 2696628 76: 2696628 78: 2699264 80: 2696628 82: 2696628 84: 2699264 86: 2696628 88: 2696628 90: 2696628 92: 2696628 94: 2699264 96: 2696628 98: 2699264 100: 2699264 102: 2696628 104: 2699264 106: 2699264 108: 2699264 110: 2696628 112: 2699264 114: 2699264 116: 2699264 118: 2699264 120: 2696628 122: 2699264 124: 2696628 126: 2699264 Note: the frequency of all CPUs are almost the same. *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. 0: 2696628 2: 2696628 4: 2696628 6: 2696628 8: 2696628 10: 2696628 12: 2696628 14: 2696628 16: 2696628 18: 2696628 20: 2696628 22: 2696628 24: 2696628 26: 2696628 28: 2696628 30: 2696628 32: 2696628 34: 2696628 36: 2696628 38: 2696628 40: 2696628 42: 2696628 44: 2696628 46: 2696628 48: 2696628 50: 2696628 52: 2696628 54: 2696628 56: 2696628 58: 2696628 60: 2696628 62: 2696628 64: 2696628 66: 2696628 68: 2696628 70: 2696628 72: 2696628 74: 2696628 76: 2696628 78: 2696628 80: 2696628 82: 2696628 84: 2696628 86: 2696628 88: 2696628 90: 2696628 92: 2696628 94: 2696628 96: 2696628 98: 2696628 100: 2696628 102: 2696628 104: 2696628 106: 2696628 108: 2696628 110: 2696628 112: 2696628 114: 2696628 116: 2696628 118: 2696628 120: 2696628 122: 2696628 124: 2696628 126: 2696628 *Case 2: setting nohz_full and cpufreq use ondemand governor* There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in /proc/cmdline. *--> Step 1: *setting ondemand governor to all policy and query 'cpuinfo_cur_freq' in no pressure case. And the frequency of CPUs all are about 400MHz. *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. The high memory access pressure is from the command: "stress-ng -c 64 --cpu-load 100% --taskset 0-63" The result: 0: 2696628 1: 400000 2: 400000 3: 400909 4: 400000 5: 400000 6: 400000 7: 400000 8: 400000 9: 400000 10: 400600 11: 2696628 12: 2696628 13: 2696628 14: 2696628 15: 2696628 16: 2696628 17: 2696628 18: 2696628 19: 2696628 20: 2696628 21: 2696628 22: 2696628 23: 2696628 24: 2696628 25: 2696628 26: 2696628 27: 2696628 28: 2696628 29: 2696628 30: 2696628 31: 2696628 32: 2696628 33: 2696628 34: 2696628 35: 2696628 36: 2696628 37: 2696628 38: 2696628 39: 2696628 40: 2696628 41: 400000 42: 400000 43: 400000 44: 400000 45: 398847 46: 400000 47: 400000 48: 400000 49: 400000 50: 400000 51: 2696628 52: 2696628 53: 2696628 54: 2696628 55: 2696628 56: 2696628 57: 2696628 58: 2696628 59: 2696628 60: 2696628 61: 2696628 62: 2696628 63: 2699264 Note: (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. It turned out that nohz full was already work. I guess that stress-ng cannot use the CPU in the range of nohz full. Because the CPU frequency will be increased to 2.7G by binding CPU to other application. (2) The frequency of the nohz full core is calculated by get() callback according to ftrace. [1] https://lore.kernel.org/lkml/20230418113459.12860-7-sumitg@nvidia.com/ [2] https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ > > Many thanks, > Ionela. > >> /Huisong >> >>>>> [1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ >>>>> >>>>> Thanks, >>>>> Ionela. >>>>> >>>>>> This patch adds a interface, cpc_read_arch_counters_on_cpu, to read >>>>>> delivered and reference performance counter together. According to my >>>>>> test[4], the discrepancy of cpu current frequency in the >>>>>> scenarios with >>>>>> high memory access pressure is lower than 0.2% by stress-ng >>>>>> application. >>>>>> >>>>>> [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ >>>>>> [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ >>>>>> [3] >>>>>> https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ >>>>>> >>>>>> [4] My local test: >>>>>> The testing platform enable SMT and include 128 logical CPU in total, >>>>>> and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each >>>>>> physical core on platform during the high memory access pressure from >>>>>> stress-ng, and the output is as follows: >>>>>> 0: 2699133 2: 2699942 4: 2698189 6: 2704347 >>>>>> 8: 2704009 10: 2696277 12: 2702016 14: 2701388 >>>>>> 16: 2700358 18: 2696741 20: 2700091 22: 2700122 >>>>>> 24: 2701713 26: 2702025 28: 2699816 30: 2700121 >>>>>> 32: 2700000 34: 2699788 36: 2698884 38: 2699109 >>>>>> 40: 2704494 42: 2698350 44: 2699997 46: 2701023 >>>>>> 48: 2703448 50: 2699501 52: 2700000 54: 2699999 >>>>>> 56: 2702645 58: 2696923 60: 2697718 62: 2700547 >>>>>> 64: 2700313 66: 2700000 68: 2699904 70: 2699259 >>>>>> 72: 2699511 74: 2700644 76: 2702201 78: 2700000 >>>>>> 80: 2700776 82: 2700364 84: 2702674 86: 2700255 >>>>>> 88: 2699886 90: 2700359 92: 2699662 94: 2696188 >>>>>> 96: 2705454 98: 2699260 100: 2701097 102: 2699630 >>>>>> 104: 2700463 106: 2698408 108: 2697766 110: 2701181 >>>>>> 112: 2699166 114: 2701804 116: 2701907 118: 2701973 >>>>>> 120: 2699584 122: 2700474 124: 2700768 126: 2701963 >>>>>> >>>>>> Signed-off-by: Huisong Li <lihuisong@huawei.com> >>>>>> --- >>>>>> [snip] > .
On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: >Hi Ionela, > >在 2024/1/8 22:03, Ionela Voinescu 写道: >>Hi, >> >>On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: >>>Hi Vanshi, >>> >>>在 2024/1/5 8:48, Vanshidhar Konda 写道: >>>>On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >>>>>在 2024/1/4 1:53, Ionela Voinescu 写道: >>>>>>Hi, >>>>>> >>>>>>On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>>>>>Many developers found that the cpu current frequency is greater than >>>>>>>the maximum frequency of the platform, please see [1], [2] and [3]. >>>>>>> >>>>>>>In the scenarios with high memory access pressure, the patch [1] has >>>>>>>proved the significant latency of cpc_read() which is used to obtain >>>>>>>delivered and reference performance counter cause an absurd frequency. >>>>>>>The sampling interval for this counters is very critical and >>>>>>>is expected >>>>>>>to be equal. However, the different latency of cpc_read() has a direct >>>>>>>impact on their sampling interval. >>>>>>> >>>>>>Would this [1] alternative solution work for you? >>>>>It would work for me AFAICS. >>>>>Because the "arch_freq_scale" is also from AMU core and constant >>>>>counter, and read together. >>>>>But, from their discuss line, it seems that there are some tricky >>>>>points to clarify or consider. >>>>I think the changes in [1] would work better when CPUs may be idle. With >>>>this >>>>patch we would have to wake any core that is in idle state to read the >>>>AMU >>>>counters. Worst case, if core 0 is trying to read the CPU frequency of >>>>all >>>>cores, it may need to wake up all the other cores to read the AMU >>>>counters. >>> From the approach in [1], if all CPUs (one or more cores) under one policy >>>are idle, they still cannot be obtained the CPU frequency, right? >>>In this case, the [1] API will return 0 and have to back to call >>>cpufreq_driver->get() for cpuinfo_cur_freq. >>>Then we still need to face the issue this patch mentioned. >>With the implementation at [1], arch_freq_get_on_cpu() will not return 0 >>for idle CPUs and the get() callback will not be called to wake up the >>CPUs. >Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. >However, for no-housekeeping CPUs, it will return 0 and have to call >get() callback, right? >> >>Worst case, arch_freq_get_on_cpu() will return a frequency based on the >>AMU counter values obtained on the last tick on that CPU. But if that CPU >>is not a housekeeping CPU, a housekeeping CPU in the same policy will be >>selected, as it would have had a more recent tick, and therefore a more >>recent frequency value for the domain. >But this frequency is from the last tick, >this last tick is probably a long time ago and it doesn't update >'arch_freq_scale' for some reasons like CPU dile. >In addition, I'm not sure if there is possible that >amu_scale_freq_tick() is executed delayed under high stress case. >It also have an impact on the accuracy of the cpu frequency we query. >> >>I understand that the frequency returned here will not be up to date, >>but there's no proper frequency feedback for an idle CPU. If one only >>wakes up a CPU to sample counters, before the CPU goes back to sleep, >>the obtained frequency feedback is meaningless. >> >>>>For systems with 128 cores or more, this could be very expensive and >>>>happen >>>>very frequently. >>>> >>>>AFAICS, the approach in [1] would avoid this cost. >>>But the CPU frequency is just an average value for the last tick period >>>instead of the current one the CPU actually runs at. >>>In addition, there are some conditions to use 'arch_freq_scale' in this >>>approach. >>What are the conditions you are referring to? >It depends on the housekeeping CPUs. >> >>>So I'm not sure if this approach can entirely cover the frequency >>>discrepancy issue. >>Unfortunately there is no perfect frequency feedback. By the time you >>observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency >>of the CPU might have already changed. Therefore, an average value might >>be a better indication of the recent performance level of a CPU. >An average value for CPU frequency is ok. It may be better if it has >not any delaying. > >The original implementation for cpuinfo_cur_freq can more reflect their >meaning in the user-guide [1]. The user-guide said: >"cpuinfo_cur_freq : Current frequency of the CPU as obtained from the >hardware, in KHz. >This is the frequency the CPU actually runs at." > > >[1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt > >> >>Would you be able to test [1] on your platform and usecase? >I has tested it on my platform (CPU number: 64, SMT: off and CPU base >frequency: 2.7GHz). >Accoding to the testing result, >1> I found that patch [1] and [2] cannot cover the no housekeeping >CPUs. They still have to face the large frequency discrepancy issue my >patch mentioned. >2> Additionally, the frequency value of all CPUs are almost the same >by using the 'arch_freq_scale' factor way. I'm not sure if it is ok. > >The patch [1] has been modified silightly as below: >--> >@@ -1756,7 +1756,10 @@ static unsigned int >cpufreq_verify_current_freq(struct cpufreq_policy *policy, b > { > unsigned int new_freq; > >- new_freq = cpufreq_driver->get(policy->cpu); >+ new_freq = arch_freq_get_on_cpu(policy->cpu); >+ if (!new_freq) >+ new_freq = cpufreq_driver->get(policy->cpu); >+ > if (!new_freq) > return 0; > >And the result is as follows: >*case 1:**No setting the nohz_full and cpufreq use performance governor* >*--> Step1: *read 'cpuinfo_cur_freq' in no pressure > 0: 2699264 2: 2699264 4: 2699264 6: 2699264 > 8: 2696628 10: 2696628 12: 2696628 14: 2699264 > 16: 2699264 18: 2696628 20: 2699264 22: 2696628 > 24: 2699264 26: 2696628 28: 2699264 30: 2696628 > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > 40: 2699264 42: 2699264 44: 2696628 46: 2696628 > 48: 2696628 50: 2699264 52: 2699264 54: 2696628 > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > 64: 2696628 66: 2699264 68: 2696628 70: 2696628 > 72: 2699264 74: 2696628 76: 2696628 78: 2699264 > 80: 2696628 82: 2696628 84: 2699264 86: 2696628 > 88: 2696628 90: 2696628 92: 2696628 94: 2699264 > 96: 2696628 98: 2699264 100: 2699264 102: 2696628 >104: 2699264 106: 2699264 108: 2699264 110: 2696628 >112: 2699264 114: 2699264 116: 2699264 118: 2699264 >120: 2696628 122: 2699264 124: 2696628 126: 2699264 >Note: the frequency of all CPUs are almost the same. > >*--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. > 0: 2696628 2: 2696628 4: 2696628 6: 2696628 > 8: 2696628 10: 2696628 12: 2696628 14: 2696628 > 16: 2696628 18: 2696628 20: 2696628 22: 2696628 > 24: 2696628 26: 2696628 28: 2696628 30: 2696628 > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > 40: 2696628 42: 2696628 44: 2696628 46: 2696628 > 48: 2696628 50: 2696628 52: 2696628 54: 2696628 > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > 64: 2696628 66: 2696628 68: 2696628 70: 2696628 > 72: 2696628 74: 2696628 76: 2696628 78: 2696628 > 80: 2696628 82: 2696628 84: 2696628 86: 2696628 > 88: 2696628 90: 2696628 92: 2696628 94: 2696628 > 96: 2696628 98: 2696628 100: 2696628 102: 2696628 >104: 2696628 106: 2696628 108: 2696628 110: 2696628 >112: 2696628 114: 2696628 116: 2696628 118: 2696628 >120: 2696628 122: 2696628 124: 2696628 126: 2696628 > >*Case 2: setting nohz_full and cpufreq use ondemand governor* >There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 >rcu_nocbs=1-10,41-50" in /proc/cmdline. >*--> Step 1: *setting ondemand governor to all policy and query >'cpuinfo_cur_freq' in no pressure case. >And the frequency of CPUs all are about 400MHz. >*--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. >The high memory access pressure is from the command: "stress-ng -c 64 >--cpu-load 100% --taskset 0-63" >The result: > 0: 2696628 1: 400000 2: 400000 3: 400909 > 4: 400000 5: 400000 6: 400000 7: 400000 > 8: 400000 9: 400000 10: 400600 11: 2696628 >12: 2696628 13: 2696628 14: 2696628 15: 2696628 >16: 2696628 17: 2696628 18: 2696628 19: 2696628 >20: 2696628 21: 2696628 22: 2696628 23: 2696628 >24: 2696628 25: 2696628 26: 2696628 27: 2696628 >28: 2696628 29: 2696628 30: 2696628 31: 2696628 >32: 2696628 33: 2696628 34: 2696628 35: 2696628 >36: 2696628 37: 2696628 38: 2696628 39: 2696628 >40: 2696628 41: 400000 42: 400000 43: 400000 >44: 400000 45: 398847 46: 400000 47: 400000 >48: 400000 49: 400000 50: 400000 51: 2696628 >52: 2696628 53: 2696628 54: 2696628 55: 2696628 >56: 2696628 57: 2696628 58: 2696628 59: 2696628 >60: 2696628 61: 2696628 62: 2696628 63: 2699264 > >Note: >(1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. > It turned out that nohz full was already work. > I guess that stress-ng cannot use the CPU in the range of nohz full. > Because the CPU frequency will be increased to 2.7G by binding >CPU to other application. >(2) The frequency of the nohz full core is calculated by get() >callback according to ftrace. I think this is a good point. It is possible that on large core count systems a number of CPUs are isolated and don't have a kernel tick. The approach in [1] won't work for those CPUs. The changes proposed in this patch make sure that regardless of the scheduler or governor configuration, the CPU frequency reporting will be correct. My concerns with this approach: 1. We will wake up idle CPUs and query the AMU counters. a. This may not reflect the CPU frequency at the time the CPU was running. For example, CPU was running at 2.7 GHz, then became idle and invoked WFI. The CPU implementation reduces the CPU frequency to 400 MHz when it enters WFI. Now if we wake up the CPU to query the frequency, it will return 400 MHz. This might be misleading. b. It wastes energy as we may wake up cores just to query frequency and then have them go back to idle. To solve this, may be we can cache the value of AMU counters; return the cached value of the AMU counter if the CPU is idle. 2. I think the acpi_cppc should use FFH registers only if the firmware publishes FFH support for both the delivered and reference counters. That way the system designer (through firmware) can control if the FFH registers are used for computing frequency. [1] https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ Thanks, Vanshidhar > >[1] https://lore.kernel.org/lkml/20230418113459.12860-7-sumitg@nvidia.com/ >[2] https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ >> >>Many thanks, >>Ionela. >> >>>/Huisong >>> >>>>>>[1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ >>>>>> >>>>>>Thanks, >>>>>>Ionela. >>>>>> >>>>>>>This patch adds a interface, cpc_read_arch_counters_on_cpu, to read >>>>>>>delivered and reference performance counter together. According to my >>>>>>>test[4], the discrepancy of cpu current frequency in the >>>>>>>scenarios with >>>>>>>high memory access pressure is lower than 0.2% by stress-ng >>>>>>>application. >>>>>>> >>>>>>>[1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ >>>>>>>[2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ >>>>>>>[3] >>>>>>>https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ >>>>>>> >>>>>>>[4] My local test: >>>>>>>The testing platform enable SMT and include 128 logical CPU in total, >>>>>>>and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each >>>>>>>physical core on platform during the high memory access pressure from >>>>>>>stress-ng, and the output is as follows: >>>>>>> 0: 2699133 2: 2699942 4: 2698189 6: 2704347 >>>>>>> 8: 2704009 10: 2696277 12: 2702016 14: 2701388 >>>>>>> 16: 2700358 18: 2696741 20: 2700091 22: 2700122 >>>>>>> 24: 2701713 26: 2702025 28: 2699816 30: 2700121 >>>>>>> 32: 2700000 34: 2699788 36: 2698884 38: 2699109 >>>>>>> 40: 2704494 42: 2698350 44: 2699997 46: 2701023 >>>>>>> 48: 2703448 50: 2699501 52: 2700000 54: 2699999 >>>>>>> 56: 2702645 58: 2696923 60: 2697718 62: 2700547 >>>>>>> 64: 2700313 66: 2700000 68: 2699904 70: 2699259 >>>>>>> 72: 2699511 74: 2700644 76: 2702201 78: 2700000 >>>>>>> 80: 2700776 82: 2700364 84: 2702674 86: 2700255 >>>>>>> 88: 2699886 90: 2700359 92: 2699662 94: 2696188 >>>>>>> 96: 2705454 98: 2699260 100: 2701097 102: 2699630 >>>>>>>104: 2700463 106: 2698408 108: 2697766 110: 2701181 >>>>>>>112: 2699166 114: 2701804 116: 2701907 118: 2701973 >>>>>>>120: 2699584 122: 2700474 124: 2700768 126: 2701963 >>>>>>> >>>>>>>Signed-off-by: Huisong Li <lihuisong@huawei.com> >>>>>>>--- >[snip] >>.
Hi, Apologies for jumping in so late.... On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: > Hi Ionela, > > 在 2024/1/8 22:03, Ionela Voinescu 写道: > > Hi, > > > > On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: > > > Hi Vanshi, > > > > > > 在 2024/1/5 8:48, Vanshidhar Konda 写道: > > > > On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: > > > > > 在 2024/1/4 1:53, Ionela Voinescu 写道: > > > > > > Hi, > > > > > > > > > > > > On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: > > > > > > > Many developers found that the cpu current frequency is greater than > > > > > > > the maximum frequency of the platform, please see [1], [2] and [3]. > > > > > > > > > > > > > > In the scenarios with high memory access pressure, the patch [1] has > > > > > > > proved the significant latency of cpc_read() which is used to obtain > > > > > > > delivered and reference performance counter cause an absurd frequency. > > > > > > > The sampling interval for this counters is very critical and > > > > > > > is expected > > > > > > > to be equal. However, the different latency of cpc_read() has a direct > > > > > > > impact on their sampling interval. > > > > > > > > > > > > > Would this [1] alternative solution work for you? > > > > > It would work for me AFAICS. > > > > > Because the "arch_freq_scale" is also from AMU core and constant > > > > > counter, and read together. > > > > > But, from their discuss line, it seems that there are some tricky > > > > > points to clarify or consider. > > > > I think the changes in [1] would work better when CPUs may be idle. With > > > > this > > > > patch we would have to wake any core that is in idle state to read the > > > > AMU > > > > counters. Worst case, if core 0 is trying to read the CPU frequency of > > > > all > > > > cores, it may need to wake up all the other cores to read the AMU > > > > counters. > > > From the approach in [1], if all CPUs (one or more cores) under one policy > > > are idle, they still cannot be obtained the CPU frequency, right? > > > In this case, the [1] API will return 0 and have to back to call > > > cpufreq_driver->get() for cpuinfo_cur_freq. > > > Then we still need to face the issue this patch mentioned. > > With the implementation at [1], arch_freq_get_on_cpu() will not return 0 > > for idle CPUs and the get() callback will not be called to wake up the > > CPUs. > Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. > However, for no-housekeeping CPUs, it will return 0 and have to call get() > callback, right? > > > > Worst case, arch_freq_get_on_cpu() will return a frequency based on the > > AMU counter values obtained on the last tick on that CPU. But if that CPU > > is not a housekeeping CPU, a housekeeping CPU in the same policy will be > > selected, as it would have had a more recent tick, and therefore a more > > recent frequency value for the domain. > But this frequency is from the last tick, > this last tick is probably a long time ago and it doesn't update > 'arch_freq_scale' for some reasons like CPU dile. > In addition, I'm not sure if there is possible that amu_scale_freq_tick() is > executed delayed under high stress case. > It also have an impact on the accuracy of the cpu frequency we query. > > > > I understand that the frequency returned here will not be up to date, > > but there's no proper frequency feedback for an idle CPU. If one only > > wakes up a CPU to sample counters, before the CPU goes back to sleep, > > the obtained frequency feedback is meaningless. > > > > > > For systems with 128 cores or more, this could be very expensive and > > > > happen > > > > very frequently. > > > > > > > > AFAICS, the approach in [1] would avoid this cost. > > > But the CPU frequency is just an average value for the last tick period > > > instead of the current one the CPU actually runs at. > > > In addition, there are some conditions to use 'arch_freq_scale' in this > > > approach. > > What are the conditions you are referring to? > It depends on the housekeeping CPUs. > > > > > So I'm not sure if this approach can entirely cover the frequency > > > discrepancy issue. > > Unfortunately there is no perfect frequency feedback. By the time you > > observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency > > of the CPU might have already changed. Therefore, an average value might > > be a better indication of the recent performance level of a CPU. > An average value for CPU frequency is ok. It may be better if it has not any > delaying. > > The original implementation for cpuinfo_cur_freq can more reflect their > meaning in the user-guide [1]. The user-guide said: > "cpuinfo_cur_freq : Current frequency of the CPU as obtained from the > hardware, in KHz. > This is the frequency the CPU actually runs at." > > > [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt > > > > > Would you be able to test [1] on your platform and usecase? > I has tested it on my platform (CPU number: 64, SMT: off and CPU base > frequency: 2.7GHz). > Accoding to the testing result, > 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. > They still have to face the large frequency discrepancy issue my patch > mentioned. > 2> Additionally, the frequency value of all CPUs are almost the same by > using the 'arch_freq_scale' factor way. I'm not sure if it is ok. > > The patch [1] has been modified silightly as below: > --> > @@ -1756,7 +1756,10 @@ static unsigned int > cpufreq_verify_current_freq(struct cpufreq_policy *policy, b > { > unsigned int new_freq; > > - new_freq = cpufreq_driver->get(policy->cpu); > + new_freq = arch_freq_get_on_cpu(policy->cpu); > + if (!new_freq) > + new_freq = cpufreq_driver->get(policy->cpu); > + As pointed out this change will not make it to the next version of the patch. So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq > if (!new_freq) > return 0; > > And the result is as follows: > *case 1:**No setting the nohz_full and cpufreq use performance governor* > *--> Step1: *read 'cpuinfo_cur_freq' in no pressure > 0: 2699264 2: 2699264 4: 2699264 6: 2699264 > 8: 2696628 10: 2696628 12: 2696628 14: 2699264 > 16: 2699264 18: 2696628 20: 2699264 22: 2696628 > 24: 2699264 26: 2696628 28: 2699264 30: 2696628 > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > 40: 2699264 42: 2699264 44: 2696628 46: 2696628 > 48: 2696628 50: 2699264 52: 2699264 54: 2696628 > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > 64: 2696628 66: 2699264 68: 2696628 70: 2696628 > 72: 2699264 74: 2696628 76: 2696628 78: 2699264 > 80: 2696628 82: 2696628 84: 2699264 86: 2696628 > 88: 2696628 90: 2696628 92: 2696628 94: 2699264 > 96: 2696628 98: 2699264 100: 2699264 102: 2696628 > 104: 2699264 106: 2699264 108: 2699264 110: 2696628 > 112: 2699264 114: 2699264 116: 2699264 118: 2699264 > 120: 2696628 122: 2699264 124: 2696628 126: 2699264 > Note: the frequency of all CPUs are almost the same. Were you expecting smth else ? > > *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. > 0: 2696628 2: 2696628 4: 2696628 6: 2696628 > 8: 2696628 10: 2696628 12: 2696628 14: 2696628 > 16: 2696628 18: 2696628 20: 2696628 22: 2696628 > 24: 2696628 26: 2696628 28: 2696628 30: 2696628 > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > 40: 2696628 42: 2696628 44: 2696628 46: 2696628 > 48: 2696628 50: 2696628 52: 2696628 54: 2696628 > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > 64: 2696628 66: 2696628 68: 2696628 70: 2696628 > 72: 2696628 74: 2696628 76: 2696628 78: 2696628 > 80: 2696628 82: 2696628 84: 2696628 86: 2696628 > 88: 2696628 90: 2696628 92: 2696628 94: 2696628 > 96: 2696628 98: 2696628 100: 2696628 102: 2696628 > 104: 2696628 106: 2696628 108: 2696628 110: 2696628 > 112: 2696628 114: 2696628 116: 2696628 118: 2696628 > 120: 2696628 122: 2696628 124: 2696628 126: 2696628 > > *Case 2: setting nohz_full and cpufreq use ondemand governor* > There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in > /proc/cmdline. Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to set that one. Now, afair, isolcpus will make the selected CPUs to disappear from the schedulers view (no balancing, no migrating), so unless you affine smth explicitly to those CPUs, you will not see much of an activity there. Need to double check though as it has been a while ... > *--> Step 1: *setting ondemand governor to all policy and query > 'cpuinfo_cur_freq' in no pressure case. > And the frequency of CPUs all are about 400MHz. > *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. > The high memory access pressure is from the command: "stress-ng -c 64 > --cpu-load 100% --taskset 0-63" I'm not entirely convinced that this will affine to isolated cpus, especially that the affinity mask spans all available cpus. If that is the case, no wonder your isolated cpus are getting wasted being idle. But I would have to double check how this is being handled. > The result: > 0: 2696628 1: 400000 2: 400000 3: 400909 > 4: 400000 5: 400000 6: 400000 7: 400000 > 8: 400000 9: 400000 10: 400600 11: 2696628 > 12: 2696628 13: 2696628 14: 2696628 15: 2696628 > 16: 2696628 17: 2696628 18: 2696628 19: 2696628 > 20: 2696628 21: 2696628 22: 2696628 23: 2696628 > 24: 2696628 25: 2696628 26: 2696628 27: 2696628 > 28: 2696628 29: 2696628 30: 2696628 31: 2696628 > 32: 2696628 33: 2696628 34: 2696628 35: 2696628 > 36: 2696628 37: 2696628 38: 2696628 39: 2696628 > 40: 2696628 41: 400000 42: 400000 43: 400000 > 44: 400000 45: 398847 46: 400000 47: 400000 > 48: 400000 49: 400000 50: 400000 51: 2696628 > 52: 2696628 53: 2696628 54: 2696628 55: 2696628 > 56: 2696628 57: 2696628 58: 2696628 59: 2696628 > 60: 2696628 61: 2696628 62: 2696628 63: 2699264 > > Note: > (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. > It turned out that nohz full was already work. > I guess that stress-ng cannot use the CPU in the range of nohz full. > Because the CPU frequency will be increased to 2.7G by binding CPU to > other application. > (2) The frequency of the nohz full core is calculated by get() callback > according to ftrace. It is as there is no sched tick on those, and apparently there is nothing running on them either. Unless I am missing smth. --- BR Beata > > [1] https://lore.kernel.org/lkml/20230418113459.12860-7-sumitg@nvidia.com/ > [2] https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ > > > > Many thanks, > > Ionela. > > > > > /Huisong > > > > > > > > > [1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ > > > > > > > > > > > > Thanks, > > > > > > Ionela. > > > > > > > > > > > > > This patch adds a interface, cpc_read_arch_counters_on_cpu, to read > > > > > > > delivered and reference performance counter together. According to my > > > > > > > test[4], the discrepancy of cpu current frequency in the > > > > > > > scenarios with > > > > > > > high memory access pressure is lower than 0.2% by stress-ng > > > > > > > application. > > > > > > > > > > > > > > [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ > > > > > > > [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ > > > > > > > [3] > > > > > > > https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ > > > > > > > > > > > > > > [4] My local test: > > > > > > > The testing platform enable SMT and include 128 logical CPU in total, > > > > > > > and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each > > > > > > > physical core on platform during the high memory access pressure from > > > > > > > stress-ng, and the output is as follows: > > > > > > > 0: 2699133 2: 2699942 4: 2698189 6: 2704347 > > > > > > > 8: 2704009 10: 2696277 12: 2702016 14: 2701388 > > > > > > > 16: 2700358 18: 2696741 20: 2700091 22: 2700122 > > > > > > > 24: 2701713 26: 2702025 28: 2699816 30: 2700121 > > > > > > > 32: 2700000 34: 2699788 36: 2698884 38: 2699109 > > > > > > > 40: 2704494 42: 2698350 44: 2699997 46: 2701023 > > > > > > > 48: 2703448 50: 2699501 52: 2700000 54: 2699999 > > > > > > > 56: 2702645 58: 2696923 60: 2697718 62: 2700547 > > > > > > > 64: 2700313 66: 2700000 68: 2699904 70: 2699259 > > > > > > > 72: 2699511 74: 2700644 76: 2702201 78: 2700000 > > > > > > > 80: 2700776 82: 2700364 84: 2702674 86: 2700255 > > > > > > > 88: 2699886 90: 2700359 92: 2699662 94: 2696188 > > > > > > > 96: 2705454 98: 2699260 100: 2701097 102: 2699630 > > > > > > > 104: 2700463 106: 2698408 108: 2697766 110: 2701181 > > > > > > > 112: 2699166 114: 2701804 116: 2701907 118: 2701973 > > > > > > > 120: 2699584 122: 2700474 124: 2700768 126: 2701963 > > > > > > > > > > > > > > Signed-off-by: Huisong Li <lihuisong@huawei.com> > > > > > > > --- > [snip] > > .
在 2024/1/13 2:33, Vanshidhar Konda 写道: > On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: >> Hi Ionela, >> >> 在 2024/1/8 22:03, Ionela Voinescu 写道: >>> Hi, >>> >>> On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: >>>> Hi Vanshi, >>>> >>>> 在 2024/1/5 8:48, Vanshidhar Konda 写道: >>>>> On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >>>>>> 在 2024/1/4 1:53, Ionela Voinescu 写道: >>>>>>> Hi, >>>>>>> >>>>>>> On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>>>>>> Many developers found that the cpu current frequency is greater >>>>>>>> than >>>>>>>> the maximum frequency of the platform, please see [1], [2] and >>>>>>>> [3]. >>>>>>>> >>>>>>>> In the scenarios with high memory access pressure, the patch >>>>>>>> [1] has >>>>>>>> proved the significant latency of cpc_read() which is used to >>>>>>>> obtain >>>>>>>> delivered and reference performance counter cause an absurd >>>>>>>> frequency. >>>>>>>> The sampling interval for this counters is very critical and >>>>>>>> is expected >>>>>>>> to be equal. However, the different latency of cpc_read() has a >>>>>>>> direct >>>>>>>> impact on their sampling interval. >>>>>>>> >>>>>>> Would this [1] alternative solution work for you? >>>>>> It would work for me AFAICS. >>>>>> Because the "arch_freq_scale" is also from AMU core and constant >>>>>> counter, and read together. >>>>>> But, from their discuss line, it seems that there are some tricky >>>>>> points to clarify or consider. >>>>> I think the changes in [1] would work better when CPUs may be >>>>> idle. With >>>>> this >>>>> patch we would have to wake any core that is in idle state to read >>>>> the >>>>> AMU >>>>> counters. Worst case, if core 0 is trying to read the CPU >>>>> frequency of >>>>> all >>>>> cores, it may need to wake up all the other cores to read the AMU >>>>> counters. >>>> From the approach in [1], if all CPUs (one or more cores) under one >>>> policy >>>> are idle, they still cannot be obtained the CPU frequency, right? >>>> In this case, the [1] API will return 0 and have to back to call >>>> cpufreq_driver->get() for cpuinfo_cur_freq. >>>> Then we still need to face the issue this patch mentioned. >>> With the implementation at [1], arch_freq_get_on_cpu() will not >>> return 0 >>> for idle CPUs and the get() callback will not be called to wake up the >>> CPUs. >> Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. >> However, for no-housekeeping CPUs, it will return 0 and have to call >> get() callback, right? >>> >>> Worst case, arch_freq_get_on_cpu() will return a frequency based on the >>> AMU counter values obtained on the last tick on that CPU. But if >>> that CPU >>> is not a housekeeping CPU, a housekeeping CPU in the same policy >>> will be >>> selected, as it would have had a more recent tick, and therefore a more >>> recent frequency value for the domain. >> But this frequency is from the last tick, >> this last tick is probably a long time ago and it doesn't update >> 'arch_freq_scale' for some reasons like CPU dile. >> In addition, I'm not sure if there is possible that >> amu_scale_freq_tick() is executed delayed under high stress case. >> It also have an impact on the accuracy of the cpu frequency we query. >>> >>> I understand that the frequency returned here will not be up to date, >>> but there's no proper frequency feedback for an idle CPU. If one only >>> wakes up a CPU to sample counters, before the CPU goes back to sleep, >>> the obtained frequency feedback is meaningless. >>> >>>>> For systems with 128 cores or more, this could be very expensive and >>>>> happen >>>>> very frequently. >>>>> >>>>> AFAICS, the approach in [1] would avoid this cost. >>>> But the CPU frequency is just an average value for the last tick >>>> period >>>> instead of the current one the CPU actually runs at. >>>> In addition, there are some conditions to use 'arch_freq_scale' in >>>> this >>>> approach. >>> What are the conditions you are referring to? >> It depends on the housekeeping CPUs. >>> >>>> So I'm not sure if this approach can entirely cover the frequency >>>> discrepancy issue. >>> Unfortunately there is no perfect frequency feedback. By the time you >>> observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the >>> frequency >>> of the CPU might have already changed. Therefore, an average value >>> might >>> be a better indication of the recent performance level of a CPU. >> An average value for CPU frequency is ok. It may be better if it has >> not any delaying. >> >> The original implementation for cpuinfo_cur_freq can more reflect their >> meaning in the user-guide [1]. The user-guide said: >> "cpuinfo_cur_freq : Current frequency of the CPU as obtained from the >> hardware, in KHz. >> This is the frequency the CPU actually runs at." >> >> >> [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt >> >>> >>> Would you be able to test [1] on your platform and usecase? >> I has tested it on my platform (CPU number: 64, SMT: off and CPU base >> frequency: 2.7GHz). >> Accoding to the testing result, >> 1> I found that patch [1] and [2] cannot cover the no housekeeping >> CPUs. They still have to face the large frequency discrepancy issue >> my patch mentioned. >> 2> Additionally, the frequency value of all CPUs are almost the same >> by using the 'arch_freq_scale' factor way. I'm not sure if it is ok. >> >> The patch [1] has been modified silightly as below: >> --> >> @@ -1756,7 +1756,10 @@ static unsigned int >> cpufreq_verify_current_freq(struct cpufreq_policy *policy, b >> { >> unsigned int new_freq; >> >> - new_freq = cpufreq_driver->get(policy->cpu); >> + new_freq = arch_freq_get_on_cpu(policy->cpu); >> + if (!new_freq) >> + new_freq = cpufreq_driver->get(policy->cpu); >> + >> if (!new_freq) >> return 0; >> >> And the result is as follows: >> *case 1:**No setting the nohz_full and cpufreq use performance governor* >> *--> Step1: *read 'cpuinfo_cur_freq' in no pressure >> 0: 2699264 2: 2699264 4: 2699264 6: 2699264 >> 8: 2696628 10: 2696628 12: 2696628 14: 2699264 >> 16: 2699264 18: 2696628 20: 2699264 22: 2696628 >> 24: 2699264 26: 2696628 28: 2699264 30: 2696628 >> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >> 40: 2699264 42: 2699264 44: 2696628 46: 2696628 >> 48: 2696628 50: 2699264 52: 2699264 54: 2696628 >> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >> 64: 2696628 66: 2699264 68: 2696628 70: 2696628 >> 72: 2699264 74: 2696628 76: 2696628 78: 2699264 >> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >> 88: 2696628 90: 2696628 92: 2696628 94: 2699264 >> 96: 2696628 98: 2699264 100: 2699264 102: 2696628 >> 104: 2699264 106: 2699264 108: 2699264 110: 2696628 >> 112: 2699264 114: 2699264 116: 2699264 118: 2699264 >> 120: 2696628 122: 2699264 124: 2696628 126: 2699264 >> Note: the frequency of all CPUs are almost the same. >> >> *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access >> pressure. >> 0: 2696628 2: 2696628 4: 2696628 6: 2696628 >> 8: 2696628 10: 2696628 12: 2696628 14: 2696628 >> 16: 2696628 18: 2696628 20: 2696628 22: 2696628 >> 24: 2696628 26: 2696628 28: 2696628 30: 2696628 >> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >> 40: 2696628 42: 2696628 44: 2696628 46: 2696628 >> 48: 2696628 50: 2696628 52: 2696628 54: 2696628 >> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >> 72: 2696628 74: 2696628 76: 2696628 78: 2696628 >> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >> 88: 2696628 90: 2696628 92: 2696628 94: 2696628 >> 96: 2696628 98: 2696628 100: 2696628 102: 2696628 >> 104: 2696628 106: 2696628 108: 2696628 110: 2696628 >> 112: 2696628 114: 2696628 116: 2696628 118: 2696628 >> 120: 2696628 122: 2696628 124: 2696628 126: 2696628 >> >> *Case 2: setting nohz_full and cpufreq use ondemand governor* >> There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 >> rcu_nocbs=1-10,41-50" in /proc/cmdline. >> *--> Step 1: *setting ondemand governor to all policy and query >> 'cpuinfo_cur_freq' in no pressure case. >> And the frequency of CPUs all are about 400MHz. >> *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access >> pressure. >> The high memory access pressure is from the command: "stress-ng -c 64 >> --cpu-load 100% --taskset 0-63" >> The result: >> 0: 2696628 1: 400000 2: 400000 3: 400909 >> 4: 400000 5: 400000 6: 400000 7: 400000 >> 8: 400000 9: 400000 10: 400600 11: 2696628 >> 12: 2696628 13: 2696628 14: 2696628 15: 2696628 >> 16: 2696628 17: 2696628 18: 2696628 19: 2696628 >> 20: 2696628 21: 2696628 22: 2696628 23: 2696628 >> 24: 2696628 25: 2696628 26: 2696628 27: 2696628 >> 28: 2696628 29: 2696628 30: 2696628 31: 2696628 >> 32: 2696628 33: 2696628 34: 2696628 35: 2696628 >> 36: 2696628 37: 2696628 38: 2696628 39: 2696628 >> 40: 2696628 41: 400000 42: 400000 43: 400000 >> 44: 400000 45: 398847 46: 400000 47: 400000 >> 48: 400000 49: 400000 50: 400000 51: 2696628 >> 52: 2696628 53: 2696628 54: 2696628 55: 2696628 >> 56: 2696628 57: 2696628 58: 2696628 59: 2696628 >> 60: 2696628 61: 2696628 62: 2696628 63: 2699264 >> >> Note: >> (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. >> It turned out that nohz full was already work. >> I guess that stress-ng cannot use the CPU in the range of nohz >> full. >> Because the CPU frequency will be increased to 2.7G by binding >> CPU to other application. >> (2) The frequency of the nohz full core is calculated by get() >> callback according to ftrace. > > I think this is a good point. It is possible that on large core count > systems a number of CPUs are isolated and don't have a kernel tick. The > approach in [1] won't work for those CPUs. The changes proposed in this > patch make sure that regardless of the scheduler or governor > configuration, the CPU frequency reporting will be correct. > > My concerns with this approach: > 1. We will wake up idle CPUs and query the AMU counters. > a. This may not reflect the CPU frequency at the time the CPU was > running. For example, CPU was running at 2.7 GHz, then became idle and > invoked WFI. The CPU implementation reduces the CPU frequency to 400 > MHz when it enters WFI. Now if we wake up the CPU to query the > frequency, it will return 400 MHz. This might be misleading. > > b. It wastes energy as we may wake up cores just to query frequency > and then have them go back to idle. Waking up an idle CPU to query CPU frequency and then going back to idle wastes energy. The frequency of the idle CPUs may be is zero, but we cannot display zero because it is not friendly for user. If we use the cached counter as the frequency feedback before CPU being idle, it cannot reflect a real frequency on CPU. So there is no perfect frequency feedback for idle CPUs. > > To solve this, may be we can cache the value of AMU counters; return > the cached value of the AMU counter if the CPU is idle. The "arch_freq_scale" which is from AMU counters is similar to the cached value in Beata's approach. But it just work for the housekeeping CPUs with tick. If using cached frequency is ok, how shoud we do for nohz full CPUs? I have no good idea. > > 2. I think the acpi_cppc should use FFH registers only if the firmware > publishes FFH support for both the delivered and reference counters. > That way the system designer (through firmware) can control if the > FFH registers are used for computing frequency. Right, so this patch have to judge if support FFH. > > [1] > https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ > > Thanks, > Vanshidhar > >> >> [1] >> https://lore.kernel.org/lkml/20230418113459.12860-7-sumitg@nvidia.com/ >> [2] >> https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ >>> >>> Many thanks, >>> Ionela. >>> >>>> /Huisong >>>> >>>>>>> [1] >>>>>>> https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ >>>>>>> >>>>>>> Thanks, >>>>>>> Ionela. >>>>>>> >>>>>>>> This patch adds a interface, cpc_read_arch_counters_on_cpu, to >>>>>>>> read >>>>>>>> delivered and reference performance counter together. According >>>>>>>> to my >>>>>>>> test[4], the discrepancy of cpu current frequency in the >>>>>>>> scenarios with >>>>>>>> high memory access pressure is lower than 0.2% by stress-ng >>>>>>>> application. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ >>>>>>>> [2] >>>>>>>> https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ >>>>>>>> [3] >>>>>>>> https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ >>>>>>>> >>>>>>>> >>>>>>>> [4] My local test: >>>>>>>> The testing platform enable SMT and include 128 logical CPU in >>>>>>>> total, >>>>>>>> and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" >>>>>>>> for each >>>>>>>> physical core on platform during the high memory access >>>>>>>> pressure from >>>>>>>> stress-ng, and the output is as follows: >>>>>>>> 0: 2699133 2: 2699942 4: 2698189 6: 2704347 >>>>>>>> 8: 2704009 10: 2696277 12: 2702016 14: 2701388 >>>>>>>> 16: 2700358 18: 2696741 20: 2700091 22: 2700122 >>>>>>>> 24: 2701713 26: 2702025 28: 2699816 30: 2700121 >>>>>>>> 32: 2700000 34: 2699788 36: 2698884 38: 2699109 >>>>>>>> 40: 2704494 42: 2698350 44: 2699997 46: 2701023 >>>>>>>> 48: 2703448 50: 2699501 52: 2700000 54: 2699999 >>>>>>>> 56: 2702645 58: 2696923 60: 2697718 62: 2700547 >>>>>>>> 64: 2700313 66: 2700000 68: 2699904 70: 2699259 >>>>>>>> 72: 2699511 74: 2700644 76: 2702201 78: 2700000 >>>>>>>> 80: 2700776 82: 2700364 84: 2702674 86: 2700255 >>>>>>>> 88: 2699886 90: 2700359 92: 2699662 94: 2696188 >>>>>>>> 96: 2705454 98: 2699260 100: 2701097 102: 2699630 >>>>>>>> 104: 2700463 106: 2698408 108: 2697766 110: 2701181 >>>>>>>> 112: 2699166 114: 2701804 116: 2701907 118: 2701973 >>>>>>>> 120: 2699584 122: 2700474 124: 2700768 126: 2701963 >>>>>>>> >>>>>>>> Signed-off-by: Huisong Li <lihuisong@huawei.com> >>>>>>>> --- >> [snip] >>> . > .
Hi, 在 2024/1/16 22:10, Beata Michalska 写道: > Hi, > > Apologies for jumping in so late.... > > On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: >> Hi Ionela, >> >> 在 2024/1/8 22:03, Ionela Voinescu 写道: >>> Hi, >>> >>> On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: >>>> Hi Vanshi, >>>> >>>> 在 2024/1/5 8:48, Vanshidhar Konda 写道: >>>>> On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >>>>>> 在 2024/1/4 1:53, Ionela Voinescu 写道: >>>>>>> Hi, >>>>>>> >>>>>>> On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>>>>>> Many developers found that the cpu current frequency is greater than >>>>>>>> the maximum frequency of the platform, please see [1], [2] and [3]. >>>>>>>> >>>>>>>> In the scenarios with high memory access pressure, the patch [1] has >>>>>>>> proved the significant latency of cpc_read() which is used to obtain >>>>>>>> delivered and reference performance counter cause an absurd frequency. >>>>>>>> The sampling interval for this counters is very critical and >>>>>>>> is expected >>>>>>>> to be equal. However, the different latency of cpc_read() has a direct >>>>>>>> impact on their sampling interval. >>>>>>>> >>>>>>> Would this [1] alternative solution work for you? >>>>>> It would work for me AFAICS. >>>>>> Because the "arch_freq_scale" is also from AMU core and constant >>>>>> counter, and read together. >>>>>> But, from their discuss line, it seems that there are some tricky >>>>>> points to clarify or consider. >>>>> I think the changes in [1] would work better when CPUs may be idle. With >>>>> this >>>>> patch we would have to wake any core that is in idle state to read the >>>>> AMU >>>>> counters. Worst case, if core 0 is trying to read the CPU frequency of >>>>> all >>>>> cores, it may need to wake up all the other cores to read the AMU >>>>> counters. >>>> From the approach in [1], if all CPUs (one or more cores) under one policy >>>> are idle, they still cannot be obtained the CPU frequency, right? >>>> In this case, the [1] API will return 0 and have to back to call >>>> cpufreq_driver->get() for cpuinfo_cur_freq. >>>> Then we still need to face the issue this patch mentioned. >>> With the implementation at [1], arch_freq_get_on_cpu() will not return 0 >>> for idle CPUs and the get() callback will not be called to wake up the >>> CPUs. >> Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. >> However, for no-housekeeping CPUs, it will return 0 and have to call get() >> callback, right? >>> Worst case, arch_freq_get_on_cpu() will return a frequency based on the >>> AMU counter values obtained on the last tick on that CPU. But if that CPU >>> is not a housekeeping CPU, a housekeeping CPU in the same policy will be >>> selected, as it would have had a more recent tick, and therefore a more >>> recent frequency value for the domain. >> But this frequency is from the last tick, >> this last tick is probably a long time ago and it doesn't update >> 'arch_freq_scale' for some reasons like CPU dile. >> In addition, I'm not sure if there is possible that amu_scale_freq_tick() is >> executed delayed under high stress case. >> It also have an impact on the accuracy of the cpu frequency we query. >>> I understand that the frequency returned here will not be up to date, >>> but there's no proper frequency feedback for an idle CPU. If one only >>> wakes up a CPU to sample counters, before the CPU goes back to sleep, >>> the obtained frequency feedback is meaningless. >>> >>>>> For systems with 128 cores or more, this could be very expensive and >>>>> happen >>>>> very frequently. >>>>> >>>>> AFAICS, the approach in [1] would avoid this cost. >>>> But the CPU frequency is just an average value for the last tick period >>>> instead of the current one the CPU actually runs at. >>>> In addition, there are some conditions to use 'arch_freq_scale' in this >>>> approach. >>> What are the conditions you are referring to? >> It depends on the housekeeping CPUs. >>>> So I'm not sure if this approach can entirely cover the frequency >>>> discrepancy issue. >>> Unfortunately there is no perfect frequency feedback. By the time you >>> observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency >>> of the CPU might have already changed. Therefore, an average value might >>> be a better indication of the recent performance level of a CPU. >> An average value for CPU frequency is ok. It may be better if it has not any >> delaying. >> >> The original implementation for cpuinfo_cur_freq can more reflect their >> meaning in the user-guide [1]. The user-guide said: >> "cpuinfo_cur_freq : Current frequency of the CPU as obtained from the >> hardware, in KHz. >> This is the frequency the CPU actually runs at." >> >> >> [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt >> >>> Would you be able to test [1] on your platform and usecase? >> I has tested it on my platform (CPU number: 64, SMT: off and CPU base >> frequency: 2.7GHz). >> Accoding to the testing result, >> 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. >> They still have to face the large frequency discrepancy issue my patch >> mentioned. >> 2> Additionally, the frequency value of all CPUs are almost the same by >> using the 'arch_freq_scale' factor way. I'm not sure if it is ok. >> >> The patch [1] has been modified silightly as below: >> --> >> @@ -1756,7 +1756,10 @@ static unsigned int >> cpufreq_verify_current_freq(struct cpufreq_policy *policy, b >> { >> unsigned int new_freq; >> >> - new_freq = cpufreq_driver->get(policy->cpu); >> + new_freq = arch_freq_get_on_cpu(policy->cpu); >> + if (!new_freq) >> + new_freq = cpufreq_driver->get(policy->cpu); >> + > As pointed out this change will not make it to the next version of the patch. > So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will > only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq >> if (!new_freq) >> return 0; >> >> And the result is as follows: >> *case 1:**No setting the nohz_full and cpufreq use performance governor* >> *--> Step1: *read 'cpuinfo_cur_freq' in no pressure >> 0: 2699264 2: 2699264 4: 2699264 6: 2699264 >> 8: 2696628 10: 2696628 12: 2696628 14: 2699264 >> 16: 2699264 18: 2696628 20: 2699264 22: 2696628 >> 24: 2699264 26: 2696628 28: 2699264 30: 2696628 >> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >> 40: 2699264 42: 2699264 44: 2696628 46: 2696628 >> 48: 2696628 50: 2699264 52: 2699264 54: 2696628 >> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >> 64: 2696628 66: 2699264 68: 2696628 70: 2696628 >> 72: 2699264 74: 2696628 76: 2696628 78: 2699264 >> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >> 88: 2696628 90: 2696628 92: 2696628 94: 2699264 >> 96: 2696628 98: 2699264 100: 2699264 102: 2696628 >> 104: 2699264 106: 2699264 108: 2699264 110: 2696628 >> 112: 2699264 114: 2699264 116: 2699264 118: 2699264 >> 120: 2696628 122: 2699264 124: 2696628 126: 2699264 >> Note: the frequency of all CPUs are almost the same. > Were you expecting smth else ? The frequency of each CPU might have a different value. All value of all CPUs is the same under high pressure. I don't know what the phenomenon is on other platform. Do you know who else tested it? >> *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. >> 0: 2696628 2: 2696628 4: 2696628 6: 2696628 >> 8: 2696628 10: 2696628 12: 2696628 14: 2696628 >> 16: 2696628 18: 2696628 20: 2696628 22: 2696628 >> 24: 2696628 26: 2696628 28: 2696628 30: 2696628 >> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >> 40: 2696628 42: 2696628 44: 2696628 46: 2696628 >> 48: 2696628 50: 2696628 52: 2696628 54: 2696628 >> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >> 72: 2696628 74: 2696628 76: 2696628 78: 2696628 >> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >> 88: 2696628 90: 2696628 92: 2696628 94: 2696628 >> 96: 2696628 98: 2696628 100: 2696628 102: 2696628 >> 104: 2696628 106: 2696628 108: 2696628 110: 2696628 >> 112: 2696628 114: 2696628 116: 2696628 118: 2696628 >> 120: 2696628 122: 2696628 124: 2696628 126: 2696628 >> >> *Case 2: setting nohz_full and cpufreq use ondemand governor* >> There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in >> /proc/cmdline. > Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to > set that one. > Now, afair, isolcpus will make the selected CPUs to disappear from the > schedulers view (no balancing, no migrating), so unless you affine smth > explicitly to those CPUs, you will not see much of an activity there. Correct. > Need to double check though as it has been a while ... >> *--> Step 1: *setting ondemand governor to all policy and query >> 'cpuinfo_cur_freq' in no pressure case. >> And the frequency of CPUs all are about 400MHz. >> *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. >> The high memory access pressure is from the command: "stress-ng -c 64 >> --cpu-load 100% --taskset 0-63" > I'm not entirely convinced that this will affine to isolated cpus, especially > that the affinity mask spans all available cpus. If that is the case, no wonder > your isolated cpus are getting wasted being idle. But I would have to double > check how this is being handled. >> The result: >> 0: 2696628 1: 400000 2: 400000 3: 400909 >> 4: 400000 5: 400000 6: 400000 7: 400000 >> 8: 400000 9: 400000 10: 400600 11: 2696628 >> 12: 2696628 13: 2696628 14: 2696628 15: 2696628 >> 16: 2696628 17: 2696628 18: 2696628 19: 2696628 >> 20: 2696628 21: 2696628 22: 2696628 23: 2696628 >> 24: 2696628 25: 2696628 26: 2696628 27: 2696628 >> 28: 2696628 29: 2696628 30: 2696628 31: 2696628 >> 32: 2696628 33: 2696628 34: 2696628 35: 2696628 >> 36: 2696628 37: 2696628 38: 2696628 39: 2696628 >> 40: 2696628 41: 400000 42: 400000 43: 400000 >> 44: 400000 45: 398847 46: 400000 47: 400000 >> 48: 400000 49: 400000 50: 400000 51: 2696628 >> 52: 2696628 53: 2696628 54: 2696628 55: 2696628 >> 56: 2696628 57: 2696628 58: 2696628 59: 2696628 >> 60: 2696628 61: 2696628 62: 2696628 63: 2699264 >> >> Note: >> (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. >> It turned out that nohz full was already work. >> I guess that stress-ng cannot use the CPU in the range of nohz full. >> Because the CPU frequency will be increased to 2.7G by binding CPU to >> other application. >> (2) The frequency of the nohz full core is calculated by get() callback >> according to ftrace. > It is as there is no sched tick on those, and apparently there is nothing > running on them either. Yes. If we select your approach and the above phenomenon is normal, the large frequency discrepancy issue can be resolved for CPUs with sched tick by the way. But the nohz full cores still have to face this issue. So this patch is also needed. BR /huisong > > Unless I am missing smth. > > --- > BR > Beata > >> [1] https://lore.kernel.org/lkml/20230418113459.12860-7-sumitg@nvidia.com/ >> [2] https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ >>> Many thanks, >>> Ionela. >>> >>>> /Huisong >>>> >>>>>>> [1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ >>>>>>> >>>>>>> Thanks, >>>>>>> Ionela. >>>>>>> >>>>>>>> This patch adds a interface, cpc_read_arch_counters_on_cpu, to read >>>>>>>> delivered and reference performance counter together. According to my >>>>>>>> test[4], the discrepancy of cpu current frequency in the >>>>>>>> scenarios with >>>>>>>> high memory access pressure is lower than 0.2% by stress-ng >>>>>>>> application. >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ >>>>>>>> [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ >>>>>>>> [3] >>>>>>>> https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ >>>>>>>> >>>>>>>> [4] My local test: >>>>>>>> The testing platform enable SMT and include 128 logical CPU in total, >>>>>>>> and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each >>>>>>>> physical core on platform during the high memory access pressure from >>>>>>>> stress-ng, and the output is as follows: >>>>>>>> 0: 2699133 2: 2699942 4: 2698189 6: 2704347 >>>>>>>> 8: 2704009 10: 2696277 12: 2702016 14: 2701388 >>>>>>>> 16: 2700358 18: 2696741 20: 2700091 22: 2700122 >>>>>>>> 24: 2701713 26: 2702025 28: 2699816 30: 2700121 >>>>>>>> 32: 2700000 34: 2699788 36: 2698884 38: 2699109 >>>>>>>> 40: 2704494 42: 2698350 44: 2699997 46: 2701023 >>>>>>>> 48: 2703448 50: 2699501 52: 2700000 54: 2699999 >>>>>>>> 56: 2702645 58: 2696923 60: 2697718 62: 2700547 >>>>>>>> 64: 2700313 66: 2700000 68: 2699904 70: 2699259 >>>>>>>> 72: 2699511 74: 2700644 76: 2702201 78: 2700000 >>>>>>>> 80: 2700776 82: 2700364 84: 2702674 86: 2700255 >>>>>>>> 88: 2699886 90: 2700359 92: 2699662 94: 2696188 >>>>>>>> 96: 2705454 98: 2699260 100: 2701097 102: 2699630 >>>>>>>> 104: 2700463 106: 2698408 108: 2697766 110: 2701181 >>>>>>>> 112: 2699166 114: 2701804 116: 2701907 118: 2701973 >>>>>>>> 120: 2699584 122: 2700474 124: 2700768 126: 2701963 >>>>>>>> >>>>>>>> Signed-off-by: Huisong Li <lihuisong@huawei.com> >>>>>>>> --- >> [snip] >>> . > .
On Wed, Jan 17, 2024 at 05:18:40PM +0800, lihuisong (C) wrote: Hi , Again, apologies for delay, > Hi, > > 在 2024/1/16 22:10, Beata Michalska 写道: > > Hi, > > > > Apologies for jumping in so late.... > > > > On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: > > > Hi Ionela, > > > > > > 在 2024/1/8 22:03, Ionela Voinescu 写道: > > > > Hi, > > > > > > > > On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: > > > > > Hi Vanshi, > > > > > > > > > > 在 2024/1/5 8:48, Vanshidhar Konda 写道: > > > > > > On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: > > > > > > > 在 2024/1/4 1:53, Ionela Voinescu 写道: > > > > > > > > Hi, > > > > > > > > > > > > > > > > On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: > > > > > > > > > Many developers found that the cpu current frequency is greater than > > > > > > > > > the maximum frequency of the platform, please see [1], [2] and [3]. > > > > > > > > > > > > > > > > > > In the scenarios with high memory access pressure, the patch [1] has > > > > > > > > > proved the significant latency of cpc_read() which is used to obtain > > > > > > > > > delivered and reference performance counter cause an absurd frequency. > > > > > > > > > The sampling interval for this counters is very critical and > > > > > > > > > is expected > > > > > > > > > to be equal. However, the different latency of cpc_read() has a direct > > > > > > > > > impact on their sampling interval. > > > > > > > > > > > > > > > > > Would this [1] alternative solution work for you? > > > > > > > It would work for me AFAICS. > > > > > > > Because the "arch_freq_scale" is also from AMU core and constant > > > > > > > counter, and read together. > > > > > > > But, from their discuss line, it seems that there are some tricky > > > > > > > points to clarify or consider. > > > > > > I think the changes in [1] would work better when CPUs may be idle. With > > > > > > this > > > > > > patch we would have to wake any core that is in idle state to read the > > > > > > AMU > > > > > > counters. Worst case, if core 0 is trying to read the CPU frequency of > > > > > > all > > > > > > cores, it may need to wake up all the other cores to read the AMU > > > > > > counters. > > > > > From the approach in [1], if all CPUs (one or more cores) under one policy > > > > > are idle, they still cannot be obtained the CPU frequency, right? > > > > > In this case, the [1] API will return 0 and have to back to call > > > > > cpufreq_driver->get() for cpuinfo_cur_freq. > > > > > Then we still need to face the issue this patch mentioned. > > > > With the implementation at [1], arch_freq_get_on_cpu() will not return 0 > > > > for idle CPUs and the get() callback will not be called to wake up the > > > > CPUs. > > > Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. > > > However, for no-housekeeping CPUs, it will return 0 and have to call get() > > > callback, right? > > > > Worst case, arch_freq_get_on_cpu() will return a frequency based on the > > > > AMU counter values obtained on the last tick on that CPU. But if that CPU > > > > is not a housekeeping CPU, a housekeeping CPU in the same policy will be > > > > selected, as it would have had a more recent tick, and therefore a more > > > > recent frequency value for the domain. > > > But this frequency is from the last tick, > > > this last tick is probably a long time ago and it doesn't update > > > 'arch_freq_scale' for some reasons like CPU dile. > > > In addition, I'm not sure if there is possible that amu_scale_freq_tick() is > > > executed delayed under high stress case. > > > It also have an impact on the accuracy of the cpu frequency we query. > > > > I understand that the frequency returned here will not be up to date, > > > > but there's no proper frequency feedback for an idle CPU. If one only > > > > wakes up a CPU to sample counters, before the CPU goes back to sleep, > > > > the obtained frequency feedback is meaningless. > > > > > > > > > > For systems with 128 cores or more, this could be very expensive and > > > > > > happen > > > > > > very frequently. > > > > > > > > > > > > AFAICS, the approach in [1] would avoid this cost. > > > > > But the CPU frequency is just an average value for the last tick period > > > > > instead of the current one the CPU actually runs at. > > > > > In addition, there are some conditions to use 'arch_freq_scale' in this > > > > > approach. > > > > What are the conditions you are referring to? > > > It depends on the housekeeping CPUs. > > > > > So I'm not sure if this approach can entirely cover the frequency > > > > > discrepancy issue. > > > > Unfortunately there is no perfect frequency feedback. By the time you > > > > observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency > > > > of the CPU might have already changed. Therefore, an average value might > > > > be a better indication of the recent performance level of a CPU. > > > An average value for CPU frequency is ok. It may be better if it has not any > > > delaying. > > > > > > The original implementation for cpuinfo_cur_freq can more reflect their > > > meaning in the user-guide [1]. The user-guide said: > > > "cpuinfo_cur_freq : Current frequency of the CPU as obtained from the > > > hardware, in KHz. > > > This is the frequency the CPU actually runs at." > > > > > > > > > [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt > > > > > > > Would you be able to test [1] on your platform and usecase? > > > I has tested it on my platform (CPU number: 64, SMT: off and CPU base > > > frequency: 2.7GHz). > > > Accoding to the testing result, > > > 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. > > > They still have to face the large frequency discrepancy issue my patch > > > mentioned. > > > 2> Additionally, the frequency value of all CPUs are almost the same by > > > using the 'arch_freq_scale' factor way. I'm not sure if it is ok. > > > > > > The patch [1] has been modified silightly as below: > > > --> > > > @@ -1756,7 +1756,10 @@ static unsigned int > > > cpufreq_verify_current_freq(struct cpufreq_policy *policy, b > > > { > > > unsigned int new_freq; > > > > > > - new_freq = cpufreq_driver->get(policy->cpu); > > > + new_freq = arch_freq_get_on_cpu(policy->cpu); > > > + if (!new_freq) > > > + new_freq = cpufreq_driver->get(policy->cpu); > > > + > > As pointed out this change will not make it to the next version of the patch. > > So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will > > only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq > > > if (!new_freq) > > > return 0; > > > > > > And the result is as follows: > > > *case 1:**No setting the nohz_full and cpufreq use performance governor* > > > *--> Step1: *read 'cpuinfo_cur_freq' in no pressure > > > 0: 2699264 2: 2699264 4: 2699264 6: 2699264 > > > 8: 2696628 10: 2696628 12: 2696628 14: 2699264 > > > 16: 2699264 18: 2696628 20: 2699264 22: 2696628 > > > 24: 2699264 26: 2696628 28: 2699264 30: 2696628 > > > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > > > 40: 2699264 42: 2699264 44: 2696628 46: 2696628 > > > 48: 2696628 50: 2699264 52: 2699264 54: 2696628 > > > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > > > 64: 2696628 66: 2699264 68: 2696628 70: 2696628 > > > 72: 2699264 74: 2696628 76: 2696628 78: 2699264 > > > 80: 2696628 82: 2696628 84: 2699264 86: 2696628 > > > 88: 2696628 90: 2696628 92: 2696628 94: 2699264 > > > 96: 2696628 98: 2699264 100: 2699264 102: 2696628 > > > 104: 2699264 106: 2699264 108: 2699264 110: 2696628 > > > 112: 2699264 114: 2699264 116: 2699264 118: 2699264 > > > 120: 2696628 122: 2699264 124: 2696628 126: 2699264 > > > Note: the frequency of all CPUs are almost the same. > > Were you expecting smth else ? > The frequency of each CPU might have a different value. > All value of all CPUs is the same under high pressure. > I don't know what the phenomenon is on other platform. > Do you know who else tested it? So I might have rushed a bit with my previous comment/question: apologies for that. The numbers above: those are on a fairly idle/lightly loaded system right? Would you mind having another go with just the arch_freq_get_on_cpu implementation beign added and dropping the changes in the cpufreq and then read 'scaling_cur_freq', doing several reads in some intervals ? The change has been tested on RD-N2 model (Neoverse N2 ref platform), it has also been discussed here [1] > > > *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. > > > 0: 2696628 2: 2696628 4: 2696628 6: 2696628 > > > 8: 2696628 10: 2696628 12: 2696628 14: 2696628 > > > 16: 2696628 18: 2696628 20: 2696628 22: 2696628 > > > 24: 2696628 26: 2696628 28: 2696628 30: 2696628 > > > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > > > 40: 2696628 42: 2696628 44: 2696628 46: 2696628 > > > 48: 2696628 50: 2696628 52: 2696628 54: 2696628 > > > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > > > 64: 2696628 66: 2696628 68: 2696628 70: 2696628 > > > 72: 2696628 74: 2696628 76: 2696628 78: 2696628 > > > 80: 2696628 82: 2696628 84: 2696628 86: 2696628 > > > 88: 2696628 90: 2696628 92: 2696628 94: 2696628 > > > 96: 2696628 98: 2696628 100: 2696628 102: 2696628 > > > 104: 2696628 106: 2696628 108: 2696628 110: 2696628 > > > 112: 2696628 114: 2696628 116: 2696628 118: 2696628 > > > 120: 2696628 122: 2696628 124: 2696628 126: 2696628 > > > > > > *Case 2: setting nohz_full and cpufreq use ondemand governor* > > > There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in > > > /proc/cmdline. > > Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to > > set that one. > > Now, afair, isolcpus will make the selected CPUs to disappear from the > > schedulers view (no balancing, no migrating), so unless you affine smth > > explicitly to those CPUs, you will not see much of an activity there. > Correct. > > Need to double check though as it has been a while ... > > > *--> Step 1: *setting ondemand governor to all policy and query > > > 'cpuinfo_cur_freq' in no pressure case. > > > And the frequency of CPUs all are about 400MHz. > > > *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. > > > The high memory access pressure is from the command: "stress-ng -c 64 > > > --cpu-load 100% --taskset 0-63" > > I'm not entirely convinced that this will affine to isolated cpus, especially > > that the affinity mask spans all available cpus. If that is the case, no wonder > > your isolated cpus are getting wasted being idle. But I would have to double > > check how this is being handled. > > > The result: > > > 0: 2696628 1: 400000 2: 400000 3: 400909 > > > 4: 400000 5: 400000 6: 400000 7: 400000 > > > 8: 400000 9: 400000 10: 400600 11: 2696628 > > > 12: 2696628 13: 2696628 14: 2696628 15: 2696628 > > > 16: 2696628 17: 2696628 18: 2696628 19: 2696628 > > > 20: 2696628 21: 2696628 22: 2696628 23: 2696628 > > > 24: 2696628 25: 2696628 26: 2696628 27: 2696628 > > > 28: 2696628 29: 2696628 30: 2696628 31: 2696628 > > > 32: 2696628 33: 2696628 34: 2696628 35: 2696628 > > > 36: 2696628 37: 2696628 38: 2696628 39: 2696628 > > > 40: 2696628 41: 400000 42: 400000 43: 400000 > > > 44: 400000 45: 398847 46: 400000 47: 400000 > > > 48: 400000 49: 400000 50: 400000 51: 2696628 > > > 52: 2696628 53: 2696628 54: 2696628 55: 2696628 > > > 56: 2696628 57: 2696628 58: 2696628 59: 2696628 > > > 60: 2696628 61: 2696628 62: 2696628 63: 2699264 > > > > > > Note: > > > (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. > > > It turned out that nohz full was already work. > > > I guess that stress-ng cannot use the CPU in the range of nohz full. > > > Because the CPU frequency will be increased to 2.7G by binding CPU to > > > other application. > > > (2) The frequency of the nohz full core is calculated by get() callback > > > according to ftrace. > > It is as there is no sched tick on those, and apparently there is nothing > > running on them either. > Yes. > If we select your approach and the above phenomenon is normal, > the large frequency discrepancy issue can be resolved for CPUs with sched > tick by the way. > But the nohz full cores still have to face this issue. So this patch is also > needed. > Yes, nohz cores full have to be handled by the cpufreq driver. --- [1] https://lore.kernel.org/lkml/ZIHpd6unkOtYVEqP@e120325.cambridge.arm.com/T/#m4e74cb5a0aaa353c60fedc6cfb95ab7a6e381e3c --- BR Beata > BR > /huisong > > > > Unless I am missing smth. > > > > --- > > BR > > Beata > > > > > [1] https://lore.kernel.org/lkml/20230418113459.12860-7-sumitg@nvidia.com/ > > > [2] https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ > > > > Many thanks, > > > > Ionela. > > > > > > > > > /Huisong > > > > > > > > > > > > > [1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Ionela. > > > > > > > > > > > > > > > > > This patch adds a interface, cpc_read_arch_counters_on_cpu, to read > > > > > > > > > delivered and reference performance counter together. According to my > > > > > > > > > test[4], the discrepancy of cpu current frequency in the > > > > > > > > > scenarios with > > > > > > > > > high memory access pressure is lower than 0.2% by stress-ng > > > > > > > > > application. > > > > > > > > > > > > > > > > > > [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ > > > > > > > > > [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ > > > > > > > > > [3] > > > > > > > > > https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ > > > > > > > > > > > > > > > > > > [4] My local test: > > > > > > > > > The testing platform enable SMT and include 128 logical CPU in total, > > > > > > > > > and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each > > > > > > > > > physical core on platform during the high memory access pressure from > > > > > > > > > stress-ng, and the output is as follows: > > > > > > > > > 0: 2699133 2: 2699942 4: 2698189 6: 2704347 > > > > > > > > > 8: 2704009 10: 2696277 12: 2702016 14: 2701388 > > > > > > > > > 16: 2700358 18: 2696741 20: 2700091 22: 2700122 > > > > > > > > > 24: 2701713 26: 2702025 28: 2699816 30: 2700121 > > > > > > > > > 32: 2700000 34: 2699788 36: 2698884 38: 2699109 > > > > > > > > > 40: 2704494 42: 2698350 44: 2699997 46: 2701023 > > > > > > > > > 48: 2703448 50: 2699501 52: 2700000 54: 2699999 > > > > > > > > > 56: 2702645 58: 2696923 60: 2697718 62: 2700547 > > > > > > > > > 64: 2700313 66: 2700000 68: 2699904 70: 2699259 > > > > > > > > > 72: 2699511 74: 2700644 76: 2702201 78: 2700000 > > > > > > > > > 80: 2700776 82: 2700364 84: 2702674 86: 2700255 > > > > > > > > > 88: 2699886 90: 2700359 92: 2699662 94: 2696188 > > > > > > > > > 96: 2705454 98: 2699260 100: 2701097 102: 2699630 > > > > > > > > > 104: 2700463 106: 2698408 108: 2697766 110: 2701181 > > > > > > > > > 112: 2699166 114: 2701804 116: 2701907 118: 2701973 > > > > > > > > > 120: 2699584 122: 2700474 124: 2700768 126: 2701963 > > > > > > > > > > > > > > > > > > Signed-off-by: Huisong Li <lihuisong@huawei.com> > > > > > > > > > --- > > > [snip] > > > > . > > .
在 2024/2/2 16:08, Beata Michalska 写道: > On Wed, Jan 17, 2024 at 05:18:40PM +0800, lihuisong (C) wrote: > > Hi , > > Again, apologies for delay, > >> Hi, >> >> 在 2024/1/16 22:10, Beata Michalska 写道: >>> Hi, >>> >>> Apologies for jumping in so late.... >>> >>> On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: >>>> Hi Ionela, >>>> >>>> 在 2024/1/8 22:03, Ionela Voinescu 写道: >>>>> Hi, >>>>> >>>>> On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: >>>>>> Hi Vanshi, >>>>>> >>>>>> 在 2024/1/5 8:48, Vanshidhar Konda 写道: >>>>>>> On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >>>>>>>> 在 2024/1/4 1:53, Ionela Voinescu 写道: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>>>>>>>> Many developers found that the cpu current frequency is greater than >>>>>>>>>> the maximum frequency of the platform, please see [1], [2] and [3]. >>>>>>>>>> >>>>>>>>>> In the scenarios with high memory access pressure, the patch [1] has >>>>>>>>>> proved the significant latency of cpc_read() which is used to obtain >>>>>>>>>> delivered and reference performance counter cause an absurd frequency. >>>>>>>>>> The sampling interval for this counters is very critical and >>>>>>>>>> is expected >>>>>>>>>> to be equal. However, the different latency of cpc_read() has a direct >>>>>>>>>> impact on their sampling interval. >>>>>>>>>> >>>>>>>>> Would this [1] alternative solution work for you? >>>>>>>> It would work for me AFAICS. >>>>>>>> Because the "arch_freq_scale" is also from AMU core and constant >>>>>>>> counter, and read together. >>>>>>>> But, from their discuss line, it seems that there are some tricky >>>>>>>> points to clarify or consider. >>>>>>> I think the changes in [1] would work better when CPUs may be idle. With >>>>>>> this >>>>>>> patch we would have to wake any core that is in idle state to read the >>>>>>> AMU >>>>>>> counters. Worst case, if core 0 is trying to read the CPU frequency of >>>>>>> all >>>>>>> cores, it may need to wake up all the other cores to read the AMU >>>>>>> counters. >>>>>> From the approach in [1], if all CPUs (one or more cores) under one policy >>>>>> are idle, they still cannot be obtained the CPU frequency, right? >>>>>> In this case, the [1] API will return 0 and have to back to call >>>>>> cpufreq_driver->get() for cpuinfo_cur_freq. >>>>>> Then we still need to face the issue this patch mentioned. >>>>> With the implementation at [1], arch_freq_get_on_cpu() will not return 0 >>>>> for idle CPUs and the get() callback will not be called to wake up the >>>>> CPUs. >>>> Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. >>>> However, for no-housekeeping CPUs, it will return 0 and have to call get() >>>> callback, right? >>>>> Worst case, arch_freq_get_on_cpu() will return a frequency based on the >>>>> AMU counter values obtained on the last tick on that CPU. But if that CPU >>>>> is not a housekeeping CPU, a housekeeping CPU in the same policy will be >>>>> selected, as it would have had a more recent tick, and therefore a more >>>>> recent frequency value for the domain. >>>> But this frequency is from the last tick, >>>> this last tick is probably a long time ago and it doesn't update >>>> 'arch_freq_scale' for some reasons like CPU dile. >>>> In addition, I'm not sure if there is possible that amu_scale_freq_tick() is >>>> executed delayed under high stress case. >>>> It also have an impact on the accuracy of the cpu frequency we query. >>>>> I understand that the frequency returned here will not be up to date, >>>>> but there's no proper frequency feedback for an idle CPU. If one only >>>>> wakes up a CPU to sample counters, before the CPU goes back to sleep, >>>>> the obtained frequency feedback is meaningless. >>>>> >>>>>>> For systems with 128 cores or more, this could be very expensive and >>>>>>> happen >>>>>>> very frequently. >>>>>>> >>>>>>> AFAICS, the approach in [1] would avoid this cost. >>>>>> But the CPU frequency is just an average value for the last tick period >>>>>> instead of the current one the CPU actually runs at. >>>>>> In addition, there are some conditions to use 'arch_freq_scale' in this >>>>>> approach. >>>>> What are the conditions you are referring to? >>>> It depends on the housekeeping CPUs. >>>>>> So I'm not sure if this approach can entirely cover the frequency >>>>>> discrepancy issue. >>>>> Unfortunately there is no perfect frequency feedback. By the time you >>>>> observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency >>>>> of the CPU might have already changed. Therefore, an average value might >>>>> be a better indication of the recent performance level of a CPU. >>>> An average value for CPU frequency is ok. It may be better if it has not any >>>> delaying. >>>> >>>> The original implementation for cpuinfo_cur_freq can more reflect their >>>> meaning in the user-guide [1]. The user-guide said: >>>> "cpuinfo_cur_freq : Current frequency of the CPU as obtained from the >>>> hardware, in KHz. >>>> This is the frequency the CPU actually runs at." >>>> >>>> >>>> [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt >>>> >>>>> Would you be able to test [1] on your platform and usecase? >>>> I has tested it on my platform (CPU number: 64, SMT: off and CPU base >>>> frequency: 2.7GHz). >>>> Accoding to the testing result, >>>> 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. >>>> They still have to face the large frequency discrepancy issue my patch >>>> mentioned. >>>> 2> Additionally, the frequency value of all CPUs are almost the same by >>>> using the 'arch_freq_scale' factor way. I'm not sure if it is ok. >>>> >>>> The patch [1] has been modified silightly as below: >>>> --> >>>> @@ -1756,7 +1756,10 @@ static unsigned int >>>> cpufreq_verify_current_freq(struct cpufreq_policy *policy, b >>>> { >>>> unsigned int new_freq; >>>> >>>> - new_freq = cpufreq_driver->get(policy->cpu); >>>> + new_freq = arch_freq_get_on_cpu(policy->cpu); >>>> + if (!new_freq) >>>> + new_freq = cpufreq_driver->get(policy->cpu); >>>> + >>> As pointed out this change will not make it to the next version of the patch. >>> So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will >>> only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq >>>> if (!new_freq) >>>> return 0; >>>> >>>> And the result is as follows: >>>> *case 1:**No setting the nohz_full and cpufreq use performance governor* >>>> *--> Step1: *read 'cpuinfo_cur_freq' in no pressure >>>> 0: 2699264 2: 2699264 4: 2699264 6: 2699264 >>>> 8: 2696628 10: 2696628 12: 2696628 14: 2699264 >>>> 16: 2699264 18: 2696628 20: 2699264 22: 2696628 >>>> 24: 2699264 26: 2696628 28: 2699264 30: 2696628 >>>> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >>>> 40: 2699264 42: 2699264 44: 2696628 46: 2696628 >>>> 48: 2696628 50: 2699264 52: 2699264 54: 2696628 >>>> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >>>> 64: 2696628 66: 2699264 68: 2696628 70: 2696628 >>>> 72: 2699264 74: 2696628 76: 2696628 78: 2699264 >>>> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >>>> 88: 2696628 90: 2696628 92: 2696628 94: 2699264 >>>> 96: 2696628 98: 2699264 100: 2699264 102: 2696628 >>>> 104: 2699264 106: 2699264 108: 2699264 110: 2696628 >>>> 112: 2699264 114: 2699264 116: 2699264 118: 2699264 >>>> 120: 2696628 122: 2699264 124: 2696628 126: 2699264 >>>> Note: the frequency of all CPUs are almost the same. >>> Were you expecting smth else ? >> The frequency of each CPU might have a different value. >> All value of all CPUs is the same under high pressure. >> I don't know what the phenomenon is on other platform. >> Do you know who else tested it? > So I might have rushed a bit with my previous comment/question: apologies for > that. > The numbers above: those are on a fairly idle/lightly loaded system right? Yes. > Would you mind having another go with just the arch_freq_get_on_cpu > implementation beign added and dropping the changes in the cpufreq and All my tests are done when cpufreq policy is "performance" and OS isn't on a high load. Reading "scaling_cur_freq" or "scaling_cur_freq" for each physical core on platform The testing result for "cpuinfo_cur_freq" with your changes on a fairly idle and high loaded system can also be found in this thread. *A: the result with your changes* --> Reading "scaling_cur_freq" 0: 2688720 2: 2696628 4: 2699264 6: 2696628 8: 2699264 10: 2696628 12: 2699264 14: 2699264 16: 2699264 18: 2696628 20: 2696628 22: 2696628 24: 2699264 26: 2696628 28: 2696628 30: 2696628 32: 2699264 34: 2691356 36: 2696628 38: 2699264 40: 2699264 42: 2696628 44: 2696628 46: 2699264 48: 2699264 50: 2696628 52: 2696628 54: 2696628 56: 2696628 58: 2699264 60: 2691356 62: 2696628 64: 2696628 66: 2696628 68: 2696628 70: 2696628 72: 2696628 74: 2696628 76: 2699264 78: 2696628 80: 2696628 82: 2696628 84: 2699264 86: 2696628 88: 2625456 90: 2696628 92: 2699264 94: 2696628 96: 2696628 98: 2696628 100: 2699264 102: 2699264 104: 2699264 106: 2696628 108: 2699264 110: 2696628 112: 2699264 114: 2699264 116: 2696628 118: 2696628 120: 2696628 122: 2699264 124: 2696628 126: 2696628 -->Reading "cpuinfo_cur_freq" 0: 2696628 2: 2696628 4: 2699264 6: 2688720 8: 2699264 10: 2700000 12: 2696628 14: 2698322 16: 2699264 18: 2699264 20: 2696628 22: 2699264 24: 2699264 26: 2699264 28: 2699264 30: 2699264 32: 2699264 34: 2693992 36: 2696628 38: 2696628 40: 2699264 42: 2699264 44: 2699264 46: 2696628 48: 2696628 50: 2699264 52: 2696628 54: 2696628 56: 2699264 58: 2699264 60: 2696628 62: 2699264 64: 2696628 66: 2699264 68: 2696628 70: 2699264 72: 2696628 74: 2696628 76: 2696628 78: 2693992 80: 2696628 82: 2696628 84: 2696628 86: 2696628 88: 2696628 90: 2699264 92: 2696628 94: 2699264 96: 2699264 98: 2696628 100: 2699264 102: 2699264 104: 2691356 106: 2699264 108: 2699264 110: 2699264 112: 2699264 114: 2696628 116: 2699264 118: 2699264 120: 2696628 122: 2696628 124: 2696628 126: 2696628 *B: the result without your changes* -->Reading "scaling_cur_freq" 0: 2698245 2: 2706690 4: 2699649 6: 2702105 8: 2704362 10: 2697993 12: 2701672 14: 2704362 16: 2701052 18: 2701052 20: 2694385 22: 2699650 24: 2706802 26: 2702389 28: 2698299 30: 2698299 32: 2697333 34: 2697993 36: 2701337 38: 2699328 40: 2700330 42: 2700330 44: 2698019 46: 2697697 48: 2699659 50: 2701700 52: 2703401 54: 2701700 56: 2704013 58: 2697658 60: 2695000 62: 2697666 64: 2697902 66: 2701052 68: 2698245 70: 2695789 72: 2701315 74: 2696655 76: 2693666 78: 2695317 80: 2704912 82: 2699649 84: 2698245 86: 2695454 88: 2697966 90: 2697959 92: 2699319 94: 2700680 96: 2695317 98: 2698996 100: 2700000 102: 2700334 104: 2701320 106: 2695065 108: 2700986 110: 2703960 112: 2697635 114: 2704421 116: 2700680 118: 2702040 120: 2700334 122: 2697993 124: 2700334 126: 2705351 -->Reading "cpuinfo_cur_freq" 0: 2696853 2: 2695454 4: 2699649 6: 2706993 8: 2706060 10: 2704362 12: 2704362 14: 2697658 16: 2707719 18: 2697192 20: 2702456 22: 2699650 24: 2705782 26: 2698299 28: 2703061 30: 2705802 32: 2700000 34: 2700671 36: 2701337 38: 2697658 40: 2700330 42: 2700330 44: 2699672 46: 2697697 48: 2703061 50: 2696610 52: 2692542 54: 2704406 56: 2695317 58: 2699331 60: 2698996 62: 2702675 64: 2704912 66: 2703859 68: 2699649 70: 2698596 72: 2703908 74: 2703355 76: 2697658 78: 2695317 80: 2702105 82: 2707719 84: 2702105 86: 2699649 88: 2697966 90: 2691525 92: 2701700 94: 2700680 96: 2695317 98: 2698996 100: 2698666 102: 2700334 104: 2690429 106: 2707590 108: 2700986 110: 2701320 112: 2696283 114: 2692881 116: 2697627 118: 2704421 120: 2698996 122: 2696321 124: 2696655 126: 2695000 > then read 'scaling_cur_freq', doing several reads in some intervals ? It seems that above phenomenon has not a lot to do with reading intervals. > The change has been tested on RD-N2 model (Neoverse N2 ref platform), > it has also been discussed here [1] I doesn't get the testing result on this platform in its thread. >>>> *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. >>>> 0: 2696628 2: 2696628 4: 2696628 6: 2696628 >>>> 8: 2696628 10: 2696628 12: 2696628 14: 2696628 >>>> 16: 2696628 18: 2696628 20: 2696628 22: 2696628 >>>> 24: 2696628 26: 2696628 28: 2696628 30: 2696628 >>>> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >>>> 40: 2696628 42: 2696628 44: 2696628 46: 2696628 >>>> 48: 2696628 50: 2696628 52: 2696628 54: 2696628 >>>> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >>>> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >>>> 72: 2696628 74: 2696628 76: 2696628 78: 2696628 >>>> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >>>> 88: 2696628 90: 2696628 92: 2696628 94: 2696628 >>>> 96: 2696628 98: 2696628 100: 2696628 102: 2696628 >>>> 104: 2696628 106: 2696628 108: 2696628 110: 2696628 >>>> 112: 2696628 114: 2696628 116: 2696628 118: 2696628 >>>> 120: 2696628 122: 2696628 124: 2696628 126: 2696628 >>>> >>>> *Case 2: setting nohz_full and cpufreq use ondemand governor* >>>> There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in >>>> /proc/cmdline. >>> Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to >>> set that one. >>> Now, afair, isolcpus will make the selected CPUs to disappear from the >>> schedulers view (no balancing, no migrating), so unless you affine smth >>> explicitly to those CPUs, you will not see much of an activity there. >> Correct. >>> Need to double check though as it has been a while ... >>>> *--> Step 1: *setting ondemand governor to all policy and query >>>> 'cpuinfo_cur_freq' in no pressure case. >>>> And the frequency of CPUs all are about 400MHz. >>>> *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. >>>> The high memory access pressure is from the command: "stress-ng -c 64 >>>> --cpu-load 100% --taskset 0-63" >>> I'm not entirely convinced that this will affine to isolated cpus, especially >>> that the affinity mask spans all available cpus. If that is the case, no wonder >>> your isolated cpus are getting wasted being idle. But I would have to double >>> check how this is being handled. >>>> The result: >>>> 0: 2696628 1: 400000 2: 400000 3: 400909 >>>> 4: 400000 5: 400000 6: 400000 7: 400000 >>>> 8: 400000 9: 400000 10: 400600 11: 2696628 >>>> 12: 2696628 13: 2696628 14: 2696628 15: 2696628 >>>> 16: 2696628 17: 2696628 18: 2696628 19: 2696628 >>>> 20: 2696628 21: 2696628 22: 2696628 23: 2696628 >>>> 24: 2696628 25: 2696628 26: 2696628 27: 2696628 >>>> 28: 2696628 29: 2696628 30: 2696628 31: 2696628 >>>> 32: 2696628 33: 2696628 34: 2696628 35: 2696628 >>>> 36: 2696628 37: 2696628 38: 2696628 39: 2696628 >>>> 40: 2696628 41: 400000 42: 400000 43: 400000 >>>> 44: 400000 45: 398847 46: 400000 47: 400000 >>>> 48: 400000 49: 400000 50: 400000 51: 2696628 >>>> 52: 2696628 53: 2696628 54: 2696628 55: 2696628 >>>> 56: 2696628 57: 2696628 58: 2696628 59: 2696628 >>>> 60: 2696628 61: 2696628 62: 2696628 63: 2699264 >>>> >>>> Note: >>>> (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. >>>> It turned out that nohz full was already work. >>>> I guess that stress-ng cannot use the CPU in the range of nohz full. >>>> Because the CPU frequency will be increased to 2.7G by binding CPU to >>>> other application. >>>> (2) The frequency of the nohz full core is calculated by get() callback >>>> according to ftrace. >>> It is as there is no sched tick on those, and apparently there is nothing >>> running on them either. >> Yes. >> If we select your approach and the above phenomenon is normal, >> the large frequency discrepancy issue can be resolved for CPUs with sched >> tick by the way. >> But the nohz full cores still have to face this issue. So this patch is also >> needed. >> > Yes, nohz cores full have to be handled by the cpufreq driver. Correct. So we still have to face the issue in this patch and push this patch. Beata, would you please review this patch? /Huisong > > --- > [1] https://lore.kernel.org/lkml/ZIHpd6unkOtYVEqP@e120325.cambridge.arm.com/T/#m4e74cb5a0aaa353c60fedc6cfb95ab7a6e381e3c > --- > BR > Beata >> BR >> /huisong >>> Unless I am missing smth. >>> >>> --- >>> BR >>> Beata >>> >>>> [1] https://lore.kernel.org/lkml/20230418113459.12860-7-sumitg@nvidia.com/ >>>> [2] https://lore.kernel.org/lkml/20231127160838.1403404-3-beata.michalska@arm.com/ >>>>> Many thanks, >>>>> Ionela. >>>>> >>>>>> /Huisong >>>>>> >>>>>>>>> [1] https://lore.kernel.org/lkml/20231127160838.1403404-1-beata.michalska@arm.com/ >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ionela. >>>>>>>>> >>>>>>>>>> This patch adds a interface, cpc_read_arch_counters_on_cpu, to read >>>>>>>>>> delivered and reference performance counter together. According to my >>>>>>>>>> test[4], the discrepancy of cpu current frequency in the >>>>>>>>>> scenarios with >>>>>>>>>> high memory access pressure is lower than 0.2% by stress-ng >>>>>>>>>> application. >>>>>>>>>> >>>>>>>>>> [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ >>>>>>>>>> [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ >>>>>>>>>> [3] >>>>>>>>>> https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ >>>>>>>>>> >>>>>>>>>> [4] My local test: >>>>>>>>>> The testing platform enable SMT and include 128 logical CPU in total, >>>>>>>>>> and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each >>>>>>>>>> physical core on platform during the high memory access pressure from >>>>>>>>>> stress-ng, and the output is as follows: >>>>>>>>>> 0: 2699133 2: 2699942 4: 2698189 6: 2704347 >>>>>>>>>> 8: 2704009 10: 2696277 12: 2702016 14: 2701388 >>>>>>>>>> 16: 2700358 18: 2696741 20: 2700091 22: 2700122 >>>>>>>>>> 24: 2701713 26: 2702025 28: 2699816 30: 2700121 >>>>>>>>>> 32: 2700000 34: 2699788 36: 2698884 38: 2699109 >>>>>>>>>> 40: 2704494 42: 2698350 44: 2699997 46: 2701023 >>>>>>>>>> 48: 2703448 50: 2699501 52: 2700000 54: 2699999 >>>>>>>>>> 56: 2702645 58: 2696923 60: 2697718 62: 2700547 >>>>>>>>>> 64: 2700313 66: 2700000 68: 2699904 70: 2699259 >>>>>>>>>> 72: 2699511 74: 2700644 76: 2702201 78: 2700000 >>>>>>>>>> 80: 2700776 82: 2700364 84: 2702674 86: 2700255 >>>>>>>>>> 88: 2699886 90: 2700359 92: 2699662 94: 2696188 >>>>>>>>>> 96: 2705454 98: 2699260 100: 2701097 102: 2699630 >>>>>>>>>> 104: 2700463 106: 2698408 108: 2697766 110: 2701181 >>>>>>>>>> 112: 2699166 114: 2701804 116: 2701907 118: 2701973 >>>>>>>>>> 120: 2699584 122: 2700474 124: 2700768 126: 2701963 >>>>>>>>>> >>>>>>>>>> Signed-off-by: Huisong Li <lihuisong@huawei.com> >>>>>>>>>> --- >>>> [snip] >>>>> . >>> . > .
On Tue, Feb 06, 2024 at 04:02:15PM +0800, lihuisong (C) wrote: > > 在 2024/2/2 16:08, Beata Michalska 写道: > > On Wed, Jan 17, 2024 at 05:18:40PM +0800, lihuisong (C) wrote: > > > > Hi , > > > > Again, apologies for delay, > > > > > Hi, > > > > > > 在 2024/1/16 22:10, Beata Michalska 写道: > > > > Hi, > > > > > > > > Apologies for jumping in so late.... > > > > > > > > On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: > > > > > Hi Ionela, > > > > > > > > > > 在 2024/1/8 22:03, Ionela Voinescu 写道: > > > > > > Hi, > > > > > > > > > > > > On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: > > > > > > > Hi Vanshi, > > > > > > > > > > > > > > 在 2024/1/5 8:48, Vanshidhar Konda 写道: > > > > > > > > On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: > > > > > > > > > 在 2024/1/4 1:53, Ionela Voinescu 写道: > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: > > > > > > > > > > > Many developers found that the cpu current frequency is greater than > > > > > > > > > > > the maximum frequency of the platform, please see [1], [2] and [3]. > > > > > > > > > > > > > > > > > > > > > > In the scenarios with high memory access pressure, the patch [1] has > > > > > > > > > > > proved the significant latency of cpc_read() which is used to obtain > > > > > > > > > > > delivered and reference performance counter cause an absurd frequency. > > > > > > > > > > > The sampling interval for this counters is very critical and > > > > > > > > > > > is expected > > > > > > > > > > > to be equal. However, the different latency of cpc_read() has a direct > > > > > > > > > > > impact on their sampling interval. > > > > > > > > > > > > > > > > > > > > > Would this [1] alternative solution work for you? > > > > > > > > > It would work for me AFAICS. > > > > > > > > > Because the "arch_freq_scale" is also from AMU core and constant > > > > > > > > > counter, and read together. > > > > > > > > > But, from their discuss line, it seems that there are some tricky > > > > > > > > > points to clarify or consider. > > > > > > > > I think the changes in [1] would work better when CPUs may be idle. With > > > > > > > > this > > > > > > > > patch we would have to wake any core that is in idle state to read the > > > > > > > > AMU > > > > > > > > counters. Worst case, if core 0 is trying to read the CPU frequency of > > > > > > > > all > > > > > > > > cores, it may need to wake up all the other cores to read the AMU > > > > > > > > counters. > > > > > > > From the approach in [1], if all CPUs (one or more cores) under one policy > > > > > > > are idle, they still cannot be obtained the CPU frequency, right? > > > > > > > In this case, the [1] API will return 0 and have to back to call > > > > > > > cpufreq_driver->get() for cpuinfo_cur_freq. > > > > > > > Then we still need to face the issue this patch mentioned. > > > > > > With the implementation at [1], arch_freq_get_on_cpu() will not return 0 > > > > > > for idle CPUs and the get() callback will not be called to wake up the > > > > > > CPUs. > > > > > Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. > > > > > However, for no-housekeeping CPUs, it will return 0 and have to call get() > > > > > callback, right? > > > > > > Worst case, arch_freq_get_on_cpu() will return a frequency based on the > > > > > > AMU counter values obtained on the last tick on that CPU. But if that CPU > > > > > > is not a housekeeping CPU, a housekeeping CPU in the same policy will be > > > > > > selected, as it would have had a more recent tick, and therefore a more > > > > > > recent frequency value for the domain. > > > > > But this frequency is from the last tick, > > > > > this last tick is probably a long time ago and it doesn't update > > > > > 'arch_freq_scale' for some reasons like CPU dile. > > > > > In addition, I'm not sure if there is possible that amu_scale_freq_tick() is > > > > > executed delayed under high stress case. > > > > > It also have an impact on the accuracy of the cpu frequency we query. > > > > > > I understand that the frequency returned here will not be up to date, > > > > > > but there's no proper frequency feedback for an idle CPU. If one only > > > > > > wakes up a CPU to sample counters, before the CPU goes back to sleep, > > > > > > the obtained frequency feedback is meaningless. > > > > > > > > > > > > > > For systems with 128 cores or more, this could be very expensive and > > > > > > > > happen > > > > > > > > very frequently. > > > > > > > > > > > > > > > > AFAICS, the approach in [1] would avoid this cost. > > > > > > > But the CPU frequency is just an average value for the last tick period > > > > > > > instead of the current one the CPU actually runs at. > > > > > > > In addition, there are some conditions to use 'arch_freq_scale' in this > > > > > > > approach. > > > > > > What are the conditions you are referring to? > > > > > It depends on the housekeeping CPUs. > > > > > > > So I'm not sure if this approach can entirely cover the frequency > > > > > > > discrepancy issue. > > > > > > Unfortunately there is no perfect frequency feedback. By the time you > > > > > > observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency > > > > > > of the CPU might have already changed. Therefore, an average value might > > > > > > be a better indication of the recent performance level of a CPU. > > > > > An average value for CPU frequency is ok. It may be better if it has not any > > > > > delaying. > > > > > > > > > > The original implementation for cpuinfo_cur_freq can more reflect their > > > > > meaning in the user-guide [1]. The user-guide said: > > > > > "cpuinfo_cur_freq : Current frequency of the CPU as obtained from the > > > > > hardware, in KHz. > > > > > This is the frequency the CPU actually runs at." > > > > > > > > > > > > > > > [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt > > > > > > > > > > > Would you be able to test [1] on your platform and usecase? > > > > > I has tested it on my platform (CPU number: 64, SMT: off and CPU base > > > > > frequency: 2.7GHz). > > > > > Accoding to the testing result, > > > > > 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. > > > > > They still have to face the large frequency discrepancy issue my patch > > > > > mentioned. > > > > > 2> Additionally, the frequency value of all CPUs are almost the same by > > > > > using the 'arch_freq_scale' factor way. I'm not sure if it is ok. > > > > > > > > > > The patch [1] has been modified silightly as below: > > > > > --> > > > > > @@ -1756,7 +1756,10 @@ static unsigned int > > > > > cpufreq_verify_current_freq(struct cpufreq_policy *policy, b > > > > > { > > > > > unsigned int new_freq; > > > > > > > > > > - new_freq = cpufreq_driver->get(policy->cpu); > > > > > + new_freq = arch_freq_get_on_cpu(policy->cpu); > > > > > + if (!new_freq) > > > > > + new_freq = cpufreq_driver->get(policy->cpu); > > > > > + > > > > As pointed out this change will not make it to the next version of the patch. > > > > So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will > > > > only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq > > > > > if (!new_freq) > > > > > return 0; > > > > > > > > > > And the result is as follows: > > > > > *case 1:**No setting the nohz_full and cpufreq use performance governor* > > > > > *--> Step1: *read 'cpuinfo_cur_freq' in no pressure > > > > > 0: 2699264 2: 2699264 4: 2699264 6: 2699264 > > > > > 8: 2696628 10: 2696628 12: 2696628 14: 2699264 > > > > > 16: 2699264 18: 2696628 20: 2699264 22: 2696628 > > > > > 24: 2699264 26: 2696628 28: 2699264 30: 2696628 > > > > > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > > > > > 40: 2699264 42: 2699264 44: 2696628 46: 2696628 > > > > > 48: 2696628 50: 2699264 52: 2699264 54: 2696628 > > > > > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > > > > > 64: 2696628 66: 2699264 68: 2696628 70: 2696628 > > > > > 72: 2699264 74: 2696628 76: 2696628 78: 2699264 > > > > > 80: 2696628 82: 2696628 84: 2699264 86: 2696628 > > > > > 88: 2696628 90: 2696628 92: 2696628 94: 2699264 > > > > > 96: 2696628 98: 2699264 100: 2699264 102: 2696628 > > > > > 104: 2699264 106: 2699264 108: 2699264 110: 2696628 > > > > > 112: 2699264 114: 2699264 116: 2699264 118: 2699264 > > > > > 120: 2696628 122: 2699264 124: 2696628 126: 2699264 > > > > > Note: the frequency of all CPUs are almost the same. > > > > Were you expecting smth else ? > > > The frequency of each CPU might have a different value. > > > All value of all CPUs is the same under high pressure. > > > I don't know what the phenomenon is on other platform. > > > Do you know who else tested it? > > So I might have rushed a bit with my previous comment/question: apologies for > > that. > > The numbers above: those are on a fairly idle/lightly loaded system right? > Yes. > > Would you mind having another go with just the arch_freq_get_on_cpu > > implementation beign added and dropping the changes in the cpufreq and > All my tests are done when cpufreq policy is "performance" and OS isn't on a > high load. > Reading "scaling_cur_freq" or "scaling_cur_freq" for each physical core on > platform > > The testing result for "cpuinfo_cur_freq" with your changes on a fairly idle > and high loaded system can also be found in this thread. > *A: the result with your changes* > --> Reading "scaling_cur_freq" > 0: 2688720 2: 2696628 4: 2699264 6: 2696628 > 8: 2699264 10: 2696628 12: 2699264 14: 2699264 > 16: 2699264 18: 2696628 20: 2696628 22: 2696628 > 24: 2699264 26: 2696628 28: 2696628 30: 2696628 > 32: 2699264 34: 2691356 36: 2696628 38: 2699264 > 40: 2699264 42: 2696628 44: 2696628 46: 2699264 > 48: 2699264 50: 2696628 52: 2696628 54: 2696628 > 56: 2696628 58: 2699264 60: 2691356 62: 2696628 > 64: 2696628 66: 2696628 68: 2696628 70: 2696628 > 72: 2696628 74: 2696628 76: 2699264 78: 2696628 > 80: 2696628 82: 2696628 84: 2699264 86: 2696628 > 88: 2625456 90: 2696628 92: 2699264 94: 2696628 > 96: 2696628 98: 2696628 100: 2699264 102: 2699264 > 104: 2699264 106: 2696628 108: 2699264 110: 2696628 > 112: 2699264 114: 2699264 116: 2696628 118: 2696628 > 120: 2696628 122: 2699264 124: 2696628 126: 2696628 > -->Reading "cpuinfo_cur_freq" > 0: 2696628 2: 2696628 4: 2699264 6: 2688720 > 8: 2699264 10: 2700000 12: 2696628 14: 2698322 > 16: 2699264 18: 2699264 20: 2696628 22: 2699264 > 24: 2699264 26: 2699264 28: 2699264 30: 2699264 > 32: 2699264 34: 2693992 36: 2696628 38: 2696628 > 40: 2699264 42: 2699264 44: 2699264 46: 2696628 > 48: 2696628 50: 2699264 52: 2696628 54: 2696628 > 56: 2699264 58: 2699264 60: 2696628 62: 2699264 > 64: 2696628 66: 2699264 68: 2696628 70: 2699264 > 72: 2696628 74: 2696628 76: 2696628 78: 2693992 > 80: 2696628 82: 2696628 84: 2696628 86: 2696628 > 88: 2696628 90: 2699264 92: 2696628 94: 2699264 > 96: 2699264 98: 2696628 100: 2699264 102: 2699264 > 104: 2691356 106: 2699264 108: 2699264 110: 2699264 > 112: 2699264 114: 2696628 116: 2699264 118: 2699264 > 120: 2696628 122: 2696628 124: 2696628 126: 2696628 > > *B: the result without your changes* > -->Reading "scaling_cur_freq" > 0: 2698245 2: 2706690 4: 2699649 6: 2702105 > 8: 2704362 10: 2697993 12: 2701672 14: 2704362 > 16: 2701052 18: 2701052 20: 2694385 22: 2699650 > 24: 2706802 26: 2702389 28: 2698299 30: 2698299 > 32: 2697333 34: 2697993 36: 2701337 38: 2699328 > 40: 2700330 42: 2700330 44: 2698019 46: 2697697 > 48: 2699659 50: 2701700 52: 2703401 54: 2701700 > 56: 2704013 58: 2697658 60: 2695000 62: 2697666 > 64: 2697902 66: 2701052 68: 2698245 70: 2695789 > 72: 2701315 74: 2696655 76: 2693666 78: 2695317 > 80: 2704912 82: 2699649 84: 2698245 86: 2695454 > 88: 2697966 90: 2697959 92: 2699319 94: 2700680 > 96: 2695317 98: 2698996 100: 2700000 102: 2700334 > 104: 2701320 106: 2695065 108: 2700986 110: 2703960 > 112: 2697635 114: 2704421 116: 2700680 118: 2702040 > 120: 2700334 122: 2697993 124: 2700334 126: 2705351 > -->Reading "cpuinfo_cur_freq" > 0: 2696853 2: 2695454 4: 2699649 6: 2706993 > 8: 2706060 10: 2704362 12: 2704362 14: 2697658 > 16: 2707719 18: 2697192 20: 2702456 22: 2699650 > 24: 2705782 26: 2698299 28: 2703061 30: 2705802 > 32: 2700000 34: 2700671 36: 2701337 38: 2697658 > 40: 2700330 42: 2700330 44: 2699672 46: 2697697 > 48: 2703061 50: 2696610 52: 2692542 54: 2704406 > 56: 2695317 58: 2699331 60: 2698996 62: 2702675 > 64: 2704912 66: 2703859 68: 2699649 70: 2698596 > 72: 2703908 74: 2703355 76: 2697658 78: 2695317 > 80: 2702105 82: 2707719 84: 2702105 86: 2699649 > 88: 2697966 90: 2691525 92: 2701700 94: 2700680 > 96: 2695317 98: 2698996 100: 2698666 102: 2700334 > 104: 2690429 106: 2707590 108: 2700986 110: 2701320 > 112: 2696283 114: 2692881 116: 2697627 118: 2704421 > 120: 2698996 122: 2696321 124: 2696655 126: 2695000 > So in both cases : whether you use arch_freq_get_on_cpu or not (so with and without the patch) you get roughly the same frequencies on all cores - or am I missing smth from the dump above ? And those are reflecting max freq you have provided earlier (?) Note that the arch_freq_get_on_cpu will return an average frequency for the last tick, so even if your system is roughly idle with your performance governor those numbers make sense (some/most of the cores might be idle but you will see the last freq the core was running at before going to idle). I do not think there is an agreement what should be shown for idle core when querying their freq through sysfs. Showing last known freq makes sense, even more than waking up core just to try to get one. @Ionela: Please jump in if I got things wrong. > > then read 'scaling_cur_freq', doing several reads in some intervals ? > It seems that above phenomenon has not a lot to do with reading intervals. > > The change has been tested on RD-N2 model (Neoverse N2 ref platform), > > it has also been discussed here [1] > I doesn't get the testing result on this platform in its thread. It might be missing exact numbers but the conclusions should be here [1] > > > > > *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. > > > > > 0: 2696628 2: 2696628 4: 2696628 6: 2696628 > > > > > 8: 2696628 10: 2696628 12: 2696628 14: 2696628 > > > > > 16: 2696628 18: 2696628 20: 2696628 22: 2696628 > > > > > 24: 2696628 26: 2696628 28: 2696628 30: 2696628 > > > > > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > > > > > 40: 2696628 42: 2696628 44: 2696628 46: 2696628 > > > > > 48: 2696628 50: 2696628 52: 2696628 54: 2696628 > > > > > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > > > > > 64: 2696628 66: 2696628 68: 2696628 70: 2696628 > > > > > 72: 2696628 74: 2696628 76: 2696628 78: 2696628 > > > > > 80: 2696628 82: 2696628 84: 2696628 86: 2696628 > > > > > 88: 2696628 90: 2696628 92: 2696628 94: 2696628 > > > > > 96: 2696628 98: 2696628 100: 2696628 102: 2696628 > > > > > 104: 2696628 106: 2696628 108: 2696628 110: 2696628 > > > > > 112: 2696628 114: 2696628 116: 2696628 118: 2696628 > > > > > 120: 2696628 122: 2696628 124: 2696628 126: 2696628 > > > > > > > > > > *Case 2: setting nohz_full and cpufreq use ondemand governor* > > > > > There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in > > > > > /proc/cmdline. > > > > Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to > > > > set that one. > > > > Now, afair, isolcpus will make the selected CPUs to disappear from the > > > > schedulers view (no balancing, no migrating), so unless you affine smth > > > > explicitly to those CPUs, you will not see much of an activity there. > > > Correct. > > > > Need to double check though as it has been a while ... > > > > > *--> Step 1: *setting ondemand governor to all policy and query > > > > > 'cpuinfo_cur_freq' in no pressure case. > > > > > And the frequency of CPUs all are about 400MHz. > > > > > *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. > > > > > The high memory access pressure is from the command: "stress-ng -c 64 > > > > > --cpu-load 100% --taskset 0-63" > > > > I'm not entirely convinced that this will affine to isolated cpus, especially > > > > that the affinity mask spans all available cpus. If that is the case, no wonder > > > > your isolated cpus are getting wasted being idle. But I would have to double > > > > check how this is being handled. > > > > > The result: > > > > > 0: 2696628 1: 400000 2: 400000 3: 400909 > > > > > 4: 400000 5: 400000 6: 400000 7: 400000 > > > > > 8: 400000 9: 400000 10: 400600 11: 2696628 > > > > > 12: 2696628 13: 2696628 14: 2696628 15: 2696628 > > > > > 16: 2696628 17: 2696628 18: 2696628 19: 2696628 > > > > > 20: 2696628 21: 2696628 22: 2696628 23: 2696628 > > > > > 24: 2696628 25: 2696628 26: 2696628 27: 2696628 > > > > > 28: 2696628 29: 2696628 30: 2696628 31: 2696628 > > > > > 32: 2696628 33: 2696628 34: 2696628 35: 2696628 > > > > > 36: 2696628 37: 2696628 38: 2696628 39: 2696628 > > > > > 40: 2696628 41: 400000 42: 400000 43: 400000 > > > > > 44: 400000 45: 398847 46: 400000 47: 400000 > > > > > 48: 400000 49: 400000 50: 400000 51: 2696628 > > > > > 52: 2696628 53: 2696628 54: 2696628 55: 2696628 > > > > > 56: 2696628 57: 2696628 58: 2696628 59: 2696628 > > > > > 60: 2696628 61: 2696628 62: 2696628 63: 2699264 > > > > > > > > > > Note: > > > > > (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. > > > > > It turned out that nohz full was already work. > > > > > I guess that stress-ng cannot use the CPU in the range of nohz full. > > > > > Because the CPU frequency will be increased to 2.7G by binding CPU to > > > > > other application. > > > > > (2) The frequency of the nohz full core is calculated by get() callback > > > > > according to ftrace. > > > > It is as there is no sched tick on those, and apparently there is nothing > > > > running on them either. > > > Yes. > > > If we select your approach and the above phenomenon is normal, > > > the large frequency discrepancy issue can be resolved for CPUs with sched > > > tick by the way. > > > But the nohz full cores still have to face this issue. So this patch is also > > > needed. > > > > > Yes, nohz cores full have to be handled by the cpufreq driver. > Correct. So we still have to face the issue in this patch and push this > patch. > Beata, would you please review this patch? Just to clarify for my benefit (apologies but I do have to contex switch pretty often these days): by reviewing this patch do you mean: 1) review your changes (if so I think there are few comments already to be addressed, but I can try to have another look) 2) review changes for AMU-based arch_freq_get_on_cpu ? *note: I will still try to have a look at the non-housekeeping cpus case --- [1] https://lore.kernel.org/lkml/691d3eb2-cd93-f0fc-a7a4-2a8c0d44262c@nvidia.com/ --- BR Beata > > > /Huisong > > [...]
Hi, On Friday 09 Feb 2024 at 11:55:08 (+0100), Beata Michalska wrote: > On Tue, Feb 06, 2024 at 04:02:15PM +0800, lihuisong (C) wrote: [..] > > > > > > > > > > > > > Would you be able to test [1] on your platform and usecase? > > > > > > I has tested it on my platform (CPU number: 64, SMT: off and CPU base > > > > > > frequency: 2.7GHz). > > > > > > Accoding to the testing result, > > > > > > 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. > > > > > > They still have to face the large frequency discrepancy issue my patch > > > > > > mentioned. > > > > > > 2> Additionally, the frequency value of all CPUs are almost the same by > > > > > > using the 'arch_freq_scale' factor way. I'm not sure if it is ok. > > > > > > > > > > > > The patch [1] has been modified silightly as below: > > > > > > --> > > > > > > @@ -1756,7 +1756,10 @@ static unsigned int > > > > > > cpufreq_verify_current_freq(struct cpufreq_policy *policy, b > > > > > > { > > > > > > unsigned int new_freq; > > > > > > > > > > > > - new_freq = cpufreq_driver->get(policy->cpu); > > > > > > + new_freq = arch_freq_get_on_cpu(policy->cpu); > > > > > > + if (!new_freq) > > > > > > + new_freq = cpufreq_driver->get(policy->cpu); > > > > > > + > > > > > As pointed out this change will not make it to the next version of the patch. > > > > > So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will > > > > > only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq > > > > > > if (!new_freq) > > > > > > return 0; > > > > > > > > > > > > And the result is as follows: > > > > > > *case 1:**No setting the nohz_full and cpufreq use performance governor* > > > > > > *--> Step1: *read 'cpuinfo_cur_freq' in no pressure > > > > > > 0: 2699264 2: 2699264 4: 2699264 6: 2699264 > > > > > > 8: 2696628 10: 2696628 12: 2696628 14: 2699264 > > > > > > 16: 2699264 18: 2696628 20: 2699264 22: 2696628 > > > > > > 24: 2699264 26: 2696628 28: 2699264 30: 2696628 > > > > > > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > > > > > > 40: 2699264 42: 2699264 44: 2696628 46: 2696628 > > > > > > 48: 2696628 50: 2699264 52: 2699264 54: 2696628 > > > > > > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > > > > > > 64: 2696628 66: 2699264 68: 2696628 70: 2696628 > > > > > > 72: 2699264 74: 2696628 76: 2696628 78: 2699264 > > > > > > 80: 2696628 82: 2696628 84: 2699264 86: 2696628 > > > > > > 88: 2696628 90: 2696628 92: 2696628 94: 2699264 > > > > > > 96: 2696628 98: 2699264 100: 2699264 102: 2696628 > > > > > > 104: 2699264 106: 2699264 108: 2699264 110: 2696628 > > > > > > 112: 2699264 114: 2699264 116: 2699264 118: 2699264 > > > > > > 120: 2696628 122: 2699264 124: 2696628 126: 2699264 > > > > > > Note: the frequency of all CPUs are almost the same. > > > > > Were you expecting smth else ? > > > > The frequency of each CPU might have a different value. > > > > All value of all CPUs is the same under high pressure. > > > > I don't know what the phenomenon is on other platform. > > > > Do you know who else tested it? > > > So I might have rushed a bit with my previous comment/question: apologies for > > > that. > > > The numbers above: those are on a fairly idle/lightly loaded system right? > > Yes. > > > Would you mind having another go with just the arch_freq_get_on_cpu > > > implementation beign added and dropping the changes in the cpufreq and > > All my tests are done when cpufreq policy is "performance" and OS isn't on a > > high load. > > Reading "scaling_cur_freq" or "scaling_cur_freq" for each physical core on > > platform > > > > The testing result for "cpuinfo_cur_freq" with your changes on a fairly idle > > and high loaded system can also be found in this thread. > > *A: the result with your changes* > > --> Reading "scaling_cur_freq" > > 0: 2688720 2: 2696628 4: 2699264 6: 2696628 > > 8: 2699264 10: 2696628 12: 2699264 14: 2699264 > > 16: 2699264 18: 2696628 20: 2696628 22: 2696628 > > 24: 2699264 26: 2696628 28: 2696628 30: 2696628 > > 32: 2699264 34: 2691356 36: 2696628 38: 2699264 > > 40: 2699264 42: 2696628 44: 2696628 46: 2699264 > > 48: 2699264 50: 2696628 52: 2696628 54: 2696628 > > 56: 2696628 58: 2699264 60: 2691356 62: 2696628 > > 64: 2696628 66: 2696628 68: 2696628 70: 2696628 > > 72: 2696628 74: 2696628 76: 2699264 78: 2696628 > > 80: 2696628 82: 2696628 84: 2699264 86: 2696628 > > 88: 2625456 90: 2696628 92: 2699264 94: 2696628 > > 96: 2696628 98: 2696628 100: 2699264 102: 2699264 > > 104: 2699264 106: 2696628 108: 2699264 110: 2696628 > > 112: 2699264 114: 2699264 116: 2696628 118: 2696628 > > 120: 2696628 122: 2699264 124: 2696628 126: 2696628 > > -->Reading "cpuinfo_cur_freq" > > 0: 2696628 2: 2696628 4: 2699264 6: 2688720 > > 8: 2699264 10: 2700000 12: 2696628 14: 2698322 > > 16: 2699264 18: 2699264 20: 2696628 22: 2699264 > > 24: 2699264 26: 2699264 28: 2699264 30: 2699264 > > 32: 2699264 34: 2693992 36: 2696628 38: 2696628 > > 40: 2699264 42: 2699264 44: 2699264 46: 2696628 > > 48: 2696628 50: 2699264 52: 2696628 54: 2696628 > > 56: 2699264 58: 2699264 60: 2696628 62: 2699264 > > 64: 2696628 66: 2699264 68: 2696628 70: 2699264 > > 72: 2696628 74: 2696628 76: 2696628 78: 2693992 > > 80: 2696628 82: 2696628 84: 2696628 86: 2696628 > > 88: 2696628 90: 2699264 92: 2696628 94: 2699264 > > 96: 2699264 98: 2696628 100: 2699264 102: 2699264 > > 104: 2691356 106: 2699264 108: 2699264 110: 2699264 > > 112: 2699264 114: 2696628 116: 2699264 118: 2699264 > > 120: 2696628 122: 2696628 124: 2696628 126: 2696628 > > > > *B: the result without your changes* > > -->Reading "scaling_cur_freq" > > 0: 2698245 2: 2706690 4: 2699649 6: 2702105 > > 8: 2704362 10: 2697993 12: 2701672 14: 2704362 > > 16: 2701052 18: 2701052 20: 2694385 22: 2699650 > > 24: 2706802 26: 2702389 28: 2698299 30: 2698299 > > 32: 2697333 34: 2697993 36: 2701337 38: 2699328 > > 40: 2700330 42: 2700330 44: 2698019 46: 2697697 > > 48: 2699659 50: 2701700 52: 2703401 54: 2701700 > > 56: 2704013 58: 2697658 60: 2695000 62: 2697666 > > 64: 2697902 66: 2701052 68: 2698245 70: 2695789 > > 72: 2701315 74: 2696655 76: 2693666 78: 2695317 > > 80: 2704912 82: 2699649 84: 2698245 86: 2695454 > > 88: 2697966 90: 2697959 92: 2699319 94: 2700680 > > 96: 2695317 98: 2698996 100: 2700000 102: 2700334 > > 104: 2701320 106: 2695065 108: 2700986 110: 2703960 > > 112: 2697635 114: 2704421 116: 2700680 118: 2702040 > > 120: 2700334 122: 2697993 124: 2700334 126: 2705351 > > -->Reading "cpuinfo_cur_freq" > > 0: 2696853 2: 2695454 4: 2699649 6: 2706993 > > 8: 2706060 10: 2704362 12: 2704362 14: 2697658 > > 16: 2707719 18: 2697192 20: 2702456 22: 2699650 > > 24: 2705782 26: 2698299 28: 2703061 30: 2705802 > > 32: 2700000 34: 2700671 36: 2701337 38: 2697658 > > 40: 2700330 42: 2700330 44: 2699672 46: 2697697 > > 48: 2703061 50: 2696610 52: 2692542 54: 2704406 > > 56: 2695317 58: 2699331 60: 2698996 62: 2702675 > > 64: 2704912 66: 2703859 68: 2699649 70: 2698596 > > 72: 2703908 74: 2703355 76: 2697658 78: 2695317 > > 80: 2702105 82: 2707719 84: 2702105 86: 2699649 > > 88: 2697966 90: 2691525 92: 2701700 94: 2700680 > > 96: 2695317 98: 2698996 100: 2698666 102: 2700334 > > 104: 2690429 106: 2707590 108: 2700986 110: 2701320 > > 112: 2696283 114: 2692881 116: 2697627 118: 2704421 > > 120: 2698996 122: 2696321 124: 2696655 126: 2695000 > > > So in both cases : whether you use arch_freq_get_on_cpu or not > (so with and without the patch) you get roughly the same frequencies > on all cores - or am I missing smth from the dump above ? > And those are reflecting max freq you have provided earlier (?) > Note that the arch_freq_get_on_cpu will return an average frequency for > the last tick, so even if your system is roughly idle with your performance > governor those numbers make sense (some/most of the cores might be idle > but you will see the last freq the core was running at before going to idle). > I do not think there is an agreement what should be shown for idle core when > querying their freq through sysfs. Showing last known freq makes sense, even > more than waking up core just to try to get one. > > @Ionela: Please jump in if I got things wrong. Yes, that's how I see things as well. When using the performance governor, when the CPU is active, the frequency of the CPU should be the maximum one (unless there has been firmware/hardware capping) and that would be reflected by cpuinfo_cur_freq, either through the use of the frequency scale factor (based on the samples on the last tick) or the driver's .get() function (having woken up the CPU to sample the counters). So the values above look alright to me. Thanks, Ionela. > > > > then read 'scaling_cur_freq', doing several reads in some intervals ? > > It seems that above phenomenon has not a lot to do with reading intervals. > > > The change has been tested on RD-N2 model (Neoverse N2 ref platform), > > > it has also been discussed here [1] > > I doesn't get the testing result on this platform in its thread. > It might be missing exact numbers but the conclusions should be here [1] > > > > > > > *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. > > > > > > 0: 2696628 2: 2696628 4: 2696628 6: 2696628 > > > > > > 8: 2696628 10: 2696628 12: 2696628 14: 2696628 > > > > > > 16: 2696628 18: 2696628 20: 2696628 22: 2696628 > > > > > > 24: 2696628 26: 2696628 28: 2696628 30: 2696628 > > > > > > 32: 2696628 34: 2696628 36: 2696628 38: 2696628 > > > > > > 40: 2696628 42: 2696628 44: 2696628 46: 2696628 > > > > > > 48: 2696628 50: 2696628 52: 2696628 54: 2696628 > > > > > > 56: 2696628 58: 2696628 60: 2696628 62: 2696628 > > > > > > 64: 2696628 66: 2696628 68: 2696628 70: 2696628 > > > > > > 72: 2696628 74: 2696628 76: 2696628 78: 2696628 > > > > > > 80: 2696628 82: 2696628 84: 2696628 86: 2696628 > > > > > > 88: 2696628 90: 2696628 92: 2696628 94: 2696628 > > > > > > 96: 2696628 98: 2696628 100: 2696628 102: 2696628 > > > > > > 104: 2696628 106: 2696628 108: 2696628 110: 2696628 > > > > > > 112: 2696628 114: 2696628 116: 2696628 118: 2696628 > > > > > > 120: 2696628 122: 2696628 124: 2696628 126: 2696628 > > > > > > > > > > > > *Case 2: setting nohz_full and cpufreq use ondemand governor* > > > > > > There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in > > > > > > /proc/cmdline. > > > > > Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to > > > > > set that one. > > > > > Now, afair, isolcpus will make the selected CPUs to disappear from the > > > > > schedulers view (no balancing, no migrating), so unless you affine smth > > > > > explicitly to those CPUs, you will not see much of an activity there. > > > > Correct. > > > > > Need to double check though as it has been a while ... > > > > > > *--> Step 1: *setting ondemand governor to all policy and query > > > > > > 'cpuinfo_cur_freq' in no pressure case. > > > > > > And the frequency of CPUs all are about 400MHz. > > > > > > *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. > > > > > > The high memory access pressure is from the command: "stress-ng -c 64 > > > > > > --cpu-load 100% --taskset 0-63" > > > > > I'm not entirely convinced that this will affine to isolated cpus, especially > > > > > that the affinity mask spans all available cpus. If that is the case, no wonder > > > > > your isolated cpus are getting wasted being idle. But I would have to double > > > > > check how this is being handled. > > > > > > The result: > > > > > > 0: 2696628 1: 400000 2: 400000 3: 400909 > > > > > > 4: 400000 5: 400000 6: 400000 7: 400000 > > > > > > 8: 400000 9: 400000 10: 400600 11: 2696628 > > > > > > 12: 2696628 13: 2696628 14: 2696628 15: 2696628 > > > > > > 16: 2696628 17: 2696628 18: 2696628 19: 2696628 > > > > > > 20: 2696628 21: 2696628 22: 2696628 23: 2696628 > > > > > > 24: 2696628 25: 2696628 26: 2696628 27: 2696628 > > > > > > 28: 2696628 29: 2696628 30: 2696628 31: 2696628 > > > > > > 32: 2696628 33: 2696628 34: 2696628 35: 2696628 > > > > > > 36: 2696628 37: 2696628 38: 2696628 39: 2696628 > > > > > > 40: 2696628 41: 400000 42: 400000 43: 400000 > > > > > > 44: 400000 45: 398847 46: 400000 47: 400000 > > > > > > 48: 400000 49: 400000 50: 400000 51: 2696628 > > > > > > 52: 2696628 53: 2696628 54: 2696628 55: 2696628 > > > > > > 56: 2696628 57: 2696628 58: 2696628 59: 2696628 > > > > > > 60: 2696628 61: 2696628 62: 2696628 63: 2699264 > > > > > > > > > > > > Note: > > > > > > (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. > > > > > > It turned out that nohz full was already work. > > > > > > I guess that stress-ng cannot use the CPU in the range of nohz full. > > > > > > Because the CPU frequency will be increased to 2.7G by binding CPU to > > > > > > other application. > > > > > > (2) The frequency of the nohz full core is calculated by get() callback > > > > > > according to ftrace. > > > > > It is as there is no sched tick on those, and apparently there is nothing > > > > > running on them either. > > > > Yes. > > > > If we select your approach and the above phenomenon is normal, > > > > the large frequency discrepancy issue can be resolved for CPUs with sched > > > > tick by the way. > > > > But the nohz full cores still have to face this issue. So this patch is also > > > > needed. > > > > > > > Yes, nohz cores full have to be handled by the cpufreq driver. > > Correct. So we still have to face the issue in this patch and push this > > patch. > > Beata, would you please review this patch? > Just to clarify for my benefit (apologies but I do have to contex switch > pretty often these days): by reviewing this patch do you mean: > 1) review your changes (if so I think there are few comments already to be > addressed, but I can try to have another look) > 2) review changes for AMU-based arch_freq_get_on_cpu ? > > *note: I will still try to have a look at the non-housekeeping cpus case > > --- > [1] https://lore.kernel.org/lkml/691d3eb2-cd93-f0fc-a7a4-2a8c0d44262c@nvidia.com/ > --- > > BR > Beata > > > > > > /Huisong > > > > [...]
在 2024/2/9 18:55, Beata Michalska 写道: > On Tue, Feb 06, 2024 at 04:02:15PM +0800, lihuisong (C) wrote: >> 在 2024/2/2 16:08, Beata Michalska 写道: >>> On Wed, Jan 17, 2024 at 05:18:40PM +0800, lihuisong (C) wrote: >>> >>> Hi , >>> >>> Again, apologies for delay, >>> >>>> Hi, >>>> >>>> 在 2024/1/16 22:10, Beata Michalska 写道: >>>>> Hi, >>>>> >>>>> Apologies for jumping in so late.... >>>>> >>>>> On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: >>>>>> Hi Ionela, >>>>>> >>>>>> 在 2024/1/8 22:03, Ionela Voinescu 写道: >>>>>>> Hi, >>>>>>> >>>>>>> On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: >>>>>>>> Hi Vanshi, >>>>>>>> >>>>>>>> 在 2024/1/5 8:48, Vanshidhar Konda 写道: >>>>>>>>> On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >>>>>>>>>> 在 2024/1/4 1:53, Ionela Voinescu 写道: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>>>>>>>>>> Many developers found that the cpu current frequency is greater than >>>>>>>>>>>> the maximum frequency of the platform, please see [1], [2] and [3]. >>>>>>>>>>>> >>>>>>>>>>>> In the scenarios with high memory access pressure, the patch [1] has >>>>>>>>>>>> proved the significant latency of cpc_read() which is used to obtain >>>>>>>>>>>> delivered and reference performance counter cause an absurd frequency. >>>>>>>>>>>> The sampling interval for this counters is very critical and >>>>>>>>>>>> is expected >>>>>>>>>>>> to be equal. However, the different latency of cpc_read() has a direct >>>>>>>>>>>> impact on their sampling interval. >>>>>>>>>>>> >>>>>>>>>>> Would this [1] alternative solution work for you? >>>>>>>>>> It would work for me AFAICS. >>>>>>>>>> Because the "arch_freq_scale" is also from AMU core and constant >>>>>>>>>> counter, and read together. >>>>>>>>>> But, from their discuss line, it seems that there are some tricky >>>>>>>>>> points to clarify or consider. >>>>>>>>> I think the changes in [1] would work better when CPUs may be idle. With >>>>>>>>> this >>>>>>>>> patch we would have to wake any core that is in idle state to read the >>>>>>>>> AMU >>>>>>>>> counters. Worst case, if core 0 is trying to read the CPU frequency of >>>>>>>>> all >>>>>>>>> cores, it may need to wake up all the other cores to read the AMU >>>>>>>>> counters. >>>>>>>> From the approach in [1], if all CPUs (one or more cores) under one policy >>>>>>>> are idle, they still cannot be obtained the CPU frequency, right? >>>>>>>> In this case, the [1] API will return 0 and have to back to call >>>>>>>> cpufreq_driver->get() for cpuinfo_cur_freq. >>>>>>>> Then we still need to face the issue this patch mentioned. >>>>>>> With the implementation at [1], arch_freq_get_on_cpu() will not return 0 >>>>>>> for idle CPUs and the get() callback will not be called to wake up the >>>>>>> CPUs. >>>>>> Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. >>>>>> However, for no-housekeeping CPUs, it will return 0 and have to call get() >>>>>> callback, right? >>>>>>> Worst case, arch_freq_get_on_cpu() will return a frequency based on the >>>>>>> AMU counter values obtained on the last tick on that CPU. But if that CPU >>>>>>> is not a housekeeping CPU, a housekeeping CPU in the same policy will be >>>>>>> selected, as it would have had a more recent tick, and therefore a more >>>>>>> recent frequency value for the domain. >>>>>> But this frequency is from the last tick, >>>>>> this last tick is probably a long time ago and it doesn't update >>>>>> 'arch_freq_scale' for some reasons like CPU dile. >>>>>> In addition, I'm not sure if there is possible that amu_scale_freq_tick() is >>>>>> executed delayed under high stress case. >>>>>> It also have an impact on the accuracy of the cpu frequency we query. >>>>>>> I understand that the frequency returned here will not be up to date, >>>>>>> but there's no proper frequency feedback for an idle CPU. If one only >>>>>>> wakes up a CPU to sample counters, before the CPU goes back to sleep, >>>>>>> the obtained frequency feedback is meaningless. >>>>>>> >>>>>>>>> For systems with 128 cores or more, this could be very expensive and >>>>>>>>> happen >>>>>>>>> very frequently. >>>>>>>>> >>>>>>>>> AFAICS, the approach in [1] would avoid this cost. >>>>>>>> But the CPU frequency is just an average value for the last tick period >>>>>>>> instead of the current one the CPU actually runs at. >>>>>>>> In addition, there are some conditions to use 'arch_freq_scale' in this >>>>>>>> approach. >>>>>>> What are the conditions you are referring to? >>>>>> It depends on the housekeeping CPUs. >>>>>>>> So I'm not sure if this approach can entirely cover the frequency >>>>>>>> discrepancy issue. >>>>>>> Unfortunately there is no perfect frequency feedback. By the time you >>>>>>> observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency >>>>>>> of the CPU might have already changed. Therefore, an average value might >>>>>>> be a better indication of the recent performance level of a CPU. >>>>>> An average value for CPU frequency is ok. It may be better if it has not any >>>>>> delaying. >>>>>> >>>>>> The original implementation for cpuinfo_cur_freq can more reflect their >>>>>> meaning in the user-guide [1]. The user-guide said: >>>>>> "cpuinfo_cur_freq : Current frequency of the CPU as obtained from the >>>>>> hardware, in KHz. >>>>>> This is the frequency the CPU actually runs at." >>>>>> >>>>>> >>>>>> [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt >>>>>> >>>>>>> Would you be able to test [1] on your platform and usecase? >>>>>> I has tested it on my platform (CPU number: 64, SMT: off and CPU base >>>>>> frequency: 2.7GHz). >>>>>> Accoding to the testing result, >>>>>> 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. >>>>>> They still have to face the large frequency discrepancy issue my patch >>>>>> mentioned. >>>>>> 2> Additionally, the frequency value of all CPUs are almost the same by >>>>>> using the 'arch_freq_scale' factor way. I'm not sure if it is ok. >>>>>> >>>>>> The patch [1] has been modified silightly as below: >>>>>> --> >>>>>> @@ -1756,7 +1756,10 @@ static unsigned int >>>>>> cpufreq_verify_current_freq(struct cpufreq_policy *policy, b >>>>>> { >>>>>> unsigned int new_freq; >>>>>> >>>>>> - new_freq = cpufreq_driver->get(policy->cpu); >>>>>> + new_freq = arch_freq_get_on_cpu(policy->cpu); >>>>>> + if (!new_freq) >>>>>> + new_freq = cpufreq_driver->get(policy->cpu); >>>>>> + >>>>> As pointed out this change will not make it to the next version of the patch. >>>>> So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will >>>>> only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq >>>>>> if (!new_freq) >>>>>> return 0; >>>>>> >>>>>> And the result is as follows: >>>>>> *case 1:**No setting the nohz_full and cpufreq use performance governor* >>>>>> *--> Step1: *read 'cpuinfo_cur_freq' in no pressure >>>>>> 0: 2699264 2: 2699264 4: 2699264 6: 2699264 >>>>>> 8: 2696628 10: 2696628 12: 2696628 14: 2699264 >>>>>> 16: 2699264 18: 2696628 20: 2699264 22: 2696628 >>>>>> 24: 2699264 26: 2696628 28: 2699264 30: 2696628 >>>>>> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >>>>>> 40: 2699264 42: 2699264 44: 2696628 46: 2696628 >>>>>> 48: 2696628 50: 2699264 52: 2699264 54: 2696628 >>>>>> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >>>>>> 64: 2696628 66: 2699264 68: 2696628 70: 2696628 >>>>>> 72: 2699264 74: 2696628 76: 2696628 78: 2699264 >>>>>> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >>>>>> 88: 2696628 90: 2696628 92: 2696628 94: 2699264 >>>>>> 96: 2696628 98: 2699264 100: 2699264 102: 2696628 >>>>>> 104: 2699264 106: 2699264 108: 2699264 110: 2696628 >>>>>> 112: 2699264 114: 2699264 116: 2699264 118: 2699264 >>>>>> 120: 2696628 122: 2699264 124: 2696628 126: 2699264 >>>>>> Note: the frequency of all CPUs are almost the same. >>>>> Were you expecting smth else ? >>>> The frequency of each CPU might have a different value. >>>> All value of all CPUs is the same under high pressure. >>>> I don't know what the phenomenon is on other platform. >>>> Do you know who else tested it? >>> So I might have rushed a bit with my previous comment/question: apologies for >>> that. >>> The numbers above: those are on a fairly idle/lightly loaded system right? >> Yes. >>> Would you mind having another go with just the arch_freq_get_on_cpu >>> implementation beign added and dropping the changes in the cpufreq and >> All my tests are done when cpufreq policy is "performance" and OS isn't on a >> high load. >> Reading "scaling_cur_freq" or "scaling_cur_freq" for each physical core on >> platform >> >> The testing result for "cpuinfo_cur_freq" with your changes on a fairly idle >> and high loaded system can also be found in this thread. >> *A: the result with your changes* >> --> Reading "scaling_cur_freq" >> 0: 2688720 2: 2696628 4: 2699264 6: 2696628 >> 8: 2699264 10: 2696628 12: 2699264 14: 2699264 >> 16: 2699264 18: 2696628 20: 2696628 22: 2696628 >> 24: 2699264 26: 2696628 28: 2696628 30: 2696628 >> 32: 2699264 34: 2691356 36: 2696628 38: 2699264 >> 40: 2699264 42: 2696628 44: 2696628 46: 2699264 >> 48: 2699264 50: 2696628 52: 2696628 54: 2696628 >> 56: 2696628 58: 2699264 60: 2691356 62: 2696628 >> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >> 72: 2696628 74: 2696628 76: 2699264 78: 2696628 >> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >> 88: 2625456 90: 2696628 92: 2699264 94: 2696628 >> 96: 2696628 98: 2696628 100: 2699264 102: 2699264 >> 104: 2699264 106: 2696628 108: 2699264 110: 2696628 >> 112: 2699264 114: 2699264 116: 2696628 118: 2696628 >> 120: 2696628 122: 2699264 124: 2696628 126: 2696628 >> -->Reading "cpuinfo_cur_freq" >> 0: 2696628 2: 2696628 4: 2699264 6: 2688720 >> 8: 2699264 10: 2700000 12: 2696628 14: 2698322 >> 16: 2699264 18: 2699264 20: 2696628 22: 2699264 >> 24: 2699264 26: 2699264 28: 2699264 30: 2699264 >> 32: 2699264 34: 2693992 36: 2696628 38: 2696628 >> 40: 2699264 42: 2699264 44: 2699264 46: 2696628 >> 48: 2696628 50: 2699264 52: 2696628 54: 2696628 >> 56: 2699264 58: 2699264 60: 2696628 62: 2699264 >> 64: 2696628 66: 2699264 68: 2696628 70: 2699264 >> 72: 2696628 74: 2696628 76: 2696628 78: 2693992 >> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >> 88: 2696628 90: 2699264 92: 2696628 94: 2699264 >> 96: 2699264 98: 2696628 100: 2699264 102: 2699264 >> 104: 2691356 106: 2699264 108: 2699264 110: 2699264 >> 112: 2699264 114: 2696628 116: 2699264 118: 2699264 >> 120: 2696628 122: 2696628 124: 2696628 126: 2696628 >> >> *B: the result without your changes* >> -->Reading "scaling_cur_freq" >> 0: 2698245 2: 2706690 4: 2699649 6: 2702105 >> 8: 2704362 10: 2697993 12: 2701672 14: 2704362 >> 16: 2701052 18: 2701052 20: 2694385 22: 2699650 >> 24: 2706802 26: 2702389 28: 2698299 30: 2698299 >> 32: 2697333 34: 2697993 36: 2701337 38: 2699328 >> 40: 2700330 42: 2700330 44: 2698019 46: 2697697 >> 48: 2699659 50: 2701700 52: 2703401 54: 2701700 >> 56: 2704013 58: 2697658 60: 2695000 62: 2697666 >> 64: 2697902 66: 2701052 68: 2698245 70: 2695789 >> 72: 2701315 74: 2696655 76: 2693666 78: 2695317 >> 80: 2704912 82: 2699649 84: 2698245 86: 2695454 >> 88: 2697966 90: 2697959 92: 2699319 94: 2700680 >> 96: 2695317 98: 2698996 100: 2700000 102: 2700334 >> 104: 2701320 106: 2695065 108: 2700986 110: 2703960 >> 112: 2697635 114: 2704421 116: 2700680 118: 2702040 >> 120: 2700334 122: 2697993 124: 2700334 126: 2705351 >> -->Reading "cpuinfo_cur_freq" >> 0: 2696853 2: 2695454 4: 2699649 6: 2706993 >> 8: 2706060 10: 2704362 12: 2704362 14: 2697658 >> 16: 2707719 18: 2697192 20: 2702456 22: 2699650 >> 24: 2705782 26: 2698299 28: 2703061 30: 2705802 >> 32: 2700000 34: 2700671 36: 2701337 38: 2697658 >> 40: 2700330 42: 2700330 44: 2699672 46: 2697697 >> 48: 2703061 50: 2696610 52: 2692542 54: 2704406 >> 56: 2695317 58: 2699331 60: 2698996 62: 2702675 >> 64: 2704912 66: 2703859 68: 2699649 70: 2698596 >> 72: 2703908 74: 2703355 76: 2697658 78: 2695317 >> 80: 2702105 82: 2707719 84: 2702105 86: 2699649 >> 88: 2697966 90: 2691525 92: 2701700 94: 2700680 >> 96: 2695317 98: 2698996 100: 2698666 102: 2700334 >> 104: 2690429 106: 2707590 108: 2700986 110: 2701320 >> 112: 2696283 114: 2692881 116: 2697627 118: 2704421 >> 120: 2698996 122: 2696321 124: 2696655 126: 2695000 >> > So in both cases : whether you use arch_freq_get_on_cpu or not > (so with and without the patch) you get roughly the same frequencies > on all cores - or am I missing smth from the dump above ? The changes in "with/without your changes" I said is your patch intruduced arch_freq_get_on_cpu. I just test them according to your requesting. > And those are reflecting max freq you have provided earlier (?) I know it is an average frequency for the last tickfor using arch_freq_get_on_cpu. I have no any doubt that the freq is maximum value on performance governor. I just want to say the difference between having or not having your patch. The frequency values of all cores from cpuinfo_cur_freq and scaling_cur_freq are almost the same if use this arch_freq_get_on_cpu on my platform. However, the frequency values of all cores are different if doesn't use this arch_freq_get_on_cpu and just use .get(). > Note that the arch_freq_get_on_cpu will return an average frequency for > the last tick, so even if your system is roughly idle with your performance > governor those numbers make sense (some/most of the cores might be idle > but you will see the last freq the core was running at before going to idle). > I do not think there is an agreement what should be shown for idle core when > querying their freq through sysfs. Showing last known freq makes sense, even > more than waking up core just to try to get one. I'm not opposed to using frequency scale factor to get CPU frequency. But it better be okay. > > @Ionela: Please jump in if I got things wrong. > >>> then read 'scaling_cur_freq', doing several reads in some intervals ? >> It seems that above phenomenon has not a lot to do with reading intervals. >>> The change has been tested on RD-N2 model (Neoverse N2 ref platform), >>> it has also been discussed here [1] >> I doesn't get the testing result on this platform in its thread. > It might be missing exact numbers but the conclusions should be here [1] > >>>>>> *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. >>>>>> 0: 2696628 2: 2696628 4: 2696628 6: 2696628 >>>>>> 8: 2696628 10: 2696628 12: 2696628 14: 2696628 >>>>>> 16: 2696628 18: 2696628 20: 2696628 22: 2696628 >>>>>> 24: 2696628 26: 2696628 28: 2696628 30: 2696628 >>>>>> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >>>>>> 40: 2696628 42: 2696628 44: 2696628 46: 2696628 >>>>>> 48: 2696628 50: 2696628 52: 2696628 54: 2696628 >>>>>> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >>>>>> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >>>>>> 72: 2696628 74: 2696628 76: 2696628 78: 2696628 >>>>>> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >>>>>> 88: 2696628 90: 2696628 92: 2696628 94: 2696628 >>>>>> 96: 2696628 98: 2696628 100: 2696628 102: 2696628 >>>>>> 104: 2696628 106: 2696628 108: 2696628 110: 2696628 >>>>>> 112: 2696628 114: 2696628 116: 2696628 118: 2696628 >>>>>> 120: 2696628 122: 2696628 124: 2696628 126: 2696628 >>>>>> >>>>>> *Case 2: setting nohz_full and cpufreq use ondemand governor* >>>>>> There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in >>>>>> /proc/cmdline. >>>>> Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to >>>>> set that one. >>>>> Now, afair, isolcpus will make the selected CPUs to disappear from the >>>>> schedulers view (no balancing, no migrating), so unless you affine smth >>>>> explicitly to those CPUs, you will not see much of an activity there. >>>> Correct. >>>>> Need to double check though as it has been a while ... >>>>>> *--> Step 1: *setting ondemand governor to all policy and query >>>>>> 'cpuinfo_cur_freq' in no pressure case. >>>>>> And the frequency of CPUs all are about 400MHz. >>>>>> *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. >>>>>> The high memory access pressure is from the command: "stress-ng -c 64 >>>>>> --cpu-load 100% --taskset 0-63" >>>>> I'm not entirely convinced that this will affine to isolated cpus, especially >>>>> that the affinity mask spans all available cpus. If that is the case, no wonder >>>>> your isolated cpus are getting wasted being idle. But I would have to double >>>>> check how this is being handled. >>>>>> The result: >>>>>> 0: 2696628 1: 400000 2: 400000 3: 400909 >>>>>> 4: 400000 5: 400000 6: 400000 7: 400000 >>>>>> 8: 400000 9: 400000 10: 400600 11: 2696628 >>>>>> 12: 2696628 13: 2696628 14: 2696628 15: 2696628 >>>>>> 16: 2696628 17: 2696628 18: 2696628 19: 2696628 >>>>>> 20: 2696628 21: 2696628 22: 2696628 23: 2696628 >>>>>> 24: 2696628 25: 2696628 26: 2696628 27: 2696628 >>>>>> 28: 2696628 29: 2696628 30: 2696628 31: 2696628 >>>>>> 32: 2696628 33: 2696628 34: 2696628 35: 2696628 >>>>>> 36: 2696628 37: 2696628 38: 2696628 39: 2696628 >>>>>> 40: 2696628 41: 400000 42: 400000 43: 400000 >>>>>> 44: 400000 45: 398847 46: 400000 47: 400000 >>>>>> 48: 400000 49: 400000 50: 400000 51: 2696628 >>>>>> 52: 2696628 53: 2696628 54: 2696628 55: 2696628 >>>>>> 56: 2696628 57: 2696628 58: 2696628 59: 2696628 >>>>>> 60: 2696628 61: 2696628 62: 2696628 63: 2699264 >>>>>> >>>>>> Note: >>>>>> (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. >>>>>> It turned out that nohz full was already work. >>>>>> I guess that stress-ng cannot use the CPU in the range of nohz full. >>>>>> Because the CPU frequency will be increased to 2.7G by binding CPU to >>>>>> other application. >>>>>> (2) The frequency of the nohz full core is calculated by get() callback >>>>>> according to ftrace. >>>>> It is as there is no sched tick on those, and apparently there is nothing >>>>> running on them either. >>>> Yes. >>>> If we select your approach and the above phenomenon is normal, >>>> the large frequency discrepancy issue can be resolved for CPUs with sched >>>> tick by the way. >>>> But the nohz full cores still have to face this issue. So this patch is also >>>> needed. >>>> >>> Yes, nohz cores full have to be handled by the cpufreq driver. >> Correct. So we still have to face the issue in this patch and push this >> patch. >> Beata, would you please review this patch? > Just to clarify for my benefit (apologies but I do have to contex switch > pretty often these days): by reviewing this patch do you mean: > 1) review your changes (if so I think there are few comments already to be > addressed, but I can try to have another look) Currently, the main comments is that my patch will wake up CPU to get frequency. BTW, the core's always been wakened up to get the frequency for FFH way in cppc_acpi. please see cpc_read_ffh(). So it may be acceptable. After all, we don't query CPU frequency very often. But your patch doesn't meet the non-housekeeping cpus. > 2) review changes for AMU-based arch_freq_get_on_cpu ? > > *note: I will still try to have a look at the non-housekeeping cpus case I am very much hope that this issue my patch mentioned can be resolved ASAP. So what's your plan about non-housekeeping cpus case? > > --- > [1] https://lore.kernel.org/lkml/691d3eb2-cd93-f0fc-a7a4-2a8c0d44262c@nvidia.com/ > --- > > BR > Beata >> >> /Huisong > [...] > .
On Mon, Feb 19, 2024 at 08:15:50PM +0800, lihuisong (C) wrote: > >在 2024/2/9 18:55, Beata Michalska 写道: >>On Tue, Feb 06, 2024 at 04:02:15PM +0800, lihuisong (C) wrote: >>>在 2024/2/2 16:08, Beata Michalska 写道: >>>>On Wed, Jan 17, 2024 at 05:18:40PM +0800, lihuisong (C) wrote: >>>> >>>>Hi , >>>> >>>>Again, apologies for delay, >>>> >>>>>Hi, >>>>> >>>>>在 2024/1/16 22:10, Beata Michalska 写道: >>>>>>Hi, >>>>>> >>>>>>Apologies for jumping in so late.... >>>>>> >>>>>>On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: >>>>>>>Hi Ionela, >>>>>>> >>>>>>>在 2024/1/8 22:03, Ionela Voinescu 写道: >>>>>>>>Hi, >>>>>>>> >>>>>>>>On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: >>>>>>>>>Hi Vanshi, >>>>>>>>> >>>>>>>>>在 2024/1/5 8:48, Vanshidhar Konda 写道: >>>>>>>>>>On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >>>>>>>>>>>在 2024/1/4 1:53, Ionela Voinescu 写道: >>>>>>>>>>>>Hi, >>>>>>>>>>>> >>>>>>>>>>>>On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>>>>>>>>>>>Many developers found that the cpu current frequency is greater than >>>>>>>>>>>>>the maximum frequency of the platform, please see [1], [2] and [3]. >>>>>>>>>>>>> >>>>>>>>>>>>>In the scenarios with high memory access pressure, the patch [1] has >>>>>>>>>>>>>proved the significant latency of cpc_read() which is used to obtain >>>>>>>>>>>>>delivered and reference performance counter cause an absurd frequency. >>>>>>>>>>>>>The sampling interval for this counters is very critical and >>>>>>>>>>>>>is expected >>>>>>>>>>>>>to be equal. However, the different latency of cpc_read() has a direct >>>>>>>>>>>>>impact on their sampling interval. >>>>>>>>>>>>> >>>>>>>>>>>>Would this [1] alternative solution work for you? >>>>>>>>>>>It would work for me AFAICS. >>>>>>>>>>>Because the "arch_freq_scale" is also from AMU core and constant >>>>>>>>>>>counter, and read together. >>>>>>>>>>>But, from their discuss line, it seems that there are some tricky >>>>>>>>>>>points to clarify or consider. >>>>>>>>>>I think the changes in [1] would work better when CPUs may be idle. With >>>>>>>>>>this >>>>>>>>>>patch we would have to wake any core that is in idle state to read the >>>>>>>>>>AMU >>>>>>>>>>counters. Worst case, if core 0 is trying to read the CPU frequency of >>>>>>>>>>all >>>>>>>>>>cores, it may need to wake up all the other cores to read the AMU >>>>>>>>>>counters. >>>>>>>>> From the approach in [1], if all CPUs (one or more cores) under one policy >>>>>>>>>are idle, they still cannot be obtained the CPU frequency, right? >>>>>>>>>In this case, the [1] API will return 0 and have to back to call >>>>>>>>>cpufreq_driver->get() for cpuinfo_cur_freq. >>>>>>>>>Then we still need to face the issue this patch mentioned. >>>>>>>>With the implementation at [1], arch_freq_get_on_cpu() will not return 0 >>>>>>>>for idle CPUs and the get() callback will not be called to wake up the >>>>>>>>CPUs. >>>>>>>Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. >>>>>>>However, for no-housekeeping CPUs, it will return 0 and have to call get() >>>>>>>callback, right? >>>>>>>>Worst case, arch_freq_get_on_cpu() will return a frequency based on the >>>>>>>>AMU counter values obtained on the last tick on that CPU. But if that CPU >>>>>>>>is not a housekeeping CPU, a housekeeping CPU in the same policy will be >>>>>>>>selected, as it would have had a more recent tick, and therefore a more >>>>>>>>recent frequency value for the domain. >>>>>>>But this frequency is from the last tick, >>>>>>>this last tick is probably a long time ago and it doesn't update >>>>>>>'arch_freq_scale' for some reasons like CPU dile. >>>>>>>In addition, I'm not sure if there is possible that amu_scale_freq_tick() is >>>>>>>executed delayed under high stress case. >>>>>>>It also have an impact on the accuracy of the cpu frequency we query. >>>>>>>>I understand that the frequency returned here will not be up to date, >>>>>>>>but there's no proper frequency feedback for an idle CPU. If one only >>>>>>>>wakes up a CPU to sample counters, before the CPU goes back to sleep, >>>>>>>>the obtained frequency feedback is meaningless. >>>>>>>> >>>>>>>>>>For systems with 128 cores or more, this could be very expensive and >>>>>>>>>>happen >>>>>>>>>>very frequently. >>>>>>>>>> >>>>>>>>>>AFAICS, the approach in [1] would avoid this cost. >>>>>>>>>But the CPU frequency is just an average value for the last tick period >>>>>>>>>instead of the current one the CPU actually runs at. >>>>>>>>>In addition, there are some conditions to use 'arch_freq_scale' in this >>>>>>>>>approach. >>>>>>>>What are the conditions you are referring to? >>>>>>>It depends on the housekeeping CPUs. >>>>>>>>>So I'm not sure if this approach can entirely cover the frequency >>>>>>>>>discrepancy issue. >>>>>>>>Unfortunately there is no perfect frequency feedback. By the time you >>>>>>>>observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, the frequency >>>>>>>>of the CPU might have already changed. Therefore, an average value might >>>>>>>>be a better indication of the recent performance level of a CPU. >>>>>>>An average value for CPU frequency is ok. It may be better if it has not any >>>>>>>delaying. >>>>>>> >>>>>>>The original implementation for cpuinfo_cur_freq can more reflect their >>>>>>>meaning in the user-guide [1]. The user-guide said: >>>>>>>"cpuinfo_cur_freq : Current frequency of the CPU as obtained from the >>>>>>>hardware, in KHz. >>>>>>>This is the frequency the CPU actually runs at." >>>>>>> >>>>>>> >>>>>>>[1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt >>>>>>> >>>>>>>>Would you be able to test [1] on your platform and usecase? >>>>>>>I has tested it on my platform (CPU number: 64, SMT: off and CPU base >>>>>>>frequency: 2.7GHz). >>>>>>>Accoding to the testing result, >>>>>>>1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs. >>>>>>>They still have to face the large frequency discrepancy issue my patch >>>>>>>mentioned. >>>>>>>2> Additionally, the frequency value of all CPUs are almost the same by >>>>>>>using the 'arch_freq_scale' factor way. I'm not sure if it is ok. >>>>>>> >>>>>>>The patch [1] has been modified silightly as below: >>>>>>>--> >>>>>>>@@ -1756,7 +1756,10 @@ static unsigned int >>>>>>>cpufreq_verify_current_freq(struct cpufreq_policy *policy, b >>>>>>> { >>>>>>> unsigned int new_freq; >>>>>>> >>>>>>>- new_freq = cpufreq_driver->get(policy->cpu); >>>>>>>+ new_freq = arch_freq_get_on_cpu(policy->cpu); >>>>>>>+ if (!new_freq) >>>>>>>+ new_freq = cpufreq_driver->get(policy->cpu); >>>>>>>+ >>>>>>As pointed out this change will not make it to the next version of the patch. >>>>>>So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will >>>>>>only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq >>>>>>> if (!new_freq) >>>>>>> return 0; >>>>>>> >>>>>>>And the result is as follows: >>>>>>>*case 1:**No setting the nohz_full and cpufreq use performance governor* >>>>>>>*--> Step1: *read 'cpuinfo_cur_freq' in no pressure >>>>>>> 0: 2699264 2: 2699264 4: 2699264 6: 2699264 >>>>>>> 8: 2696628 10: 2696628 12: 2696628 14: 2699264 >>>>>>> 16: 2699264 18: 2696628 20: 2699264 22: 2696628 >>>>>>> 24: 2699264 26: 2696628 28: 2699264 30: 2696628 >>>>>>> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >>>>>>> 40: 2699264 42: 2699264 44: 2696628 46: 2696628 >>>>>>> 48: 2696628 50: 2699264 52: 2699264 54: 2696628 >>>>>>> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >>>>>>> 64: 2696628 66: 2699264 68: 2696628 70: 2696628 >>>>>>> 72: 2699264 74: 2696628 76: 2696628 78: 2699264 >>>>>>> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >>>>>>> 88: 2696628 90: 2696628 92: 2696628 94: 2699264 >>>>>>> 96: 2696628 98: 2699264 100: 2699264 102: 2696628 >>>>>>>104: 2699264 106: 2699264 108: 2699264 110: 2696628 >>>>>>>112: 2699264 114: 2699264 116: 2699264 118: 2699264 >>>>>>>120: 2696628 122: 2699264 124: 2696628 126: 2699264 >>>>>>>Note: the frequency of all CPUs are almost the same. >>>>>>Were you expecting smth else ? >>>>>The frequency of each CPU might have a different value. >>>>>All value of all CPUs is the same under high pressure. >>>>>I don't know what the phenomenon is on other platform. >>>>>Do you know who else tested it? >>>>So I might have rushed a bit with my previous comment/question: apologies for >>>>that. >>>>The numbers above: those are on a fairly idle/lightly loaded system right? >>>Yes. >>>>Would you mind having another go with just the arch_freq_get_on_cpu >>>>implementation beign added and dropping the changes in the cpufreq and >>>All my tests are done when cpufreq policy is "performance" and OS isn't on a >>>high load. >>>Reading "scaling_cur_freq" or "scaling_cur_freq" for each physical core on >>>platform >>> >>>The testing result for "cpuinfo_cur_freq" with your changes on a fairly idle >>>and high loaded system can also be found in this thread. >>>*A: the result with your changes* >>>--> Reading "scaling_cur_freq" >>> 0: 2688720 2: 2696628 4: 2699264 6: 2696628 >>> 8: 2699264 10: 2696628 12: 2699264 14: 2699264 >>> 16: 2699264 18: 2696628 20: 2696628 22: 2696628 >>> 24: 2699264 26: 2696628 28: 2696628 30: 2696628 >>> 32: 2699264 34: 2691356 36: 2696628 38: 2699264 >>> 40: 2699264 42: 2696628 44: 2696628 46: 2699264 >>> 48: 2699264 50: 2696628 52: 2696628 54: 2696628 >>> 56: 2696628 58: 2699264 60: 2691356 62: 2696628 >>> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >>> 72: 2696628 74: 2696628 76: 2699264 78: 2696628 >>> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >>> 88: 2625456 90: 2696628 92: 2699264 94: 2696628 >>> 96: 2696628 98: 2696628 100: 2699264 102: 2699264 >>>104: 2699264 106: 2696628 108: 2699264 110: 2696628 >>>112: 2699264 114: 2699264 116: 2696628 118: 2696628 >>>120: 2696628 122: 2699264 124: 2696628 126: 2696628 >>>-->Reading "cpuinfo_cur_freq" >>> 0: 2696628 2: 2696628 4: 2699264 6: 2688720 >>> 8: 2699264 10: 2700000 12: 2696628 14: 2698322 >>> 16: 2699264 18: 2699264 20: 2696628 22: 2699264 >>> 24: 2699264 26: 2699264 28: 2699264 30: 2699264 >>> 32: 2699264 34: 2693992 36: 2696628 38: 2696628 >>> 40: 2699264 42: 2699264 44: 2699264 46: 2696628 >>> 48: 2696628 50: 2699264 52: 2696628 54: 2696628 >>> 56: 2699264 58: 2699264 60: 2696628 62: 2699264 >>> 64: 2696628 66: 2699264 68: 2696628 70: 2699264 >>> 72: 2696628 74: 2696628 76: 2696628 78: 2693992 >>> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >>> 88: 2696628 90: 2699264 92: 2696628 94: 2699264 >>> 96: 2699264 98: 2696628 100: 2699264 102: 2699264 >>>104: 2691356 106: 2699264 108: 2699264 110: 2699264 >>>112: 2699264 114: 2696628 116: 2699264 118: 2699264 >>>120: 2696628 122: 2696628 124: 2696628 126: 2696628 >>> >>>*B: the result without your changes* >>>-->Reading "scaling_cur_freq" >>> 0: 2698245 2: 2706690 4: 2699649 6: 2702105 >>> 8: 2704362 10: 2697993 12: 2701672 14: 2704362 >>> 16: 2701052 18: 2701052 20: 2694385 22: 2699650 >>> 24: 2706802 26: 2702389 28: 2698299 30: 2698299 >>> 32: 2697333 34: 2697993 36: 2701337 38: 2699328 >>> 40: 2700330 42: 2700330 44: 2698019 46: 2697697 >>> 48: 2699659 50: 2701700 52: 2703401 54: 2701700 >>> 56: 2704013 58: 2697658 60: 2695000 62: 2697666 >>> 64: 2697902 66: 2701052 68: 2698245 70: 2695789 >>> 72: 2701315 74: 2696655 76: 2693666 78: 2695317 >>> 80: 2704912 82: 2699649 84: 2698245 86: 2695454 >>> 88: 2697966 90: 2697959 92: 2699319 94: 2700680 >>> 96: 2695317 98: 2698996 100: 2700000 102: 2700334 >>>104: 2701320 106: 2695065 108: 2700986 110: 2703960 >>>112: 2697635 114: 2704421 116: 2700680 118: 2702040 >>>120: 2700334 122: 2697993 124: 2700334 126: 2705351 >>>-->Reading "cpuinfo_cur_freq" >>> 0: 2696853 2: 2695454 4: 2699649 6: 2706993 >>> 8: 2706060 10: 2704362 12: 2704362 14: 2697658 >>> 16: 2707719 18: 2697192 20: 2702456 22: 2699650 >>> 24: 2705782 26: 2698299 28: 2703061 30: 2705802 >>> 32: 2700000 34: 2700671 36: 2701337 38: 2697658 >>> 40: 2700330 42: 2700330 44: 2699672 46: 2697697 >>> 48: 2703061 50: 2696610 52: 2692542 54: 2704406 >>> 56: 2695317 58: 2699331 60: 2698996 62: 2702675 >>> 64: 2704912 66: 2703859 68: 2699649 70: 2698596 >>> 72: 2703908 74: 2703355 76: 2697658 78: 2695317 >>> 80: 2702105 82: 2707719 84: 2702105 86: 2699649 >>> 88: 2697966 90: 2691525 92: 2701700 94: 2700680 >>> 96: 2695317 98: 2698996 100: 2698666 102: 2700334 >>>104: 2690429 106: 2707590 108: 2700986 110: 2701320 >>>112: 2696283 114: 2692881 116: 2697627 118: 2704421 >>>120: 2698996 122: 2696321 124: 2696655 126: 2695000 >>> >>So in both cases : whether you use arch_freq_get_on_cpu or not >>(so with and without the patch) you get roughly the same frequencies >>on all cores - or am I missing smth from the dump above ? >The changes in "with/without your changes" I said is your patch >intruduced arch_freq_get_on_cpu. >I just test them according to your requesting. >>And those are reflecting max freq you have provided earlier (?) >I know it is an average frequency for the last tickfor using >arch_freq_get_on_cpu. >I have no any doubt that the freq is maximum value on performance governor. >I just want to say the difference between having or not having your patch. >The frequency values of all cores from cpuinfo_cur_freq and >scaling_cur_freq are almost the same if use this arch_freq_get_on_cpu >on my platform. >However, the frequency values of all cores are different if doesn't >use this arch_freq_get_on_cpu and just use .get(). >>Note that the arch_freq_get_on_cpu will return an average frequency for >>the last tick, so even if your system is roughly idle with your performance >>governor those numbers make sense (some/most of the cores might be idle >>but you will see the last freq the core was running at before going to idle). >>I do not think there is an agreement what should be shown for idle core when >>querying their freq through sysfs. Showing last known freq makes sense, even >>more than waking up core just to try to get one. >I'm not opposed to using frequency scale factor to get CPU frequency. >But it better be okay. >> >>@Ionela: Please jump in if I got things wrong. >> >>>>then read 'scaling_cur_freq', doing several reads in some intervals ? >>>It seems that above phenomenon has not a lot to do with reading intervals. >>>>The change has been tested on RD-N2 model (Neoverse N2 ref platform), >>>>it has also been discussed here [1] >>>I doesn't get the testing result on this platform in its thread. >>It might be missing exact numbers but the conclusions should be here [1] >> >>>>>>>*--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure. >>>>>>> 0: 2696628 2: 2696628 4: 2696628 6: 2696628 >>>>>>> 8: 2696628 10: 2696628 12: 2696628 14: 2696628 >>>>>>> 16: 2696628 18: 2696628 20: 2696628 22: 2696628 >>>>>>> 24: 2696628 26: 2696628 28: 2696628 30: 2696628 >>>>>>> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >>>>>>> 40: 2696628 42: 2696628 44: 2696628 46: 2696628 >>>>>>> 48: 2696628 50: 2696628 52: 2696628 54: 2696628 >>>>>>> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >>>>>>> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >>>>>>> 72: 2696628 74: 2696628 76: 2696628 78: 2696628 >>>>>>> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >>>>>>> 88: 2696628 90: 2696628 92: 2696628 94: 2696628 >>>>>>> 96: 2696628 98: 2696628 100: 2696628 102: 2696628 >>>>>>>104: 2696628 106: 2696628 108: 2696628 110: 2696628 >>>>>>>112: 2696628 114: 2696628 116: 2696628 118: 2696628 >>>>>>>120: 2696628 122: 2696628 124: 2696628 126: 2696628 >>>>>>> >>>>>>>*Case 2: setting nohz_full and cpufreq use ondemand governor* >>>>>>>There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in >>>>>>>/proc/cmdline. >>>>>>Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to >>>>>>set that one. >>>>>>Now, afair, isolcpus will make the selected CPUs to disappear from the >>>>>>schedulers view (no balancing, no migrating), so unless you affine smth >>>>>>explicitly to those CPUs, you will not see much of an activity there. >>>>>Correct. >>>>>>Need to double check though as it has been a while ... >>>>>>>*--> Step 1: *setting ondemand governor to all policy and query >>>>>>>'cpuinfo_cur_freq' in no pressure case. >>>>>>>And the frequency of CPUs all are about 400MHz. >>>>>>>*--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure. >>>>>>>The high memory access pressure is from the command: "stress-ng -c 64 >>>>>>>--cpu-load 100% --taskset 0-63" >>>>>>I'm not entirely convinced that this will affine to isolated cpus, especially >>>>>>that the affinity mask spans all available cpus. If that is the case, no wonder >>>>>>your isolated cpus are getting wasted being idle. But I would have to double >>>>>>check how this is being handled. >>>>>>>The result: >>>>>>> 0: 2696628 1: 400000 2: 400000 3: 400909 >>>>>>> 4: 400000 5: 400000 6: 400000 7: 400000 >>>>>>> 8: 400000 9: 400000 10: 400600 11: 2696628 >>>>>>>12: 2696628 13: 2696628 14: 2696628 15: 2696628 >>>>>>>16: 2696628 17: 2696628 18: 2696628 19: 2696628 >>>>>>>20: 2696628 21: 2696628 22: 2696628 23: 2696628 >>>>>>>24: 2696628 25: 2696628 26: 2696628 27: 2696628 >>>>>>>28: 2696628 29: 2696628 30: 2696628 31: 2696628 >>>>>>>32: 2696628 33: 2696628 34: 2696628 35: 2696628 >>>>>>>36: 2696628 37: 2696628 38: 2696628 39: 2696628 >>>>>>>40: 2696628 41: 400000 42: 400000 43: 400000 >>>>>>>44: 400000 45: 398847 46: 400000 47: 400000 >>>>>>>48: 400000 49: 400000 50: 400000 51: 2696628 >>>>>>>52: 2696628 53: 2696628 54: 2696628 55: 2696628 >>>>>>>56: 2696628 57: 2696628 58: 2696628 59: 2696628 >>>>>>>60: 2696628 61: 2696628 62: 2696628 63: 2699264 >>>>>>> >>>>>>>Note: >>>>>>>(1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency. >>>>>>> It turned out that nohz full was already work. >>>>>>> I guess that stress-ng cannot use the CPU in the range of nohz full. >>>>>>> Because the CPU frequency will be increased to 2.7G by binding CPU to >>>>>>>other application. >>>>>>>(2) The frequency of the nohz full core is calculated by get() callback >>>>>>>according to ftrace. >>>>>>It is as there is no sched tick on those, and apparently there is nothing >>>>>>running on them either. >>>>>Yes. >>>>>If we select your approach and the above phenomenon is normal, >>>>>the large frequency discrepancy issue can be resolved for CPUs with sched >>>>>tick by the way. >>>>>But the nohz full cores still have to face this issue. So this patch is also >>>>>needed. >>>>> >>>>Yes, nohz cores full have to be handled by the cpufreq driver. >>>Correct. So we still have to face the issue in this patch and push this >>>patch. >>>Beata, would you please review this patch? >>Just to clarify for my benefit (apologies but I do have to contex switch >>pretty often these days): by reviewing this patch do you mean: >>1) review your changes (if so I think there are few comments already to be >>addressed, but I can try to have another look) >Currently, the main comments is that my patch will wake up CPU to get >frequency. >BTW, the core's always been wakened up to get the frequency for FFH >way in cppc_acpi. please see cpc_read_ffh(). >So it may be acceptable. After all, we don't query CPU frequency very often. Today's implementation of cpc_read_ffh() wakes up the core to read AMU counters - this is far from ideal. According to the architecture specification the CPU_CYCLES and CNT_CYCLES counters in AMU do not increment when the core is in WFI or WFE. If we cache the value of the AMU counter before a PE goes idle, we may be able to avoid waking up a PE just to read the AMU counters. I'm wondering if it makes sense to cache the value in cpu_do_idle() and return this cached value if idle_cpu() returns true. >But your patch doesn't meet the non-housekeeping cpus. For non-housekeeping CPUs may be it is better to just invoke cpufreq->get() call? Thanks, Vanshi >>2) review changes for AMU-based arch_freq_get_on_cpu ? >> >>*note: I will still try to have a look at the non-housekeeping cpus case >I am very much hope that this issue my patch mentioned can be resolved ASAP. >So what's your plan about non-housekeeping cpus case? >> >>--- >>[1] https://lore.kernel.org/lkml/691d3eb2-cd93-f0fc-a7a4-2a8c0d44262c@nvidia.com/ >>--- >> >>BR >>Beata >>> >>>/Huisong >>[...] >>.
在 2024/2/21 0:11, Vanshidhar Konda 写道: > On Mon, Feb 19, 2024 at 08:15:50PM +0800, lihuisong (C) wrote: >> >> 在 2024/2/9 18:55, Beata Michalska 写道: >>> On Tue, Feb 06, 2024 at 04:02:15PM +0800, lihuisong (C) wrote: >>>> 在 2024/2/2 16:08, Beata Michalska 写道: >>>>> On Wed, Jan 17, 2024 at 05:18:40PM +0800, lihuisong (C) wrote: >>>>> >>>>> Hi , >>>>> >>>>> Again, apologies for delay, >>>>> >>>>>> Hi, >>>>>> >>>>>> 在 2024/1/16 22:10, Beata Michalska 写道: >>>>>>> Hi, >>>>>>> >>>>>>> Apologies for jumping in so late.... >>>>>>> >>>>>>> On Wed, Jan 10, 2024 at 03:09:48PM +0800, lihuisong (C) wrote: >>>>>>>> Hi Ionela, >>>>>>>> >>>>>>>> 在 2024/1/8 22:03, Ionela Voinescu 写道: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> On Friday 05 Jan 2024 at 15:04:47 (+0800), lihuisong (C) wrote: >>>>>>>>>> Hi Vanshi, >>>>>>>>>> >>>>>>>>>> 在 2024/1/5 8:48, Vanshidhar Konda 写道: >>>>>>>>>>> On Thu, Jan 04, 2024 at 05:36:51PM +0800, lihuisong (C) wrote: >>>>>>>>>>>> 在 2024/1/4 1:53, Ionela Voinescu 写道: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> On Tuesday 12 Dec 2023 at 15:26:17 (+0800), Huisong Li wrote: >>>>>>>>>>>>>> Many developers found that the cpu current frequency is >>>>>>>>>>>>>> greater than >>>>>>>>>>>>>> the maximum frequency of the platform, please see [1], >>>>>>>>>>>>>> [2] and [3]. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the scenarios with high memory access pressure, the >>>>>>>>>>>>>> patch [1] has >>>>>>>>>>>>>> proved the significant latency of cpc_read() which is >>>>>>>>>>>>>> used to obtain >>>>>>>>>>>>>> delivered and reference performance counter cause an >>>>>>>>>>>>>> absurd frequency. >>>>>>>>>>>>>> The sampling interval for this counters is very critical and >>>>>>>>>>>>>> is expected >>>>>>>>>>>>>> to be equal. However, the different latency of cpc_read() >>>>>>>>>>>>>> has a direct >>>>>>>>>>>>>> impact on their sampling interval. >>>>>>>>>>>>>> >>>>>>>>>>>>> Would this [1] alternative solution work for you? >>>>>>>>>>>> It would work for me AFAICS. >>>>>>>>>>>> Because the "arch_freq_scale" is also from AMU core and >>>>>>>>>>>> constant >>>>>>>>>>>> counter, and read together. >>>>>>>>>>>> But, from their discuss line, it seems that there are some >>>>>>>>>>>> tricky >>>>>>>>>>>> points to clarify or consider. >>>>>>>>>>> I think the changes in [1] would work better when CPUs may >>>>>>>>>>> be idle. With >>>>>>>>>>> this >>>>>>>>>>> patch we would have to wake any core that is in idle state >>>>>>>>>>> to read the >>>>>>>>>>> AMU >>>>>>>>>>> counters. Worst case, if core 0 is trying to read the CPU >>>>>>>>>>> frequency of >>>>>>>>>>> all >>>>>>>>>>> cores, it may need to wake up all the other cores to read >>>>>>>>>>> the AMU >>>>>>>>>>> counters. >>>>>>>>>> From the approach in [1], if all CPUs (one or more cores) >>>>>>>>>> under one policy >>>>>>>>>> are idle, they still cannot be obtained the CPU frequency, >>>>>>>>>> right? >>>>>>>>>> In this case, the [1] API will return 0 and have to back to call >>>>>>>>>> cpufreq_driver->get() for cpuinfo_cur_freq. >>>>>>>>>> Then we still need to face the issue this patch mentioned. >>>>>>>>> With the implementation at [1], arch_freq_get_on_cpu() will >>>>>>>>> not return 0 >>>>>>>>> for idle CPUs and the get() callback will not be called to >>>>>>>>> wake up the >>>>>>>>> CPUs. >>>>>>>> Right, arch_freq_get_on_cpu() will not return 0 for idle CPUs. >>>>>>>> However, for no-housekeeping CPUs, it will return 0 and have to >>>>>>>> call get() >>>>>>>> callback, right? >>>>>>>>> Worst case, arch_freq_get_on_cpu() will return a frequency >>>>>>>>> based on the >>>>>>>>> AMU counter values obtained on the last tick on that CPU. But >>>>>>>>> if that CPU >>>>>>>>> is not a housekeeping CPU, a housekeeping CPU in the same >>>>>>>>> policy will be >>>>>>>>> selected, as it would have had a more recent tick, and >>>>>>>>> therefore a more >>>>>>>>> recent frequency value for the domain. >>>>>>>> But this frequency is from the last tick, >>>>>>>> this last tick is probably a long time ago and it doesn't update >>>>>>>> 'arch_freq_scale' for some reasons like CPU dile. >>>>>>>> In addition, I'm not sure if there is possible that >>>>>>>> amu_scale_freq_tick() is >>>>>>>> executed delayed under high stress case. >>>>>>>> It also have an impact on the accuracy of the cpu frequency we >>>>>>>> query. >>>>>>>>> I understand that the frequency returned here will not be up >>>>>>>>> to date, >>>>>>>>> but there's no proper frequency feedback for an idle CPU. If >>>>>>>>> one only >>>>>>>>> wakes up a CPU to sample counters, before the CPU goes back to >>>>>>>>> sleep, >>>>>>>>> the obtained frequency feedback is meaningless. >>>>>>>>> >>>>>>>>>>> For systems with 128 cores or more, this could be very >>>>>>>>>>> expensive and >>>>>>>>>>> happen >>>>>>>>>>> very frequently. >>>>>>>>>>> >>>>>>>>>>> AFAICS, the approach in [1] would avoid this cost. >>>>>>>>>> But the CPU frequency is just an average value for the last >>>>>>>>>> tick period >>>>>>>>>> instead of the current one the CPU actually runs at. >>>>>>>>>> In addition, there are some conditions to use >>>>>>>>>> 'arch_freq_scale' in this >>>>>>>>>> approach. >>>>>>>>> What are the conditions you are referring to? >>>>>>>> It depends on the housekeeping CPUs. >>>>>>>>>> So I'm not sure if this approach can entirely cover the >>>>>>>>>> frequency >>>>>>>>>> discrepancy issue. >>>>>>>>> Unfortunately there is no perfect frequency feedback. By the >>>>>>>>> time you >>>>>>>>> observe/use the value of scaling_cur_freq/cpuinfo_cur_freq, >>>>>>>>> the frequency >>>>>>>>> of the CPU might have already changed. Therefore, an average >>>>>>>>> value might >>>>>>>>> be a better indication of the recent performance level of a CPU. >>>>>>>> An average value for CPU frequency is ok. It may be better if >>>>>>>> it has not any >>>>>>>> delaying. >>>>>>>> >>>>>>>> The original implementation for cpuinfo_cur_freq can more >>>>>>>> reflect their >>>>>>>> meaning in the user-guide [1]. The user-guide said: >>>>>>>> "cpuinfo_cur_freq : Current frequency of the CPU as obtained >>>>>>>> from the >>>>>>>> hardware, in KHz. >>>>>>>> This is the frequency the CPU actually runs at." >>>>>>>> >>>>>>>> >>>>>>>> [1]https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt >>>>>>>> >>>>>>>> >>>>>>>>> Would you be able to test [1] on your platform and usecase? >>>>>>>> I has tested it on my platform (CPU number: 64, SMT: off and >>>>>>>> CPU base >>>>>>>> frequency: 2.7GHz). >>>>>>>> Accoding to the testing result, >>>>>>>> 1> I found that patch [1] and [2] cannot cover the no >>>>>>>> housekeeping CPUs. >>>>>>>> They still have to face the large frequency discrepancy issue >>>>>>>> my patch >>>>>>>> mentioned. >>>>>>>> 2> Additionally, the frequency value of all CPUs are almost the >>>>>>>> same by >>>>>>>> using the 'arch_freq_scale' factor way. I'm not sure if it is ok. >>>>>>>> >>>>>>>> The patch [1] has been modified silightly as below: >>>>>>>> --> >>>>>>>> @@ -1756,7 +1756,10 @@ static unsigned int >>>>>>>> cpufreq_verify_current_freq(struct cpufreq_policy *policy, b >>>>>>>> { >>>>>>>> unsigned int new_freq; >>>>>>>> >>>>>>>> - new_freq = cpufreq_driver->get(policy->cpu); >>>>>>>> + new_freq = arch_freq_get_on_cpu(policy->cpu); >>>>>>>> + if (!new_freq) >>>>>>>> + new_freq = cpufreq_driver->get(policy->cpu); >>>>>>>> + >>>>>>> As pointed out this change will not make it to the next version >>>>>>> of the patch. >>>>>>> So I'd say you can safely ignore it and assume that >>>>>>> arch_freq_get_on_cpu will >>>>>>> only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq >>>>>>>> if (!new_freq) >>>>>>>> return 0; >>>>>>>> >>>>>>>> And the result is as follows: >>>>>>>> *case 1:**No setting the nohz_full and cpufreq use performance >>>>>>>> governor* >>>>>>>> *--> Step1: *read 'cpuinfo_cur_freq' in no pressure >>>>>>>> 0: 2699264 2: 2699264 4: 2699264 6: 2699264 >>>>>>>> 8: 2696628 10: 2696628 12: 2696628 14: 2699264 >>>>>>>> 16: 2699264 18: 2696628 20: 2699264 22: 2696628 >>>>>>>> 24: 2699264 26: 2696628 28: 2699264 30: 2696628 >>>>>>>> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >>>>>>>> 40: 2699264 42: 2699264 44: 2696628 46: 2696628 >>>>>>>> 48: 2696628 50: 2699264 52: 2699264 54: 2696628 >>>>>>>> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >>>>>>>> 64: 2696628 66: 2699264 68: 2696628 70: 2696628 >>>>>>>> 72: 2699264 74: 2696628 76: 2696628 78: 2699264 >>>>>>>> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >>>>>>>> 88: 2696628 90: 2696628 92: 2696628 94: 2699264 >>>>>>>> 96: 2696628 98: 2699264 100: 2699264 102: 2696628 >>>>>>>> 104: 2699264 106: 2699264 108: 2699264 110: 2696628 >>>>>>>> 112: 2699264 114: 2699264 116: 2699264 118: 2699264 >>>>>>>> 120: 2696628 122: 2699264 124: 2696628 126: 2699264 >>>>>>>> Note: the frequency of all CPUs are almost the same. >>>>>>> Were you expecting smth else ? >>>>>> The frequency of each CPU might have a different value. >>>>>> All value of all CPUs is the same under high pressure. >>>>>> I don't know what the phenomenon is on other platform. >>>>>> Do you know who else tested it? >>>>> So I might have rushed a bit with my previous comment/question: >>>>> apologies for >>>>> that. >>>>> The numbers above: those are on a fairly idle/lightly loaded >>>>> system right? >>>> Yes. >>>>> Would you mind having another go with just the arch_freq_get_on_cpu >>>>> implementation beign added and dropping the changes in the cpufreq >>>>> and >>>> All my tests are done when cpufreq policy is "performance" and OS >>>> isn't on a >>>> high load. >>>> Reading "scaling_cur_freq" or "scaling_cur_freq" for each physical >>>> core on >>>> platform >>>> >>>> The testing result for "cpuinfo_cur_freq" with your changes on a >>>> fairly idle >>>> and high loaded system can also be found in this thread. >>>> *A: the result with your changes* >>>> --> Reading "scaling_cur_freq" >>>> 0: 2688720 2: 2696628 4: 2699264 6: 2696628 >>>> 8: 2699264 10: 2696628 12: 2699264 14: 2699264 >>>> 16: 2699264 18: 2696628 20: 2696628 22: 2696628 >>>> 24: 2699264 26: 2696628 28: 2696628 30: 2696628 >>>> 32: 2699264 34: 2691356 36: 2696628 38: 2699264 >>>> 40: 2699264 42: 2696628 44: 2696628 46: 2699264 >>>> 48: 2699264 50: 2696628 52: 2696628 54: 2696628 >>>> 56: 2696628 58: 2699264 60: 2691356 62: 2696628 >>>> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >>>> 72: 2696628 74: 2696628 76: 2699264 78: 2696628 >>>> 80: 2696628 82: 2696628 84: 2699264 86: 2696628 >>>> 88: 2625456 90: 2696628 92: 2699264 94: 2696628 >>>> 96: 2696628 98: 2696628 100: 2699264 102: 2699264 >>>> 104: 2699264 106: 2696628 108: 2699264 110: 2696628 >>>> 112: 2699264 114: 2699264 116: 2696628 118: 2696628 >>>> 120: 2696628 122: 2699264 124: 2696628 126: 2696628 >>>> -->Reading "cpuinfo_cur_freq" >>>> 0: 2696628 2: 2696628 4: 2699264 6: 2688720 >>>> 8: 2699264 10: 2700000 12: 2696628 14: 2698322 >>>> 16: 2699264 18: 2699264 20: 2696628 22: 2699264 >>>> 24: 2699264 26: 2699264 28: 2699264 30: 2699264 >>>> 32: 2699264 34: 2693992 36: 2696628 38: 2696628 >>>> 40: 2699264 42: 2699264 44: 2699264 46: 2696628 >>>> 48: 2696628 50: 2699264 52: 2696628 54: 2696628 >>>> 56: 2699264 58: 2699264 60: 2696628 62: 2699264 >>>> 64: 2696628 66: 2699264 68: 2696628 70: 2699264 >>>> 72: 2696628 74: 2696628 76: 2696628 78: 2693992 >>>> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >>>> 88: 2696628 90: 2699264 92: 2696628 94: 2699264 >>>> 96: 2699264 98: 2696628 100: 2699264 102: 2699264 >>>> 104: 2691356 106: 2699264 108: 2699264 110: 2699264 >>>> 112: 2699264 114: 2696628 116: 2699264 118: 2699264 >>>> 120: 2696628 122: 2696628 124: 2696628 126: 2696628 >>>> >>>> *B: the result without your changes* >>>> -->Reading "scaling_cur_freq" >>>> 0: 2698245 2: 2706690 4: 2699649 6: 2702105 >>>> 8: 2704362 10: 2697993 12: 2701672 14: 2704362 >>>> 16: 2701052 18: 2701052 20: 2694385 22: 2699650 >>>> 24: 2706802 26: 2702389 28: 2698299 30: 2698299 >>>> 32: 2697333 34: 2697993 36: 2701337 38: 2699328 >>>> 40: 2700330 42: 2700330 44: 2698019 46: 2697697 >>>> 48: 2699659 50: 2701700 52: 2703401 54: 2701700 >>>> 56: 2704013 58: 2697658 60: 2695000 62: 2697666 >>>> 64: 2697902 66: 2701052 68: 2698245 70: 2695789 >>>> 72: 2701315 74: 2696655 76: 2693666 78: 2695317 >>>> 80: 2704912 82: 2699649 84: 2698245 86: 2695454 >>>> 88: 2697966 90: 2697959 92: 2699319 94: 2700680 >>>> 96: 2695317 98: 2698996 100: 2700000 102: 2700334 >>>> 104: 2701320 106: 2695065 108: 2700986 110: 2703960 >>>> 112: 2697635 114: 2704421 116: 2700680 118: 2702040 >>>> 120: 2700334 122: 2697993 124: 2700334 126: 2705351 >>>> -->Reading "cpuinfo_cur_freq" >>>> 0: 2696853 2: 2695454 4: 2699649 6: 2706993 >>>> 8: 2706060 10: 2704362 12: 2704362 14: 2697658 >>>> 16: 2707719 18: 2697192 20: 2702456 22: 2699650 >>>> 24: 2705782 26: 2698299 28: 2703061 30: 2705802 >>>> 32: 2700000 34: 2700671 36: 2701337 38: 2697658 >>>> 40: 2700330 42: 2700330 44: 2699672 46: 2697697 >>>> 48: 2703061 50: 2696610 52: 2692542 54: 2704406 >>>> 56: 2695317 58: 2699331 60: 2698996 62: 2702675 >>>> 64: 2704912 66: 2703859 68: 2699649 70: 2698596 >>>> 72: 2703908 74: 2703355 76: 2697658 78: 2695317 >>>> 80: 2702105 82: 2707719 84: 2702105 86: 2699649 >>>> 88: 2697966 90: 2691525 92: 2701700 94: 2700680 >>>> 96: 2695317 98: 2698996 100: 2698666 102: 2700334 >>>> 104: 2690429 106: 2707590 108: 2700986 110: 2701320 >>>> 112: 2696283 114: 2692881 116: 2697627 118: 2704421 >>>> 120: 2698996 122: 2696321 124: 2696655 126: 2695000 >>>> >>> So in both cases : whether you use arch_freq_get_on_cpu or not >>> (so with and without the patch) you get roughly the same frequencies >>> on all cores - or am I missing smth from the dump above ? >> The changes in "with/without your changes" I said is your patch >> intruduced arch_freq_get_on_cpu. >> I just test them according to your requesting. >>> And those are reflecting max freq you have provided earlier (?) >> I know it is an average frequency for the last tickfor using >> arch_freq_get_on_cpu. >> I have no any doubt that the freq is maximum value on performance >> governor. >> I just want to say the difference between having or not having your >> patch. >> The frequency values of all cores from cpuinfo_cur_freq and >> scaling_cur_freq are almost the same if use this arch_freq_get_on_cpu >> on my platform. >> However, the frequency values of all cores are different if doesn't >> use this arch_freq_get_on_cpu and just use .get(). >>> Note that the arch_freq_get_on_cpu will return an average frequency for >>> the last tick, so even if your system is roughly idle with your >>> performance >>> governor those numbers make sense (some/most of the cores might be idle >>> but you will see the last freq the core was running at before going >>> to idle). >>> I do not think there is an agreement what should be shown for idle >>> core when >>> querying their freq through sysfs. Showing last known freq makes >>> sense, even >>> more than waking up core just to try to get one. >> I'm not opposed to using frequency scale factor to get CPU frequency. >> But it better be okay. >>> >>> @Ionela: Please jump in if I got things wrong. >>> >>>>> then read 'scaling_cur_freq', doing several reads in some intervals ? >>>> It seems that above phenomenon has not a lot to do with reading >>>> intervals. >>>>> The change has been tested on RD-N2 model (Neoverse N2 ref platform), >>>>> it has also been discussed here [1] >>>> I doesn't get the testing result on this platform in its thread. >>> It might be missing exact numbers but the conclusions should be here >>> [1] >>> >>>>>>>> *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access >>>>>>>> pressure. >>>>>>>> 0: 2696628 2: 2696628 4: 2696628 6: 2696628 >>>>>>>> 8: 2696628 10: 2696628 12: 2696628 14: 2696628 >>>>>>>> 16: 2696628 18: 2696628 20: 2696628 22: 2696628 >>>>>>>> 24: 2696628 26: 2696628 28: 2696628 30: 2696628 >>>>>>>> 32: 2696628 34: 2696628 36: 2696628 38: 2696628 >>>>>>>> 40: 2696628 42: 2696628 44: 2696628 46: 2696628 >>>>>>>> 48: 2696628 50: 2696628 52: 2696628 54: 2696628 >>>>>>>> 56: 2696628 58: 2696628 60: 2696628 62: 2696628 >>>>>>>> 64: 2696628 66: 2696628 68: 2696628 70: 2696628 >>>>>>>> 72: 2696628 74: 2696628 76: 2696628 78: 2696628 >>>>>>>> 80: 2696628 82: 2696628 84: 2696628 86: 2696628 >>>>>>>> 88: 2696628 90: 2696628 92: 2696628 94: 2696628 >>>>>>>> 96: 2696628 98: 2696628 100: 2696628 102: 2696628 >>>>>>>> 104: 2696628 106: 2696628 108: 2696628 110: 2696628 >>>>>>>> 112: 2696628 114: 2696628 116: 2696628 118: 2696628 >>>>>>>> 120: 2696628 122: 2696628 124: 2696628 126: 2696628 >>>>>>>> >>>>>>>> *Case 2: setting nohz_full and cpufreq use ondemand governor* >>>>>>>> There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 >>>>>>>> rcu_nocbs=1-10,41-50" in >>>>>>>> /proc/cmdline. >>>>>>> Right, so if I remember correctly nohz_full implies rcu_nocbs, >>>>>>> so no need to >>>>>>> set that one. >>>>>>> Now, afair, isolcpus will make the selected CPUs to disappear >>>>>>> from the >>>>>>> schedulers view (no balancing, no migrating), so unless you >>>>>>> affine smth >>>>>>> explicitly to those CPUs, you will not see much of an activity >>>>>>> there. >>>>>> Correct. >>>>>>> Need to double check though as it has been a while ... >>>>>>>> *--> Step 1: *setting ondemand governor to all policy and query >>>>>>>> 'cpuinfo_cur_freq' in no pressure case. >>>>>>>> And the frequency of CPUs all are about 400MHz. >>>>>>>> *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access >>>>>>>> pressure. >>>>>>>> The high memory access pressure is from the command: "stress-ng >>>>>>>> -c 64 >>>>>>>> --cpu-load 100% --taskset 0-63" >>>>>>> I'm not entirely convinced that this will affine to isolated >>>>>>> cpus, especially >>>>>>> that the affinity mask spans all available cpus. If that is the >>>>>>> case, no wonder >>>>>>> your isolated cpus are getting wasted being idle. But I would >>>>>>> have to double >>>>>>> check how this is being handled. >>>>>>>> The result: >>>>>>>> 0: 2696628 1: 400000 2: 400000 3: 400909 >>>>>>>> 4: 400000 5: 400000 6: 400000 7: 400000 >>>>>>>> 8: 400000 9: 400000 10: 400600 11: 2696628 >>>>>>>> 12: 2696628 13: 2696628 14: 2696628 15: 2696628 >>>>>>>> 16: 2696628 17: 2696628 18: 2696628 19: 2696628 >>>>>>>> 20: 2696628 21: 2696628 22: 2696628 23: 2696628 >>>>>>>> 24: 2696628 25: 2696628 26: 2696628 27: 2696628 >>>>>>>> 28: 2696628 29: 2696628 30: 2696628 31: 2696628 >>>>>>>> 32: 2696628 33: 2696628 34: 2696628 35: 2696628 >>>>>>>> 36: 2696628 37: 2696628 38: 2696628 39: 2696628 >>>>>>>> 40: 2696628 41: 400000 42: 400000 43: 400000 >>>>>>>> 44: 400000 45: 398847 46: 400000 47: 400000 >>>>>>>> 48: 400000 49: 400000 50: 400000 51: 2696628 >>>>>>>> 52: 2696628 53: 2696628 54: 2696628 55: 2696628 >>>>>>>> 56: 2696628 57: 2696628 58: 2696628 59: 2696628 >>>>>>>> 60: 2696628 61: 2696628 62: 2696628 63: 2699264 >>>>>>>> >>>>>>>> Note: >>>>>>>> (1) The frequency of 1-10 and 41-50 CPUs work on the lowest >>>>>>>> frequency. >>>>>>>> It turned out that nohz full was already work. >>>>>>>> I guess that stress-ng cannot use the CPU in the range >>>>>>>> of nohz full. >>>>>>>> Because the CPU frequency will be increased to 2.7G by >>>>>>>> binding CPU to >>>>>>>> other application. >>>>>>>> (2) The frequency of the nohz full core is calculated by get() >>>>>>>> callback >>>>>>>> according to ftrace. >>>>>>> It is as there is no sched tick on those, and apparently there >>>>>>> is nothing >>>>>>> running on them either. >>>>>> Yes. >>>>>> If we select your approach and the above phenomenon is normal, >>>>>> the large frequency discrepancy issue can be resolved for CPUs >>>>>> with sched >>>>>> tick by the way. >>>>>> But the nohz full cores still have to face this issue. So this >>>>>> patch is also >>>>>> needed. >>>>>> >>>>> Yes, nohz cores full have to be handled by the cpufreq driver. >>>> Correct. So we still have to face the issue in this patch and push >>>> this >>>> patch. >>>> Beata, would you please review this patch? >>> Just to clarify for my benefit (apologies but I do have to contex >>> switch >>> pretty often these days): by reviewing this patch do you mean: >>> 1) review your changes (if so I think there are few comments already >>> to be >>> addressed, but I can try to have another look) >> Currently, the main comments is that my patch will wake up CPU to get >> frequency. >> BTW, the core's always been wakened up to get the frequency for FFH >> way in cppc_acpi. please see cpc_read_ffh(). >> So it may be acceptable. After all, we don't query CPU frequency very >> often. > > Today's implementation of cpc_read_ffh() wakes up the core to read AMU > counters - this is far from ideal. According to the architecture > specification the CPU_CYCLES and CNT_CYCLES counters in AMU do not > increment when the core is in WFI or WFE. If we cache the value of the > AMU counter before a PE goes idle, we may be able to avoid waking up a > PE just to read the AMU counters. I'm wondering if it makes sense to > cache the value in cpu_do_idle() and return this cached value if > idle_cpu() returns true. It just might be useful for the idle state from WFI, right? What about other idle states? it will be a little complex. The 'cpuinfo_cur_freq' is feedback for CPU frequency, and we firstly need to ensure that its function is ok. What's more, I guess that user don't query CPU frequency very often when OS system is in idle. Because it is not the center of attention. From the point of view, it is acceptable to wake up the core to read AMU counters and fundamentally resolve this issue . What do you think? > >> But your patch doesn't meet the non-housekeeping cpus. > > For non-housekeeping CPUs may be it is better to just invoke > cpufreq->get() call? Then we're still going to have this issue. > > Thanks, > Vanshi > >>> 2) review changes for AMU-based arch_freq_get_on_cpu ? >>> >>> *note: I will still try to have a look at the non-housekeeping cpus >>> case >> I am very much hope that this issue my patch mentioned can be >> resolved ASAP. >> So what's your plan about non-housekeeping cpus case? >>> >>> --- >>> [1] >>> https://lore.kernel.org/lkml/691d3eb2-cd93-f0fc-a7a4-2a8c0d44262c@nvidia.com/ >>> --- >>> >>> BR >>> Beata >>>> >>>> /Huisong >>> [...] >>> . > .
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 7d37e458e2f5..c3122154d738 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -299,6 +299,11 @@ core_initcall(init_amu_fie); #ifdef CONFIG_ACPI_CPPC_LIB #include <acpi/cppc_acpi.h> +struct amu_counters { + u64 corecnt; + u64 constcnt; +}; + static void cpu_read_corecnt(void *val) { /* @@ -322,8 +327,27 @@ static void cpu_read_constcnt(void *val) 0UL : read_constcnt(); } +static void cpu_read_amu_counters(void *data) +{ + struct amu_counters *cnt = (struct amu_counters *)data; + + /* + * The running time of the this_cpu_has_cap() might have a couple of + * microseconds and is significantly increased to tens of microseconds. + * But AMU core and constant counter need to be read togeter without any + * time interval to reduce the calculation discrepancy using this counters. + */ + if (this_cpu_has_cap(ARM64_WORKAROUND_2457168)) { + cnt->corecnt = read_corecnt(); + cnt->constcnt = 0; + } else { + cnt->corecnt = read_corecnt(); + cnt->constcnt = read_constcnt(); + } +} + static inline -int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) +int counters_read_on_cpu(int cpu, smp_call_func_t func, void *data) { /* * Abort call on counterless CPU or when interrupts are @@ -335,7 +359,7 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) if (WARN_ON_ONCE(irqs_disabled())) return -EPERM; - smp_call_function_single(cpu, func, val, 1); + smp_call_function_single(cpu, func, data, 1); return 0; } @@ -364,6 +388,21 @@ bool cpc_ffh_supported(void) return true; } +int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) +{ + struct amu_counters cnts = {0}; + int ret; + + ret = counters_read_on_cpu(cpu, cpu_read_amu_counters, &cnts); + if (ret) + return ret; + + *delivered = cnts.corecnt; + *reference = cnts.constcnt; + + return 0; +} + int cpc_read_ffh(int cpu, struct cpc_reg *reg, u64 *val) { int ret = -EOPNOTSUPP; diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c index 7ff269a78c20..f303fabd7cfe 100644 --- a/drivers/acpi/cppc_acpi.c +++ b/drivers/acpi/cppc_acpi.c @@ -1299,6 +1299,11 @@ bool cppc_perf_ctrs_in_pcc(void) } EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc); +int __weak cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) +{ + return 0; +} + /** * cppc_get_perf_ctrs - Read a CPU's performance feedback counters. * @cpunum: CPU from which to read counters. @@ -1313,7 +1318,8 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) *ref_perf_reg, *ctr_wrap_reg; int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum); struct cppc_pcc_data *pcc_ss_data = NULL; - u64 delivered, reference, ref_perf, ctr_wrap_time; + u64 delivered = 0, reference = 0; + u64 ref_perf, ctr_wrap_time; int ret = 0, regs_in_pcc = 0; if (!cpc_desc) { @@ -1350,8 +1356,18 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs) } } - cpc_read(cpunum, delivered_reg, &delivered); - cpc_read(cpunum, reference_reg, &reference); + if (cpc_ffh_supported()) { + ret = cpc_read_arch_counters_on_cpu(cpunum, &delivered, &reference); + if (ret) { + pr_debug("read arch counters failed, ret=%d.\n", ret); + ret = 0; + } + } + if (!delivered || !reference) { + cpc_read(cpunum, delivered_reg, &delivered); + cpc_read(cpunum, reference_reg, &reference); + } + cpc_read(cpunum, ref_perf_reg, &ref_perf); /* diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h index 6126c977ece0..07d4fd82d499 100644 --- a/include/acpi/cppc_acpi.h +++ b/include/acpi/cppc_acpi.h @@ -152,6 +152,7 @@ extern bool cpc_ffh_supported(void); extern bool cpc_supported_by_cpu(void); extern int cpc_read_ffh(int cpunum, struct cpc_reg *reg, u64 *val); extern int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val); +extern int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference); extern int cppc_get_epp_perf(int cpunum, u64 *epp_perf); extern int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable); extern int cppc_get_auto_sel_caps(int cpunum, struct cppc_perf_caps *perf_caps); @@ -209,6 +210,10 @@ static inline int cpc_write_ffh(int cpunum, struct cpc_reg *reg, u64 val) { return -ENOTSUPP; } +static inline int cpc_read_arch_counters_on_cpu(int cpu, u64 *delivered, u64 *reference) +{ + return -EOPNOTSUPP; +} static inline int cppc_set_epp_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls, bool enable) { return -ENOTSUPP;
Many developers found that the cpu current frequency is greater than the maximum frequency of the platform, please see [1], [2] and [3]. In the scenarios with high memory access pressure, the patch [1] has proved the significant latency of cpc_read() which is used to obtain delivered and reference performance counter cause an absurd frequency. The sampling interval for this counters is very critical and is expected to be equal. However, the different latency of cpc_read() has a direct impact on their sampling interval. This patch adds a interface, cpc_read_arch_counters_on_cpu, to read delivered and reference performance counter together. According to my test[4], the discrepancy of cpu current frequency in the scenarios with high memory access pressure is lower than 0.2% by stress-ng application. [1] https://lore.kernel.org/all/20231025093847.3740104-4-zengheng4@huawei.com/ [2] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/ [3] https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/ [4] My local test: The testing platform enable SMT and include 128 logical CPU in total, and CPU base frequency is 2.7GHz. Reading "cpuinfo_cur_freq" for each physical core on platform during the high memory access pressure from stress-ng, and the output is as follows: 0: 2699133 2: 2699942 4: 2698189 6: 2704347 8: 2704009 10: 2696277 12: 2702016 14: 2701388 16: 2700358 18: 2696741 20: 2700091 22: 2700122 24: 2701713 26: 2702025 28: 2699816 30: 2700121 32: 2700000 34: 2699788 36: 2698884 38: 2699109 40: 2704494 42: 2698350 44: 2699997 46: 2701023 48: 2703448 50: 2699501 52: 2700000 54: 2699999 56: 2702645 58: 2696923 60: 2697718 62: 2700547 64: 2700313 66: 2700000 68: 2699904 70: 2699259 72: 2699511 74: 2700644 76: 2702201 78: 2700000 80: 2700776 82: 2700364 84: 2702674 86: 2700255 88: 2699886 90: 2700359 92: 2699662 94: 2696188 96: 2705454 98: 2699260 100: 2701097 102: 2699630 104: 2700463 106: 2698408 108: 2697766 110: 2701181 112: 2699166 114: 2701804 116: 2701907 118: 2701973 120: 2699584 122: 2700474 124: 2700768 126: 2701963 Signed-off-by: Huisong Li <lihuisong@huawei.com> --- arch/arm64/kernel/topology.c | 43 ++++++++++++++++++++++++++++++++++-- drivers/acpi/cppc_acpi.c | 22 +++++++++++++++--- include/acpi/cppc_acpi.h | 5 +++++ 3 files changed, 65 insertions(+), 5 deletions(-)