Message ID | 1974978.nRy8TqEeLZ@kreacher |
---|---|
State | New |
Headers | show |
Series | [RFT,v1] cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known | expand |
On 2/15/21 1:24 PM, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover > boost frequencies") attempted to address a performance issue involving > acpi-cpufreq, the schedutil governor and scale-invariance on x86 by > extending the frequency tables created by acpi-cpufreq to cover the > entire range of "turbo" (or "boost") frequencies, but that caused > frequencies reported via /proc/cpuinfo and the scaling_cur_freq > attribute in sysfs to change which may confuse users and monitoring > tools. > > For this reason, revert the part of commit 3c55e94c0ade adding the > extra entry to the frequency table and use the observation that > in principle cpuinfo.max_freq need not be equal to the maximum > frequency listed in the frequency table for the given policy. > > Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq > drivers to set their own cpuinfo.max_freq above that frequency and > change acpi-cpufreq to set cpuinfo.max_freq to the maximum boost > frequency found via CPPC. > > This should be sufficient to let all of the cpufreq subsystem know > the real maximum frequency of the CPU without changing frequency > reporting. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305 > Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies") > Reported-by: Matt McDonald <gardotd426@gmail.com> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > Michael, Giovanni, > > The fix for the EPYC performance regression that was merged into 5.11 introduced > an undesirable side-effect by distorting the CPU frequency reporting via > /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details). > > The patch below is reported to address this problem and it should still allow > schedutil to achieve desirable performance, because it simply sets > cpuinfo.max_freq without extending the frequency table of the CPU. > > Please test this one and let me know if it adversely affects performance. > > Thanks! When carrying out tests so far today on an EPYC 7F72 2P and Ryzen 9 5900X with workloads seeing impact from the prior patches, everything is looking good when comparing v5.11 to v5.11 + this patch. Not seeing any measurable difference on either of those systems as a result of this patch. Running some additional tests and on a few more boxes that should wrap up tomorrow but at least so far the patch isn't showing any measurable changes to performance. Michael > > --- > drivers/cpufreq/acpi-cpufreq.c | 62 ++++++++++------------------------------- > drivers/cpufreq/freq_table.c | 8 ++++- > 2 files changed, 23 insertions(+), 47 deletions(-) > > Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c > =================================================================== > --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c > +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c > @@ -54,7 +54,6 @@ struct acpi_cpufreq_data { > unsigned int resume; > unsigned int cpu_feature; > unsigned int acpi_perf_cpu; > - unsigned int first_perf_state; > cpumask_var_t freqdomain_cpus; > void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val); > u32 (*cpu_freq_read)(struct acpi_pct_register *reg); > @@ -223,10 +222,10 @@ static unsigned extract_msr(struct cpufr > > perf = to_perf_data(data); > > - cpufreq_for_each_entry(pos, policy->freq_table + data->first_perf_state) > + cpufreq_for_each_entry(pos, policy->freq_table) > if (msr == perf->states[pos->driver_data].status) > return pos->frequency; > - return policy->freq_table[data->first_perf_state].frequency; > + return policy->freq_table[0].frequency; > } > > static unsigned extract_freq(struct cpufreq_policy *policy, u32 val) > @@ -365,7 +364,6 @@ static unsigned int get_cur_freq_on_cpu( > struct cpufreq_policy *policy; > unsigned int freq; > unsigned int cached_freq; > - unsigned int state; > > pr_debug("%s (%d)\n", __func__, cpu); > > @@ -377,11 +375,7 @@ static unsigned int get_cur_freq_on_cpu( > if (unlikely(!data || !policy->freq_table)) > return 0; > > - state = to_perf_data(data)->state; > - if (state < data->first_perf_state) > - state = data->first_perf_state; > - > - cached_freq = policy->freq_table[state].frequency; > + cached_freq = policy->freq_table[to_perf_data(data)->state].frequency; > freq = extract_freq(policy, get_cur_val(cpumask_of(cpu), data)); > if (freq != cached_freq) { > /* > @@ -680,7 +674,6 @@ static int acpi_cpufreq_cpu_init(struct > struct cpuinfo_x86 *c = &cpu_data(cpu); > unsigned int valid_states = 0; > unsigned int result = 0; > - unsigned int state_count; > u64 max_boost_ratio; > unsigned int i; > #ifdef CONFIG_SMP > @@ -795,28 +788,8 @@ static int acpi_cpufreq_cpu_init(struct > goto err_unreg; > } > > - state_count = perf->state_count + 1; > - > - max_boost_ratio = get_max_boost_ratio(cpu); > - if (max_boost_ratio) { > - /* > - * Make a room for one more entry to represent the highest > - * available "boost" frequency. > - */ > - state_count++; > - valid_states++; > - data->first_perf_state = valid_states; > - } else { > - /* > - * If the maximum "boost" frequency is unknown, ask the arch > - * scale-invariance code to use the "nominal" performance for > - * CPU utilization scaling so as to prevent the schedutil > - * governor from selecting inadequate CPU frequencies. > - */ > - arch_set_max_freq_ratio(true); > - } > - > - freq_table = kcalloc(state_count, sizeof(*freq_table), GFP_KERNEL); > + freq_table = kcalloc(perf->state_count + 1, sizeof(*freq_table), > + GFP_KERNEL); > if (!freq_table) { > result = -ENOMEM; > goto err_unreg; > @@ -851,27 +824,25 @@ static int acpi_cpufreq_cpu_init(struct > } > freq_table[valid_states].frequency = CPUFREQ_TABLE_END; > > + max_boost_ratio = get_max_boost_ratio(cpu); > if (max_boost_ratio) { > - unsigned int state = data->first_perf_state; > - unsigned int freq = freq_table[state].frequency; > + unsigned int freq = freq_table[0].frequency; > > /* > * Because the loop above sorts the freq_table entries in the > * descending order, freq is the maximum frequency in the table. > * Assume that it corresponds to the CPPC nominal frequency and > - * use it to populate the frequency field of the extra "boost" > - * frequency entry. > + * use it to set cpuinfo.max_freq. > */ > - freq_table[0].frequency = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT; > + policy->cpuinfo.max_freq = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT; > + } else { > /* > - * The purpose of the extra "boost" frequency entry is to make > - * the rest of cpufreq aware of the real maximum frequency, but > - * the way to request it is the same as for the first_perf_state > - * entry that is expected to cover the entire range of "boost" > - * frequencies of the CPU, so copy the driver_data value from > - * that entry. > + * If the maximum "boost" frequency is unknown, ask the arch > + * scale-invariance code to use the "nominal" performance for > + * CPU utilization scaling so as to prevent the schedutil > + * governor from selecting inadequate CPU frequencies. > */ > - freq_table[0].driver_data = freq_table[state].driver_data; > + arch_set_max_freq_ratio(true); > } > > policy->freq_table = freq_table; > @@ -947,8 +918,7 @@ static void acpi_cpufreq_cpu_ready(struc > { > struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data, > policy->cpu); > - struct acpi_cpufreq_data *data = policy->driver_data; > - unsigned int freq = policy->freq_table[data->first_perf_state].frequency; > + unsigned int freq = policy->freq_table[0].frequency; > > if (perf->states[0].core_frequency * 1000 != freq) > pr_warn(FW_WARN "P-state 0 is not max freq\n"); > Index: linux-pm/drivers/cpufreq/freq_table.c > =================================================================== > --- linux-pm.orig/drivers/cpufreq/freq_table.c > +++ linux-pm/drivers/cpufreq/freq_table.c > @@ -52,7 +52,13 @@ int cpufreq_frequency_table_cpuinfo(stru > } > > policy->min = policy->cpuinfo.min_freq = min_freq; > - policy->max = policy->cpuinfo.max_freq = max_freq; > + policy->max = max_freq; > + /* > + * If the driver has set its own cpuinfo.max_freq above max_freq, leave > + * it as is. > + */ > + if (policy->cpuinfo.max_freq < max_freq) > + policy->max = policy->cpuinfo.max_freq = max_freq; > > if (policy->min == ~0) > return -EINVAL; > > >
On Mon, 2021-02-15 at 20:24 +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover > boost frequencies") attempted to address a performance issue involving > acpi-cpufreq, the schedutil governor and scale-invariance on x86 by > extending the frequency tables created by acpi-cpufreq to cover the > entire range of "turbo" (or "boost") frequencies, but that caused > frequencies reported via /proc/cpuinfo and the scaling_cur_freq > attribute in sysfs to change which may confuse users and monitoring > tools. > > For this reason, revert the part of commit 3c55e94c0ade adding the > extra entry to the frequency table and use the observation that > in principle cpuinfo.max_freq need not be equal to the maximum > frequency listed in the frequency table for the given policy. > > Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq > drivers to set their own cpuinfo.max_freq above that frequency and > change acpi-cpufreq to set cpuinfo.max_freq to the maximum boost > frequency found via CPPC. > > This should be sufficient to let all of the cpufreq subsystem know > the real maximum frequency of the CPU without changing frequency > reporting. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305 > Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies") > Reported-by: Matt McDonald <gardotd426@gmail.com> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > Michael, Giovanni, > > The fix for the EPYC performance regression that was merged into 5.11 introduced > an undesirable side-effect by distorting the CPU frequency reporting via > /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details). > > The patch below is reported to address this problem and it should still allow > schedutil to achieve desirable performance, because it simply sets > cpuinfo.max_freq without extending the frequency table of the CPU. > > Please test this one and let me know if it adversely affects performance. > > Thanks! > > --- > drivers/cpufreq/acpi-cpufreq.c | 62 ++++++++++------------------------------- > drivers/cpufreq/freq_table.c | 8 ++++- > 2 files changed, 23 insertions(+), 47 deletions(-) Hello Rafael, I've run the quick image processing test below and the performance is in line with v5.11. I'll send some more results as longer tests complete. TEST : Intel Open Image Denoise, www.openimagedenoise.org INVOCATION : ./denoise -hdr memorial.pfm -out out.pfm -bench 200 -threads $NTHREADS CPU : MODEL : 2x AMD EPYC 7742 FREQUENCY TABLE : P2: 1.50 GHz P1: 2.00 GHz P0: 2.25 GHz MAX BOOST : 3.40 GHz Results: threads, msecs (ratio). Lower is better. v5.11 v5.11-patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 1071.43 (1.00) 1068.57 (1.00) 2 541.50 (1.00) 542.26 (1.00) 4 276.38 (1.00) 276.96 (1.00) 8 149.51 (1.00) 149.24 (1.00) 16 78.57 (1.00) 78.57 (1.00) 24 57.59 (1.00) 57.67 (1.00) 32 46.40 (1.00) 46.30 (1.00) 48 37.48 (1.00) 38.28 (1.02) 64 33.18 (1.00) 33.69 (1.02) 80 30.73 (1.00) 31.24 (1.02) 96 28.06 (1.00) 28.79 (1.03) 112 27.82 (1.00) 28.14 (1.01) 120 28.33 (1.00) 29.16 (1.03) 128 28.44 (1.00) 28.35 (1.00) Giovanni
On Mon, 2021-02-15 at 20:24 +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover > boost frequencies") attempted to address a performance issue involving > acpi-cpufreq, the schedutil governor and scale-invariance on x86 by > extending the frequency tables created by acpi-cpufreq to cover the > entire range of "turbo" (or "boost") frequencies, but that caused > frequencies reported via /proc/cpuinfo and the scaling_cur_freq > attribute in sysfs to change which may confuse users and monitoring > tools. > > For this reason, revert the part of commit 3c55e94c0ade adding the > extra entry to the frequency table and use the observation that > in principle cpuinfo.max_freq need not be equal to the maximum > frequency listed in the frequency table for the given policy. > > Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq > drivers to set their own cpuinfo.max_freq above that frequency and > change acpi-cpufreq to set cpuinfo.max_freq to the maximum boost > frequency found via CPPC. > > This should be sufficient to let all of the cpufreq subsystem know > the real maximum frequency of the CPU without changing frequency > reporting. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305 > Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies") > Reported-by: Matt McDonald <gardotd426@gmail.com> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > Michael, Giovanni, > > The fix for the EPYC performance regression that was merged into 5.11 introduced > an undesirable side-effect by distorting the CPU frequency reporting via > /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details). > > The patch below is reported to address this problem and it should still allow > schedutil to achieve desirable performance, because it simply sets > cpuinfo.max_freq without extending the frequency table of the CPU. > > Please test this one and let me know if it adversely affects performance. > > Thanks! Hello Rafael, more extended testing confirms the initial feeling; performance with this patch is mostly identical to vanilla v5.11. Tbench shows an improvement. Thanks for the fix! Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Results follow. The machine has two sockets with an AMD EPYC 7742 each. The governor is always schedutil. Ratios of time, lower is better: v5.11 v5.11 vanilla patch - - - - - - - - - - - - - - - - - - - - - - - - - - - - NASA Parallel Benchmarks w/ MPI 1.00 0.96 NASA Parallel Benchmarks w/ OpenMP 1.00 ~ dbench on XFS 1.00 ~ Linux kernel compilation 1.00 ~ git unit test suite 1.00 ~ Ratio of throughput, higher is better: v5.11 v5.11 vanilla patch - - - - - - - - - - - - - - - - - - - - - - - - - - - - tbench on localhost 1.00 1.09 Tilde (~): no change wrt baseline. Giovanni
On 2/15/21 7:49 PM, Michael Larabel wrote: > > On 2/15/21 1:24 PM, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover >> boost frequencies") attempted to address a performance issue involving >> acpi-cpufreq, the schedutil governor and scale-invariance on x86 by >> extending the frequency tables created by acpi-cpufreq to cover the >> entire range of "turbo" (or "boost") frequencies, but that caused >> frequencies reported via /proc/cpuinfo and the scaling_cur_freq >> attribute in sysfs to change which may confuse users and monitoring >> tools. >> >> For this reason, revert the part of commit 3c55e94c0ade adding the >> extra entry to the frequency table and use the observation that >> in principle cpuinfo.max_freq need not be equal to the maximum >> frequency listed in the frequency table for the given policy. >> >> Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq >> drivers to set their own cpuinfo.max_freq above that frequency and >> change acpi-cpufreq to set cpuinfo.max_freq to the maximum boost >> frequency found via CPPC. >> >> This should be sufficient to let all of the cpufreq subsystem know >> the real maximum frequency of the CPU without changing frequency >> reporting. >> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305 >> Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover >> boost frequencies") >> Reported-by: Matt McDonald <gardotd426@gmail.com> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> --- >> >> Michael, Giovanni, >> >> The fix for the EPYC performance regression that was merged into 5.11 >> introduced >> an undesirable side-effect by distorting the CPU frequency reporting via >> /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details). >> >> The patch below is reported to address this problem and it should >> still allow >> schedutil to achieve desirable performance, because it simply sets >> cpuinfo.max_freq without extending the frequency table of the CPU. >> >> Please test this one and let me know if it adversely affects >> performance. >> >> Thanks! > > > When carrying out tests so far today on an EPYC 7F72 2P and Ryzen 9 > 5900X with workloads seeing impact from the prior patches, everything > is looking good when comparing v5.11 to v5.11 + this patch. Not seeing > any measurable difference on either of those systems as a result of > this patch. > > Running some additional tests and on a few more boxes that should wrap > up tomorrow but at least so far the patch isn't showing any measurable > changes to performance. > > Michael > Linux 5.11 + this patch is still looking fine on the mix of EPYC (Zen 2) and Ryzen (Zen 2/3) systems tried with the previous workloads. Not seeing any measurable change in performance from this new patch, so it's looking fine on my end. Michael Tested-by: Michael Larabel <Michael@phoronix.com> > >> >> --- >> drivers/cpufreq/acpi-cpufreq.c | 62 >> ++++++++++------------------------------- >> drivers/cpufreq/freq_table.c | 8 ++++- >> 2 files changed, 23 insertions(+), 47 deletions(-) >> >> Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c >> =================================================================== >> --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c >> +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c >> @@ -54,7 +54,6 @@ struct acpi_cpufreq_data { >> unsigned int resume; >> unsigned int cpu_feature; >> unsigned int acpi_perf_cpu; >> - unsigned int first_perf_state; >> cpumask_var_t freqdomain_cpus; >> void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val); >> u32 (*cpu_freq_read)(struct acpi_pct_register *reg); >> @@ -223,10 +222,10 @@ static unsigned extract_msr(struct cpufr >> perf = to_perf_data(data); >> - cpufreq_for_each_entry(pos, policy->freq_table + >> data->first_perf_state) >> + cpufreq_for_each_entry(pos, policy->freq_table) >> if (msr == perf->states[pos->driver_data].status) >> return pos->frequency; >> - return policy->freq_table[data->first_perf_state].frequency; >> + return policy->freq_table[0].frequency; >> } >> static unsigned extract_freq(struct cpufreq_policy *policy, u32 val) >> @@ -365,7 +364,6 @@ static unsigned int get_cur_freq_on_cpu( >> struct cpufreq_policy *policy; >> unsigned int freq; >> unsigned int cached_freq; >> - unsigned int state; >> pr_debug("%s (%d)\n", __func__, cpu); >> @@ -377,11 +375,7 @@ static unsigned int get_cur_freq_on_cpu( >> if (unlikely(!data || !policy->freq_table)) >> return 0; >> - state = to_perf_data(data)->state; >> - if (state < data->first_perf_state) >> - state = data->first_perf_state; >> - >> - cached_freq = policy->freq_table[state].frequency; >> + cached_freq = >> policy->freq_table[to_perf_data(data)->state].frequency; >> freq = extract_freq(policy, get_cur_val(cpumask_of(cpu), data)); >> if (freq != cached_freq) { >> /* >> @@ -680,7 +674,6 @@ static int acpi_cpufreq_cpu_init(struct >> struct cpuinfo_x86 *c = &cpu_data(cpu); >> unsigned int valid_states = 0; >> unsigned int result = 0; >> - unsigned int state_count; >> u64 max_boost_ratio; >> unsigned int i; >> #ifdef CONFIG_SMP >> @@ -795,28 +788,8 @@ static int acpi_cpufreq_cpu_init(struct >> goto err_unreg; >> } >> - state_count = perf->state_count + 1; >> - >> - max_boost_ratio = get_max_boost_ratio(cpu); >> - if (max_boost_ratio) { >> - /* >> - * Make a room for one more entry to represent the highest >> - * available "boost" frequency. >> - */ >> - state_count++; >> - valid_states++; >> - data->first_perf_state = valid_states; >> - } else { >> - /* >> - * If the maximum "boost" frequency is unknown, ask the arch >> - * scale-invariance code to use the "nominal" performance for >> - * CPU utilization scaling so as to prevent the schedutil >> - * governor from selecting inadequate CPU frequencies. >> - */ >> - arch_set_max_freq_ratio(true); >> - } >> - >> - freq_table = kcalloc(state_count, sizeof(*freq_table), GFP_KERNEL); >> + freq_table = kcalloc(perf->state_count + 1, sizeof(*freq_table), >> + GFP_KERNEL); >> if (!freq_table) { >> result = -ENOMEM; >> goto err_unreg; >> @@ -851,27 +824,25 @@ static int acpi_cpufreq_cpu_init(struct >> } >> freq_table[valid_states].frequency = CPUFREQ_TABLE_END; >> + max_boost_ratio = get_max_boost_ratio(cpu); >> if (max_boost_ratio) { >> - unsigned int state = data->first_perf_state; >> - unsigned int freq = freq_table[state].frequency; >> + unsigned int freq = freq_table[0].frequency; >> /* >> * Because the loop above sorts the freq_table entries in the >> * descending order, freq is the maximum frequency in the >> table. >> * Assume that it corresponds to the CPPC nominal frequency >> and >> - * use it to populate the frequency field of the extra "boost" >> - * frequency entry. >> + * use it to set cpuinfo.max_freq. >> */ >> - freq_table[0].frequency = freq * max_boost_ratio >> >> SCHED_CAPACITY_SHIFT; >> + policy->cpuinfo.max_freq = freq * max_boost_ratio >> >> SCHED_CAPACITY_SHIFT; >> + } else { >> /* >> - * The purpose of the extra "boost" frequency entry is to make >> - * the rest of cpufreq aware of the real maximum frequency, but >> - * the way to request it is the same as for the >> first_perf_state >> - * entry that is expected to cover the entire range of "boost" >> - * frequencies of the CPU, so copy the driver_data value from >> - * that entry. >> + * If the maximum "boost" frequency is unknown, ask the arch >> + * scale-invariance code to use the "nominal" performance for >> + * CPU utilization scaling so as to prevent the schedutil >> + * governor from selecting inadequate CPU frequencies. >> */ >> - freq_table[0].driver_data = freq_table[state].driver_data; >> + arch_set_max_freq_ratio(true); >> } >> policy->freq_table = freq_table; >> @@ -947,8 +918,7 @@ static void acpi_cpufreq_cpu_ready(struc >> { >> struct acpi_processor_performance *perf = >> per_cpu_ptr(acpi_perf_data, >> policy->cpu); >> - struct acpi_cpufreq_data *data = policy->driver_data; >> - unsigned int freq = >> policy->freq_table[data->first_perf_state].frequency; >> + unsigned int freq = policy->freq_table[0].frequency; >> if (perf->states[0].core_frequency * 1000 != freq) >> pr_warn(FW_WARN "P-state 0 is not max freq\n"); >> Index: linux-pm/drivers/cpufreq/freq_table.c >> =================================================================== >> --- linux-pm.orig/drivers/cpufreq/freq_table.c >> +++ linux-pm/drivers/cpufreq/freq_table.c >> @@ -52,7 +52,13 @@ int cpufreq_frequency_table_cpuinfo(stru >> } >> policy->min = policy->cpuinfo.min_freq = min_freq; >> - policy->max = policy->cpuinfo.max_freq = max_freq; >> + policy->max = max_freq; >> + /* >> + * If the driver has set its own cpuinfo.max_freq above >> max_freq, leave >> + * it as is. >> + */ >> + if (policy->cpuinfo.max_freq < max_freq) >> + policy->max = policy->cpuinfo.max_freq = max_freq; >> if (policy->min == ~0) >> return -EINVAL; >> >> >>
On Wed, Feb 17, 2021 at 11:46 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > On Mon, 2021-02-15 at 20:24 +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover > > boost frequencies") attempted to address a performance issue involving > > acpi-cpufreq, the schedutil governor and scale-invariance on x86 by > > extending the frequency tables created by acpi-cpufreq to cover the > > entire range of "turbo" (or "boost") frequencies, but that caused > > frequencies reported via /proc/cpuinfo and the scaling_cur_freq > > attribute in sysfs to change which may confuse users and monitoring > > tools. > > > > For this reason, revert the part of commit 3c55e94c0ade adding the > > extra entry to the frequency table and use the observation that > > in principle cpuinfo.max_freq need not be equal to the maximum > > frequency listed in the frequency table for the given policy. > > > > Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq > > drivers to set their own cpuinfo.max_freq above that frequency and > > change acpi-cpufreq to set cpuinfo.max_freq to the maximum boost > > frequency found via CPPC. > > > > This should be sufficient to let all of the cpufreq subsystem know > > the real maximum frequency of the CPU without changing frequency > > reporting. > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305 > > Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies") > > Reported-by: Matt McDonald <gardotd426@gmail.com> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > > > Michael, Giovanni, > > > > The fix for the EPYC performance regression that was merged into 5.11 introduced > > an undesirable side-effect by distorting the CPU frequency reporting via > > /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details). > > > > The patch below is reported to address this problem and it should still allow > > schedutil to achieve desirable performance, because it simply sets > > cpuinfo.max_freq without extending the frequency table of the CPU. > > > > Please test this one and let me know if it adversely affects performance. > > > > Thanks! > > Hello Rafael, > > more extended testing confirms the initial feeling; performance with this > patch is mostly identical to vanilla v5.11. Thank you! > Tbench shows an improvement. Interesting. > Thanks for the fix! YW > Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> > > Results follow. The machine has two sockets with an AMD EPYC 7742 each. > The governor is always schedutil. > > > Ratios of time, lower is better: > v5.11 v5.11 > vanilla patch > - - - - - - - - - - - - - - - - - - - - - - - - - - - - > NASA Parallel Benchmarks w/ MPI 1.00 0.96 > NASA Parallel Benchmarks w/ OpenMP 1.00 ~ > dbench on XFS 1.00 ~ > Linux kernel compilation 1.00 ~ > git unit test suite 1.00 ~ > > > Ratio of throughput, higher is better: > v5.11 v5.11 > vanilla patch > - - - - - - - - - - - - - - - - - - - - - - - - - - - - > tbench on localhost 1.00 1.09 > > > Tilde (~): no change wrt baseline. Thanks again! It would be good to hear from Michael too, but this is already sufficient for me to queue up the patch for 5.12-rc. Cheers!
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -54,7 +54,6 @@ struct acpi_cpufreq_data { unsigned int resume; unsigned int cpu_feature; unsigned int acpi_perf_cpu; - unsigned int first_perf_state; cpumask_var_t freqdomain_cpus; void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val); u32 (*cpu_freq_read)(struct acpi_pct_register *reg); @@ -223,10 +222,10 @@ static unsigned extract_msr(struct cpufr perf = to_perf_data(data); - cpufreq_for_each_entry(pos, policy->freq_table + data->first_perf_state) + cpufreq_for_each_entry(pos, policy->freq_table) if (msr == perf->states[pos->driver_data].status) return pos->frequency; - return policy->freq_table[data->first_perf_state].frequency; + return policy->freq_table[0].frequency; } static unsigned extract_freq(struct cpufreq_policy *policy, u32 val) @@ -365,7 +364,6 @@ static unsigned int get_cur_freq_on_cpu( struct cpufreq_policy *policy; unsigned int freq; unsigned int cached_freq; - unsigned int state; pr_debug("%s (%d)\n", __func__, cpu); @@ -377,11 +375,7 @@ static unsigned int get_cur_freq_on_cpu( if (unlikely(!data || !policy->freq_table)) return 0; - state = to_perf_data(data)->state; - if (state < data->first_perf_state) - state = data->first_perf_state; - - cached_freq = policy->freq_table[state].frequency; + cached_freq = policy->freq_table[to_perf_data(data)->state].frequency; freq = extract_freq(policy, get_cur_val(cpumask_of(cpu), data)); if (freq != cached_freq) { /* @@ -680,7 +674,6 @@ static int acpi_cpufreq_cpu_init(struct struct cpuinfo_x86 *c = &cpu_data(cpu); unsigned int valid_states = 0; unsigned int result = 0; - unsigned int state_count; u64 max_boost_ratio; unsigned int i; #ifdef CONFIG_SMP @@ -795,28 +788,8 @@ static int acpi_cpufreq_cpu_init(struct goto err_unreg; } - state_count = perf->state_count + 1; - - max_boost_ratio = get_max_boost_ratio(cpu); - if (max_boost_ratio) { - /* - * Make a room for one more entry to represent the highest - * available "boost" frequency. - */ - state_count++; - valid_states++; - data->first_perf_state = valid_states; - } else { - /* - * If the maximum "boost" frequency is unknown, ask the arch - * scale-invariance code to use the "nominal" performance for - * CPU utilization scaling so as to prevent the schedutil - * governor from selecting inadequate CPU frequencies. - */ - arch_set_max_freq_ratio(true); - } - - freq_table = kcalloc(state_count, sizeof(*freq_table), GFP_KERNEL); + freq_table = kcalloc(perf->state_count + 1, sizeof(*freq_table), + GFP_KERNEL); if (!freq_table) { result = -ENOMEM; goto err_unreg; @@ -851,27 +824,25 @@ static int acpi_cpufreq_cpu_init(struct } freq_table[valid_states].frequency = CPUFREQ_TABLE_END; + max_boost_ratio = get_max_boost_ratio(cpu); if (max_boost_ratio) { - unsigned int state = data->first_perf_state; - unsigned int freq = freq_table[state].frequency; + unsigned int freq = freq_table[0].frequency; /* * Because the loop above sorts the freq_table entries in the * descending order, freq is the maximum frequency in the table. * Assume that it corresponds to the CPPC nominal frequency and - * use it to populate the frequency field of the extra "boost" - * frequency entry. + * use it to set cpuinfo.max_freq. */ - freq_table[0].frequency = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT; + policy->cpuinfo.max_freq = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT; + } else { /* - * The purpose of the extra "boost" frequency entry is to make - * the rest of cpufreq aware of the real maximum frequency, but - * the way to request it is the same as for the first_perf_state - * entry that is expected to cover the entire range of "boost" - * frequencies of the CPU, so copy the driver_data value from - * that entry. + * If the maximum "boost" frequency is unknown, ask the arch + * scale-invariance code to use the "nominal" performance for + * CPU utilization scaling so as to prevent the schedutil + * governor from selecting inadequate CPU frequencies. */ - freq_table[0].driver_data = freq_table[state].driver_data; + arch_set_max_freq_ratio(true); } policy->freq_table = freq_table; @@ -947,8 +918,7 @@ static void acpi_cpufreq_cpu_ready(struc { struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data, policy->cpu); - struct acpi_cpufreq_data *data = policy->driver_data; - unsigned int freq = policy->freq_table[data->first_perf_state].frequency; + unsigned int freq = policy->freq_table[0].frequency; if (perf->states[0].core_frequency * 1000 != freq) pr_warn(FW_WARN "P-state 0 is not max freq\n"); Index: linux-pm/drivers/cpufreq/freq_table.c =================================================================== --- linux-pm.orig/drivers/cpufreq/freq_table.c +++ linux-pm/drivers/cpufreq/freq_table.c @@ -52,7 +52,13 @@ int cpufreq_frequency_table_cpuinfo(stru } policy->min = policy->cpuinfo.min_freq = min_freq; - policy->max = policy->cpuinfo.max_freq = max_freq; + policy->max = max_freq; + /* + * If the driver has set its own cpuinfo.max_freq above max_freq, leave + * it as is. + */ + if (policy->cpuinfo.max_freq < max_freq) + policy->max = policy->cpuinfo.max_freq = max_freq; if (policy->min == ~0) return -EINVAL;