Message ID | 20250212084232.2349984-1-dedekind1@gmail.com |
---|---|
State | New |
Headers | show |
Series | [1/2] intel_idle: Add C1 demotion on/off sysfs knob | expand |
Hi Rafael, thanks for reply! On Wed, 2025-02-12 at 21:11 +0100, Rafael J. Wysocki wrote: > > > +static ssize_t c1_demotion_store(struct device *dev, > > + struct device_attribute *attr, > > + const char *buf, size_t count) > > +{ > > + int err; > > + bool enable; > > + > > + err = kstrtobool(buf, &enable); > > + if (err) > > + return err; > > + > > + mutex_lock(&c1_demotion_mutex); > > + /* Enable/disable C1 demotion on all CPUs */ > > + on_each_cpu(c1_demotion_toggle, &enable, 1); > > + mutex_unlock(&c1_demotion_mutex); > > This is not the only place where MSR_PKG_CST_CONFIG_CONTROL gets > updated. The other one is drivers/platform/x86/intel/pmc/cnp.c > > There is no real conflict because the PMC core thing happens during > system suspend/resume on client platforms, but this is kind of > duplicated code. Any chance to consolidate this? Thanks for pointing this out. I'd propose to leave it as is because it is such a small amount of duplication and also trivial. Since the two code paths do not interfere with each other, I am not sure consolidation is worth it in this case. I was also hoping that this patch would be backported by OS vendors, because it makes a very significant difference on recent Intel server platforms, so I wanted to keep it simple for easier backporting. But I did not dare to CC stable, because it may not be perceived as fix. However, in practice enabling C1 demotion fixes performance issues on recent Xeons in some workloads. If you really think the MSR read and write code should be consolidated, I would propose to do this as a separate patch-set on top, so this one stays simple and easier to backport. To recap: * I propose not to consolidate it. * If you insist, I propose to do it on top of this one. Please, let me know. > > +static ssize_t c1_demotion_show(struct device *dev, > > + struct device_attribute *attr, char *buf) > > +{ > > + unsigned long long msr_val; > > + > > + rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val); > > + return sysfs_emit(buf, "%d\n", !!(msr_val & NHM_C1_AUTO_DEMOTE)); > > This reads the register on the current CPU with the assumption that > the specific bit value will be the same for all CPUs. Is this always > true? Yes, anything else would be either a BIOS bug or a user toggling the MSR bits directly via /dev/msr. I was trying to keep the driver simple and avoid detecting different values, because the only think we could do in that case is just print a warning. > What about systems with more than one package? Do they always > initialize this bit to the same value in all packages? Yes, anything else would be a misconfiguration or bug. > I guess so, > but then I would add a comment documenting this assumption and the > reasons for it. I'll add, thanks! > I'm not sure if the attr name is clear enough. I guess reading the > doc is really necessary to get an idea of what this is about, but it > might indicate that the demotion is done in hardware, like > "hw_c1_demotion". Well, let's see. I propose two approaches for naming this attribute. 1. Follow the Intel name, documented in the Intel SDM: c1_demotion. 2. Try to come up with a better and more intuitive name. I would suggest one of these: * demotion_to_c1 * auto_demotion_to_c1 I followed approach #1. I agree that C1 demotion is not an intuitive name, because it sounds like C1 is being demoted. But I am used to it, and it is also old and documented in the SDM. However, I can rename it if you direct me this way. In that case, I would propose one of the above two. Please, let me know. Thank you!
Hi Artem, On Fri, Feb 14, 2025 at 1:51 PM Artem Bityutskiy <dedekind1@gmail.com> wrote: > > Hi Rafael, > > thanks for reply! > > On Wed, 2025-02-12 at 21:11 +0100, Rafael J. Wysocki wrote: > > > > > +static ssize_t c1_demotion_store(struct device *dev, > > > + struct device_attribute *attr, > > > + const char *buf, size_t count) > > > +{ > > > + int err; > > > + bool enable; > > > + > > > + err = kstrtobool(buf, &enable); > > > + if (err) > > > + return err; > > > + > > > + mutex_lock(&c1_demotion_mutex); > > > + /* Enable/disable C1 demotion on all CPUs */ > > > + on_each_cpu(c1_demotion_toggle, &enable, 1); > > > + mutex_unlock(&c1_demotion_mutex); > > > > This is not the only place where MSR_PKG_CST_CONFIG_CONTROL gets > > updated. The other one is drivers/platform/x86/intel/pmc/cnp.c > > > > There is no real conflict because the PMC core thing happens during > > system suspend/resume on client platforms, but this is kind of > > duplicated code. Any chance to consolidate this? > > Thanks for pointing this out. > > I'd propose to leave it as is because it is such a small amount of > duplication and also trivial. Since the two code paths do not interfere > with each other, I am not sure consolidation is worth it in this case. > > I was also hoping that this patch would be backported by OS vendors, > because it makes a very significant difference on recent Intel > server platforms, so I wanted to keep it simple for easier backporting. > > But I did not dare to CC stable, because it may not be perceived as fix. > However, in practice enabling C1 demotion fixes performance issues on > recent Xeons in some workloads. > > If you really think the MSR read and write code should be consolidated, > I would propose to do this as a separate patch-set on top, so this one > stays simple and easier to backport. > > To recap: > > * I propose not to consolidate it. > * If you insist, I propose to do it on top of this one. I would at least add a common wrapper around the RMW thing that would return the old value, to make it clear that it is not only used in this one place. That wrapper can be added in a separate patch, so it is possible to backport it without changing functionality. > Please, let me know. > > > > +static ssize_t c1_demotion_show(struct device *dev, > > > + struct device_attribute *attr, char *buf) > > > +{ > > > + unsigned long long msr_val; > > > + > > > + rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val); > > > + return sysfs_emit(buf, "%d\n", !!(msr_val & NHM_C1_AUTO_DEMOTE)); > > > > This reads the register on the current CPU with the assumption that > > the specific bit value will be the same for all CPUs. Is this always > > true? > > Yes, anything else would be either a BIOS bug or a user toggling the MSR > bits directly via /dev/msr. I was trying to keep the driver simple and > avoid detecting different values, because the only think we could do in > that case is just print a warning. > > > What about systems with more than one package? Do they always > > initialize this bit to the same value in all packages? > > Yes, anything else would be a misconfiguration or bug. > > > I guess so, > > but then I would add a comment documenting this assumption and the > > reasons for it. > > I'll add, thanks! > > > I'm not sure if the attr name is clear enough. I guess reading the > > doc is really necessary to get an idea of what this is about, but it > > might indicate that the demotion is done in hardware, like > > "hw_c1_demotion". > > Well, let's see. I propose two approaches for naming this attribute. > > 1. Follow the Intel name, documented in the Intel SDM: c1_demotion. > 2. Try to come up with a better and more intuitive name. I would suggest one of > these: > * demotion_to_c1 > * auto_demotion_to_c1 > > I followed approach #1. I agree that C1 demotion is not an intuitive name, > because it sounds like C1 is being demoted. But I am used to it, and it is also > old and documented in the SDM. This is not my point. You are essentially extending Linux ABI here which isn't Intel-specific and I'm just saying that "c1_demotion" needs some context, or it is even hard to say where it comes from. It is described in intel_idle documentation, but in order to find that description, one needs to know that this is an intel_idle feature which isn't really obvious. "intel_c1_demotion" would be better even IMV.
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 118fe1d37c22..f950e8e793fe 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -48,9 +48,11 @@ #include <trace/events/power.h> #include <linux/sched.h> #include <linux/sched/smt.h> +#include <linux/mutex.h> #include <linux/notifier.h> #include <linux/cpu.h> #include <linux/moduleparam.h> +#include <linux/sysfs.h> #include <asm/cpuid.h> #include <asm/cpu_device_id.h> #include <asm/intel-family.h> @@ -91,9 +93,15 @@ struct idle_cpu { unsigned long auto_demotion_disable_flags; bool byt_auto_demotion_disable_flag; bool disable_promotion_to_c1e; + bool c1_demotion_supported; bool use_acpi; }; +static bool c1_demotion_supported; +static DEFINE_MUTEX(c1_demotion_mutex); + +static struct device *sysfs_root __initdata; + static const struct idle_cpu *icpu __initdata; static struct cpuidle_state *cpuidle_state_table __initdata; @@ -1541,18 +1549,21 @@ static const struct idle_cpu idle_cpu_gmt __initconst = { static const struct idle_cpu idle_cpu_spr __initconst = { .state_table = spr_cstates, .disable_promotion_to_c1e = true, + .c1_demotion_supported = true, .use_acpi = true, }; static const struct idle_cpu idle_cpu_gnr __initconst = { .state_table = gnr_cstates, .disable_promotion_to_c1e = true, + .c1_demotion_supported = true, .use_acpi = true, }; static const struct idle_cpu idle_cpu_gnrd __initconst = { .state_table = gnrd_cstates, .disable_promotion_to_c1e = true, + .c1_demotion_supported = true, .use_acpi = true, }; @@ -1591,12 +1602,14 @@ static const struct idle_cpu idle_cpu_snr __initconst = { static const struct idle_cpu idle_cpu_grr __initconst = { .state_table = grr_cstates, .disable_promotion_to_c1e = true, + .c1_demotion_supported = true, .use_acpi = true, }; static const struct idle_cpu idle_cpu_srf __initconst = { .state_table = srf_cstates, .disable_promotion_to_c1e = true, + .c1_demotion_supported = true, .use_acpi = true, }; @@ -2291,6 +2304,85 @@ static void __init intel_idle_cpuidle_devices_uninit(void) cpuidle_unregister_device(per_cpu_ptr(intel_idle_cpuidle_devices, i)); } +static void c1_demotion_toggle(void *info) +{ + unsigned long long msr_val; + bool enable = *(bool *)info; + + rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val); + /* + * Enable/disable C1 undemotion along with C1 demotion, as this is the + * most sensible configuration in general. + */ + if (enable) + msr_val |= NHM_C1_AUTO_DEMOTE | SNB_C1_AUTO_UNDEMOTE; + else + msr_val &= ~(NHM_C1_AUTO_DEMOTE | SNB_C1_AUTO_UNDEMOTE); + wrmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val); +} + +static ssize_t c1_demotion_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int err; + bool enable; + + err = kstrtobool(buf, &enable); + if (err) + return err; + + mutex_lock(&c1_demotion_mutex); + /* Enable/disable C1 demotion on all CPUs */ + on_each_cpu(c1_demotion_toggle, &enable, 1); + mutex_unlock(&c1_demotion_mutex); + + return count; +} + +static ssize_t c1_demotion_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + unsigned long long msr_val; + + rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val); + return sysfs_emit(buf, "%d\n", !!(msr_val & NHM_C1_AUTO_DEMOTE)); +} +static DEVICE_ATTR_RW(c1_demotion); + +static int __init intel_idle_sysfs_init(void) +{ + int err; + + if (!c1_demotion_supported) + return 0; + + sysfs_root = bus_get_dev_root(&cpu_subsys); + if (!sysfs_root) + return 0; + + err = sysfs_add_file_to_group(&sysfs_root->kobj, + &dev_attr_c1_demotion.attr, + "cpuidle"); + if (err) { + put_device(sysfs_root); + return err; + } + + return 0; +} + +static void __init intel_idle_sysfs_uninit(void) +{ + if (!sysfs_root) + return; + + sysfs_remove_file_from_group(&sysfs_root->kobj, + &dev_attr_c1_demotion.attr, + "cpuidle"); + put_device(sysfs_root); +} + static int __init intel_idle_init(void) { const struct x86_cpu_id *id; @@ -2337,6 +2429,8 @@ static int __init intel_idle_init(void) auto_demotion_disable_flags = icpu->auto_demotion_disable_flags; if (icpu->disable_promotion_to_c1e) c1e_promotion = C1E_PROMOTION_DISABLE; + if (icpu->c1_demotion_supported) + c1_demotion_supported = true; if (icpu->use_acpi || force_use_acpi) intel_idle_acpi_cst_extract(); } else if (!intel_idle_acpi_cst_extract()) { @@ -2350,6 +2444,10 @@ static int __init intel_idle_init(void) if (!intel_idle_cpuidle_devices) return -ENOMEM; + retval = intel_idle_sysfs_init(); + if (retval) + pr_warn("failed to initialized sysfs"); + intel_idle_cpuidle_driver_init(&intel_idle_driver); retval = cpuidle_register_driver(&intel_idle_driver); @@ -2374,6 +2472,7 @@ static int __init intel_idle_init(void) intel_idle_cpuidle_devices_uninit(); cpuidle_unregister_driver(&intel_idle_driver); init_driver_fail: + intel_idle_sysfs_uninit(); free_percpu(intel_idle_cpuidle_devices); return retval;