Message ID | 20230525211655.627415-1-evalenti@kernel.org |
---|---|
Headers | show |
Series | thermal: hook in with reboot and crash | expand |
Hi, On 25/05/2023 23:16, Eduardo Valentin wrote: > From: Eduardo Valentin <eduval@amazon.com> > > Hello, > > This small series of changes teaches thermal core about > reboot and crash callbacks. The intention is to have the core > to get notified and the pass in the event to thermal governors > that are willing to perform actions during reboot or crash events. > The thermal workers will be teared down in the process too. What problem does it solve? > There is no code dependency this series was built on top of: > https://lkml.org/lkml/2023/5/18/1207 > > Separate governor changes will be sent in another series. > > BR, > > Eduardo Valentin (3): > thermal: core: introduce governor .reboot_prepare() > thermal: core: register reboot nb > thermal: core: register a crash callback > > drivers/thermal/thermal_core.c | 54 ++++++++++++++++++++++++++++++++++ > include/linux/thermal.h | 4 +++ > 2 files changed, 58 insertions(+) >
On Fri, May 26, 2023 at 10:27:39AM +0200, Daniel Lezcano wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Hi, > > On 25/05/2023 23:16, Eduardo Valentin wrote: > > From: Eduardo Valentin <eduval@amazon.com> > > > > Hello, > > > > This small series of changes teaches thermal core about > > reboot and crash callbacks. The intention is to have the core > > to get notified and the pass in the event to thermal governors > > that are willing to perform actions during reboot or crash events. > > The thermal workers will be teared down in the process too. > > What problem does it solve? This cover letter was not clear enough. In fact, the context for all patches I am sending now and will be sending in near future is when the thermal subsystem is configured to control temperature of a target device. The thermal subsystem is configured to have cooling devices that will act on the target system, and has input, temperature sensors, to have visibility to the target system temperature. In this case, the problem is when the controlling system becomes unresponsive upon reboot or crash, therefore losing control of temperature of the target system. This series solves the problem by giving knowledge to the governors of such events, allowing the governors to have opportunity to act before the actual event happens. > > > > There is no code dependency this series was built on top of: > > https://lkml.org/lkml/2023/5/18/1207 > > > > Separate governor changes will be sent in another series. > > > > BR, > > > > Eduardo Valentin (3): > > thermal: core: introduce governor .reboot_prepare() > > thermal: core: register reboot nb > > thermal: core: register a crash callback > > > > drivers/thermal/thermal_core.c | 54 ++++++++++++++++++++++++++++++++++ > > include/linux/thermal.h | 4 +++ > > 2 files changed, 58 insertions(+) > > > > -- > <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs > > Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | > <http://twitter.com/#!/linaroorg> Twitter | > <http://www.linaro.org/linaro-blog/> Blog >
On Mon, Jun 05, 2023 at 04:27:55PM -0700, Eduardo Valentin wrote: > > > > On Fri, May 26, 2023 at 10:27:39AM +0200, Daniel Lezcano wrote: > > > > > > > > Hi, > > > > On 25/05/2023 23:16, Eduardo Valentin wrote: > > > From: Eduardo Valentin <eduval@amazon.com> > > > > > > Hello, > > > > > > This small series of changes teaches thermal core about > > > reboot and crash callbacks. The intention is to have the core > > > to get notified and the pass in the event to thermal governors > > > that are willing to perform actions during reboot or crash events. > > > The thermal workers will be teared down in the process too. > > > > What problem does it solve? > > This cover letter was not clear enough. In fact, the context for > all patches I am sending now and will be sending in near future > is when the thermal subsystem is configured to control temperature > of a target device. The thermal subsystem is configured to have > cooling devices that will act on the target system, and has > input, temperature sensors, to have visibility to the target system > temperature. In this case, the problem is when the controlling system > becomes unresponsive upon reboot or crash, therefore losing > control of temperature of the target system. This series solves the > problem by giving knowledge to the governors of such events, allowing > the governors to have opportunity to act before the actual event happens. Again, this is a different situation than a emergence shutdown due to temperature/overheat on the typical application of the thermal subsystem. Where it runs in the same system it controls the temperature of. Here we want to reduce the likelihood of loosing control of temperature of a target systems upon events where the controlling system is unavailable.
On Tue, Jun 6, 2023 at 1:32 AM Eduardo Valentin <evalenti@kernel.org> wrote: > > On Mon, Jun 05, 2023 at 04:27:55PM -0700, Eduardo Valentin wrote: > > > > > > > > On Fri, May 26, 2023 at 10:27:39AM +0200, Daniel Lezcano wrote: > > > > > > > > > > > > Hi, > > > > > > On 25/05/2023 23:16, Eduardo Valentin wrote: > > > > From: Eduardo Valentin <eduval@amazon.com> > > > > > > > > Hello, > > > > > > > > This small series of changes teaches thermal core about > > > > reboot and crash callbacks. The intention is to have the core > > > > to get notified and the pass in the event to thermal governors > > > > that are willing to perform actions during reboot or crash events. > > > > The thermal workers will be teared down in the process too. > > > > > > What problem does it solve? > > > > This cover letter was not clear enough. In fact, the context for > > all patches I am sending now and will be sending in near future > > is when the thermal subsystem is configured to control temperature > > of a target device. The thermal subsystem is configured to have > > cooling devices that will act on the target system, and has > > input, temperature sensors, to have visibility to the target system > > temperature. In this case, the problem is when the controlling system > > becomes unresponsive upon reboot or crash, therefore losing > > control of temperature of the target system. This series solves the > > problem by giving knowledge to the governors of such events, allowing > > the governors to have opportunity to act before the actual event happens. > > > Again, this is a different situation than a emergence shutdown due to > temperature/overheat on the typical application of the thermal subsystem. > Where it runs in the same system it controls the temperature of. > > Here we want to reduce the likelihood of loosing control of temperature of a target systems > upon events where the controlling system is unavailable. So the use case for this seems to be a BMC running Linux that is responsible for the thermal control of a host system. It kind of escapes me why you want a thermal governor in the BMC's kernel to be part of this.
Hey Rafael, On Thu, Jun 29, 2023 at 09:07:36PM +0200, Rafael J. Wysocki wrote: > > > > On Tue, Jun 6, 2023 at 1:32 AM Eduardo Valentin <evalenti@kernel.org> wrote: > > > > On Mon, Jun 05, 2023 at 04:27:55PM -0700, Eduardo Valentin wrote: > > > > > > > > > > > > On Fri, May 26, 2023 at 10:27:39AM +0200, Daniel Lezcano wrote: > > > > > > > > > > > > > > > > Hi, > > > > > > > > On 25/05/2023 23:16, Eduardo Valentin wrote: > > > > > From: Eduardo Valentin <eduval@amazon.com> > > > > > > > > > > Hello, > > > > > > > > > > This small series of changes teaches thermal core about > > > > > reboot and crash callbacks. The intention is to have the core > > > > > to get notified and the pass in the event to thermal governors > > > > > that are willing to perform actions during reboot or crash events. > > > > > The thermal workers will be teared down in the process too. > > > > > > > > What problem does it solve? > > > > > > This cover letter was not clear enough. In fact, the context for > > > all patches I am sending now and will be sending in near future > > > is when the thermal subsystem is configured to control temperature > > > of a target device. The thermal subsystem is configured to have > > > cooling devices that will act on the target system, and has > > > input, temperature sensors, to have visibility to the target system > > > temperature. In this case, the problem is when the controlling system > > > becomes unresponsive upon reboot or crash, therefore losing > > > control of temperature of the target system. This series solves the > > > problem by giving knowledge to the governors of such events, allowing > > > the governors to have opportunity to act before the actual event happens. > > > > > > Again, this is a different situation than a emergence shutdown due to > > temperature/overheat on the typical application of the thermal subsystem. > > Where it runs in the same system it controls the temperature of. > > > > Here we want to reduce the likelihood of loosing control of temperature of a target systems > > upon events where the controlling system is unavailable. > > So the use case for this seems to be a BMC running Linux that is > responsible for the thermal control of a host system. BMC is a great example here, yes. But not limited to it. BMC is responsible of thermal control of not only host though, but the entire server thermal, considering not only host temperature sensor as inputs, but plenty of other sensing in the motherboard, for instance. The changes are not limited to BMC controlling a host temperature though. Any system controlling a target device temperature will fit. > > It kind of escapes me why you want a thermal governor in the BMC's > kernel to be part of this. Yeah, I understand the typical BMC design would have its control in userland. But as I said this not necessarily only about BMCs controlling host temperature. Primary reason to have in kernel is reliability compared to a userspace based control, avoiding downtime of the control, having the control available and stable during the entire life time of the controlling system, e.g from early boot, and avoiding unnecessary down time of the control due to application crashes. Sure the kernel can crash too :-), that is why this patch was written in the first place, but it is a more rare case. Userspace may still be involved here where user wants to have, for example a fixed override cooling state. But in that case the user would need to disable the in kernel control and take care of the cooling state and corner scenarios, like crash and reboot. I also do not see anything that would prevent a control of the controlling system, like a BMC, be put in the kernel, given all the abstractions needed, cooling devices, governor, sensors, are available there. Mechanically and physically the thermal zone is still something part of the hardware where the controlling system, say a BMC, is responsible for.
From: Eduardo Valentin <eduval@amazon.com> Hello, This small series of changes teaches thermal core about reboot and crash callbacks. The intention is to have the core to get notified and the pass in the event to thermal governors that are willing to perform actions during reboot or crash events. The thermal workers will be teared down in the process too. There is no code dependency this series was built on top of: https://lkml.org/lkml/2023/5/18/1207 Separate governor changes will be sent in another series. BR, Eduardo Valentin (3): thermal: core: introduce governor .reboot_prepare() thermal: core: register reboot nb thermal: core: register a crash callback drivers/thermal/thermal_core.c | 54 ++++++++++++++++++++++++++++++++++ include/linux/thermal.h | 4 +++ 2 files changed, 58 insertions(+)