mbox series

[RFC,0/3] thermal: Add CPU hotplug cooling driver

Message ID 20250309121324.29633-1-john.madieu.xa@bp.renesas.com
Headers show
Series thermal: Add CPU hotplug cooling driver | expand

Message

John Madieu March 9, 2025, 12:13 p.m. UTC
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This patch series introduces a new thermal cooling driver that implements CPU
hotplug-based thermal management. The driver dynamically takes CPUs offline
during thermal excursions to reduce power consumption and prevent overheating,
while maintaining system stability by keeping at least one CPU online. 

1- Problem Statement

Modern SoCs require robust thermal management to prevent overheating under heavy
workloads. Existing cooling mechanisms like frequency scaling may not always
provide sufficient thermal relief, especially in multi-core systems where
per-core thermal contributions can be significant. 

2- Solution Overview 

The driver:

 - Integrates with the Linux thermal framework as a cooling device  
 - Registers per-CPU cooling devices that respond to thermal trip points  
 - Uses CPU hotplug operations to reduce thermal load  
 - Maintains system stability by preserving the boot CPU from being put offline,
 regardless the CPUs that are specified in cooling device list. 
 - Implements proper state tracking and cleanup

Key Features:   

 - Dynamic CPU online/offline management based on thermal thresholds  
 - Device tree-based configuration via thermal zones and trip points  
 - Hysteresis support through thermal governor interactions  
 - Safe handling of CPU state transitions during module load/unload  
 - Compatibility with existing thermal management frameworks

Testing    

 - Verified on Renesas RZ/G3E platforms with multi-core CPU configurations  
 - Validated thermal response using artificial load generation (emul_temp)  
 - Confirmed proper interaction with other cooling devices
 - Verified support for 'plug' type trace events
 - Tested with step_wise governor

As the 'hot' type is already used for user space notification, I've choosen
'plug' for this new type. suggestions on this are welcome. Here is an example
of 'thermal-zone' that integrate 'plug' type:

```
thermal-zones {
	cpu-thermal {
		polling-delay = <1000>;
		polling-delay-passive = <250>;
		thermal-sensors = <&tsu>;

		cooling-maps {
			map0 {
				trip = <&target>;
				cooling-device = <&cpu0 0 3>, <&cpu3 0 3>;
				contribution = <1024>;
			};

			map1 {
				trip = <&trip_emergency>;
				cooling-device = <&cpu1 0 1>, <&cpu2 0 1>;
				contribution = <1024>;
			};

		};

		trips {
			target: trip-point {
				temperature = <95000>;
				hysteresis = <1000>;
				type = "passive";
			};

			trip_emergency: emergency {
				temperature = <110000>;
				hysteresis = <1000>;
				type = "plug";
			};

			sensor_crit: sensor-crit {
				temperature = <120000>;
				hysteresis = <1000>;
				type = "critical";
			};
		};
	};
};
```

Dependencies    

 - Requires standard thermal framework components (CONFIG_THERMAL)  
 - Depends on CPU hotplug support (CONFIG_HOTPLUG_CPU)  
 - Assumes device tree contains appropriate thermal zone definitions

This series also depends upon [1], more precisely on patch 6/7, 
arm64: dts: renesas: r9a09g047: Add TSU node.


3) Notes for Reviewers    

 - Focus areas: Thermal framework integration, CPU state management, and error handling  
 - Feedback on device tree binding requirements is particularly welcome  
 - Suggestions for interaction improvements with other governors are appreciated

I look forward to your feedback and guidance on this contribution.

[1] https://patchwork.kernel.org/project/linux-clk/cover/20250227122453.30480-1-john.madieu.xa@bp.renesas.com/

Regards,
John


John Madieu (3):
  thermal/cpuplog_cooling: Add CPU hotplug cooling driver
  tmon: Add support for THERMAL_TRIP_PLUG type
  arm64: dts: renesas: r9a09g047: Add thermal hotplug trip point

 arch/arm64/boot/dts/renesas/r9a09g047.dtsi |  13 +
 drivers/thermal/Kconfig                    |  12 +
 drivers/thermal/Makefile                   |   1 +
 drivers/thermal/cpuplug_cooling.c          | 363 +++++++++++++++++++++
 drivers/thermal/thermal_of.c               |   1 +
 drivers/thermal/thermal_trace.h            |   2 +
 drivers/thermal/thermal_trip.c             |   1 +
 include/uapi/linux/thermal.h               |   1 +
 tools/thermal/tmon/tmon.h                  |   1 +
 tools/thermal/tmon/tui.c                   |   3 +-
 10 files changed, 397 insertions(+), 1 deletion(-)
 create mode 100644 drivers/thermal/cpuplug_cooling.c

Comments

Krzysztof Kozlowski March 11, 2025, 7:30 a.m. UTC | #1
On 09/03/2025 13:13, John Madieu wrote:
> +
> +/* Check if a trip point is of type "plug" */
> +static bool is_plug_trip_point(struct device_node *trip_node)
> +{
> +	const char *trip_type_str;
> +
> +	if (!trip_node) {
> +		pr_err("Trip node is NULL\n");
> +		return false;
> +	}
> +
> +	if (of_property_read_string(trip_node, "type", &trip_type_str)) {
> +		pr_err("Trip node missing 'type' property\n");
> +		return false;
> +	}
> +
> +	pr_info("Trip type: '%s'\n", trip_type_str);
> +
> +	if (strcmp(trip_type_str, "plug") != 0) {

type is object, not string. Where is ABI documented? For the type and
its value?


> +		pr_debug("Trip type is '%s', not 'plug' - skipping\n",
> +			 trip_type_str);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +/* Init function */
> +static int __init cpu_hotplug_cooling_init(void)
> +{
> +	struct device_node *thermal_zones, *thermal_zone;
> +	int ret = 0;
> +	int count = 0;
> +
> +	bitmap_zero(cpu_cooling_registered, NR_CPUS);
> +
> +	thermal_zones = of_find_node_by_name(NULL, "thermal-zones");
> +	if (!thermal_zones) {
> +		pr_err("Missing thermal-zones node\n");
> +		return -EINVAL;
> +	}
> +
> +	/* Process each thermal zone */
> +	for_each_child_of_node(thermal_zones, thermal_zone) {
> +		struct device_node *trips, *trip;
> +		struct device_node *maps, *map;
> +		bool found_plug = false;
> +
> +		/* First find trips and get a specific plug trip */
> +		trips = of_find_node_by_name(thermal_zone, "trips");
> +		if (!trips)
> +			continue;
> +
> +		/* Find the emergency trip with type="plug" */
> +		for_each_child_of_node(trips, trip) {
> +			if (is_plug_trip_point(trip)) {
> +				found_plug = true;
> +				break;
> +			}
> +		}
> +
> +		/* If we didn't find a plug trip, no need to process this zone */
> +		if (!found_plug) {
> +			of_node_put(trips);
> +			continue;
> +		}
> +
> +		maps = of_find_node_by_name(thermal_zone, "cooling-maps");
> +		if (!maps) {
> +			of_node_put(trip);
> +			of_node_put(trips);
> +			continue;
> +		}
> +
> +		pr_info("Found 'plug' trip point, processing cooling devices\n");

dev_info, or just drop. You are not supposed to print successes of
standard DT parsing.

> +
> +		/* Find the specific cooling map that references our plug trip */
> +		for_each_child_of_node(maps, map) {
> +			struct device_node *trip_ref;
> +			struct of_phandle_args cooling_spec;
> +			int idx = 0;
> +
> +			trip_ref = of_parse_phandle(map, "trip", 0);
> +			if (!trip_ref || trip_ref != trip) {
> +				if (trip_ref)
> +					of_node_put(trip_ref);
> +				continue;
> +			}
> +			of_node_put(trip_ref);
> +
> +			if (!of_find_property(map, "cooling-device", NULL)) {
> +				pr_err("Missing cooling-device property\n");
> +				continue;
> +			}
> +
> +			/* Iterate through all cooling-device entries */
> +			while (of_parse_phandle_with_args(
> +				       map, "cooling-device",
> +				       "#cooling-cells", idx++,
> +				       &cooling_spec) == 0) {
> +				struct device_node *cpu_node = cooling_spec.np;
> +				int cpu;
> +
> +				if (!cpu_node) {
> +					pr_err("CPU node at index %d is NULL\n",
> +					       idx - 1);
> +					continue;
> +				}
> +
> +				cpu = of_cpu_node_to_id(cpu_node);
> +				if (cpu < 0) {
> +					pr_err("Failed to map CPU node %pOF to logical ID\n",
> +					       cpu_node);
> +					of_node_put(cpu_node);
> +					continue;
> +				}
> +
> +				if (cpu >= num_possible_cpus()) {
> +					pr_err("Invalid CPU ID %d (max %d)\n",
> +					       cpu, num_possible_cpus() - 1);
> +					of_node_put(cpu_node);
> +					continue;
> +				}
> +
> +				pr_info("Processing cooling device for CPU%d\n", cpu);
> +				ret = register_cpu_hotplug_cooling(cpu_node, cpu);
> +				if (ret == 0)
> +					count++;
> +
> +				of_node_put(cpu_node);
> +			}
> +			break; /* Only process the first map that references our trip */
> +		}
> +		of_node_put(maps);
> +		of_node_put(trip);
> +		of_node_put(trips);
> +	}
> +	of_node_put(thermal_zones);
> +
> +	if (count == 0) {
> +		pr_err("No cooling devices registered\n");
> +		return -ENODEV;
> +	}
> +
> +	pr_info("CPU hotplug cooling driver initialized with %d devices\n", count);

Drop. Why would you print this on MIPS machine which does not care about
it, just because someone loaded a module?

> +	return 0;
> +}
> +
> +/* Exit function */
> +static void __exit cpu_hotplug_cooling_exit(void)
> +{
> +	cleanup_cooling_devices();
> +	pr_info("CPU hotplug cooling driver removed\n");

No, drop


> +}
> +
> +module_init(cpu_hotplug_cooling_init);
> +module_exit(cpu_hotplug_cooling_exit);
> +
> +MODULE_AUTHOR("John Madieu <john.madieu.xa@bp.renesas.com>");
> +MODULE_DESCRIPTION("CPU Hotplug Thermal Cooling Device");
> +MODULE_LICENSE("GPL");
> \ No newline at end of file

Warning here

> diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
> index 0eb92d57a1e2..41655af1e419 100644
> --- a/drivers/thermal/thermal_of.c
> +++ b/drivers/thermal/thermal_of.c
> @@ -28,6 +28,7 @@ static const char * const trip_types[] = {
>  	[THERMAL_TRIP_ACTIVE]	= "active",
>  	[THERMAL_TRIP_PASSIVE]	= "passive",
>  	[THERMAL_TRIP_HOT]	= "hot",
> +	[THERMAL_TRIP_PLUG]	= "plug",
>  	[THERMAL_TRIP_CRITICAL]	= "critical",
>  };
>  
> diff --git a/drivers/thermal/thermal_trace.h b/drivers/thermal/thermal_trace.h
> index df8f4edd6068..c26a3aa7de5f 100644
> --- a/drivers/thermal/thermal_trace.h
> +++ b/drivers/thermal/thermal_trace.h
> @@ -12,6 +12,7 @@
>  #include "thermal_core.h"
>  
>  TRACE_DEFINE_ENUM(THERMAL_TRIP_CRITICAL);
> +TRACE_DEFINE_ENUM(THERMAL_TRIP_PLUG);
>  TRACE_DEFINE_ENUM(THERMAL_TRIP_HOT);
>  TRACE_DEFINE_ENUM(THERMAL_TRIP_PASSIVE);
>  TRACE_DEFINE_ENUM(THERMAL_TRIP_ACTIVE);
> @@ -19,6 +20,7 @@ TRACE_DEFINE_ENUM(THERMAL_TRIP_ACTIVE);
>  #define show_tzt_type(type)					\
>  	__print_symbolic(type,					\
>  			 { THERMAL_TRIP_CRITICAL, "CRITICAL"},	\
> +			 { THERMAL_TRIP_PLUG,     "PLUG"},	\
>  			 { THERMAL_TRIP_HOT,      "HOT"},	\
>  			 { THERMAL_TRIP_PASSIVE,  "PASSIVE"},	\
>  			 { THERMAL_TRIP_ACTIVE,   "ACTIVE"})
> diff --git a/drivers/thermal/thermal_trip.c b/drivers/thermal/thermal_trip.c
> index 4b8238468b53..373f6aaaf0da 100644
> --- a/drivers/thermal/thermal_trip.c
> +++ b/drivers/thermal/thermal_trip.c
> @@ -13,6 +13,7 @@ static const char *trip_type_names[] = {
>  	[THERMAL_TRIP_ACTIVE] = "active",
>  	[THERMAL_TRIP_PASSIVE] = "passive",
>  	[THERMAL_TRIP_HOT] = "hot",
> +	[THERMAL_TRIP_PLUG]	= "plug",
>  	[THERMAL_TRIP_CRITICAL] = "critical",
>  };
>  
> diff --git a/include/uapi/linux/thermal.h b/include/uapi/linux/thermal.h
> index 46a2633d33aa..5f76360c6f69 100644
> --- a/include/uapi/linux/thermal.h
> +++ b/include/uapi/linux/thermal.h
> @@ -15,6 +15,7 @@ enum thermal_trip_type {
>  	THERMAL_TRIP_ACTIVE = 0,
>  	THERMAL_TRIP_PASSIVE,
>  	THERMAL_TRIP_HOT,
> +	THERMAL_TRIP_PLUG,
>  	THERMAL_TRIP_CRITICAL,
>  };
>  


Best regards,
Krzysztof
Geert Uytterhoeven March 11, 2025, 8:28 a.m. UTC | #2
Hi John,

On Sun, 9 Mar 2025 at 13:14, John Madieu <john.madieu.xa@bp.renesas.com> wrote:
> Add thermal cooling mechanism that dynamically manages CPU online/offline
> states to prevent overheating. It registers  per-CPU cooling devices that can
> take CPUs offline when thermal thresholds are excee and that integrates with
> the Linux thermal framework as a cooling devices.
>
> Signed-off-by: John Madieu <john.madieu.xa@bp.renesas.com>

Thanks for your patch!

> --- /dev/null
> +++ b/drivers/thermal/cpuplug_cooling.c

> +static int register_cpu_hotplug_cooling(struct device_node *cpu_node,
> +                                       int cpu_id)
> +{

> +       hotplug_cdev = kzalloc(sizeof(*hotplug_cdev), GFP_KERNEL);
> +       if (!hotplug_cdev) {
> +               pr_err("Failed to allocate memory for cooling device\n");

scripts/checkpatch.pl:

WARNING: Possible unnecessary 'out of memory' message

and checkpatch is right, as the memory core already takes care of
printing a message.

> +               return -ENOMEM;
> +       }

Gr{oetje,eeting}s,

                        Geert
Christian Loehle March 11, 2025, 10:53 a.m. UTC | #3
On 3/9/25 12:13, John Madieu wrote:
> Add CPU hotplug trip point to shutdown CPU1 and CPU2 when exceeding 110°C.
> 
> Signed-off-by: John Madieu <john.madieu.xa@bp.renesas.com>
> ---
>  arch/arm64/boot/dts/renesas/r9a09g047.dtsi | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/renesas/r9a09g047.dtsi b/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
> index 93b57d7ad7b9..06bd394582e2 100644
> --- a/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
> +++ b/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
> @@ -533,6 +533,13 @@ map0 {
>  							 <&cpu2 0 3>, <&cpu3 0 3>;
>  					contribution = <1024>;
>  				};
> +
> +				map1 {
> +					trip = <&trip_emergency>;
> +					cooling-device = <&cpu1 0 1>, <&cpu2 0 1>;
> +					contribution = <1024>;
> +				};
> +
>  			};
>  
>  			trips {
> @@ -542,6 +549,12 @@ target: trip-point {
>  					type = "passive";
>  				};
>  
> +				trip_emergency: emergency {
> +					temperature = <110000>;
> +					hysteresis = <1000>;
> +					type = "plug";
> +				};
> +
>  				sensor_crit: sensor-crit {
>  					temperature = <120000>;
>  					hysteresis = <1000>;


Are there no other cooling methods?
How does it compare to idle inject?

Furthermore, couldn't the offlining of some CPUs lead to the rest being
operated at much higher OPPs therefore the overall power increase, too?
(Without having looked at if this is a possibility for this particular
SoC.)
Some numbers would be helpful IMO.
John Madieu March 11, 2025, 11:38 a.m. UTC | #4
Hi Krzysztof,

Thanks for the review.

> -----Original Message-----
> From: Krzysztof Kozlowski <krzk@kernel.org>
> Sent: Tuesday, March 11, 2025 8:31 AM
> To: John Madieu <john.madieu.xa@bp.renesas.com>; geert+renesas@glider.be;
> niklas.soderlund+renesas@ragnatech.se; conor+dt@kernel.org;
> krzk+dt@kernel.org; robh@kernel.org; rafael@kernel.org;
> daniel.lezcano@linaro.org
> Subject: Re: [RFC PATCH 1/3] thermal/cpuplog_cooling: Add CPU hotplug
> cooling driver
> 
> On 09/03/2025 13:13, John Madieu wrote:
> > +
> > +/* Check if a trip point is of type "plug" */ static bool
> > +is_plug_trip_point(struct device_node *trip_node) {
> > +	const char *trip_type_str;
> > +
> > +	if (!trip_node) {
> > +		pr_err("Trip node is NULL\n");
> > +		return false;
> > +	}
> > +
> > +	if (of_property_read_string(trip_node, "type", &trip_type_str)) {
> > +		pr_err("Trip node missing 'type' property\n");
> > +		return false;
> > +	}
> > +
> > +	pr_info("Trip type: '%s'\n", trip_type_str);
> > +
> > +	if (strcmp(trip_type_str, "plug") != 0) {
> 
> type is object, not string. Where is ABI documented? For the type and its
> value?

I'll prepare it for v2.

> 
> 
> > +		pr_debug("Trip type is '%s', not 'plug' - skipping\n",
> > +			 trip_type_str);
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +/* Init function */
> > +static int __init cpu_hotplug_cooling_init(void) {
> > +	struct device_node *thermal_zones, *thermal_zone;
> > +	int ret = 0;
> > +	int count = 0;
> > +
> > +	bitmap_zero(cpu_cooling_registered, NR_CPUS);
> > +
> > +	thermal_zones = of_find_node_by_name(NULL, "thermal-zones");
> > +	if (!thermal_zones) {
> > +		pr_err("Missing thermal-zones node\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* Process each thermal zone */
> > +	for_each_child_of_node(thermal_zones, thermal_zone) {
> > +		struct device_node *trips, *trip;
> > +		struct device_node *maps, *map;
> > +		bool found_plug = false;
> > +
> > +		/* First find trips and get a specific plug trip */
> > +		trips = of_find_node_by_name(thermal_zone, "trips");
> > +		if (!trips)
> > +			continue;
> > +
> > +		/* Find the emergency trip with type="plug" */
> > +		for_each_child_of_node(trips, trip) {
> > +			if (is_plug_trip_point(trip)) {
> > +				found_plug = true;
> > +				break;
> > +			}
> > +		}
> > +
> > +		/* If we didn't find a plug trip, no need to process this zone
> */
> > +		if (!found_plug) {
> > +			of_node_put(trips);
> > +			continue;
> > +		}
> > +
> > +		maps = of_find_node_by_name(thermal_zone, "cooling-maps");
> > +		if (!maps) {
> > +			of_node_put(trip);
> > +			of_node_put(trips);
> > +			continue;
> > +		}
> > +
> > +		pr_info("Found 'plug' trip point, processing cooling
> devices\n");
> 
> dev_info, or just drop. You are not supposed to print successes of
> standard DT parsing.

Noted. Thanks!

> 
> > +
> > +		/* Find the specific cooling map that references our plug trip
> */
> > +		for_each_child_of_node(maps, map) {
> > +			struct device_node *trip_ref;
> > +			struct of_phandle_args cooling_spec;
> > +			int idx = 0;
> > +
> > +			trip_ref = of_parse_phandle(map, "trip", 0);
> > +			if (!trip_ref || trip_ref != trip) {
> > +				if (trip_ref)
> > +					of_node_put(trip_ref);
> > +				continue;
> > +			}
> > +			of_node_put(trip_ref);
> > +
> > +			if (!of_find_property(map, "cooling-device", NULL)) {
> > +				pr_err("Missing cooling-device property\n");
> > +				continue;
> > +			}
> > +
> > +			/* Iterate through all cooling-device entries */
> > +			while (of_parse_phandle_with_args(
> > +				       map, "cooling-device",
> > +				       "#cooling-cells", idx++,
> > +				       &cooling_spec) == 0) {
> > +				struct device_node *cpu_node = cooling_spec.np;
> > +				int cpu;
> > +
> > +				if (!cpu_node) {
> > +					pr_err("CPU node at index %d is NULL\n",
> > +					       idx - 1);
> > +					continue;
> > +				}
> > +
> > +				cpu = of_cpu_node_to_id(cpu_node);
> > +				if (cpu < 0) {
> > +					pr_err("Failed to map CPU node %pOF to
> logical ID\n",
> > +					       cpu_node);
> > +					of_node_put(cpu_node);
> > +					continue;
> > +				}
> > +
> > +				if (cpu >= num_possible_cpus()) {
> > +					pr_err("Invalid CPU ID %d (max %d)\n",
> > +					       cpu, num_possible_cpus() - 1);
> > +					of_node_put(cpu_node);
> > +					continue;
> > +				}
> > +
> > +				pr_info("Processing cooling device for CPU%d\n",
> cpu);
> > +				ret = register_cpu_hotplug_cooling(cpu_node, cpu);
> > +				if (ret == 0)
> > +					count++;
> > +
> > +				of_node_put(cpu_node);
> > +			}
> > +			break; /* Only process the first map that references our
> trip */
> > +		}
> > +		of_node_put(maps);
> > +		of_node_put(trip);
> > +		of_node_put(trips);
> > +	}
> > +	of_node_put(thermal_zones);
> > +
> > +	if (count == 0) {
> > +		pr_err("No cooling devices registered\n");
> > +		return -ENODEV;
> > +	}
> > +
> > +	pr_info("CPU hotplug cooling driver initialized with %d devices\n",
> > +count);
> 
> Drop. Why would you print this on MIPS machine which does not care about
> it, just because someone loaded a module?
> 

Will remove this in v2.

> > +	return 0;
> > +}
> > +
> > +/* Exit function */
> > +static void __exit cpu_hotplug_cooling_exit(void) {
> > +	cleanup_cooling_devices();
> > +	pr_info("CPU hotplug cooling driver removed\n");
> 
> No, drop
> 

Got it.

> 
> > +}
> > +
> > +module_init(cpu_hotplug_cooling_init);
> > +module_exit(cpu_hotplug_cooling_exit);
> > +
> > +MODULE_AUTHOR("John Madieu <john.madieu.xa@bp.renesas.com>");
> > +MODULE_DESCRIPTION("CPU Hotplug Thermal Cooling Device");
> > +MODULE_LICENSE("GPL");
> > \ No newline at end of file
> 
> Warning here
> 

Will be fixed in v2.

> > diff --git a/drivers/thermal/thermal_of.c
> > b/drivers/thermal/thermal_of.c index 0eb92d57a1e2..41655af1e419 100644
> > --- a/drivers/thermal/thermal_of.c
> > +++ b/drivers/thermal/thermal_of.c
> > @@ -28,6 +28,7 @@ static const char * const trip_types[] = {
> >  	[THERMAL_TRIP_ACTIVE]	= "active",
> >  	[THERMAL_TRIP_PASSIVE]	= "passive",
> >  	[THERMAL_TRIP_HOT]	= "hot",
> > +	[THERMAL_TRIP_PLUG]	= "plug",
> >  	[THERMAL_TRIP_CRITICAL]	= "critical",
> >  };
> >
> > diff --git a/drivers/thermal/thermal_trace.h
> > b/drivers/thermal/thermal_trace.h index df8f4edd6068..c26a3aa7de5f
> > 100644
> > --- a/drivers/thermal/thermal_trace.h
> > +++ b/drivers/thermal/thermal_trace.h
> > @@ -12,6 +12,7 @@
> >  #include "thermal_core.h"
> >
> >  TRACE_DEFINE_ENUM(THERMAL_TRIP_CRITICAL);
> > +TRACE_DEFINE_ENUM(THERMAL_TRIP_PLUG);
> >  TRACE_DEFINE_ENUM(THERMAL_TRIP_HOT);
> >  TRACE_DEFINE_ENUM(THERMAL_TRIP_PASSIVE);
> >  TRACE_DEFINE_ENUM(THERMAL_TRIP_ACTIVE);
> > @@ -19,6 +20,7 @@ TRACE_DEFINE_ENUM(THERMAL_TRIP_ACTIVE);
> >  #define show_tzt_type(type)					\
> >  	__print_symbolic(type,					\
> >  			 { THERMAL_TRIP_CRITICAL, "CRITICAL"},	\
> > +			 { THERMAL_TRIP_PLUG,     "PLUG"},	\
> >  			 { THERMAL_TRIP_HOT,      "HOT"},	\
> >  			 { THERMAL_TRIP_PASSIVE,  "PASSIVE"},	\
> >  			 { THERMAL_TRIP_ACTIVE,   "ACTIVE"})
> > diff --git a/drivers/thermal/thermal_trip.c
> > b/drivers/thermal/thermal_trip.c index 4b8238468b53..373f6aaaf0da
> > 100644
> > --- a/drivers/thermal/thermal_trip.c
> > +++ b/drivers/thermal/thermal_trip.c
> > @@ -13,6 +13,7 @@ static const char *trip_type_names[] = {
> >  	[THERMAL_TRIP_ACTIVE] = "active",
> >  	[THERMAL_TRIP_PASSIVE] = "passive",
> >  	[THERMAL_TRIP_HOT] = "hot",
> > +	[THERMAL_TRIP_PLUG]	= "plug",
> >  	[THERMAL_TRIP_CRITICAL] = "critical",  };
> >
> > diff --git a/include/uapi/linux/thermal.h
> > b/include/uapi/linux/thermal.h index 46a2633d33aa..5f76360c6f69 100644
> > --- a/include/uapi/linux/thermal.h
> > +++ b/include/uapi/linux/thermal.h
> > @@ -15,6 +15,7 @@ enum thermal_trip_type {
> >  	THERMAL_TRIP_ACTIVE = 0,
> >  	THERMAL_TRIP_PASSIVE,
> >  	THERMAL_TRIP_HOT,
> > +	THERMAL_TRIP_PLUG,
> >  	THERMAL_TRIP_CRITICAL,
> >  };
> >
> 
> 
> Best regards,
> Krzysztof

Regards,
John
John Madieu March 11, 2025, 11:57 a.m. UTC | #5
Hi Christian,

Thanks for reviewing.

> -----Original Message-----
> From: Christian Loehle <christian.loehle@arm.com>
> Sent: Tuesday, March 11, 2025 11:53 AM
> To: John Madieu <john.madieu.xa@bp.renesas.com>; geert+renesas@glider.be;
> niklas.soderlund+renesas@ragnatech.se; conor+dt@kernel.org;
> krzk+dt@kernel.org; robh@kernel.org; rafael@kernel.org;
> daniel.lezcano@linaro.org
> Cc: magnus.damm@gmail.com; Claudiu Beznea
> <claudiu.beznea.uj@bp.renesas.com>; devicetree@vger.kernel.org;
> john.madieu@gmail.com; rui.zhang@intel.com; linux-kernel@vger.kernel.org;
> linux-renesas-soc@vger.kernel.org; Biju Das <biju.das.jz@bp.renesas.com>;
> linux-pm@vger.kernel.org
> Subject: Re: [RFC PATCH 3/3] arm64: dts: renesas: r9a09g047: Add thermal
> hotplug trip point
> 
> On 3/9/25 12:13, John Madieu wrote:
> > Add CPU hotplug trip point to shutdown CPU1 and CPU2 when exceeding
> 110°C.
> >
> > Signed-off-by: John Madieu <john.madieu.xa@bp.renesas.com>
> > ---
> >  arch/arm64/boot/dts/renesas/r9a09g047.dtsi | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
> > b/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
> > index 93b57d7ad7b9..06bd394582e2 100644
> > --- a/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
> > +++ b/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
> > @@ -533,6 +533,13 @@ map0 {
> >  							 <&cpu2 0 3>, <&cpu3 0 3>;
> >  					contribution = <1024>;
> >  				};
> > +
> > +				map1 {
> > +					trip = <&trip_emergency>;
> > +					cooling-device = <&cpu1 0 1>, <&cpu2 0 1>;
> > +					contribution = <1024>;
> > +				};
> > +
> >  			};
> >
> >  			trips {
> > @@ -542,6 +549,12 @@ target: trip-point {
> >  					type = "passive";
> >  				};
> >
> > +				trip_emergency: emergency {
> > +					temperature = <110000>;
> > +					hysteresis = <1000>;
> > +					type = "plug";
> > +				};
> > +
> >  				sensor_crit: sensor-crit {
> >  					temperature = <120000>;
> >  					hysteresis = <1000>;
> 
> 
> Are there no other cooling methods?
> How does it compare to idle inject?
> 
> Furthermore, couldn't the offlining of some CPUs lead to the rest being
> operated at much higher OPPs therefore the overall power increase, too?
> (Without having looked at if this is a possibility for this particular
> SoC.)
> Some numbers would be helpful IMO.

To clarify this, I tested with CPUFreq cooling, along with performance
Governor, with "plug" threshold higher than "passive" one. When passive
trip is crossed, we observe proper CPUs throttling, and when "plug" trip
is crossed, we observe target CPUs being put offline, while throttling
remains.

When "plug" targeted CPUs come back online, throttling is still operational.

Once I get comparison results with CPU idle cooling, I'll keep you posted.

Regards,
John.
Christian Loehle March 11, 2025, 3:29 p.m. UTC | #6
On 3/11/25 11:57, John Madieu wrote:
> Hi Christian,
> 
> Thanks for reviewing.
> 
>> -----Original Message-----
>> From: Christian Loehle <christian.loehle@arm.com>
>> Sent: Tuesday, March 11, 2025 11:53 AM
>> To: John Madieu <john.madieu.xa@bp.renesas.com>; geert+renesas@glider.be;
>> niklas.soderlund+renesas@ragnatech.se; conor+dt@kernel.org;
>> krzk+dt@kernel.org; robh@kernel.org; rafael@kernel.org;
>> daniel.lezcano@linaro.org
>> Cc: magnus.damm@gmail.com; Claudiu Beznea
>> <claudiu.beznea.uj@bp.renesas.com>; devicetree@vger.kernel.org;
>> john.madieu@gmail.com; rui.zhang@intel.com; linux-kernel@vger.kernel.org;
>> linux-renesas-soc@vger.kernel.org; Biju Das <biju.das.jz@bp.renesas.com>;
>> linux-pm@vger.kernel.org
>> Subject: Re: [RFC PATCH 3/3] arm64: dts: renesas: r9a09g047: Add thermal
>> hotplug trip point
>>
>> On 3/9/25 12:13, John Madieu wrote:
>>> Add CPU hotplug trip point to shutdown CPU1 and CPU2 when exceeding
>> 110°C.
>>>
>>> Signed-off-by: John Madieu <john.madieu.xa@bp.renesas.com>
>>> ---
>>>  arch/arm64/boot/dts/renesas/r9a09g047.dtsi | 13 +++++++++++++
>>>  1 file changed, 13 insertions(+)
>>>
>>> diff --git a/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
>>> b/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
>>> index 93b57d7ad7b9..06bd394582e2 100644
>>> --- a/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
>>> +++ b/arch/arm64/boot/dts/renesas/r9a09g047.dtsi
>>> @@ -533,6 +533,13 @@ map0 {
>>>  							 <&cpu2 0 3>, <&cpu3 0 3>;
>>>  					contribution = <1024>;
>>>  				};
>>> +
>>> +				map1 {
>>> +					trip = <&trip_emergency>;
>>> +					cooling-device = <&cpu1 0 1>, <&cpu2 0 1>;
>>> +					contribution = <1024>;
>>> +				};
>>> +
>>>  			};
>>>
>>>  			trips {
>>> @@ -542,6 +549,12 @@ target: trip-point {
>>>  					type = "passive";
>>>  				};
>>>
>>> +				trip_emergency: emergency {
>>> +					temperature = <110000>;
>>> +					hysteresis = <1000>;
>>> +					type = "plug";
>>> +				};
>>> +
>>>  				sensor_crit: sensor-crit {
>>>  					temperature = <120000>;
>>>  					hysteresis = <1000>;
>>
>>
>> Are there no other cooling methods?
>> How does it compare to idle inject?
>>
>> Furthermore, couldn't the offlining of some CPUs lead to the rest being
>> operated at much higher OPPs therefore the overall power increase, too?
>> (Without having looked at if this is a possibility for this particular
>> SoC.)
>> Some numbers would be helpful IMO.
> 
> To clarify this, I tested with CPUFreq cooling, along with performance
> Governor, with "plug" threshold higher than "passive" one. When passive
> trip is crossed, we observe proper CPUs throttling, and when "plug" trip
> is crossed, we observe target CPUs being put offline, while throttling
> remains.
> 
> When "plug" targeted CPUs come back online, throttling is still operational.
> 
> Once I get comparison results with CPU idle cooling, I'll keep you posted.
> 

Thanks John!
Might make sense to also try this with schedutil, because my argument doesn't
hold with performance governor.
As long as we also have throttling that's not a concern anyway.