Message ID | 20201207190530.30334-1-daniel.lezcano@linaro.org |
---|---|
State | New |
Headers | show |
Series | thermal/core: Emit a warning if the thermal zone is updated without ops | expand |
Hi Daniel, On 12/7/20 7:05 PM, Daniel Lezcano wrote: > The actual code is silently ignoring a thermal zone update when a > driver is requesting it without a get_temp ops set. > > That looks not correct, as the caller should not have called this > function if the thermal zone is unable to read the temperature. > > That makes the code less robust as the check won't detect the driver > is inconsistently using the thermal API and that does not help to > improve the framework as these circumvolutions hide the problem at the > source. Make sense. > > In order to detect the situation when it happens, let's add a warning > when the update is requested without the get_temp() ops set. > > Any warning emitted will have to be fixed at the source of the > problem: the caller must not call thermal_zone_device_update if there > is not get_temp callback set. > > As the check is done in thermal_zone_get_temperature() via the > update_temperature() function, it is pointless to have the check and > the WARN in the thermal_zone_device_update() function. Just remove the > check and let the next call to raise the warning. > > Cc: Thara Gopinath <thara.gopinath@linaro.org> > Cc: Amit Kucheria <amitk@kernel.org> > Cc: linux-pm@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> > --- > drivers/thermal/thermal_core.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c > index 90e38cc199f4..1bd23ff2247b 100644 > --- a/drivers/thermal/thermal_core.c > +++ b/drivers/thermal/thermal_core.c > @@ -448,17 +448,17 @@ static void handle_thermal_trip(struct thermal_zone_device *tz, int trip) > monitor_thermal_zone(tz); > } > > -static void update_temperature(struct thermal_zone_device *tz) > +static int update_temperature(struct thermal_zone_device *tz) > { > int temp, ret; > > ret = thermal_zone_get_temp(tz, &temp); > if (ret) { > if (ret != -EAGAIN) > - dev_warn(&tz->device, > - "failed to read out thermal zone (%d)\n", > - ret); > - return; > + dev_warn_once(&tz->device, > + "failed to read out thermal zone (%d)\n", > + ret); > + return ret; > } > > mutex_lock(&tz->lock); > @@ -469,6 +469,8 @@ static void update_temperature(struct thermal_zone_device *tz) > trace_thermal_temperature(tz); > > thermal_genl_sampling_temp(tz->id, temp); > + > + return 0; > } > > static void thermal_zone_device_init(struct thermal_zone_device *tz) > @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct thermal_zone_device *tz, > if (atomic_read(&in_suspend)) > return; > > - if (!tz->ops->get_temp) > + if (update_temperature(tz)) > return; > > - update_temperature(tz); > - I think the patch does a bit more. Previously we continued running the code below even when the thermal_zone_get_temp() returned an error (due to various reasons). Now we stop and probably would not schedule next polling, not calling: handle_thermal_trip() and monitor_thermal_zone() I would left update_temperature(tz) as it was and not check the return. The function thermal_zone_get_temp() can protect itself from missing tz->ops->get_temp(), so we should be safe. What do you think? Regards, Lukasz
Hi Lukasz, On 08/12/2020 10:36, Lukasz Luba wrote: > Hi Daniel, [ ... ] >> static void thermal_zone_device_init(struct thermal_zone_device *tz) >> @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct >> thermal_zone_device *tz, >> if (atomic_read(&in_suspend)) >> return; >> - if (!tz->ops->get_temp) >> + if (update_temperature(tz)) >> return; >> - update_temperature(tz); >> - > > I think the patch does a bit more. Previously we continued running the > code below even when the thermal_zone_get_temp() returned an error (due > to various reasons). Now we stop and probably would not schedule next > polling, not calling: > handle_thermal_trip() and monitor_thermal_zone() I agree there is a change in the behavior. > I would left update_temperature(tz) as it was and not check the return. > The function thermal_zone_get_temp() can protect itself from missing > tz->ops->get_temp(), so we should be safe. > > What do you think? Does it make sense to handle the trip point if we are unable to read the temperature? The lines following the update_temperature() are: - thermal_zone_set_trips() which needs a correct tz->temperature - handle_thermal_trip() which needs a correct tz->temperature to compare with - monitor_thermal_zone() which needs a consistent tz->passive. This one is updated by the governor which is in an inconsistent state because the temperature is not updated. The problem I see here is how the interrupt mode and the polling mode are existing in the same code path. The interrupt mode can call thermal_notify_framework() for critical/hot trip points without being followed by a monitoring. But for the other trip points, the get_temp is needed. IMHO, we should return if update_temperature() is failing. Perhaps, it would make sense to simply prevent to register a thermal zone if the get_temp ops is not defined. AFAICS, if the interrupt mode without get_temp callback are for hot and critical trip points which can be directly invoked from the sensor via a specified callback, no thermal zone would be needed in this case. -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog
On 12/8/20 1:51 PM, Daniel Lezcano wrote: > > Hi Lukasz, > > On 08/12/2020 10:36, Lukasz Luba wrote: >> Hi Daniel, > > [ ... ] > >>> static void thermal_zone_device_init(struct thermal_zone_device *tz) >>> @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct >>> thermal_zone_device *tz, >>> if (atomic_read(&in_suspend)) >>> return; >>> - if (!tz->ops->get_temp) >>> + if (update_temperature(tz)) >>> return; >>> - update_temperature(tz); >>> - >> >> I think the patch does a bit more. Previously we continued running the >> code below even when the thermal_zone_get_temp() returned an error (due >> to various reasons). Now we stop and probably would not schedule next >> polling, not calling: >> handle_thermal_trip() and monitor_thermal_zone() > > I agree there is a change in the behavior. > >> I would left update_temperature(tz) as it was and not check the return. >> The function thermal_zone_get_temp() can protect itself from missing >> tz->ops->get_temp(), so we should be safe. >> >> What do you think? > > Does it make sense to handle the trip point if we are unable to read the > temperature? > > The lines following the update_temperature() are: > > - thermal_zone_set_trips() which needs a correct tz->temperature > > - handle_thermal_trip() which needs a correct tz->temperature to > compare with > > - monitor_thermal_zone() which needs a consistent tz->passive. This one > is updated by the governor which is in an inconsistent state because the > temperature is not updated. > > The problem I see here is how the interrupt mode and the polling mode > are existing in the same code path. > > The interrupt mode can call thermal_notify_framework() for critical/hot > trip points without being followed by a monitoring. But for the other > trip points, the get_temp is needed. Yes, I agree that we can bail out when there is no .get_temp() callback and even not schedule next polling in such case. But I am just not sure if we can bail out and not schedule the next polling, when there is .get_temp() populated and the driver returned an error only at that moment, e.g. indicating some internal temporary, issue like send queue full, so such as -EBUSY, or -EAGAIN, etc. The thermal_zone_get_temp() would pass the error to update_temperature() but we return, losing the next try. We would not check the temperature again. > > IMHO, we should return if update_temperature() is failing. > > Perhaps, it would make sense to simply prevent to register a thermal > zone if the get_temp ops is not defined. > > AFAICS, if the interrupt mode without get_temp callback are for hot and > critical trip points which can be directly invoked from the sensor via a > specified callback, no thermal zone would be needed in this case. > > >
On 12/8/20 2:37 PM, Lukasz Luba wrote: > > > On 12/8/20 1:51 PM, Daniel Lezcano wrote: >> >> Hi Lukasz, >> >> On 08/12/2020 10:36, Lukasz Luba wrote: >>> Hi Daniel, >> >> [ ... ] >> >>>> static void thermal_zone_device_init(struct thermal_zone_device >>>> *tz) >>>> @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct >>>> thermal_zone_device *tz, >>>> if (atomic_read(&in_suspend)) >>>> return; >>>> - if (!tz->ops->get_temp) >>>> + if (update_temperature(tz)) >>>> return; >>>> - update_temperature(tz); >>>> - >>> >>> I think the patch does a bit more. Previously we continued running the >>> code below even when the thermal_zone_get_temp() returned an error (due >>> to various reasons). Now we stop and probably would not schedule next >>> polling, not calling: >>> handle_thermal_trip() and monitor_thermal_zone() >> >> I agree there is a change in the behavior. >> >>> I would left update_temperature(tz) as it was and not check the return. >>> The function thermal_zone_get_temp() can protect itself from missing >>> tz->ops->get_temp(), so we should be safe. >>> >>> What do you think? >> >> Does it make sense to handle the trip point if we are unable to read the >> temperature? >> >> The lines following the update_temperature() are: >> >> - thermal_zone_set_trips() which needs a correct tz->temperature >> >> - handle_thermal_trip() which needs a correct tz->temperature to >> compare with >> >> - monitor_thermal_zone() which needs a consistent tz->passive. This one >> is updated by the governor which is in an inconsistent state because the >> temperature is not updated. >> >> The problem I see here is how the interrupt mode and the polling mode >> are existing in the same code path. >> >> The interrupt mode can call thermal_notify_framework() for critical/hot >> trip points without being followed by a monitoring. But for the other >> trip points, the get_temp is needed. > > Yes, I agree that we can bail out when there is no .get_temp() callback > and even not schedule next polling in such case. > But I am just not sure if we can bail out and not schedule the next > polling, when there is .get_temp() populated and the driver returned > an error only at that moment, e.g. indicating some internal temporary, > issue like send queue full, so such as -EBUSY, or -EAGAIN, etc. > The thermal_zone_get_temp() would pass the error to update_temperature() > but we return, losing the next try. We would not check the temperature > again. > Some links to point you to such code where sensor has to deal with protocol and various error reasons [1][2][3] [1] https://elixir.bootlin.com/linux/latest/source/drivers/firmware/arm_scmi/sensors.c#L230 [2] https://elixir.bootlin.com/linux/latest/source/drivers/firmware/arm_scmi/driver.c#L425 [3] https://elixir.bootlin.com/linux/latest/source/drivers/firmware/arm_scmi/driver.c#L474
On 08/12/2020 15:37, Lukasz Luba wrote: > > > On 12/8/20 1:51 PM, Daniel Lezcano wrote: >> >> Hi Lukasz, >> >> On 08/12/2020 10:36, Lukasz Luba wrote: >>> Hi Daniel, >> >> [ ... ] >> >>>> static void thermal_zone_device_init(struct thermal_zone_device >>>> *tz) >>>> @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct >>>> thermal_zone_device *tz, >>>> if (atomic_read(&in_suspend)) >>>> return; >>>> - if (!tz->ops->get_temp) >>>> + if (update_temperature(tz)) >>>> return; >>>> - update_temperature(tz); >>>> - >>> >>> I think the patch does a bit more. Previously we continued running the >>> code below even when the thermal_zone_get_temp() returned an error (due >>> to various reasons). Now we stop and probably would not schedule next >>> polling, not calling: >>> handle_thermal_trip() and monitor_thermal_zone() >> >> I agree there is a change in the behavior. >> >>> I would left update_temperature(tz) as it was and not check the return. >>> The function thermal_zone_get_temp() can protect itself from missing >>> tz->ops->get_temp(), so we should be safe. >>> >>> What do you think? >> >> Does it make sense to handle the trip point if we are unable to read the >> temperature? >> >> The lines following the update_temperature() are: >> >> - thermal_zone_set_trips() which needs a correct tz->temperature >> >> - handle_thermal_trip() which needs a correct tz->temperature to >> compare with >> >> - monitor_thermal_zone() which needs a consistent tz->passive. This one >> is updated by the governor which is in an inconsistent state because the >> temperature is not updated. >> >> The problem I see here is how the interrupt mode and the polling mode >> are existing in the same code path. >> >> The interrupt mode can call thermal_notify_framework() for critical/hot >> trip points without being followed by a monitoring. But for the other >> trip points, the get_temp is needed. > > Yes, I agree that we can bail out when there is no .get_temp() callback > and even not schedule next polling in such case. > But I am just not sure if we can bail out and not schedule the next > polling, when there is .get_temp() populated and the driver returned > an error only at that moment, e.g. indicating some internal temporary, > issue like send queue full, so such as -EBUSY, or -EAGAIN, etc. > The thermal_zone_get_temp() would pass the error to update_temperature() > but we return, losing the next try. We would not check the temperature > again. Hmm, right. I agree with your point. What about the following changes: - Add the new APIs: thermal_zone_device_critical(struct thermal_zone_device *tz); => emergency poweroff thermal_zone_device_hot(struct thermal_zone_device *tz); => userspace notification - Add a big fat WARN when thermal_zone_device_update is called with .get_temp == NULL because that must not happen. If the .get_temp is NULL it is because we only have a HOT/CRITICAL thermal trip points where we don't care about the temperature and governor decision, right ? -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog
On 12/8/20 3:19 PM, Daniel Lezcano wrote: > On 08/12/2020 15:37, Lukasz Luba wrote: >> >> >> On 12/8/20 1:51 PM, Daniel Lezcano wrote: >>> >>> Hi Lukasz, >>> >>> On 08/12/2020 10:36, Lukasz Luba wrote: >>>> Hi Daniel, >>> >>> [ ... ] >>> >>>>> static void thermal_zone_device_init(struct thermal_zone_device >>>>> *tz) >>>>> @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct >>>>> thermal_zone_device *tz, >>>>> if (atomic_read(&in_suspend)) >>>>> return; >>>>> - if (!tz->ops->get_temp) >>>>> + if (update_temperature(tz)) >>>>> return; >>>>> - update_temperature(tz); >>>>> - >>>> >>>> I think the patch does a bit more. Previously we continued running the >>>> code below even when the thermal_zone_get_temp() returned an error (due >>>> to various reasons). Now we stop and probably would not schedule next >>>> polling, not calling: >>>> handle_thermal_trip() and monitor_thermal_zone() >>> >>> I agree there is a change in the behavior. >>> >>>> I would left update_temperature(tz) as it was and not check the return. >>>> The function thermal_zone_get_temp() can protect itself from missing >>>> tz->ops->get_temp(), so we should be safe. >>>> >>>> What do you think? >>> >>> Does it make sense to handle the trip point if we are unable to read the >>> temperature? >>> >>> The lines following the update_temperature() are: >>> >>> - thermal_zone_set_trips() which needs a correct tz->temperature >>> >>> - handle_thermal_trip() which needs a correct tz->temperature to >>> compare with >>> >>> - monitor_thermal_zone() which needs a consistent tz->passive. This one >>> is updated by the governor which is in an inconsistent state because the >>> temperature is not updated. >>> >>> The problem I see here is how the interrupt mode and the polling mode >>> are existing in the same code path. >>> >>> The interrupt mode can call thermal_notify_framework() for critical/hot >>> trip points without being followed by a monitoring. But for the other >>> trip points, the get_temp is needed. >> >> Yes, I agree that we can bail out when there is no .get_temp() callback >> and even not schedule next polling in such case. >> But I am just not sure if we can bail out and not schedule the next >> polling, when there is .get_temp() populated and the driver returned >> an error only at that moment, e.g. indicating some internal temporary, >> issue like send queue full, so such as -EBUSY, or -EAGAIN, etc. >> The thermal_zone_get_temp() would pass the error to update_temperature() >> but we return, losing the next try. We would not check the temperature >> again. > > Hmm, right. I agree with your point. > > What about the following changes: > > - Add the new APIs: > > thermal_zone_device_critical(struct thermal_zone_device *tz); > => emergency poweroff > > thermal_zone_device_hot(struct thermal_zone_device *tz); > => userspace notification They look promising, I have to look into the existing code. When they would be called? > > - Add a big fat WARN when thermal_zone_device_update is called with > .get_temp == NULL because that must not happen. Good idea > > If the .get_temp is NULL it is because we only have a HOT/CRITICAL > thermal trip points where we don't care about the temperature and > governor decision, right ? > That is a good question. Let me dig into the code. I would say yes - we don't have to hassle with governor in this circumstances.
On 09/12/2020 11:41, Lukasz Luba wrote: > > > On 12/8/20 3:19 PM, Daniel Lezcano wrote: >> On 08/12/2020 15:37, Lukasz Luba wrote: >>> >>> >>> On 12/8/20 1:51 PM, Daniel Lezcano wrote: >>>> >>>> Hi Lukasz, >>>> >>>> On 08/12/2020 10:36, Lukasz Luba wrote: >>>>> Hi Daniel, >>>> >>>> [ ... ] >>>> >>>>>> static void thermal_zone_device_init(struct thermal_zone_device >>>>>> *tz) >>>>>> @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct >>>>>> thermal_zone_device *tz, >>>>>> if (atomic_read(&in_suspend)) >>>>>> return; >>>>>> - if (!tz->ops->get_temp) >>>>>> + if (update_temperature(tz)) >>>>>> return; >>>>>> - update_temperature(tz); >>>>>> - >>>>> >>>>> I think the patch does a bit more. Previously we continued running the >>>>> code below even when the thermal_zone_get_temp() returned an error >>>>> (due >>>>> to various reasons). Now we stop and probably would not schedule next >>>>> polling, not calling: >>>>> handle_thermal_trip() and monitor_thermal_zone() >>>> >>>> I agree there is a change in the behavior. >>>> >>>>> I would left update_temperature(tz) as it was and not check the >>>>> return. >>>>> The function thermal_zone_get_temp() can protect itself from missing >>>>> tz->ops->get_temp(), so we should be safe. >>>>> >>>>> What do you think? >>>> >>>> Does it make sense to handle the trip point if we are unable to read >>>> the >>>> temperature? >>>> >>>> The lines following the update_temperature() are: >>>> >>>> - thermal_zone_set_trips() which needs a correct tz->temperature >>>> >>>> - handle_thermal_trip() which needs a correct tz->temperature to >>>> compare with >>>> >>>> - monitor_thermal_zone() which needs a consistent tz->passive. >>>> This one >>>> is updated by the governor which is in an inconsistent state because >>>> the >>>> temperature is not updated. >>>> >>>> The problem I see here is how the interrupt mode and the polling mode >>>> are existing in the same code path. >>>> >>>> The interrupt mode can call thermal_notify_framework() for critical/hot >>>> trip points without being followed by a monitoring. But for the other >>>> trip points, the get_temp is needed. >>> >>> Yes, I agree that we can bail out when there is no .get_temp() callback >>> and even not schedule next polling in such case. >>> But I am just not sure if we can bail out and not schedule the next >>> polling, when there is .get_temp() populated and the driver returned >>> an error only at that moment, e.g. indicating some internal temporary, >>> issue like send queue full, so such as -EBUSY, or -EAGAIN, etc. >>> The thermal_zone_get_temp() would pass the error to update_temperature() >>> but we return, losing the next try. We would not check the temperature >>> again. >> >> Hmm, right. I agree with your point. >> >> What about the following changes: >> >> - Add the new APIs: >> >> thermal_zone_device_critical(struct thermal_zone_device *tz); >> => emergency poweroff >> >> thermal_zone_device_hot(struct thermal_zone_device *tz); >> => userspace notification > > They look promising, I have to look into the existing code. > When they would be called? They can be called directly by the driver when there is no get_temp callback instead of calling thermal_zone_device_update, and by the usual code path via handle_critical_trip function. Also that can solve the issue [1] when registering a device which is already too hot [1] by adding the ops in the thermal zone. [1] https://lkml.org/lkml/2020/11/28/166 >> - Add a big fat WARN when thermal_zone_device_update is called with >> .get_temp == NULL because that must not happen. > > Good idea > >> >> If the .get_temp is NULL it is because we only have a HOT/CRITICAL >> thermal trip points where we don't care about the temperature and >> governor decision, right ? >> > > That is a good question. Let me dig into the code. I would say yes - we > don't have to hassle with governor in this circumstances. -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog
On 12/9/20 12:20 PM, Daniel Lezcano wrote: > On 09/12/2020 11:41, Lukasz Luba wrote: >> >> >> On 12/8/20 3:19 PM, Daniel Lezcano wrote: >>> On 08/12/2020 15:37, Lukasz Luba wrote: >>>> >>>> >>>> On 12/8/20 1:51 PM, Daniel Lezcano wrote: >>>>> >>>>> Hi Lukasz, >>>>> >>>>> On 08/12/2020 10:36, Lukasz Luba wrote: >>>>>> Hi Daniel, >>>>> >>>>> [ ... ] >>>>> >>>>>>> static void thermal_zone_device_init(struct thermal_zone_device >>>>>>> *tz) >>>>>>> @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct >>>>>>> thermal_zone_device *tz, >>>>>>> if (atomic_read(&in_suspend)) >>>>>>> return; >>>>>>> - if (!tz->ops->get_temp) >>>>>>> + if (update_temperature(tz)) >>>>>>> return; >>>>>>> - update_temperature(tz); >>>>>>> - >>>>>> >>>>>> I think the patch does a bit more. Previously we continued running the >>>>>> code below even when the thermal_zone_get_temp() returned an error >>>>>> (due >>>>>> to various reasons). Now we stop and probably would not schedule next >>>>>> polling, not calling: >>>>>> handle_thermal_trip() and monitor_thermal_zone() >>>>> >>>>> I agree there is a change in the behavior. >>>>> >>>>>> I would left update_temperature(tz) as it was and not check the >>>>>> return. >>>>>> The function thermal_zone_get_temp() can protect itself from missing >>>>>> tz->ops->get_temp(), so we should be safe. >>>>>> >>>>>> What do you think? >>>>> >>>>> Does it make sense to handle the trip point if we are unable to read >>>>> the >>>>> temperature? >>>>> >>>>> The lines following the update_temperature() are: >>>>> >>>>> - thermal_zone_set_trips() which needs a correct tz->temperature >>>>> >>>>> - handle_thermal_trip() which needs a correct tz->temperature to >>>>> compare with >>>>> >>>>> - monitor_thermal_zone() which needs a consistent tz->passive. >>>>> This one >>>>> is updated by the governor which is in an inconsistent state because >>>>> the >>>>> temperature is not updated. >>>>> >>>>> The problem I see here is how the interrupt mode and the polling mode >>>>> are existing in the same code path. >>>>> >>>>> The interrupt mode can call thermal_notify_framework() for critical/hot >>>>> trip points without being followed by a monitoring. But for the other >>>>> trip points, the get_temp is needed. >>>> >>>> Yes, I agree that we can bail out when there is no .get_temp() callback >>>> and even not schedule next polling in such case. >>>> But I am just not sure if we can bail out and not schedule the next >>>> polling, when there is .get_temp() populated and the driver returned >>>> an error only at that moment, e.g. indicating some internal temporary, >>>> issue like send queue full, so such as -EBUSY, or -EAGAIN, etc. >>>> The thermal_zone_get_temp() would pass the error to update_temperature() >>>> but we return, losing the next try. We would not check the temperature >>>> again. >>> >>> Hmm, right. I agree with your point. >>> >>> What about the following changes: >>> >>> - Add the new APIs: >>> >>> thermal_zone_device_critical(struct thermal_zone_device *tz); >>> => emergency poweroff >>> >>> thermal_zone_device_hot(struct thermal_zone_device *tz); >>> => userspace notification >> >> They look promising, I have to look into the existing code. >> When they would be called? > > They can be called directly by the driver when there is no get_temp > callback instead of calling thermal_zone_device_update, and by the usual > code path via handle_critical_trip function. > > Also that can solve the issue [1] when registering a device which is > already too hot [1] by adding the ops in the thermal zone. > > [1] https://lkml.org/lkml/2020/11/28/166 > Thank you for the link. I went through these discussions. Let me add some comment below your posted RFC.
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 90e38cc199f4..1bd23ff2247b 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -448,17 +448,17 @@ static void handle_thermal_trip(struct thermal_zone_device *tz, int trip) monitor_thermal_zone(tz); } -static void update_temperature(struct thermal_zone_device *tz) +static int update_temperature(struct thermal_zone_device *tz) { int temp, ret; ret = thermal_zone_get_temp(tz, &temp); if (ret) { if (ret != -EAGAIN) - dev_warn(&tz->device, - "failed to read out thermal zone (%d)\n", - ret); - return; + dev_warn_once(&tz->device, + "failed to read out thermal zone (%d)\n", + ret); + return ret; } mutex_lock(&tz->lock); @@ -469,6 +469,8 @@ static void update_temperature(struct thermal_zone_device *tz) trace_thermal_temperature(tz); thermal_genl_sampling_temp(tz->id, temp); + + return 0; } static void thermal_zone_device_init(struct thermal_zone_device *tz) @@ -553,11 +555,9 @@ void thermal_zone_device_update(struct thermal_zone_device *tz, if (atomic_read(&in_suspend)) return; - if (!tz->ops->get_temp) + if (update_temperature(tz)) return; - update_temperature(tz); - thermal_zone_set_trips(tz); tz->notify_event = event;
The actual code is silently ignoring a thermal zone update when a driver is requesting it without a get_temp ops set. That looks not correct, as the caller should not have called this function if the thermal zone is unable to read the temperature. That makes the code less robust as the check won't detect the driver is inconsistently using the thermal API and that does not help to improve the framework as these circumvolutions hide the problem at the source. In order to detect the situation when it happens, let's add a warning when the update is requested without the get_temp() ops set. Any warning emitted will have to be fixed at the source of the problem: the caller must not call thermal_zone_device_update if there is not get_temp callback set. As the check is done in thermal_zone_get_temperature() via the update_temperature() function, it is pointless to have the check and the WARN in the thermal_zone_device_update() function. Just remove the check and let the next call to raise the warning. Cc: Thara Gopinath <thara.gopinath@linaro.org> Cc: Amit Kucheria <amitk@kernel.org> Cc: linux-pm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> --- drivers/thermal/thermal_core.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) -- 2.17.1