Message ID | 20231129110853.94344-1-lukasz.luba@arm.com |
---|---|
Headers | show |
Series | Introduce runtime modifiable Energy Model | expand |
On 29/11/2023 12:08, Lukasz Luba wrote: > Hi all, > > This patch set adds a new feature which allows to modify Energy Model (EM) > power values at runtime. It will allow to better reflect power model of > a recent SoCs and silicon. Different characteristics of the power usage > can be leveraged and thus better decisions made during task placement in EAS. > > It's part of feature set know as Dynamic Energy Model. It has been presented > and discussed recently at OSPM2023 [3]. This patch set implements the 1st > improvement for the EM. > > The concepts: > 1. The CPU power usage can vary due to the workload that it's running or due > to the temperature of the SoC. The same workload can use more power when the > temperature of the silicon has increased (e.g. due to hot GPU or ISP). > In such situation the EM can be adjusted and reflect the fact of increased > power usage. That power increase is due to static power > (sometimes called simply: leakage). The CPUs in recent SoCs are different. > We have heterogeneous SoCs with 3 (or even 4) different microarchitectures. > They are also built differently with High Performance (HP) cells or > Low Power (LP) cells. They are affected by the temperature increase > differently: HP cells have bigger leakage. The SW model can leverage that > knowledge. > > 2. It is also possible to change the EM to better reflect the currently > running workload. Usually the EM is derived from some average power values > taken from experiments with benchmark (e.g. Dhrystone). The model derived > from such scenario might not represent properly the workloads usually running > on the device. Therefore, runtime modification of the EM allows to switch to > a different model, when there is a need. > > 3. The EM can be adjusted after boot, when all the modules are loaded and > more information about the SoC is available e.g. chip binning. This would help > to better reflect the silicon characteristics. Thus, this EM modification > API allows it now. It wasn't possible in the past and the EM had to be > 'set in stone'. > > More detailed explanation and background can be found in presentations > during LPC2022 [1][2] or in the documentation patches. > > Some test results. > The EM can be updated to fit better the workload type. In the case below the EM > has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports > for the scheduler bits). The Jankbench was run 10 times for those two configurations, > to get more reliable data. > > 1. Janky frames percentage > +--------+-----------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+-----------------+---------------------+-------+-----------+ > | gmean | jank_percentage | EM_default | 2.0 | 0.0% | > | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% | > +--------+-----------------+---------------------+-------+-----------+ > > 2. Avg frame render time duration > +--------+---------------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+---------------------+---------------------+-------+-----------+ > | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% | > | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% | > +--------+---------------------+---------------------+-------+-----------+ > > 3. Max frame render time duration > +--------+--------------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+--------------------+---------------------+-------+-----------+ > | gmean | max_frame_duration | EM_default | 251.6 | 0.0% | > | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% | > +--------+--------------------+---------------------+-------+-----------+ > > 4. OS overutilized state percentage (when EAS is not working) > +--------------+---------------------+------+------------+------------+ > | metric | wa_path | time | total_time | percentage | > +--------------+---------------------+------+------------+------------+ > | overutilized | EM_default | 1.65 | 253.38 | 0.65 | > | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 | > +--------------+---------------------+------+------------+------------+ > > 5. All CPUs (Little+Mid+Big) power values in mW > +------------+--------+---------------------+-------+-----------+ > | channel | metric | kernel | value | perc_diff | > +------------+--------+---------------------+-------+-----------+ > | CPU | gmean | EM_default | 142.1 | 0.0% | > | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% | > +------------+--------+---------------------+-------+-----------+ > > The time cost to update the EM decreased in this v5 vs v4: > big: 5us vs 2us -> 2.6x faster > mid: 9us vs 3us -> 3x faster > little: 16us vs 16us -> no change I guess this is entirely due to the changes in em_dev_update_perf_domain()? Moving from per-OPP em_update_callback to switching the entire table (pd->runtime_table) inside em_dev_update_perf_domain()? > We still have to update the inefficiency in the cpufreq framework, thus > a bit of overhead will be there. > > Changelog: > v5: > - removed 2 tables design > - have only one table (runtime_table) used also in thermal (Wei, Rafael) Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in v5 this changed to only have one, the modifiable. IMHO it would be better to change the existing table to be modifiable rather than staring with two EM's and then removing the static one. I assume you end up with way less code changes and the patch-set will become easier to digest for reviewers. I would mention that 14/23 "PM: EM: Support late CPUs booting and capacity adjustment" is a testcase for the modifiable EM build-in into the code changes. This relies on the table being modifiable. > - refactored update function and removed callback call for each opp > - added faster EM table swap, using only the RCU pointer update > - added memory allocation API and tracking with kref > - avoid overhead for computing 'cost' for each OPP in update, it can be > pre-computed in device drivers EM earlier > - add support for device drivers providing EM table > - added API for computing 'cost' values in EM for EAS > - added API for thermal/powercap to use EM (using RCU wrappers) > - switched to single allocation and 'state[]' array (Rafael) > - changed documentation to align with current design > - added helper API for computing cost values > - simplified EM free in unregister path (thanks to kref) > - split patch updating EM clients and changed them separetly > - added seperate patch removing old static EM table > - added EM debugfs change patch to dump the runtime_table > - addressed comments in v4 for spelling/comments/headers > - added review tags [...]
Hi Lukasz, On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi all, > > This patch set adds a new feature which allows to modify Energy Model (EM) > power values at runtime. It will allow to better reflect power model of > a recent SoCs and silicon. Different characteristics of the power usage > can be leveraged and thus better decisions made during task placement in EAS. > > It's part of feature set know as Dynamic Energy Model. It has been presented > and discussed recently at OSPM2023 [3]. This patch set implements the 1st > improvement for the EM. > > The concepts: > 1. The CPU power usage can vary due to the workload that it's running or due > to the temperature of the SoC. The same workload can use more power when the > temperature of the silicon has increased (e.g. due to hot GPU or ISP). > In such situation the EM can be adjusted and reflect the fact of increased > power usage. That power increase is due to static power > (sometimes called simply: leakage). The CPUs in recent SoCs are different. > We have heterogeneous SoCs with 3 (or even 4) different microarchitectures. > They are also built differently with High Performance (HP) cells or > Low Power (LP) cells. They are affected by the temperature increase > differently: HP cells have bigger leakage. The SW model can leverage that > knowledge. > > 2. It is also possible to change the EM to better reflect the currently > running workload. Usually the EM is derived from some average power values > taken from experiments with benchmark (e.g. Dhrystone). The model derived > from such scenario might not represent properly the workloads usually running > on the device. Therefore, runtime modification of the EM allows to switch to > a different model, when there is a need. > > 3. The EM can be adjusted after boot, when all the modules are loaded and > more information about the SoC is available e.g. chip binning. This would help > to better reflect the silicon characteristics. Thus, this EM modification > API allows it now. It wasn't possible in the past and the EM had to be > 'set in stone'. > > More detailed explanation and background can be found in presentations > during LPC2022 [1][2] or in the documentation patches. > > Some test results. > The EM can be updated to fit better the workload type. In the case below the EM > has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports > for the scheduler bits). The Jankbench was run 10 times for those two configurations, > to get more reliable data. > > 1. Janky frames percentage > +--------+-----------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+-----------------+---------------------+-------+-----------+ > | gmean | jank_percentage | EM_default | 2.0 | 0.0% | > | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% | > +--------+-----------------+---------------------+-------+-----------+ > > 2. Avg frame render time duration > +--------+---------------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+---------------------+---------------------+-------+-----------+ > | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% | > | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% | > +--------+---------------------+---------------------+-------+-----------+ > > 3. Max frame render time duration > +--------+--------------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+--------------------+---------------------+-------+-----------+ > | gmean | max_frame_duration | EM_default | 251.6 | 0.0% | > | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% | > +--------+--------------------+---------------------+-------+-----------+ > > 4. OS overutilized state percentage (when EAS is not working) > +--------------+---------------------+------+------------+------------+ > | metric | wa_path | time | total_time | percentage | > +--------------+---------------------+------+------------+------------+ > | overutilized | EM_default | 1.65 | 253.38 | 0.65 | > | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 | > +--------------+---------------------+------+------------+------------+ > > 5. All CPUs (Little+Mid+Big) power values in mW > +------------+--------+---------------------+-------+-----------+ > | channel | metric | kernel | value | perc_diff | > +------------+--------+---------------------+-------+-----------+ > | CPU | gmean | EM_default | 142.1 | 0.0% | > | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% | > +------------+--------+---------------------+-------+-----------+ > > The time cost to update the EM decreased in this v5 vs v4: > big: 5us vs 2us -> 2.6x faster > mid: 9us vs 3us -> 3x faster > little: 16us vs 16us -> no change > > We still have to update the inefficiency in the cpufreq framework, thus > a bit of overhead will be there. > > Changelog: > v5: > - removed 2 tables design > - have only one table (runtime_table) used also in thermal (Wei, Rafael) > - refactored update function and removed callback call for each opp > - added faster EM table swap, using only the RCU pointer update > - added memory allocation API and tracking with kref > - avoid overhead for computing 'cost' for each OPP in update, it can be > pre-computed in device drivers EM earlier > - add support for device drivers providing EM table > - added API for computing 'cost' values in EM for EAS > - added API for thermal/powercap to use EM (using RCU wrappers) > - switched to single allocation and 'state[]' array (Rafael) > - changed documentation to align with current design > - added helper API for computing cost values > - simplified EM free in unregister path (thanks to kref) > - split patch updating EM clients and changed them separetly > - added seperate patch removing old static EM table > - added EM debugfs change patch to dump the runtime_table > - addressed comments in v4 for spelling/comments/headers > - added review tags I like this one more than the previous one and thanks for taking my feedback into account. I would still like other people having a vested interest in the EM to look at it and give feedback (or just tags), so I'm not inclined to apply it just yet. However, I don't have any specific comments on it.
Hi Dietmar, Thank you for the review, I will go one-by-one to respond your comments in patches as well. First comments are below. On 12/12/23 18:48, Dietmar Eggemann wrote: > On 29/11/2023 12:08, Lukasz Luba wrote: >> Hi all, >> >> This patch set adds a new feature which allows to modify Energy Model (EM) >> power values at runtime. It will allow to better reflect power model of >> a recent SoCs and silicon. Different characteristics of the power usage >> can be leveraged and thus better decisions made during task placement in EAS. >> >> It's part of feature set know as Dynamic Energy Model. It has been presented >> and discussed recently at OSPM2023 [3]. This patch set implements the 1st >> improvement for the EM. >> >> The concepts: >> 1. The CPU power usage can vary due to the workload that it's running or due >> to the temperature of the SoC. The same workload can use more power when the >> temperature of the silicon has increased (e.g. due to hot GPU or ISP). >> In such situation the EM can be adjusted and reflect the fact of increased >> power usage. That power increase is due to static power >> (sometimes called simply: leakage). The CPUs in recent SoCs are different. >> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures. >> They are also built differently with High Performance (HP) cells or >> Low Power (LP) cells. They are affected by the temperature increase >> differently: HP cells have bigger leakage. The SW model can leverage that >> knowledge. >> >> 2. It is also possible to change the EM to better reflect the currently >> running workload. Usually the EM is derived from some average power values >> taken from experiments with benchmark (e.g. Dhrystone). The model derived >> from such scenario might not represent properly the workloads usually running >> on the device. Therefore, runtime modification of the EM allows to switch to >> a different model, when there is a need. >> >> 3. The EM can be adjusted after boot, when all the modules are loaded and >> more information about the SoC is available e.g. chip binning. This would help >> to better reflect the silicon characteristics. Thus, this EM modification >> API allows it now. It wasn't possible in the past and the EM had to be >> 'set in stone'. >> >> More detailed explanation and background can be found in presentations >> during LPC2022 [1][2] or in the documentation patches. >> >> Some test results. >> The EM can be updated to fit better the workload type. In the case below the EM >> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports >> for the scheduler bits). The Jankbench was run 10 times for those two configurations, >> to get more reliable data. >> >> 1. Janky frames percentage >> +--------+-----------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+-----------------+---------------------+-------+-----------+ >> | gmean | jank_percentage | EM_default | 2.0 | 0.0% | >> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% | >> +--------+-----------------+---------------------+-------+-----------+ >> >> 2. Avg frame render time duration >> +--------+---------------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+---------------------+---------------------+-------+-----------+ >> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% | >> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% | >> +--------+---------------------+---------------------+-------+-----------+ >> >> 3. Max frame render time duration >> +--------+--------------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+--------------------+---------------------+-------+-----------+ >> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% | >> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% | >> +--------+--------------------+---------------------+-------+-----------+ >> >> 4. OS overutilized state percentage (when EAS is not working) >> +--------------+---------------------+------+------------+------------+ >> | metric | wa_path | time | total_time | percentage | >> +--------------+---------------------+------+------------+------------+ >> | overutilized | EM_default | 1.65 | 253.38 | 0.65 | >> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 | >> +--------------+---------------------+------+------------+------------+ >> >> 5. All CPUs (Little+Mid+Big) power values in mW >> +------------+--------+---------------------+-------+-----------+ >> | channel | metric | kernel | value | perc_diff | >> +------------+--------+---------------------+-------+-----------+ >> | CPU | gmean | EM_default | 142.1 | 0.0% | >> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% | >> +------------+--------+---------------------+-------+-----------+ >> >> The time cost to update the EM decreased in this v5 vs v4: >> big: 5us vs 2us -> 2.6x faster >> mid: 9us vs 3us -> 3x faster >> little: 16us vs 16us -> no change > > I guess this is entirely due to the changes in > em_dev_update_perf_domain()? Moving from per-OPP em_update_callback to > switching the entire table (pd->runtime_table) inside > em_dev_update_perf_domain()? Yes correct, it's due to that design change. > >> We still have to update the inefficiency in the cpufreq framework, thus >> a bit of overhead will be there. >> >> Changelog: >> v5: >> - removed 2 tables design >> - have only one table (runtime_table) used also in thermal (Wei, Rafael) > > Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in > v5 this changed to only have one, the modifiable. IMHO it would be > better to change the existing table to be modifiable rather than staring > with two EM's and then removing the static one. I assume you end up with > way less code changes and the patch-set will become easier to digest for > reviewers. The patches are structured in this way following Daniel's recommendation I got when I was adding similar big changes to EM in 2020 (support all devices in kernel). The approach is as follows: 0. Do some basic clean-up/refactoring if needed for a new feature, to re-use some code if possible in future 1. Introduce new feature next to the existing one 2. Add API and all needed infrastructure (structures, fields) for drivers 3. Re-wire the existing drivers/frameworks to the new feature via new API; ideally keep 1 patch per driver so the maintainer can easily grasp the changes and ACK it, because it will go via different tree (Rafael's tree); in case of some code clash in the driver's code during merge - it will be a single driver so easier to handle 4. when all drivers and frameworks are wired up with the new feature remove the old feature (structures, fields, APIs, etc) 5. Update the documentation with new latest state of desing In this approach the patches are less convoluted. Because if I remove the old feature and add new in a single patch (e.g. the main structure) that patch will have to modify all drivers to still compile. It would be a big messy patch for this re-design. I can see in some later comment from Rafael that he is OK with current patch set structure. > > I would mention that 14/23 "PM: EM: Support late CPUs booting and > capacity adjustment" is a testcase for the modifiable EM build-in into > the code changes. This relies on the table being modifiable. > Correct, that the 1st user on runtime modifiable EM, which is actually also build-in. I could add that to the cover letter. Regards, Lukasz
Hi Rafael, Thank you for having a loot at the series. On 12/12/23 18:49, Rafael J. Wysocki wrote: > Hi Lukasz, > > On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> Hi all, >> >> This patch set adds a new feature which allows to modify Energy Model (EM) >> power values at runtime. It will allow to better reflect power model of >> a recent SoCs and silicon. Different characteristics of the power usage >> can be leveraged and thus better decisions made during task placement in EAS. >> >> It's part of feature set know as Dynamic Energy Model. It has been presented >> and discussed recently at OSPM2023 [3]. This patch set implements the 1st >> improvement for the EM. >> >> The concepts: >> 1. The CPU power usage can vary due to the workload that it's running or due >> to the temperature of the SoC. The same workload can use more power when the >> temperature of the silicon has increased (e.g. due to hot GPU or ISP). >> In such situation the EM can be adjusted and reflect the fact of increased >> power usage. That power increase is due to static power >> (sometimes called simply: leakage). The CPUs in recent SoCs are different. >> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures. >> They are also built differently with High Performance (HP) cells or >> Low Power (LP) cells. They are affected by the temperature increase >> differently: HP cells have bigger leakage. The SW model can leverage that >> knowledge. >> >> 2. It is also possible to change the EM to better reflect the currently >> running workload. Usually the EM is derived from some average power values >> taken from experiments with benchmark (e.g. Dhrystone). The model derived >> from such scenario might not represent properly the workloads usually running >> on the device. Therefore, runtime modification of the EM allows to switch to >> a different model, when there is a need. >> >> 3. The EM can be adjusted after boot, when all the modules are loaded and >> more information about the SoC is available e.g. chip binning. This would help >> to better reflect the silicon characteristics. Thus, this EM modification >> API allows it now. It wasn't possible in the past and the EM had to be >> 'set in stone'. >> >> More detailed explanation and background can be found in presentations >> during LPC2022 [1][2] or in the documentation patches. >> >> Some test results. >> The EM can be updated to fit better the workload type. In the case below the EM >> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports >> for the scheduler bits). The Jankbench was run 10 times for those two configurations, >> to get more reliable data. >> >> 1. Janky frames percentage >> +--------+-----------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+-----------------+---------------------+-------+-----------+ >> | gmean | jank_percentage | EM_default | 2.0 | 0.0% | >> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% | >> +--------+-----------------+---------------------+-------+-----------+ >> >> 2. Avg frame render time duration >> +--------+---------------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+---------------------+---------------------+-------+-----------+ >> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% | >> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% | >> +--------+---------------------+---------------------+-------+-----------+ >> >> 3. Max frame render time duration >> +--------+--------------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+--------------------+---------------------+-------+-----------+ >> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% | >> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% | >> +--------+--------------------+---------------------+-------+-----------+ >> >> 4. OS overutilized state percentage (when EAS is not working) >> +--------------+---------------------+------+------------+------------+ >> | metric | wa_path | time | total_time | percentage | >> +--------------+---------------------+------+------------+------------+ >> | overutilized | EM_default | 1.65 | 253.38 | 0.65 | >> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 | >> +--------------+---------------------+------+------------+------------+ >> >> 5. All CPUs (Little+Mid+Big) power values in mW >> +------------+--------+---------------------+-------+-----------+ >> | channel | metric | kernel | value | perc_diff | >> +------------+--------+---------------------+-------+-----------+ >> | CPU | gmean | EM_default | 142.1 | 0.0% | >> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% | >> +------------+--------+---------------------+-------+-----------+ >> >> The time cost to update the EM decreased in this v5 vs v4: >> big: 5us vs 2us -> 2.6x faster >> mid: 9us vs 3us -> 3x faster >> little: 16us vs 16us -> no change >> >> We still have to update the inefficiency in the cpufreq framework, thus >> a bit of overhead will be there. >> >> Changelog: >> v5: >> - removed 2 tables design >> - have only one table (runtime_table) used also in thermal (Wei, Rafael) >> - refactored update function and removed callback call for each opp >> - added faster EM table swap, using only the RCU pointer update >> - added memory allocation API and tracking with kref >> - avoid overhead for computing 'cost' for each OPP in update, it can be >> pre-computed in device drivers EM earlier >> - add support for device drivers providing EM table >> - added API for computing 'cost' values in EM for EAS >> - added API for thermal/powercap to use EM (using RCU wrappers) >> - switched to single allocation and 'state[]' array (Rafael) >> - changed documentation to align with current design >> - added helper API for computing cost values >> - simplified EM free in unregister path (thanks to kref) >> - split patch updating EM clients and changed them separetly >> - added seperate patch removing old static EM table >> - added EM debugfs change patch to dump the runtime_table >> - addressed comments in v4 for spelling/comments/headers >> - added review tags > > I like this one more than the previous one and thanks for taking my > feedback into account. > > I would still like other people having a vested interest in the EM to > look at it and give feedback (or just tags), so I'm not inclined to > apply it just yet. However, I don't have any specific comments on it. Let me contact offline some of the partners who were keen to have this in mainline (when I presented some first implementation in 2021 at Android kernel review systems). Regards, Lukasz
On 13/12/2023 10:23, Lukasz Luba wrote: > Hi Dietmar, > > Thank you for the review, I will go one-by-one to respond > your comments in patches as well. First comments are below. > > On 12/12/23 18:48, Dietmar Eggemann wrote: >> On 29/11/2023 12:08, Lukasz Luba wrote: [...] >>> Changelog: >>> v5: >>> - removed 2 tables design >>> - have only one table (runtime_table) used also in thermal (Wei, Rafael) >> >> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in >> v5 this changed to only have one, the modifiable. IMHO it would be >> better to change the existing table to be modifiable rather than staring >> with two EM's and then removing the static one. I assume you end up with >> way less code changes and the patch-set will become easier to digest for >> reviewers. > > The patches are structured in this way following Daniel's recommendation > I got when I was adding similar big changes to EM in 2020 (support all > devices in kernel). The approach is as follows: > 0. Do some basic clean-up/refactoring if needed for a new feature, to > re-use some code if possible in future > 1. Introduce new feature next to the existing one > 2. Add API and all needed infrastructure (structures, fields) for > drivers > 3. Re-wire the existing drivers/frameworks to the new feature via new > API; ideally keep 1 patch per driver so the maintainer can easily > grasp the changes and ACK it, because it will go via different tree > (Rafael's tree); in case of some code clash in the driver's code > during merge - it will be a single driver so easier to handle > 4. when all drivers and frameworks are wired up with the new feature > remove the old feature (structures, fields, APIs, etc) > 5. Update the documentation with new latest state of desing > > In this approach the patches are less convoluted. Because if I remove > the old feature and add new in a single patch (e.g. the main structure) > that patch will have to modify all drivers to still compile. It > would be a big messy patch for this re-design. > > I can see in some later comment from Rafael that he is OK with current > patch set structure. OK, in case Rafael and Daniel prefer this, then it's fine. I just find it weird that we now have 70 struct em_perf_domain { 71 struct em_perf_table __rcu *runtime_table; ^^^^^^^^^^^^^ as the only EM table.
On Wed, Dec 13, 2023 at 12:34 PM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 13/12/2023 10:23, Lukasz Luba wrote: > > Hi Dietmar, > > > > Thank you for the review, I will go one-by-one to respond > > your comments in patches as well. First comments are below. > > > > On 12/12/23 18:48, Dietmar Eggemann wrote: > >> On 29/11/2023 12:08, Lukasz Luba wrote: > > [...] > > >>> Changelog: > >>> v5: > >>> - removed 2 tables design > >>> - have only one table (runtime_table) used also in thermal (Wei, Rafael) > >> > >> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in > >> v5 this changed to only have one, the modifiable. IMHO it would be > >> better to change the existing table to be modifiable rather than staring > >> with two EM's and then removing the static one. I assume you end up with > >> way less code changes and the patch-set will become easier to digest for > >> reviewers. > > > > The patches are structured in this way following Daniel's recommendation > > I got when I was adding similar big changes to EM in 2020 (support all > > devices in kernel). The approach is as follows: > > 0. Do some basic clean-up/refactoring if needed for a new feature, to > > re-use some code if possible in future > > 1. Introduce new feature next to the existing one > > 2. Add API and all needed infrastructure (structures, fields) for > > drivers > > 3. Re-wire the existing drivers/frameworks to the new feature via new > > API; ideally keep 1 patch per driver so the maintainer can easily > > grasp the changes and ACK it, because it will go via different tree > > (Rafael's tree); in case of some code clash in the driver's code > > during merge - it will be a single driver so easier to handle > > 4. when all drivers and frameworks are wired up with the new feature > > remove the old feature (structures, fields, APIs, etc) > > 5. Update the documentation with new latest state of desing > > > > In this approach the patches are less convoluted. Because if I remove > > the old feature and add new in a single patch (e.g. the main structure) > > that patch will have to modify all drivers to still compile. It > > would be a big messy patch for this re-design. > > > > I can see in some later comment from Rafael that he is OK with current > > patch set structure. > > OK, in case Rafael and Daniel prefer this, then it's fine. > > I just find it weird that we now have > > 70 struct em_perf_domain { > 71 struct em_perf_table __rcu *runtime_table; > ^^^^^^^^^^^^^ > > as the only EM table. I agree that it would be better to call it something like em_table.
On 12/13/23 11:45, Rafael J. Wysocki wrote: > On Wed, Dec 13, 2023 at 12:34 PM Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: >> >> On 13/12/2023 10:23, Lukasz Luba wrote: >>> Hi Dietmar, >>> >>> Thank you for the review, I will go one-by-one to respond >>> your comments in patches as well. First comments are below. >>> >>> On 12/12/23 18:48, Dietmar Eggemann wrote: >>>> On 29/11/2023 12:08, Lukasz Luba wrote: >> >> [...] >> >>>>> Changelog: >>>>> v5: >>>>> - removed 2 tables design >>>>> - have only one table (runtime_table) used also in thermal (Wei, Rafael) >>>> >>>> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in >>>> v5 this changed to only have one, the modifiable. IMHO it would be >>>> better to change the existing table to be modifiable rather than staring >>>> with two EM's and then removing the static one. I assume you end up with >>>> way less code changes and the patch-set will become easier to digest for >>>> reviewers. >>> >>> The patches are structured in this way following Daniel's recommendation >>> I got when I was adding similar big changes to EM in 2020 (support all >>> devices in kernel). The approach is as follows: >>> 0. Do some basic clean-up/refactoring if needed for a new feature, to >>> re-use some code if possible in future >>> 1. Introduce new feature next to the existing one >>> 2. Add API and all needed infrastructure (structures, fields) for >>> drivers >>> 3. Re-wire the existing drivers/frameworks to the new feature via new >>> API; ideally keep 1 patch per driver so the maintainer can easily >>> grasp the changes and ACK it, because it will go via different tree >>> (Rafael's tree); in case of some code clash in the driver's code >>> during merge - it will be a single driver so easier to handle >>> 4. when all drivers and frameworks are wired up with the new feature >>> remove the old feature (structures, fields, APIs, etc) >>> 5. Update the documentation with new latest state of desing >>> >>> In this approach the patches are less convoluted. Because if I remove >>> the old feature and add new in a single patch (e.g. the main structure) >>> that patch will have to modify all drivers to still compile. It >>> would be a big messy patch for this re-design. >>> >>> I can see in some later comment from Rafael that he is OK with current >>> patch set structure. >> >> OK, in case Rafael and Daniel prefer this, then it's fine. >> >> I just find it weird that we now have >> >> 70 struct em_perf_domain { >> 71 struct em_perf_table __rcu *runtime_table; >> ^^^^^^^^^^^^^ >> >> as the only EM table. > > I agree that it would be better to call it something like em_table. > OK, I'll change that. Thanks Rafael and Dietmar!
Hi Abhijeet, It's been a while when we discussed an EM feature presented on some Android common kernel Gerrit (Nov 2021). On 11/29/23 11:08, Lukasz Luba wrote: > Hi all, > > This patch set adds a new feature which allows to modify Energy Model (EM) > power values at runtime. It will allow to better reflect power model of > a recent SoCs and silicon. Different characteristics of the power usage > can be leveraged and thus better decisions made during task placement in EAS. > > It's part of feature set know as Dynamic Energy Model. It has been presented > and discussed recently at OSPM2023 [3]. This patch set implements the 1st > improvement for the EM. > > The concepts: > 1. The CPU power usage can vary due to the workload that it's running or due > to the temperature of the SoC. The same workload can use more power when the > temperature of the silicon has increased (e.g. due to hot GPU or ISP). > In such situation the EM can be adjusted and reflect the fact of increased > power usage. That power increase is due to static power > (sometimes called simply: leakage). The CPUs in recent SoCs are different. > We have heterogeneous SoCs with 3 (or even 4) different microarchitectures. > They are also built differently with High Performance (HP) cells or > Low Power (LP) cells. They are affected by the temperature increase > differently: HP cells have bigger leakage. The SW model can leverage that > knowledge. > > 2. It is also possible to change the EM to better reflect the currently > running workload. Usually the EM is derived from some average power values > taken from experiments with benchmark (e.g. Dhrystone). The model derived > from such scenario might not represent properly the workloads usually running > on the device. Therefore, runtime modification of the EM allows to switch to > a different model, when there is a need. > > 3. The EM can be adjusted after boot, when all the modules are loaded and > more information about the SoC is available e.g. chip binning. This would help > to better reflect the silicon characteristics. Thus, this EM modification > API allows it now. It wasn't possible in the past and the EM had to be > 'set in stone'. > > More detailed explanation and background can be found in presentations > during LPC2022 [1][2] or in the documentation patches. > > Some test results. > The EM can be updated to fit better the workload type. In the case below the EM > has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports > for the scheduler bits). The Jankbench was run 10 times for those two configurations, > to get more reliable data. > > 1. Janky frames percentage > +--------+-----------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+-----------------+---------------------+-------+-----------+ > | gmean | jank_percentage | EM_default | 2.0 | 0.0% | > | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% | > +--------+-----------------+---------------------+-------+-----------+ > > 2. Avg frame render time duration > +--------+---------------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+---------------------+---------------------+-------+-----------+ > | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% | > | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% | > +--------+---------------------+---------------------+-------+-----------+ > > 3. Max frame render time duration > +--------+--------------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+--------------------+---------------------+-------+-----------+ > | gmean | max_frame_duration | EM_default | 251.6 | 0.0% | > | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% | > +--------+--------------------+---------------------+-------+-----------+ > > 4. OS overutilized state percentage (when EAS is not working) > +--------------+---------------------+------+------------+------------+ > | metric | wa_path | time | total_time | percentage | > +--------------+---------------------+------+------------+------------+ > | overutilized | EM_default | 1.65 | 253.38 | 0.65 | > | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 | > +--------------+---------------------+------+------------+------------+ > > 5. All CPUs (Little+Mid+Big) power values in mW > +------------+--------+---------------------+-------+-----------+ > | channel | metric | kernel | value | perc_diff | > +------------+--------+---------------------+-------+-----------+ > | CPU | gmean | EM_default | 142.1 | 0.0% | > | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% | > +------------+--------+---------------------+-------+-----------+ > > The time cost to update the EM decreased in this v5 vs v4: > big: 5us vs 2us -> 2.6x faster > mid: 9us vs 3us -> 3x faster > little: 16us vs 16us -> no change > > We still have to update the inefficiency in the cpufreq framework, thus > a bit of overhead will be there. > > Changelog: > v5: > - removed 2 tables design > - have only one table (runtime_table) used also in thermal (Wei, Rafael) > - refactored update function and removed callback call for each opp > - added faster EM table swap, using only the RCU pointer update > - added memory allocation API and tracking with kref > - avoid overhead for computing 'cost' for each OPP in update, it can be > pre-computed in device drivers EM earlier > - add support for device drivers providing EM table > - added API for computing 'cost' values in EM for EAS > - added API for thermal/powercap to use EM (using RCU wrappers) > - switched to single allocation and 'state[]' array (Rafael) > - changed documentation to align with current design > - added helper API for computing cost values > - simplified EM free in unregister path (thanks to kref) > - split patch updating EM clients and changed them separetly > - added seperate patch removing old static EM table > - added EM debugfs change patch to dump the runtime_table > - addressed comments in v4 for spelling/comments/headers > - added review tags > v4 changes are here [4] > > Regards, > Lukasz Luba > > [1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf > [2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf > [3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H > [4] https://lore.kernel.org/lkml/20230925081139.1305766-1-lukasz.luba@arm.com/ > > > Lukasz Luba (23): > PM: EM: Add missing newline for the message log > PM: EM: Refactor em_cpufreq_update_efficiencies() arguments > PM: EM: Find first CPU active while updating OPP efficiency > PM: EM: Refactor em_pd_get_efficient_state() to be more flexible > PM: EM: Refactor a new function em_compute_costs() > PM: EM: Check if the get_cost() callback is present in > em_compute_costs() > PM: EM: Refactor how the EM table is allocated and populated > PM: EM: Introduce runtime modifiable table > PM: EM: Use runtime modified EM for CPUs energy estimation in EAS > PM: EM: Add API for memory allocations for new tables > PM: EM: Add API for updating the runtime modifiable EM > PM: EM: Add helpers to read under RCU lock the EM table > PM: EM: Add performance field to struct em_perf_state > PM: EM: Support late CPUs booting and capacity adjustment > PM: EM: Optimize em_cpu_energy() and remove division > powercap/dtpm_cpu: Use new Energy Model interface to get table > powercap/dtpm_devfreq: Use new Energy Model interface to get table > drivers/thermal/cpufreq_cooling: Use new Energy Model interface > drivers/thermal/devfreq_cooling: Use new Energy Model interface > PM: EM: Change debugfs configuration to use runtime EM table data > PM: EM: Remove old table > PM: EM: Add em_dev_compute_costs() as API for device drivers > Documentation: EM: Update with runtime modification design > > Documentation/power/energy-model.rst | 206 +++++++++++- > drivers/powercap/dtpm_cpu.c | 35 +- > drivers/powercap/dtpm_devfreq.c | 31 +- > drivers/thermal/cpufreq_cooling.c | 40 ++- > drivers/thermal/devfreq_cooling.c | 43 ++- > include/linux/energy_model.h | 163 +++++---- > kernel/power/energy_model.c | 479 +++++++++++++++++++++++---- > 7 files changed, 813 insertions(+), 184 deletions(-) > You've been interested in this feature back then. I have a gentle ask, if you are still interested in. It would be nice if you (or some other Qcom engineer) could leave a feedback comment (similar what you have made for the Gerrit original series). I will be really grateful. In this cover letter, there are some power saving numbers from a real phone, with also performance metrics (janky frames). You might be interested in those scenarios as well. Regards, Lukasz
Hi Rafael, On 12/12/2023 18:49, Rafael J. Wysocki wrote: > Hi Lukasz, > > On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> [...] > > I like this one more than the previous one and thanks for taking my > feedback into account. > > I would still like other people having a vested interest in the EM to > look at it and give feedback (or just tags), so I'm not inclined to > apply it just yet. However, I don't have any specific comments on it. I do have a keen interest in this series, but mostly from the point of view of uclamp. Currently uclamp is able to send hint the scheduler to bias task placement. Some CPU cores are known to have very different energy efficiency depending on the task. We know these tasks beforehand and can use uclamp to bias to certain CPUs which we know are more efficient for them. Personally I've always been wondering if this could just be reflected in the EM itself without emphasizing on the task placement aspect of uclamp. The idea of this series LGTM and I'll take a deeper look. Hongyan
Hi Lukasz On 11/29/23 11:08, Lukasz Luba wrote: > Hi all, > > This patch set adds a new feature which allows to modify Energy Model (EM) > power values at runtime. It will allow to better reflect power model of > a recent SoCs and silicon. Different characteristics of the power usage > can be leveraged and thus better decisions made during task placement in EAS. > > It's part of feature set know as Dynamic Energy Model. It has been presented > and discussed recently at OSPM2023 [3]. This patch set implements the 1st > improvement for the EM. Thanks. The problem of EM accuracy has been observed in the field and would be nice to have a mainline solution for it. We carry our own out-of-tree change to enable modifying the EM. > > The concepts: > 1. The CPU power usage can vary due to the workload that it's running or due > to the temperature of the SoC. The same workload can use more power when the > temperature of the silicon has increased (e.g. due to hot GPU or ISP). > In such situation the EM can be adjusted and reflect the fact of increased > power usage. That power increase is due to static power > (sometimes called simply: leakage). The CPUs in recent SoCs are different. > We have heterogeneous SoCs with 3 (or even 4) different microarchitectures. > They are also built differently with High Performance (HP) cells or > Low Power (LP) cells. They are affected by the temperature increase > differently: HP cells have bigger leakage. The SW model can leverage that > knowledge. One thing I'm not sure about is that in practice temperature of the SoC can vary a lot in a short period of time. What is the expectation here? I can see this useful in practice only if we average it over a window of time. Following it will be really hard. Big variations can happen in few ms scales. Driver interface for this part makes sense; as thermal framework will likely to know how feed things back to EM table, if necessary. > > 2. It is also possible to change the EM to better reflect the currently > running workload. Usually the EM is derived from some average power values > taken from experiments with benchmark (e.g. Dhrystone). The model derived > from such scenario might not represent properly the workloads usually running > on the device. Therefore, runtime modification of the EM allows to switch to > a different model, when there is a need. I didn't get how the new performance field is supposed to be controlled and modified by users. A driver interface doesn't seem suitable as there's no subsystem that knows the characteristic of the workload except userspace. In Android we do have contextual info about what the current top-app to enable modifying the capacities to match its characteristics. > > 3. The EM can be adjusted after boot, when all the modules are loaded and > more information about the SoC is available e.g. chip binning. This would help > to better reflect the silicon characteristics. Thus, this EM modification > API allows it now. It wasn't possible in the past and the EM had to be > 'set in stone'. > > More detailed explanation and background can be found in presentations > during LPC2022 [1][2] or in the documentation patches. > > Some test results. > The EM can be updated to fit better the workload type. In the case below the EM > has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports > for the scheduler bits). The Jankbench was run 10 times for those two configurations, > to get more reliable data. > > 1. Janky frames percentage > +--------+-----------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+-----------------+---------------------+-------+-----------+ > | gmean | jank_percentage | EM_default | 2.0 | 0.0% | > | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% | > +--------+-----------------+---------------------+-------+-----------+ > > 2. Avg frame render time duration > +--------+---------------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+---------------------+---------------------+-------+-----------+ > | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% | > | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% | > +--------+---------------------+---------------------+-------+-----------+ > > 3. Max frame render time duration > +--------+--------------------+---------------------+-------+-----------+ > | metric | variable | kernel | value | perc_diff | > +--------+--------------------+---------------------+-------+-----------+ > | gmean | max_frame_duration | EM_default | 251.6 | 0.0% | > | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% | > +--------+--------------------+---------------------+-------+-----------+ > > 4. OS overutilized state percentage (when EAS is not working) > +--------------+---------------------+------+------------+------------+ > | metric | wa_path | time | total_time | percentage | > +--------------+---------------------+------+------------+------------+ > | overutilized | EM_default | 1.65 | 253.38 | 0.65 | > | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 | > +--------------+---------------------+------+------------+------------+ > > 5. All CPUs (Little+Mid+Big) power values in mW > +------------+--------+---------------------+-------+-----------+ > | channel | metric | kernel | value | perc_diff | > +------------+--------+---------------------+-------+-----------+ > | CPU | gmean | EM_default | 142.1 | 0.0% | > | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% | > +------------+--------+---------------------+-------+-----------+ How did you modify the EM here? Did you change both performance and power fields? How did you calculate the new ones? Did you try to simulate any heating effect during the run if you're taking temperature into account to modify the power? What was the variation like and at what rate was the EM being updated in this case? I think Jankbench in general wouldn't stress the SoC enough. It'd be insightful to look at frequency residencies between the two runs and power breakdown for each cluster if you have access to them. No worries if not! My brain started to fail me somewhere around patch 15. I'll have another look some time later in the week but generally looks good to me. If I have any worries it is about how it can be used with the provided interfaces. Especially expectations about managing fast thermal changes at the level you're targeting. Thanks! -- Qais Yousef
Hi Qais, On 12/17/23 18:22, Qais Yousef wrote: > Hi Lukasz > > On 11/29/23 11:08, Lukasz Luba wrote: >> Hi all, >> >> This patch set adds a new feature which allows to modify Energy Model (EM) >> power values at runtime. It will allow to better reflect power model of >> a recent SoCs and silicon. Different characteristics of the power usage >> can be leveraged and thus better decisions made during task placement in EAS. >> >> It's part of feature set know as Dynamic Energy Model. It has been presented >> and discussed recently at OSPM2023 [3]. This patch set implements the 1st >> improvement for the EM. > > Thanks. The problem of EM accuracy has been observed in the field and would be > nice to have a mainline solution for it. We carry our own out-of-tree change to > enable modifying the EM. Thanks for that statement here. > >> >> The concepts: >> 1. The CPU power usage can vary due to the workload that it's running or due >> to the temperature of the SoC. The same workload can use more power when the >> temperature of the silicon has increased (e.g. due to hot GPU or ISP). >> In such situation the EM can be adjusted and reflect the fact of increased >> power usage. That power increase is due to static power >> (sometimes called simply: leakage). The CPUs in recent SoCs are different. >> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures. >> They are also built differently with High Performance (HP) cells or >> Low Power (LP) cells. They are affected by the temperature increase >> differently: HP cells have bigger leakage. The SW model can leverage that >> knowledge. > > One thing I'm not sure about is that in practice temperature of the SoC can > vary a lot in a short period of time. What is the expectation here? I can see > this useful in practice only if we average it over a window of time. Following > it will be really hard. Big variations can happen in few ms scales. It's mostly for long running heavy workloads, which involve other device than CPUs, e.g. GPU or ISP (Image Signal Processor). Those devices can heat up the SoC. In our game DrArm running on pixel6 the GPU uses 75-77% of total power budget (starting from ~2.5W for GPU + 1.3W for all CPUs). That 2.5W from the GPU is heating up the CPUs and mostly impact the Big cores, which are made from High-Performance cells (thus leaking more). OverUtilization in the first 4-5min of gaming is ~4-9%, so EAS can work and save some power, if it has a good model. Later we have thermal throttling and OU goes to ~50% but EAS still can work. If the model is more precised - thus adjusted for the raising leakage due to temperature increase (generated due to GPU power), than we still can use better that power budget and not waist on the leakage at higher OPPs. > > Driver interface for this part makes sense; as thermal framework will likely to > know how feed things back to EM table, if necessary. Thermal framework or I would rather say smarter thermal dedicated driver which has built-in power model and access to the sensors data. In this way it can provide adjusted power model into the EM dynamically. It will also calculate the efficiency (the 'cost' field). > >> >> 2. It is also possible to change the EM to better reflect the currently >> running workload. Usually the EM is derived from some average power values >> taken from experiments with benchmark (e.g. Dhrystone). The model derived >> from such scenario might not represent properly the workloads usually running >> on the device. Therefore, runtime modification of the EM allows to switch to >> a different model, when there is a need. > > I didn't get how the new performance field is supposed to be controlled and > modified by users. A driver interface doesn't seem suitable as there's no > subsystem that knows the characteristic of the workload except userspace. In > Android we do have contextual info about what the current top-app to enable > modifying the capacities to match its characteristics. Well in latest public documentation (May2023) for Cortex-X4 there are described new features of Arm cores: PDP, MPMM, which can change the 'performance' of the core in FW. Our SCMI kernel subsystem will get an interrupt, so the drivers can know about it. It could be used for recalculating the efficiency of the CPUs in the EM. When there is no hotplug and the long running app is still running, that FW policy would be reflected in EM. It's just not done all-in-one-step. Those patches will be later. Second, I have used that 'performance' field to finally get rid of this runtime division in em_cpu_energy() hot path - which was annoying me for very long time. It wasn't possible to optimize that last operation there, because the not all CPUs boot and final CPU capacity is not known when we register EMs. With this feature finally I can remove that heavy operation. You can see more in that patch 15/23. > >> >> 3. The EM can be adjusted after boot, when all the modules are loaded and >> more information about the SoC is available e.g. chip binning. This would help >> to better reflect the silicon characteristics. Thus, this EM modification >> API allows it now. It wasn't possible in the past and the EM had to be >> 'set in stone'. >> >> More detailed explanation and background can be found in presentations >> during LPC2022 [1][2] or in the documentation patches. >> >> Some test results. >> The EM can be updated to fit better the workload type. In the case below the EM >> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports >> for the scheduler bits). The Jankbench was run 10 times for those two configurations, >> to get more reliable data. >> >> 1. Janky frames percentage >> +--------+-----------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+-----------------+---------------------+-------+-----------+ >> | gmean | jank_percentage | EM_default | 2.0 | 0.0% | >> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% | >> +--------+-----------------+---------------------+-------+-----------+ >> >> 2. Avg frame render time duration >> +--------+---------------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+---------------------+---------------------+-------+-----------+ >> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% | >> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% | >> +--------+---------------------+---------------------+-------+-----------+ >> >> 3. Max frame render time duration >> +--------+--------------------+---------------------+-------+-----------+ >> | metric | variable | kernel | value | perc_diff | >> +--------+--------------------+---------------------+-------+-----------+ >> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% | >> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% | >> +--------+--------------------+---------------------+-------+-----------+ >> >> 4. OS overutilized state percentage (when EAS is not working) >> +--------------+---------------------+------+------------+------------+ >> | metric | wa_path | time | total_time | percentage | >> +--------------+---------------------+------+------------+------------+ >> | overutilized | EM_default | 1.65 | 253.38 | 0.65 | >> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 | >> +--------------+---------------------+------+------------+------------+ >> >> 5. All CPUs (Little+Mid+Big) power values in mW >> +------------+--------+---------------------+-------+-----------+ >> | channel | metric | kernel | value | perc_diff | >> +------------+--------+---------------------+-------+-----------+ >> | CPU | gmean | EM_default | 142.1 | 0.0% | >> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% | >> +------------+--------+---------------------+-------+-----------+ > > How did you modify the EM here? Did you change both performance and power > fields? How did you calculate the new ones? It was just the power values modified on my pixel6: for Littles 1.6x, Mid 0.8x, Big 1.3x of their boot power. TBH I don't know the chip binning of that SoC, but I suspect it could be due to this fact. More about possible error range in chip binning power values you can find in my comment to the patch 22/23 > > Did you try to simulate any heating effect during the run if you're taking > temperature into account to modify the power? What was the variation like and Yes, I did that experiment and presented on OSPM 2023 slide 13. There is big CPU power plot change in time, due to GPU heat. All detailed data is there. The big CPU power is ~18-20% higher when 1-1.5W GPU is heating up the whole SoC. > at what rate was the EM being updated in this case? I think Jankbench in In this experiment EM was only set once w/ the values mentioned above. It could be due to the chip lottery. I cannot say on 100% this phone. > general wouldn't stress the SoC enough. True, this test is not power heavy as it can be seen. It's more to show that the default EM after boot might not be the optimal one. > > It'd be insightful to look at frequency residencies between the two runs and > power breakdown for each cluster if you have access to them. No worries if not! I'm afraid you're asking for too much ;) > > My brain started to fail me somewhere around patch 15. I'll have another look > some time later in the week but generally looks good to me. If I have any > worries it is about how it can be used with the provided interfaces. Especially > expectations about managing fast thermal changes at the level you're targeting. No worries, thanks for the review! The fast thermal changes, which are linked to the CPU's workload are not an issue here and I'm not worried about those. The side effect of the heat from other device is the issue. Thus, that thermal driver which modifies the EM should be aware of the 'whole SoC' situation (like mainline IPA does, when it manages all devices in a single thermal zone). Regards, Lukasz
On 12/19/23 10:22, Lukasz Luba wrote: > > One thing I'm not sure about is that in practice temperature of the SoC can > > vary a lot in a short period of time. What is the expectation here? I can see > > this useful in practice only if we average it over a window of time. Following > > it will be really hard. Big variations can happen in few ms scales. > > It's mostly for long running heavy workloads, which involve other device > than CPUs, e.g. GPU or ISP (Image Signal Processor). Those devices can > heat up the SoC. In our game DrArm running on pixel6 the GPU uses 75-77% > of total power budget (starting from ~2.5W for GPU + 1.3W for all CPUs). > That 2.5W from the GPU is heating up the CPUs and mostly impact the Big > cores, which are made from High-Performance cells (thus leaking more). > OverUtilization in the first 4-5min of gaming is ~4-9%, so EAS can work > and save some power, if it has a good model. Later we have thermal > throttling and OU goes to ~50% but EAS still can work. If the model is > more precised - thus adjusted for the raising leakage due to temperature > increase (generated due to GPU power), than we still can use better that > power budget and not waist on the leakage at higher OPPs. I can understand the need. But looking at one specific case vs generalized form is different. So IIUC the expectation is to track temperature variations over minutes by external sources to CPU. > > I didn't get how the new performance field is supposed to be controlled and > > modified by users. A driver interface doesn't seem suitable as there's no > > subsystem that knows the characteristic of the workload except userspace. In > > Android we do have contextual info about what the current top-app to enable > > modifying the capacities to match its characteristics. > > Well in latest public documentation (May2023) for Cortex-X4 there are > described new features of Arm cores: PDP, MPMM, which can change the > 'performance' of the core in FW. Our SCMI kernel subsystem will get an > interrupt, so the drivers can know about it. It could be used for > recalculating the efficiency of the CPUs in the EM. When there is no > hotplug and the long running app is still running, that FW policy would > be reflected in EM. It's just not done all-in-one-step. Those patches > will be later. I think these features are some form of thermal throttling IIUC. I was asking for handling the EM accuracy issue using the runtime model. I was expecting some sysfs knobs. Do you see this also require a vendor specific driver to try to account for the EM inaccuracy issues we're seeing? > Second, I have used that 'performance' field to finally get rid of > this runtime division in em_cpu_energy() hot path - which was annoying > me for very long time. It wasn't possible to optimize that last > operation there, because the not all CPUs boot and final CPU capacity > is not known when we register EMs. With this feature finally I can > remove that heavy operation. You can see more in that patch 15/23. Yep, it's good addition :) > > > 5. All CPUs (Little+Mid+Big) power values in mW > > > +------------+--------+---------------------+-------+-----------+ > > > | channel | metric | kernel | value | perc_diff | > > > +------------+--------+---------------------+-------+-----------+ > > > | CPU | gmean | EM_default | 142.1 | 0.0% | > > > | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% | > > > +------------+--------+---------------------+-------+-----------+ > > > > How did you modify the EM here? Did you change both performance and power > > fields? How did you calculate the new ones? > > It was just the power values modified on my pixel6: > for Littles 1.6x, Mid 0.8x, Big 1.3x of their boot power. > TBH I don't know the chip binning of that SoC, but I suspect it > could be due to this fact. More about possible error range in chip > binning power values you can find in my comment to the patch 22/23 Strange just modifying the power had this impact. It could be related to similar impact I've seen with migration margin for the little increasing. By making the cost higher there, then it'd move the residency to other cores and potentially reduce running at higher freq on the littles. > > Did you try to simulate any heating effect during the run if you're taking > > temperature into account to modify the power? What was the variation like and > > Yes, I did that experiment and presented on OSPM 2023 slide 13. There is > big CPU power plot change in time, due to GPU heat. All detailed data is > there. The big CPU power is ~18-20% higher when 1-1.5W GPU is heating up > the whole SoC. I meant during your experiment above. > > at what rate was the EM being updated in this case? I think Jankbench in > > In this experiment EM was only set once w/ the values mentioned above. > It could be due to the chip lottery. I cannot say on 100% this phone. > > > general wouldn't stress the SoC enough. > > True, this test is not power heavy as it can be seen. It's more > to show that the default EM after boot might not be the optimal one. I wouldn't reach that conclusion for this particular case. But the underlying issues exists for sure. > > It'd be insightful to look at frequency residencies between the two runs and > > power breakdown for each cluster if you have access to them. No worries if not! > > I'm afraid you're asking for too much ;) It should be easy to get them. It's hard to know where the benefit is coming from otherwise. But as I said, no worries if not. If you have perfetto traces I can take help to take a look. > > My brain started to fail me somewhere around patch 15. I'll have another look > > some time later in the week but generally looks good to me. If I have any > > worries it is about how it can be used with the provided interfaces. Especially > > expectations about managing fast thermal changes at the level you're targeting. > > No worries, thanks for the review! The fast thermal changes, which are > linked to the CPU's workload are not an issue here and I'm not worried > about those. The side effect of the heat from other device is the issue. > Thus, that thermal driver which modifies the EM should be aware of the > 'whole SoC' situation (like mainline IPA does, when it manages all > devices in a single thermal zone). I think in practice there will be challenges to generalize the thermal impact. But overall from EM accuracy point of view (for all the various reasons mentioned), we need this ability to help handle them in practice. Booting with a single hardcoded EM doesn't work. Cheers -- Qais Yousef