diff mbox

[2/7,v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

Message ID a8b6d188-060c-9af9-177b-55059738e99c@arm.com
State Superseded
Headers show

Commit Message

Dietmar Eggemann Sept. 21, 2016, 10:14 a.m. UTC
Hi Vincent,

On 12/09/16 08:47, Vincent Guittot wrote:
> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that

> a child will always be called before its parent.

> 

> The hierarchical order in shares update list has been introduced by

> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

> 

> With the current implementation a child can be still put after its parent.

> 

> Lets take the example of

>        root

>         \

>          b

>          /\

>          c d*

>            |

>            e*

> 

> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list

> looks like: head -> c -> b -> root -> tail

> 

> The branch d -> e will be added the first time that they are enqueued,

> starting with e then d.

> 

> When e is added, its parents is not already on the list so e is put at the

> tail : head -> c -> b -> root -> e -> tail

> 

> Then, d is added at the head because its parent is already on the list:

> head -> d -> c -> b -> root -> e -> tail

> 

> e is not placed at the right position and will be called the last whereas

> it should be called at the beginning.

> 

> Because it follows the bottom-up enqueue sequence, we are sure that we

> will finished to add either a cfs_rq without parent or a cfs_rq with a parent

> that is already on the list. We can use this event to detect when we have

> finished to add a new branch. For the others, whose parents are not already

> added, we have to ensure that they will be added after their children that

> have just been inserted the steps before, and after any potential parents that

> are already in the list. The easiest way is to put the cfs_rq just after the

> last inserted one and to keep track of it untl the branch is fully added.


[...]

I tested it with a multi-level task group hierarchy and it does the
right thing for this testcase (task t runs alternately on Cpu0 and Cpu1
in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6).

I wonder if this patch is related to the comment "TODO: fix up out-of-order
children on enqueue." in update_shares_cpu() of commit 82958366cfea
("sched: Replace update_shares weight distribution with per-entity
computation")?

I guess in the meantime we lost the functionality to remove a cfs_rq from the
leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates
away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
owned by Cpu1 stay in the leaf_cfs_rq list.

Shouldn't we reintegrate it? Following patch goes on top of this patch:

-- >8 --

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Date: Tue, 20 Sep 2016 17:30:09 +0100
Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into
 update_blocked_averages()

There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq
list in case there are no se's enqueued on it. 

The functionality had been initially introduced by commit 82958366cfea
("sched: Replace update_shares weight distribution with per-entity
computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair:
Rewrite runnable load and utilization average tracking").

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

---
 kernel/sched/fair.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

-- 
1.9.1

Comments

Vincent Guittot Sept. 21, 2016, 12:34 p.m. UTC | #1
Hi Dietmar,

On 21 September 2016 at 12:14, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> Hi Vincent,

>

> On 12/09/16 08:47, Vincent Guittot wrote:

>> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that

>> a child will always be called before its parent.

>>

>> The hierarchical order in shares update list has been introduced by

>> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

>>

>> With the current implementation a child can be still put after its parent.

>>

>> Lets take the example of

>>        root

>>         \

>>          b

>>          /\

>>          c d*

>>            |

>>            e*

>>

>> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list

>> looks like: head -> c -> b -> root -> tail

>>

>> The branch d -> e will be added the first time that they are enqueued,

>> starting with e then d.

>>

>> When e is added, its parents is not already on the list so e is put at the

>> tail : head -> c -> b -> root -> e -> tail

>>

>> Then, d is added at the head because its parent is already on the list:

>> head -> d -> c -> b -> root -> e -> tail

>>

>> e is not placed at the right position and will be called the last whereas

>> it should be called at the beginning.

>>

>> Because it follows the bottom-up enqueue sequence, we are sure that we

>> will finished to add either a cfs_rq without parent or a cfs_rq with a parent

>> that is already on the list. We can use this event to detect when we have

>> finished to add a new branch. For the others, whose parents are not already

>> added, we have to ensure that they will be added after their children that

>> have just been inserted the steps before, and after any potential parents that

>> are already in the list. The easiest way is to put the cfs_rq just after the

>> last inserted one and to keep track of it untl the branch is fully added.

>

> [...]

>

> I tested it with a multi-level task group hierarchy and it does the

> right thing for this testcase (task t runs alternately on Cpu0 and Cpu1

> in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6).


Thanks for testing

>

> I wonder if this patch is related to the comment "TODO: fix up out-of-order

> children on enqueue." in update_shares_cpu() of commit 82958366cfea

> ("sched: Replace update_shares weight distribution with per-entity

> computation")?


I had not that comment in mind when i have done this patch only to
ensure that the propagation of load and utilization in children will
be done before testing their parents
That being said, I agree that the comment probably points to the same
ordering issue than what this patch solves

>

> I guess in the meantime we lost the functionality to remove a cfs_rq from the

> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates

> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)

> owned by Cpu1 stay in the leaf_cfs_rq list.

>

> Shouldn't we reintegrate it? Following patch goes on top of this patch:


I see one problem: Once a cfs_rq has been removed , it will not be
periodically updated anymore until the next enqueue. A sleeping task
is only attached but not enqueued when it moves into a cfs_rq so its
load is added to cfs_rq's load which have to be periodically
updated/decayed. So adding a cfs_rq to the list only during an enqueue
is not enough in this case.

Then, only the highest child level of task group will be removed
because cfs_rq->nr_running will be > 0 as soon as a child task group
is created and enqueued into a task group. Or you should use
cfs.h_nr_running instead i don't know all implications

Regards,
Vincent

>

> -- >8 --

>

> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

> Date: Tue, 20 Sep 2016 17:30:09 +0100

> Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into

>  update_blocked_averages()

>

> There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq

> list in case there are no se's enqueued on it.

>

> The functionality had been initially introduced by commit 82958366cfea

> ("sched: Replace update_shares weight distribution with per-entity

> computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair:

> Rewrite runnable load and utilization average tracking").

>

> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

> ---

>  kernel/sched/fair.c | 6 +++++-

>  1 file changed, 5 insertions(+), 1 deletion(-)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index dd530359ef84..951c83337e2b 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu)

>                         update_tg_load_avg(cfs_rq, 0);

>

>                 /* Propagate pending load changes to the parent */

> -               if (cfs_rq->tg->se[cpu])

> +               if (cfs_rq->tg->se[cpu]) {

>                         update_load_avg(cfs_rq->tg->se[cpu], 0, 0);

> +

> +                       if (!cfs_rq->nr_running)

> +                               list_del_leaf_cfs_rq(cfs_rq);

> +               }

>         }

>         raw_spin_unlock_irqrestore(&rq->lock, flags);

>  }

> --

> 1.9.1
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd530359ef84..951c83337e2b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6584,8 +6584,12 @@  static void update_blocked_averages(int cpu)
                        update_tg_load_avg(cfs_rq, 0); 
 
                /* Propagate pending load changes to the parent */
-               if (cfs_rq->tg->se[cpu])
+               if (cfs_rq->tg->se[cpu]) {
                        update_load_avg(cfs_rq->tg->se[cpu], 0, 0); 
+
+                       if (!cfs_rq->nr_running)
+                               list_del_leaf_cfs_rq(cfs_rq);
+               }
        }
        raw_spin_unlock_irqrestore(&rq->lock, flags);
 }