Message ID | 20160211171012.GS11415@e106622-lin |
---|---|
State | New |
Headers | show |
On 12/02/16 18:05, Peter Zijlstra wrote: > On Thu, Feb 11, 2016 at 05:10:12PM +0000, Juri Lelli wrote: > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > > index 6368f43..1eccecf 100644 > > --- a/kernel/sched/deadline.c > > +++ b/kernel/sched/deadline.c > > > +static void swap_task_ac_bw(struct task_struct *p, > > + struct rq *from, > > + struct rq *to) > > +{ > > + unsigned long flags; > > + > > + lockdep_assert_held(&p->pi_lock); > > + local_irq_save(flags); > > + double_rq_lock(from, to); > > + __dl_sub_ac(from, p->dl.dl_bw); > > + __dl_add_ac(to, p->dl.dl_bw); > > + double_rq_unlock(from, to); > > + local_irq_restore(flags); > > +} > > > +static void migrate_task_rq_dl(struct task_struct *p) > > +{ > > + if (p->fallback_cpu != -1) > > + swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu)); > > +} > > This patch scares me. > Yeah, same here. However, I didn't find yet something different to fix this and wanted some help :). > Now, my brain is having an awfully hard time trying to re-engage after > flu, but this looks very wrong. > > So we call sched_class::migrate_task_rq() from set_task_cpu(), and we > call set_task_cpu() while potentially holding rq::lock's (try > push_dl_task() for kicks). > > Sure, you play horrible games with fallback_cpu, but those games are > just that, horrible. > Right. I'm counting on fallback_cpu to be able to not call swap_task_ac_bw() (and the rq locks) during push/pull migrations. I was actually thinking that we could have a non locked version of swap and call that in push/pull from migrate_task_rq_dl. But this is most probably more horrible. > > So your initial patch migrates the bandwidth along when a runnable task > gets moved about, this hack seems to be mostly about waking up. The > 'normal' accounting is done on enqueue/dequeue, while here you use the > migration hook. > The problem is that I don't do anything in enqueue/dequeue (apart from when cpuset migrates us while still on_rq), and I think we don't want to do anything there as a task dl_bw should remain in ac_bw when it goes to sleep, etc. This is the static view of admitted bw. We want to save/restore the admitted bw in the root_domain also when tasks are sleeping/blocked. > Having two separate means of accounting this also feels more fragile > than one would want. > > Let me think a bit about this. > I was looking at sending out a v2 with this as RFC. I guess is better if I wait :). Thanks! Best, - Juri
Hi Peter, On 24/02/16 20:17, Peter Zijlstra wrote: > On Fri, Feb 12, 2016 at 06:05:30PM +0100, Peter Zijlstra wrote: > > Having two separate means of accounting this also feels more fragile > > than one would want. > > > > Let me think a bit about this. > > I think there's a fundamental problem that makes the whole notion of > per-rq accounting 'impossible'. > > On hot-unplug we only migrate runnable tasks, all blocked tasks remain > on the dead cpu. This would very much include their bandwidth > requirements. > > This means that between a hot-unplug and the moment that _all_ those > blocked tasks have ran at least once, the sum of online bandwidth > doesn't match and we can get into admission trouble (same for GRUB, > which can also use per-rq bw like this). > > The main problem is that there is no real way to find blocked tasks; > currently the only way is to iterate _all_ tasks and filter on > task_cpu(). > > We could of course add a blocked tree/list for deadline tasks, to > explicitly keep track of all these; this would allow migrating blocked > tasks on hotplug and avoid the real ugly I think. But I've not tried > yet. > Argh, this makes lot of sense to me. I've actually pondered a tree/list solution, but then decided to try the cumulative approach because it looked nicer. But it contains holes, I'm afraid. As Luca already said, GRUB shouldn't have these problems though. I'll try and see what introducting a list of blocked/throttled deadline tasks means, considering also the interaction with cpusets and such. Maybe it's simpler than it seems. I'm not sure this will come anytime soon, unfortunately. I'm almost 100% on the sched-freq/schedutil discussion these days. Anyway, do you also think that what we want to solve the root domain issue is something based on rq_online/offline and per-rq information? Everything else that I tried or thought of was broken/more horrible. :-/ Best, - Juri
diff --git a/include/linux/init_task.h b/include/linux/init_task.h index f2cb8d4..c582f9d 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -199,6 +199,7 @@ extern struct task_group root_task_group; .policy = SCHED_NORMAL, \ .cpus_allowed = CPU_MASK_ALL, \ .nr_cpus_allowed= NR_CPUS, \ + .fallback_cpu = -1, \ .mm = NULL, \ .active_mm = &init_mm, \ .restart_block = { \ diff --git a/include/linux/sched.h b/include/linux/sched.h index a10494a..a6fc95c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1401,6 +1401,7 @@ struct task_struct { struct task_struct *last_wakee; int wake_cpu; + int fallback_cpu; #endif int on_rq; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7fb9246..4e4bc41 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1442,7 +1442,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p) continue; if (!cpu_active(dest_cpu)) continue; - if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p))) + if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p))) { + p->fallback_cpu = dest_cpu; return dest_cpu; } } @@ -1490,6 +1491,7 @@ out: } } + p->fallback_cpu = dest_cpu; return dest_cpu; } @@ -1954,6 +1956,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; set_task_cpu(p, cpu); + p->fallback_cpu = -1; } #endif /* CONFIG_SMP */ diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 6368f43..1eccecf 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1043,6 +1043,21 @@ static void yield_task_dl(struct rq *rq) #ifdef CONFIG_SMP +static void swap_task_ac_bw(struct task_struct *p, + struct rq *from, + struct rq *to) +{ + unsigned long flags; + + lockdep_assert_held(&p->pi_lock); + local_irq_save(flags); + double_rq_lock(from, to); + __dl_sub_ac(from, p->dl.dl_bw); + __dl_add_ac(to, p->dl.dl_bw); + double_rq_unlock(from, to); + local_irq_restore(flags); +} + static int find_later_rq(struct task_struct *task); static int @@ -1077,8 +1092,10 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) if (target != -1 && (dl_time_before(p->dl.deadline, cpu_rq(target)->dl.earliest_dl.curr) || - (cpu_rq(target)->dl.dl_nr_running == 0))) + (cpu_rq(target)->dl.dl_nr_running == 0))) { cpu = target; + swap_task_ac_bw(p, rq, cpu_rq(target)); + } } rcu_read_unlock(); @@ -1807,6 +1824,12 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p, switched_to_dl(rq, p); } +static void migrate_task_rq_dl(struct task_struct *p) +{ + if (p->fallback_cpu != -1) + swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu)); +} + const struct sched_class dl_sched_class = { .next = &rt_sched_class, .enqueue_task = enqueue_task_dl, @@ -1820,6 +1843,7 @@ const struct sched_class dl_sched_class = { #ifdef CONFIG_SMP .select_task_rq = select_task_rq_dl, + .migrate_task_rq = migrate_task_rq_dl, .set_cpus_allowed = set_cpus_allowed_dl, .rq_online = rq_online_dl, .rq_offline = rq_offline_dl,