Message ID | 20160210113258.GX11415@e106622-lin |
---|---|
State | New |
Headers | show |
On 10/02/16 12:43, Luca Abeni wrote: > Hi all, > Hi Luca, > On Wed, 10 Feb 2016 11:32:58 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > [...] > > @@ -2445,14 +2445,18 @@ static int dl_overflow(struct task_struct *p, > > int policy, if (dl_policy(policy) && !task_has_dl_policy(p) && > > !__dl_overflow(dl_b, cpus, 0, new_bw)) { > > __dl_add(dl_b, new_bw); > > + __dl_add_ac(task_rq(p), new_bw); > > err = 0; > > } else if (dl_policy(policy) && task_has_dl_policy(p) && > > !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { > > __dl_clear(dl_b, p->dl.dl_bw); > > + __dl_sub_ac(task_rq(p), p->dl.dl_bw); > > __dl_add(dl_b, new_bw); > > + __dl_add_ac(task_rq(p), new_bw); > > err = 0; > > } else if (!dl_policy(policy) && task_has_dl_policy(p)) { > > __dl_clear(dl_b, p->dl.dl_bw); > > + __dl_sub_ac(task_rq(p), p->dl.dl_bw); > > Instead of adding __dl_add_ac() and __dl_sub_ac) calls here, maybe they > can be added in switched_to_dl() and switched_from_dl()? > That might work too yes. I think there is value if we are able to move all __dl_{add,sub}_ac calls in deadline.c. Actually, we should probably move __dl_{add,clear} there as well, so that changes to rq and root_domain happen at the same time. > I'll test this idea locally, and I'll send an updated patch if it works. > Thanks! Will wait for it :). Best, - Juri
On 10/02/16 13:48, Luca Abeni wrote: > Hi, > > On Wed, 10 Feb 2016 11:32:58 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > [...] > > From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001 > > From: Juri Lelli <juri.lelli@arm.com> > > Date: Sat, 6 Feb 2016 12:41:09 +0000 > > Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted > > bandwidth > > > > Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks > > that passed admission control at root_domain level only. This creates > > problems when such data structure(s) are destroyed, when we > > reconfigure scheduling domains for example. > > > > This is part one of two changes required to fix the problem. In this > > patch we add per-rq tracking of admitted bandwidth. Tasks bring with > > them their bandwidth contribution when they enter the system and are > > enqueued for the first time. Contributions are then moved around when > > migrations happen and removed when tasks die. > > I think this patch actually does two different things (addressing two > separate problems): > 1) it introduces the tracking of per-rq utilization (used in your next > patch to address the root domain issues) > 2) it fixes a bug in the current utilization tracking mechanism. > Currently, a task doing > while(1) { > switch to SCHED_DEADLINE > switch to SCHED_OTHER > } > brings dl_b->total_bw below 0. Thanks to Juri for showing me this > problem (and how to reproduce it) in a private email. > This happens because when the task switches back from SCHED_DEADLINE > to SCHED_OTHER, switched_from_dl() does not clear its deadline > parameters (they will be cleared by the deadline timer when it > fires). But dl_overflow() removes its utilization from > dl_b->total_bw. When the task switches back to SCHED_DEADLINE, the > if (new_bw == p->dl.dl_bw) check in dl_overflow() prevents > __dl_add() from being called, and so when the task switches back to > SCHED_OTHER dl_b->total_bw becomes negative. > Yep. That is also the reason why I think, whatever refactoring we are goind to do, we should keep per-rq and root_domain accounting done together. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 9503d59..0ee0ec2 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, > > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period, > > runtime) : 0; int cpus, err = -1; > > > > - if (new_bw == p->dl.dl_bw) > > + if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw) > > return 0; > > This hunk actually fixes issue 2) mentioned above, so I think it should > be committed in a short time (independently from the rest of the > patch). And maybe is a good candidate for backporting to stable kernels? > Yes, this is a sensible fix per se. I can split it and send it separately. Best, - Juri
On 10/02/16 09:37, Steven Rostedt wrote: > On Wed, 10 Feb 2016 11:32:58 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > Hi, > > > > I've updated this patch since, with a bit more testing and talking with > > Luca in private, I realized that the previous version didn't manage > > switching back and forth from SCHED_DEADLINE correctly. Thanks a lot > > Luca for your feedback (even if not visible on the list). > > > > I updated the testing branch accordingly and added a test to my tests > > that stresses switch-in/switch-out. > > > > I don't particularly like the fact that we break the scheduling classes > > abstraction in __dl_overflow(), so I think a little bit of refactoring > > is still needed. But that can also happen afterwards, if we fix the > > problem with root domains. > > > > Best, > > > > - Juri > > > > --->8--- > > > > >From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001 > > From: Juri Lelli <juri.lelli@arm.com> > > Date: Sat, 6 Feb 2016 12:41:09 +0000 > > Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth > > > > > I applied this patch and patch 2 and hit this: > > [ 2298.134284] ------------[ cut here ]------------ > [ 2298.138933] WARNING: CPU: 4 PID: 0 at /home/rostedt/work/git/linux-trace.git/kernel/sched/sched.h:735 task_dead_dl+0xc5/0xd0() > [ 2298.150350] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc bluetooth lockd grace snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal snd_seq vhost_net snd_seq_device tun vhost macvtap macvlan coretemp iTCO_wdt snd_pcm hp_wmi rfkill kvm_intel sparse_keymap iTCO_vendor_support snd_timer snd acpi_cpufreq kvm i2c_i801 mei_me mei soundcore lpc_ich mfd_core irqbypass wmi serio_raw uinput i915 i2c_algo_bit e1000e drm_kms_helper crc32_pclmul ptp crc32c_intel drm pps_core i2c_core video sunrpc > [ 2298.207495] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.5.0-rc1-test+ #204 > [ 2298.214392] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 > [ 2298.223371] ffffffff81abc680 ffff880119433d68 ffffffff81411c3f 0000000000000000 > [ 2298.230904] ffffffff81abc680 ffff880119433da0 ffffffff810acf66 ffff88011eb16f40 > [ 2298.238435] ffff88001ee16200 ffffffff81fd4a00 00000000000aaaaa 0000000000000001 > [ 2298.245958] Call Trace: > [ 2298.248431] [<ffffffff81411c3f>] dump_stack+0x50/0xb1 > [ 2298.253597] [<ffffffff810acf66>] warn_slowpath_common+0x86/0xc0 > [ 2298.259627] [<ffffffff810ad05a>] warn_slowpath_null+0x1a/0x20 > [ 2298.265490] [<ffffffff810f92a5>] task_dead_dl+0xc5/0xd0 > [ 2298.270828] [<ffffffff810d833f>] finish_task_switch+0x16f/0x310 > [ 2298.276871] [<ffffffff810fa7f3>] ? pick_next_task_dl+0xb3/0x250 > [ 2298.282906] [<ffffffff817f07a3>] __schedule+0x3d3/0x9e0 > [ 2298.288252] [<ffffffff817f1001>] schedule+0x41/0xc0 > [ 2298.293242] [<ffffffff817f12c8>] schedule_preempt_disabled+0x18/0x30 > [ 2298.299712] [<ffffffff810fc974>] cpu_startup_entry+0x74/0x4e0 > [ 2298.305573] [<ffffffff8105d16f>] start_secondary+0x2bf/0x330 > [ 2298.311347] ---[ end trace 732d16efabe456f1 ]--- > > It's the warning you added in __dl_sub_ac(). > OK. There are still holes where we fail to properly update per-rq bw. It seems (by running you test) that we fail to move the per-rq bw when we move the root_domain bw due css_set_move_task(). So, the final task_dead_dl() tries to remove bw from where there isn't. I'm trying to see how we can close this hole. Thanks, - Juri
On 10/02/16 16:27, Juri Lelli wrote: > On 10/02/16 09:37, Steven Rostedt wrote: > > On Wed, 10 Feb 2016 11:32:58 +0000 > > Juri Lelli <juri.lelli@arm.com> wrote: > > [...] > > > > I applied this patch and patch 2 and hit this: > > [...] > > > > It's the warning you added in __dl_sub_ac(). > > > > OK. There are still holes where we fail to properly update per-rq bw. It > seems (by running you test) that we fail to move the per-rq bw when we > move the root_domain bw due css_set_move_task(). So, the final > task_dead_dl() tries to remove bw from where there isn't. > > I'm trying to see how we can close this hole. > So, just to give an update from yesterday (kind of tricky this one :/). I think we still have (at least) two problems: - select_task_rq_dl, if we select a different target - select_task_rq might make use of select_fallback_rq, if cpus_allowed changed after the task went to sleep Second case is what creates the problem here, as we don't update task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, maybe adding fallback_cpu in task_struct, from migrate_task_rq_dl() (it has to be added yes), but I fear that we should hold both rq locks :/. Luca, did you already face this problem (if I got it right) and thought of a way to fix it? I'll go back and stare a bit more at those paths. Best, - Juri
On 11/02/16 13:22, Luca Abeni wrote: > Hi Juri, > > On Thu, 11 Feb 2016 12:12:57 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > [...] > > I think we still have (at least) two problems: > > > > - select_task_rq_dl, if we select a different target > > - select_task_rq might make use of select_fallback_rq, if > > cpus_allowed changed after the task went to sleep > > > > Second case is what creates the problem here, as we don't update > > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, > > maybe adding fallback_cpu in task_struct, from migrate_task_rq_dl() > > (it has to be added yes), but I fear that we should hold both rq > > locks :/. > > > > Luca, did you already face this problem (if I got it right) and > > thought of a way to fix it? I'll go back and stare a bit more at > > those paths. > In my patch I took care of the first case (modifying > select_task_rq_dl() to move the utilization from the "old rq" to the > "new rq"), but I never managed to trigger select_fallback_rq() in my > tests, so I overlooked that case. > Right, I was thinking to do the same. And you did that after grabbing both locks, right?
On 11/02/16 13:40, Luca Abeni wrote: > On Thu, 11 Feb 2016 12:27:54 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > On 11/02/16 13:22, Luca Abeni wrote: > > > Hi Juri, > > > > > > On Thu, 11 Feb 2016 12:12:57 +0000 > > > Juri Lelli <juri.lelli@arm.com> wrote: > > > [...] > > > > I think we still have (at least) two problems: > > > > > > > > - select_task_rq_dl, if we select a different target > > > > - select_task_rq might make use of select_fallback_rq, if > > > > cpus_allowed changed after the task went to sleep > > > > > > > > Second case is what creates the problem here, as we don't update > > > > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, > > > > maybe adding fallback_cpu in task_struct, from > > > > migrate_task_rq_dl() (it has to be added yes), but I fear that we > > > > should hold both rq locks :/. > > > > > > > > Luca, did you already face this problem (if I got it right) and > > > > thought of a way to fix it? I'll go back and stare a bit more at > > > > those paths. > > > In my patch I took care of the first case (modifying > > > select_task_rq_dl() to move the utilization from the "old rq" to the > > > "new rq"), but I never managed to trigger select_fallback_rq() in my > > > tests, so I overlooked that case. > > > > > > > Right, I was thinking to do the same. And you did that after grabbing > > both locks, right? > > Not sure if I did everything correctly, but my code in > select_task_rq_dl() currently looks like this (you can obviously > ignore the "migrate_active" and "*_running_bw()" parts, and focus on > the "*_rq_bw()" stuff): > [...] > if (rq != cpu_rq(cpu)) { > int migrate_active; > > raw_spin_lock(&rq->lock); > migrate_active = hrtimer_active(&p->dl.inactive_timer); > if (migrate_active) { > hrtimer_try_to_cancel(&p->dl.inactive_timer); > sub_running_bw(&p->dl, &rq->dl); > } > sub_rq_bw(&p->dl, &rq->dl); > raw_spin_unlock(&rq->lock); > rq = cpu_rq(cpu); Can't something happen here? My problem is that I use per-rq bw tracking to save/restore root_domain state. So, I fear that a root_domain update can happen while we are in the middle of moving bw from one cpu to another. > raw_spin_lock(&rq->lock); > add_rq_bw(&p->dl, &rq->dl); > if (migrate_active) > add_running_bw(&p->dl, &rq->dl); > raw_spin_unlock(&rq->lock); > } > [...] > > lockdep is not screaming, and I am not able to trigger any race > condition or strange behaviour (I am currently at more than 24h of > continuous stress-testing, but maybe my testcase is not so good in > finding races here :) > Thanks for sharing what you have!
On 23/02/16 16:48, Peter Zijlstra wrote: > On Wed, Feb 10, 2016 at 01:42:40PM +0000, Juri Lelli wrote: > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > index 9503d59..0ee0ec2 100644 > > > > --- a/kernel/sched/core.c > > > > +++ b/kernel/sched/core.c > > > > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, > > > > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period, > > > > runtime) : 0; int cpus, err = -1; > > > > > > > > - if (new_bw == p->dl.dl_bw) > > > > + if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw) > > > > return 0; > > > > > > This hunk actually fixes issue 2) mentioned above, so I think it should > > > be committed in a short time (independently from the rest of the > > > patch). And maybe is a good candidate for backporting to stable kernels? > > > > > > > Yes, this is a sensible fix per se. I can split it and send it > > separately. > > Did you ever send that? > No, I didn't. Will do ASAP. Thanks, - Juri
Hi, On 24/02/16 14:53, luca abeni wrote: > On Tue, 23 Feb 2016 16:42:49 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote: > > > switched_to_dl() can be used instead > > > > This seems unrelated to the other patches, and looks like a nice > > cleanup. > > Ok; I'll rebase the patch on master and I'll run some more serious > tests (Juri, your tests repository is available on github, right? Can I > assume that if the patch passes your tests then it is ok?). > If everything goes well, I'll submit the patch. > Yes, tests reside here https://github.com/jlelli/tests. They should give you some confidence that things are not completely broken, but of course they might be still broken and you do not notice by running such tests. :-) Please run also the PI related onces, they might be important in this case. Anyway, I'd say that, once you are fine with it, you can submit the patch an then we have a second look at it. Thanks, - Juri
Hi Luca, On 03/03/16 10:03, Luca Abeni wrote: > On Thu, 25 Feb 2016 09:46:55 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > Hi, > > > > On 24/02/16 14:53, luca abeni wrote: > > > On Tue, 23 Feb 2016 16:42:49 +0100 > > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote: > > > > > switched_to_dl() can be used instead > > > > > > > > This seems unrelated to the other patches, and looks like a nice > > > > cleanup. > > > > > > Ok; I'll rebase the patch on master and I'll run some more serious > > > tests (Juri, your tests repository is available on github, right? > > > Can I assume that if the patch passes your tests then it is ok?). > > > If everything goes well, I'll submit the patch. > > > > > > > Yes, tests reside here https://github.com/jlelli/tests. They should > > give you some confidence that things are not completely broken, but > > of course they might be still broken and you do not notice by running > > such tests. :-) > I am trying these tests, but... Some scripts use "schedtool"; where can > I find a proper version of it (supporting SCHED_DEADLINE)? > I tried https://github.com/scheduler-tools/schedtool-dl but it does not > seem to work correctly... > That's the one that I use, and I'm not seeing any problems with it. I'll send you the binary in private. Best, - Juri
Hi Steve, On 03/03/16 09:23, Steven Rostedt wrote: > On Thu, 3 Mar 2016 09:28:01 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > That's the one that I use, and I'm not seeing any problems with it. I'll > > send you the binary in private. > > That's the one I use too. BTW, Juri, do you plan on submitting patches > to schedtool upstream? > Good point, I should. But I don't have any plans ATM :-/. OTOH, if anyone else wants to do that I'll be more than happy. :-) Best, - Juri
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9503d59..0ee0ec2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; int cpus, err = -1; - if (new_bw == p->dl.dl_bw) + if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw) return 0; /* @@ -2445,14 +2445,18 @@ static int dl_overflow(struct task_struct *p, int policy, if (dl_policy(policy) && !task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, 0, new_bw)) { __dl_add(dl_b, new_bw); + __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (dl_policy(policy) && task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { __dl_clear(dl_b, p->dl.dl_bw); + __dl_sub_ac(task_rq(p), p->dl.dl_bw); __dl_add(dl_b, new_bw); + __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (!dl_policy(policy) && task_has_dl_policy(p)) { __dl_clear(dl_b, p->dl.dl_bw); + __dl_sub_ac(task_rq(p), p->dl.dl_bw); err = 0; } raw_spin_unlock(&dl_b->lock); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index cd64c97..8ac0fee 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -83,6 +83,7 @@ void init_dl_rq(struct dl_rq *dl_rq) #else init_dl_bw(&dl_rq->dl_bw); #endif + dl_rq->ac_bw = 0; } #ifdef CONFIG_SMP @@ -278,8 +279,10 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p * By now the task is replenished and enqueued; migrate it. */ deactivate_task(rq, p, 0); + __dl_sub_ac(rq, p->dl.dl_bw); set_task_cpu(p, later_rq->cpu); activate_task(later_rq, p, 0); + __dl_add_ac(later_rq, p->dl.dl_bw); if (!fallback) resched_curr(later_rq); @@ -597,7 +600,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) /* * The task might have changed its scheduling policy to something - * different than SCHED_DEADLINE (through switched_fromd_dl()). + * different than SCHED_DEADLINE (through switched_from_dl()). */ if (!dl_task(p)) { __dl_clear_params(p); @@ -955,6 +958,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags) return; } + if (p->on_rq == TASK_ON_RQ_MIGRATING) + __dl_add_ac(rq, p->dl.dl_bw); + /* * If p is throttled, we do nothing. In fact, if it exhausted * its budget it needs a replenishment and, since it now is on @@ -980,6 +986,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); __dequeue_task_dl(rq, p, flags); + if (p->on_rq == TASK_ON_RQ_MIGRATING) + __dl_sub_ac(rq, p->dl.dl_bw); } /* @@ -1219,6 +1227,8 @@ static void task_dead_dl(struct task_struct *p) { struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); + __dl_sub_ac(task_rq(p), p->dl.dl_bw); + /* * Since we are TASK_DEAD we won't slip out of the domain! */ @@ -1511,8 +1521,10 @@ retry: } deactivate_task(rq, next_task, 0); + __dl_sub_ac(rq, next_task->dl.dl_bw); set_task_cpu(next_task, later_rq->cpu); activate_task(later_rq, next_task, 0); + __dl_add_ac(later_rq, next_task->dl.dl_bw); ret = 1; resched_curr(later_rq); @@ -1599,8 +1611,10 @@ static void pull_dl_task(struct rq *this_rq) resched = true; deactivate_task(src_rq, p, 0); + __dl_sub_ac(src_rq, p->dl.dl_bw); set_task_cpu(p, this_cpu); activate_task(this_rq, p, 0); + __dl_add_ac(this_rq, p->dl.dl_bw); dmin = p->dl.deadline; /* Is there any other task even earlier? */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 10f1637..242907f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -519,6 +519,14 @@ struct dl_rq { #else struct dl_bw dl_bw; #endif + + /* + * ac_bw keeps track of per rq admitted bandwidth. It only changes + * when a new task is admitted, it dies, it changes scheduling policy + * or is migrated to another rq. It is used to correctly save/resore + * total_bw on root_domain changes. + */ + u64 ac_bw; }; #ifdef CONFIG_SMP @@ -720,6 +728,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); #define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define raw_rq() raw_cpu_ptr(&runqueues) +static inline +void __dl_sub_ac(struct rq *rq, u64 tsk_bw) +{ + WARN_ON(rq->dl.ac_bw < tsk_bw); + + rq->dl.ac_bw -= tsk_bw; +} + +static inline +void __dl_add_ac(struct rq *rq, u64 tsk_bw) +{ + rq->dl.ac_bw += tsk_bw; +} + static inline u64 __rq_clock_broken(struct rq *rq) { return READ_ONCE(rq->clock);