Message ID | 1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org |
---|---|
Headers | show |
Series | sched/deadline: fix cpusets bandwidth accounting | expand |
Hi Mathieu, On Wed, 16 Aug 2017 15:20:36 -0600 Mathieu Poirier <mathieu.poirier@linaro.org> wrote: > This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] > where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug > operations. When CPUhotplug and some CUPset manipulation take place root > domains are destroyed and new ones created, loosing at the same time DL > accounting pertaining to utilisation. Thanks for looking at this longstanding issue! I am just back from vacations; in the next days I'll try your patches. Do you have some kind of scripts for reproducing the issue automatically? (I see that in the original email Steven described how to reproduce it manually; I just wonder if anyone already scripted the test). > An earlier attempt by Juri [2] used the scheduling classes' rq_online() and > rq_offline() methods, something that highlighted a problem with sleeping > DL tasks. The email thread that followed envisioned creating a list of > sleeping tasks to circle through when recomputing DL accounting. > > In this set the problem is addressed by relying on existing list of tasks > (sleeping or not) already maintained by CPUsets. When CPUset or > CPUhotplug operations have completed we circle through the list of tasks > maintained by each CPUset looking for DL tasks. When a DL task is found > its utilisation is added to the root domain it pertains to by way of its > runqueue. > > The advantage of proceeding this way is that recomputing of DL accounting > is done the same way for both active and inactive tasks, along with > guaranteeing that DL accounting for tasks end up in the correct root > domain regardless of the CPUset topology. The disadvantage is that > circling through all the tasks in a CPUset can be time consuming. The > counter argument is that both CPUset and CPUhotplug operations are time > consuming in the first place. I do not know the cpuset code too much, but I agree that your approach looks better than creating an additional list for blocked deadline tasks. > OPEN ISSUE: > > Regardless of how we proceed (using existing CPUset list or new ones) we > need to deal with DL tasks that span more than one root domain, something > that will typically happen after a CPUset operation. For example, if we > split the number of available CPUs on a system in two CPUsets and then turn > off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the > parent CPUset will end up spanning two root domains. > > One way to deal with this is to prevent CPUset operations from happening > when such condition is detected, as enacted in this set. I think this is the simplest (if not only?) solution if we want to use gEDF in each root domain. > Although simple > this approach feels brittle and akin to a "whack-a-mole" game. A better > and more reliable approach would be to teach the DL scheduler to deal with > tasks that span multiple root domains, a serious and substantial > undertaking. > > I am sending this as a starting point for discussion. I would be grateful > if you could take the time to comment on the approach and most importantly > provide input on how to deal with the open issue underlined above. I suspect that if we want to guarantee bounded tardiness then we have to go for a solution similar to the one suggested by Tommaso some time ago (if I remember well): if we want to create some "second level cpusets" inside a "parent cpuset", allowing deadline tasks to be placed inside both the "parent cpuset" and the "second level cpusets", then we have to subtract the "second level cpusets" maximum utilizations from the "parent cpuset" utilization. I am not sure how difficult it can be to implement this... If, instead, we want to allow to guarantee the respect of all the deadlines, then we need to have a look at Brandenburg's paper on arbitrary affinities: https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf Thanks, Luca
On 22 August 2017 at 06:21, Luca Abeni <luca.abeni@santannapisa.it> wrote: > Hi Mathieu, Good day to you, > > On Wed, 16 Aug 2017 15:20:36 -0600 > Mathieu Poirier <mathieu.poirier@linaro.org> wrote: > >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug >> operations. When CPUhotplug and some CUPset manipulation take place root >> domains are destroyed and new ones created, loosing at the same time DL >> accounting pertaining to utilisation. > > Thanks for looking at this longstanding issue! I am just back from > vacations; in the next days I'll try your patches. > Do you have some kind of scripts for reproducing the issue > automatically? (I see that in the original email Steven described how > to reproduce it manually; I just wonder if anyone already scripted the > test). I didn't bother scripting it since it is so easy to do. I'm eager to see how things work out on your end. > >> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and >> rq_offline() methods, something that highlighted a problem with sleeping >> DL tasks. The email thread that followed envisioned creating a list of >> sleeping tasks to circle through when recomputing DL accounting. >> >> In this set the problem is addressed by relying on existing list of tasks >> (sleeping or not) already maintained by CPUsets. When CPUset or >> CPUhotplug operations have completed we circle through the list of tasks >> maintained by each CPUset looking for DL tasks. When a DL task is found >> its utilisation is added to the root domain it pertains to by way of its >> runqueue. >> >> The advantage of proceeding this way is that recomputing of DL accounting >> is done the same way for both active and inactive tasks, along with >> guaranteeing that DL accounting for tasks end up in the correct root >> domain regardless of the CPUset topology. The disadvantage is that >> circling through all the tasks in a CPUset can be time consuming. The >> counter argument is that both CPUset and CPUhotplug operations are time >> consuming in the first place. > > I do not know the cpuset code too much, but I agree that your approach > looks better than creating an additional list for blocked deadline > tasks. > > >> OPEN ISSUE: >> >> Regardless of how we proceed (using existing CPUset list or new ones) we >> need to deal with DL tasks that span more than one root domain, something >> that will typically happen after a CPUset operation. For example, if we >> split the number of available CPUs on a system in two CPUsets and then turn >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the >> parent CPUset will end up spanning two root domains. >> >> One way to deal with this is to prevent CPUset operations from happening >> when such condition is detected, as enacted in this set. > > I think this is the simplest (if not only?) solution if we want to use > gEDF in each root domain. Global Earliest Deadline First? Is my interpretation correct? > >> Although simple >> this approach feels brittle and akin to a "whack-a-mole" game. A better >> and more reliable approach would be to teach the DL scheduler to deal with >> tasks that span multiple root domains, a serious and substantial >> undertaking. >> >> I am sending this as a starting point for discussion. I would be grateful >> if you could take the time to comment on the approach and most importantly >> provide input on how to deal with the open issue underlined above. > > I suspect that if we want to guarantee bounded tardiness then we have to > go for a solution similar to the one suggested by Tommaso some time ago > (if I remember well): > > if we want to create some "second level cpusets" inside a "parent > cpuset", allowing deadline tasks to be placed inside both the "parent > cpuset" and the "second level cpusets", then we have to subtract the > "second level cpusets" maximum utilizations from the "parent cpuset" > utilization. > > I am not sure how difficult it can be to implement this... Humm... I am missing some context here. Nonetheless the approach I was contemplating was to repeat the current mathematics to all the root domains accessible from a p->cpus_allowed's flag. As such we'd have the same acceptance test but repeated to more than one root domain. To do that time can be an issue but the real problem I see is related to the current DL code. It is geared around a single root domain and changing that means meddling in a lot of places. I had a prototype that was beginning to address that but decided to gather people's opinion before getting in too deep. > > > If, instead, we want to allow to guarantee the respect of all the > deadlines, then we need to have a look at Brandenburg's paper on > arbitrary affinities: > https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf > Ouch, that's an extended read... > > Thanks, > Luca
On Wed, 23 Aug 2017 13:47:13 -0600 Mathieu Poirier <mathieu.poirier@linaro.org> wrote: > >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] > >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug > >> operations. When CPUhotplug and some CUPset manipulation take place root > >> domains are destroyed and new ones created, loosing at the same time DL > >> accounting pertaining to utilisation. > > > > Thanks for looking at this longstanding issue! I am just back from > > vacations; in the next days I'll try your patches. > > Do you have some kind of scripts for reproducing the issue > > automatically? (I see that in the original email Steven described how > > to reproduce it manually; I just wonder if anyone already scripted the > > test). > > I didn't bother scripting it since it is so easy to do. I'm eager to > see how things work out on your end. Ok, so I'll try to reproduce the issue manually as described in Steven's original email; I'll run some tests as soon as I finish with some stuff that accumulated during vacations. [...] > >> OPEN ISSUE: > >> > >> Regardless of how we proceed (using existing CPUset list or new ones) we > >> need to deal with DL tasks that span more than one root domain, something > >> that will typically happen after a CPUset operation. For example, if we > >> split the number of available CPUs on a system in two CPUsets and then turn > >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the > >> parent CPUset will end up spanning two root domains. > >> > >> One way to deal with this is to prevent CPUset operations from happening > >> when such condition is detected, as enacted in this set. > > > > I think this is the simplest (if not only?) solution if we want to use > > gEDF in each root domain. > > Global Earliest Deadline First? Is my interpretation correct? Right. As far as I understand, the original SCHED_DEADLINE design is to partition the CPUs in disjoint sets, and then use global EDF scheduling on each one of those sets (this guarantees bounded tardiness, and if you run some additional admission tests in user space you can also guarantee the hard respect of every deadline). > >> Although simple > >> this approach feels brittle and akin to a "whack-a-mole" game. A better > >> and more reliable approach would be to teach the DL scheduler to deal with > >> tasks that span multiple root domains, a serious and substantial > >> undertaking. > >> > >> I am sending this as a starting point for discussion. I would be grateful > >> if you could take the time to comment on the approach and most importantly > >> provide input on how to deal with the open issue underlined above. > > > > I suspect that if we want to guarantee bounded tardiness then we have to > > go for a solution similar to the one suggested by Tommaso some time ago > > (if I remember well): > > > > if we want to create some "second level cpusets" inside a "parent > > cpuset", allowing deadline tasks to be placed inside both the "parent > > cpuset" and the "second level cpusets", then we have to subtract the > > "second level cpusets" maximum utilizations from the "parent cpuset" > > utilization. > > > > I am not sure how difficult it can be to implement this... > > Humm... I am missing some context here. Or maybe I misunderstood the issue you were seeing (I am no expert on cpusets). Is it related to hierarchies of cpusets (with one cpuset contained inside another one)? Can you describe how to reproduce the problematic situation? > Nonetheless the approach I > was contemplating was to repeat the current mathematics to all the > root domains accessible from a p->cpus_allowed's flag. I think in the original SCHED_DEADLINE design there should be only one root domain compatible with the task's affinity... If this does not happen, I suspect it is a bug (Juri, can you confirm?). My understanding is that with SCHED_DEADLINE cpusets should be used to partition the system's CPUs in disjoint sets (and I think there is one root domain for each one of those disjoint sets). And the task affinity mask should correspond with the CPUs composing the set in which the task is executing. > As such we'd > have the same acceptance test but repeated to more than one root > domain. To do that time can be an issue but the real problem I see is > related to the current DL code. It is geared around a single root > domain and changing that means meddling in a lot of places. I had a > prototype that was beginning to address that but decided to gather > people's opinion before getting in too deep. I still do not fully understand this (I got the impression that this is related to hierarchies of cpusets, but I am not sure if this understanding is correct). Maybe an example would help me to understand. Thanks, Luca
Hi, On 24/08/17 09:53, Luca Abeni wrote: > On Wed, 23 Aug 2017 13:47:13 -0600 > Mathieu Poirier <mathieu.poirier@linaro.org> wrote: > > >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] > > >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug > > >> operations. When CPUhotplug and some CUPset manipulation take place root > > >> domains are destroyed and new ones created, loosing at the same time DL > > >> accounting pertaining to utilisation. > > > > > > Thanks for looking at this longstanding issue! I am just back from > > > vacations; in the next days I'll try your patches. > > > Do you have some kind of scripts for reproducing the issue > > > automatically? (I see that in the original email Steven described how > > > to reproduce it manually; I just wonder if anyone already scripted the > > > test). > > > > I didn't bother scripting it since it is so easy to do. I'm eager to > > see how things work out on your end. > > Ok, so I'll try to reproduce the issue manually as described in Steven's > original email; I'll run some tests as soon as I finish with some stuff > that accumulated during vacations. > I have to apologize myself, as I suspect I won't have much time to properly review this set before LPC. :( I'll try my best to have a look though. [...] > > Nonetheless the approach I > > was contemplating was to repeat the current mathematics to all the > > root domains accessible from a p->cpus_allowed's flag. > > I think in the original SCHED_DEADLINE design there should be only one > root domain compatible with the task's affinity... If this does not > happen, I suspect it is a bug (Juri, can you confirm?). > > My understanding is that with SCHED_DEADLINE cpusets should be used to > partition the system's CPUs in disjoint sets (and I think there is one > root domain for each one of those disjoint sets). And the task affinity > mask should correspond with the CPUs composing the set in which the > task is executing. > Correct. No overlapping cpusets are allowed, and a task's affinity can't be restricted to a subset of the cpuset's root domain cpus. [...] Thanks, - Juri
On 24 August 2017 at 01:53, Luca Abeni <luca.abeni@santannapisa.it> wrote: > On Wed, 23 Aug 2017 13:47:13 -0600 > Mathieu Poirier <mathieu.poirier@linaro.org> wrote: >> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] >> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug >> >> operations. When CPUhotplug and some CUPset manipulation take place root >> >> domains are destroyed and new ones created, loosing at the same time DL >> >> accounting pertaining to utilisation. >> > >> > Thanks for looking at this longstanding issue! I am just back from >> > vacations; in the next days I'll try your patches. >> > Do you have some kind of scripts for reproducing the issue >> > automatically? (I see that in the original email Steven described how >> > to reproduce it manually; I just wonder if anyone already scripted the >> > test). >> >> I didn't bother scripting it since it is so easy to do. I'm eager to >> see how things work out on your end. > > Ok, so I'll try to reproduce the issue manually as described in Steven's > original email; I'll run some tests as soon as I finish with some stuff > that accumulated during vacations. > > [...] >> >> OPEN ISSUE: >> >> >> >> Regardless of how we proceed (using existing CPUset list or new ones) we >> >> need to deal with DL tasks that span more than one root domain, something >> >> that will typically happen after a CPUset operation. For example, if we >> >> split the number of available CPUs on a system in two CPUsets and then turn >> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the >> >> parent CPUset will end up spanning two root domains. >> >> >> >> One way to deal with this is to prevent CPUset operations from happening >> >> when such condition is detected, as enacted in this set. >> > >> > I think this is the simplest (if not only?) solution if we want to use >> > gEDF in each root domain. >> >> Global Earliest Deadline First? Is my interpretation correct? > > Right. As far as I understand, the original SCHED_DEADLINE design is to > partition the CPUs in disjoint sets, and then use global EDF scheduling > on each one of those sets (this guarantees bounded tardiness, and if > you run some additional admission tests in user space you can also > guarantee the hard respect of every deadline). > > >> >> Although simple >> >> this approach feels brittle and akin to a "whack-a-mole" game. A better >> >> and more reliable approach would be to teach the DL scheduler to deal with >> >> tasks that span multiple root domains, a serious and substantial >> >> undertaking. >> >> >> >> I am sending this as a starting point for discussion. I would be grateful >> >> if you could take the time to comment on the approach and most importantly >> >> provide input on how to deal with the open issue underlined above. >> > >> > I suspect that if we want to guarantee bounded tardiness then we have to >> > go for a solution similar to the one suggested by Tommaso some time ago >> > (if I remember well): >> > >> > if we want to create some "second level cpusets" inside a "parent >> > cpuset", allowing deadline tasks to be placed inside both the "parent >> > cpuset" and the "second level cpusets", then we have to subtract the >> > "second level cpusets" maximum utilizations from the "parent cpuset" >> > utilization. >> > >> > I am not sure how difficult it can be to implement this... >> >> Humm... I am missing some context here. > > Or maybe I misunderstood the issue you were seeing (I am no expert on > cpusets). Is it related to hierarchies of cpusets (with one cpuset > contained inside another one)? Having spent a lot of time in the CPUset code, I can understand the confusion. CPUset allows to create a hierarchy of sets, _seemingly_ creating overlapping root domains. Fortunately that isn't the case - overlapping CPUsets are morphed together to create non-overlapping root domains. The magic happens in rebuild_sched_domains_locked() [1] where generate_sched_domains() [2] transforms any CPUset topology into disjoint domains. > Can you describe how to reproduce the problematic situation? Let's start with a 4 CPU system (in this case the Q401c Dragon board) where patches 1/7 and 2/7 have been applied to a vanilla kernel. I'm also using Juri's tools [3,4] as describe in Steve's email [5]. root@linaro-developer:/home/linaro# uname -a Linux linaro-developer 4.13.0-rc5-00012-g98bf1310205e #149 SMP PREEMPT Thu Aug 24 13:12:39 MDT 2017 aarch64 GNU/Linux root@linaro-developer:/home/linaro# root@linaro-developer:/home/linaro# cat /sys/devices/system/cpu/online 0-3 root@linaro-developer:/home/linaro# root@linaro-developer:/home/linaro# grep dl /proc/sched_debug dl_rq[0]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[1]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[2]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[3]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 root@linaro-developer:/home/linaro# This checks out as expected. Now let's create 2 CPUsets and make sure new root domains are created by setting the 'sched_load_balance' flag to '0' on the default CPUset. root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2 root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 > set1/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo 2,3 > set2/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance root@linaro-developer:/sys/fs/cgroup/cpuset# At this time runqueue0 and runqueue1 point to root domain A while runqueue2 and runqueue3 point to root domain B (something that can't be seen without adding more instrumentation). Newly created tasks can roam on all the CPUs available: root@linaro-developer:/home/linaro# ./burn & [1] 3973 root@linaro-developer:/home/linaro# grep Cpus_allowed: /proc/3973/status Cpus_allowed: f root@linaro-developer:/home/linaro# The above demonstrate that even if we have two CPUsets new task belong to the "default" CPUset and as such can use all the available CPUs. Now let's make task 3973 a DL task: root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000 3973 root@linaro-developer:/home/linaro# grep dl /proc/sched_debug dl_rq[0]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 <------ Problem dl_rq[1]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 <------ Problem dl_rq[2]: .dl_nr_running : 1 .dl_nr_migratory : 1 .dl_bw->bw : 996147 .dl_bw->total_bw : 943718 <------ As expected dl_rq[3]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 943718 <------ As expected root@linaro-developer:/home/linaro/jlelli# When task 3973 was promoted to a DL task it was running on either CPU2 or CPU3. The acceptance test was done on root domain B and the task utilisation added as expected. But as pointed out above task 3973 can still be scheduled on CPU0 and CPU1 and that is a problem since the utilisation hasn't been added there as well. The task is now spread over two root domains rather than a single one, as currently expected by the DL code (note that there are many ways to reproduce this situation). In its current form the patchset prevents specific operations from being carried out if we recognise that a task could end up spanning more than a single root domain. But that will break as soon as we find a new way to create a DL task that spans multiple domains (and I may not have caught them all either). Another way to fix this is to do an acceptance test on all the root domain of a task. So above we'd run the acceptance test on root domain A and B before promoting the task. Of course we'd also have to add the utilisation of that task to both root domain. Although simple it goes at the core of the DL scheduler and touches pretty much every aspect of it, something I'm reluctant to embark on. [1]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814 [2]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634 [3]. https://github.com/jlelli/tests.git [4]. https://github.com/jlelli/schedtool-dl.git [5]. https://lkml.org/lkml/2016/2/3/966 > >> Nonetheless the approach I >> was contemplating was to repeat the current mathematics to all the >> root domains accessible from a p->cpus_allowed's flag. > > I think in the original SCHED_DEADLINE design there should be only one > root domain compatible with the task's affinity... If this does not > happen, I suspect it is a bug (Juri, can you confirm?). > > My understanding is that with SCHED_DEADLINE cpusets should be used to > partition the system's CPUs in disjoint sets (and I think there is one > root domain for each one of those disjoint sets). And the task affinity > mask should correspond with the CPUs composing the set in which the > task is executing. > > >> As such we'd >> have the same acceptance test but repeated to more than one root >> domain. To do that time can be an issue but the real problem I see is >> related to the current DL code. It is geared around a single root >> domain and changing that means meddling in a lot of places. I had a >> prototype that was beginning to address that but decided to gather >> people's opinion before getting in too deep. > > I still do not fully understand this (I got the impression that this is > related to hierarchies of cpusets, but I am not sure if this > understanding is correct). Maybe an example would help me to understand. The above should say it all - please get back to me if I haven't expressed myself clearly. > > > > Thanks, > Luca
Hi Mathieu, On Thu, 24 Aug 2017 14:32:20 -0600 Mathieu Poirier <mathieu.poirier@linaro.org> wrote: [...] > >> > if we want to create some "second level cpusets" inside a "parent > >> > cpuset", allowing deadline tasks to be placed inside both the > >> > "parent cpuset" and the "second level cpusets", then we have to > >> > subtract the "second level cpusets" maximum utilizations from > >> > the "parent cpuset" utilization. > >> > > >> > I am not sure how difficult it can be to implement this... > >> > >> Humm... I am missing some context here. > > > > Or maybe I misunderstood the issue you were seeing (I am no expert > > on cpusets). Is it related to hierarchies of cpusets (with one > > cpuset contained inside another one)? > > Having spent a lot of time in the CPUset code, I can understand the > confusion. > > CPUset allows to create a hierarchy of sets, _seemingly_ creating > overlapping root domains. Fortunately that isn't the case - > overlapping CPUsets are morphed together to create non-overlapping > root domains. The magic happens in rebuild_sched_domains_locked() [1] > where generate_sched_domains() [2] transforms any CPUset topology into > disjoint domains. Ok; thanks for explaining [...] > root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2 > root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem > root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems > root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 > > set1/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo > 2,3 > set2/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# > echo 0 > cpuset.sched_load_balance > root@linaro-developer:/sys/fs/cgroup/cpuset# > > At this time runqueue0 and runqueue1 point to root domain A while > runqueue2 and runqueue3 point to root domain B (something that can't > be seen without adding more instrumentation). Ok; up to here, everything is clear to me ;-) > Newly created tasks can roam on all the CPUs available: > > > root@linaro-developer:/home/linaro# ./burn & > [1] 3973 > root@linaro-developer:/home/linaro# grep > Cpus_allowed: /proc/3973/status Cpus_allowed: f > root@linaro-developer:/home/linaro# This happens because the task is not in set1 nor in set2, right? I _think_ (but I am not sure; I did not design this part of SCHED_DEADLINE) that the original idea was that in this situation SCHED_DEADLINE tasks can be only in set1 or in set2 (SCHED_DEADLINE tasks are not allowed to be in the "default" CPUset, in this setup). Is this what one of your later patches enforces? > The above demonstrate that even if we have two CPUsets new task belong > to the "default" CPUset and as such can use all the available CPUs. I still have a doubt (probably showing all my ignorance about CPUsets :)... In this situation, we have 3 CPUsets: "default", set1, and set2... Is everyone of these CPUsets associated to a root domain (so, we have 3 root domains)? Or only set1 and set2 are associated to a root domain? > Now let's make task 3973 a DL task: > > root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000 > 3973 root@linaro-developer:/home/linaro# grep dl /proc/sched_debug > dl_rq[0]: > .dl_nr_running : 0 > .dl_nr_migratory : 0 > .dl_bw->bw : 996147 > .dl_bw->total_bw : 0 <------ Problem Ok; I think I understand the problem, now... > dl_rq[3]: > .dl_nr_running : 0 > .dl_nr_migratory : 0 > .dl_bw->bw : 996147 > .dl_bw->total_bw : 943718 <------ As expected > root@linaro-developer:/home/linaro/jlelli# > > When task 3973 was promoted to a DL task it was running on either CPU2 > or CPU3. The acceptance test was done on root domain B and the task > utilisation added as expected. But as pointed out above task 3973 can > still be scheduled on CPU0 and CPU1 and that is a problem since the > utilisation hasn't been added there as well. The task is now spread > over two root domains rather than a single one, as currently expected > by the DL code (note that there are many ways to reproduce this > situation). I think this is a bug, and the only reasonable solution is to allow the task to become SCHED_DEADLINE if it is in set1 or set2 (so, if its affinity mask coincides exactly with all of the CPUs of the root domain where the task utilization is added). > In its current form the patchset prevents specific operations from > being carried out if we recognise that a task could end up spanning > more than a single root domain. Good. I think this is the right way to go. > But that will break as soon as we > find a new way to create a DL task that spans multiple domains (and I > may not have caught them all either). So, we need to fix that too ;-) > Another way to fix this is to do an acceptance test on all the root > domain of a task. I think we need to undestand what's the inteded behaviour of SCHED_DEADLINE in this situation... My understanding is that SCHED_DEADLINE is designed to do global EDF scheduling inside an "isolated" CPUset; a SCHED_DEADLINE task spanning multiple domains would break some SCHED_DEADLINE properties (from the scheduling theory point of view) in some interesting ways... I am not saying we should not do this, but I believe that allowing tasks to span multiple domains require some redesign of the admission test and migration mechanisms in SCHED_DEADLINE. I think this is related to the "generic affinities" issue that Peter mentioned some time ago. > So above we'd run the acceptance test on root > domain A and B before promoting the task. Of course we'd also have to > add the utilisation of that task to both root domain. Although simple > it goes at the core of the DL scheduler and touches pretty much every > aspect of it, something I'm reluctant to embark on. I see... So, the "default" CPUset does not have any root domain associated to it? If it had, we could just subtract the maximum utilizations of set1 and set2 to it when creating the root domains of set1 and set2. Thanks, Luca > > [1]. > http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814 > [2]. > http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634 > [3]. https://github.com/jlelli/tests.git [4]. > https://github.com/jlelli/schedtool-dl.git [5]. > https://lkml.org/lkml/2016/2/3/966 > > > > >> Nonetheless the approach I > >> was contemplating was to repeat the current mathematics to all the > >> root domains accessible from a p->cpus_allowed's flag. > > > > I think in the original SCHED_DEADLINE design there should be only > > one root domain compatible with the task's affinity... If this does > > not happen, I suspect it is a bug (Juri, can you confirm?). > > > > My understanding is that with SCHED_DEADLINE cpusets should be used > > to partition the system's CPUs in disjoint sets (and I think there > > is one root domain for each one of those disjoint sets). And the > > task affinity mask should correspond with the CPUs composing the > > set in which the task is executing. > > > > > >> As such we'd > >> have the same acceptance test but repeated to more than one root > >> domain. To do that time can be an issue but the real problem I > >> see is related to the current DL code. It is geared around a > >> single root domain and changing that means meddling in a lot of > >> places. I had a prototype that was beginning to address that but > >> decided to gather people's opinion before getting in too deep. > > > > I still do not fully understand this (I got the impression that > > this is related to hierarchies of cpusets, but I am not sure if this > > understanding is correct). Maybe an example would help me to > > understand. > > The above should say it all - please get back to me if I haven't > expressed myself clearly. > > > > > > > > > Thanks, > > Luca
On Fri, 25 Aug 2017 08:02:43 +0200 luca abeni <luca.abeni@santannapisa.it> wrote: [...] > > The above demonstrate that even if we have two CPUsets new task belong > > to the "default" CPUset and as such can use all the available CPUs. > > I still have a doubt (probably showing all my ignorance about > CPUsets :)... In this situation, we have 3 CPUsets: "default", > set1, and set2... Is everyone of these CPUsets associated to a > root domain (so, we have 3 root domains)? Or only set1 and set2 are > associated to a root domain? Ok, after reading (and hopefully understanding better :) the code, I think this question was kind of silly... There are only 2 root domains, corresponding to set1 and set2 (right?). [...] > > So above we'd run the acceptance test on root > > domain A and B before promoting the task. Of course we'd also have to > > add the utilisation of that task to both root domain. Although simple > > it goes at the core of the DL scheduler and touches pretty much every > > aspect of it, something I'm reluctant to embark on. > > I see... So, the "default" CPUset does not have any root domain > associated to it? If it had, we could just subtract the maximum > utilizations of set1 and set2 to it when creating the root domains of > set1 and set2. ... So, this idea of mine had no sense. I think the correct solution is what you implemented in your patchset (if I understand it correctly). If we want to have task spanning multiple root domains, many more changes in the code are needed... I am wondering if it would make more sense to track utilizations per runqueue (instead of per root domain): - when a task tries to become SCHED_DEADLINE, we count how many CPUs are in its affinity mask. Let's call "n" this number - then, we sum u / n (where "u" is the task's utilization) to the utilization of every runqueue that is in its affinity mask, and we check if all the sums are below the schedulability bound For tasks spanning one single root domain, this should be equivalent to the current admission test. Moreover, this check should ensure that no root domain can be ever overloaded (even if tasks span multiple domains). But I do not know the locking implications for this idea... I suspect it will not scale :( Luca
Hi Mathieu, On Wed, 23 Aug 2017 13:47:13 -0600 Mathieu Poirier <mathieu.poirier@linaro.org> wrote: > On 22 August 2017 at 06:21, Luca Abeni <luca.abeni@santannapisa.it> wrote: > > Hi Mathieu, > > Good day to you, > > > > > On Wed, 16 Aug 2017 15:20:36 -0600 > > Mathieu Poirier <mathieu.poirier@linaro.org> wrote: > > > >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] > >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug > >> operations. When CPUhotplug and some CUPset manipulation take place root > >> domains are destroyed and new ones created, loosing at the same time DL > >> accounting pertaining to utilisation. > > > > Thanks for looking at this longstanding issue! I am just back from > > vacations; in the next days I'll try your patches. > > Do you have some kind of scripts for reproducing the issue > > automatically? (I see that in the original email Steven described how > > to reproduce it manually; I just wonder if anyone already scripted the > > test). > > I didn't bother scripting it since it is so easy to do. I'm eager to > see how things work out on your end. I ran some tests with your patchset, and I confirm that it fixes the issue originally pointed out by Steven. But I still need to run some more tests (I'll continue on Monday). I think I found an issue by: 1) creating two disjoint cpusets (CPUs 0 and 1 in the first cpuset, CPUs 2 and 3 in the second one) and setting sched_load_balance to 0 2) starting a task in one of the two cpusets, and making it SCHED_DEADLINE <--- up to here, everything looks fine 3) setting sched_load_balance to 1 <--- At this point, I think there is a bug: the system has only one root domain, and the task utilization is summed to it... But the task affinity mask is still the one of the "old root domain" that was associated with the cpuset where the task is executing. I still need to run some experiments about this. Thanks, Luca
On 25 August 2017 at 03:52, Luca Abeni <luca.abeni@santannapisa.it> wrote: > On Fri, 25 Aug 2017 08:02:43 +0200 > luca abeni <luca.abeni@santannapisa.it> wrote: > [...] >> > The above demonstrate that even if we have two CPUsets new task belong >> > to the "default" CPUset and as such can use all the available CPUs. >> >> I still have a doubt (probably showing all my ignorance about >> CPUsets :)... In this situation, we have 3 CPUsets: "default", >> set1, and set2... Is everyone of these CPUsets associated to a >> root domain (so, we have 3 root domains)? Or only set1 and set2 are >> associated to a root domain? > > Ok, after reading (and hopefully understanding better :) the code, I > think this question was kind of silly... There are only 2 root domains, > corresponding to set1 and set2 (right?). Correct - although there is a default CPUset there isn't a default root domain. > > [...] > >> > So above we'd run the acceptance test on root >> > domain A and B before promoting the task. Of course we'd also have to >> > add the utilisation of that task to both root domain. Although simple >> > it goes at the core of the DL scheduler and touches pretty much every >> > aspect of it, something I'm reluctant to embark on. >> >> I see... So, the "default" CPUset does not have any root domain >> associated to it? If it had, we could just subtract the maximum >> utilizations of set1 and set2 to it when creating the root domains of >> set1 and set2. > ... > So, this idea of mine had no sense. > > I think the correct solution is what you implemented in your patchset > (if I understand it correctly). > > If we want to have task spanning multiple root domains, many more > changes in the code are needed... I am wondering if it would make more > sense to track utilizations per runqueue (instead of per root domain): > - when a task tries to become SCHED_DEADLINE, we count how many CPUs are > in its affinity mask. Let's call "n" this number > - then, we sum u / n (where "u" is the task's utilization) to the > utilization of every runqueue that is in its affinity mask, and we > check if all the sums are below the schedulability bound > > For tasks spanning one single root domain, this should be equivalent to > the current admission test. Moreover, this check should ensure that no > root domain can be ever overloaded (even if tasks span multiple > domains). This is an idea worth exploring. > But I do not know the locking implications for this idea... I suspect > it will not scale :( Right, scaling could be a problem - we'd have to prototype it and see how bad things get. We _may_ be able to figure something out with RCU trickery. As I mention in a previous email I toyed with the idea of extending the DL code to support more than one root domain. Maybe it is time to go back to it, finish the admission test and publish just that part... At least we would have code to comment on. Regardless of the avenue we choose to go with I think we could use my current solution as a stepping stone while we figure out what we really want to do. At least it would be an improvement on the current situation. > > > > Luca
On 25 August 2017 at 08:37, Luca Abeni <luca.abeni@santannapisa.it> wrote: > Hi Mathieu, > > On Wed, 23 Aug 2017 13:47:13 -0600 > Mathieu Poirier <mathieu.poirier@linaro.org> wrote: > >> On 22 August 2017 at 06:21, Luca Abeni <luca.abeni@santannapisa.it> wrote: >> > Hi Mathieu, >> >> Good day to you, >> >> > >> > On Wed, 16 Aug 2017 15:20:36 -0600 >> > Mathieu Poirier <mathieu.poirier@linaro.org> wrote: >> > >> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] >> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug >> >> operations. When CPUhotplug and some CUPset manipulation take place root >> >> domains are destroyed and new ones created, loosing at the same time DL >> >> accounting pertaining to utilisation. >> > >> > Thanks for looking at this longstanding issue! I am just back from >> > vacations; in the next days I'll try your patches. >> > Do you have some kind of scripts for reproducing the issue >> > automatically? (I see that in the original email Steven described how >> > to reproduce it manually; I just wonder if anyone already scripted the >> > test). >> >> I didn't bother scripting it since it is so easy to do. I'm eager to >> see how things work out on your end. > > I ran some tests with your patchset, and I confirm that it fixes the > issue originally pointed out by Steven. > Good, at least it's a start. > But I still need to run some more tests (I'll continue on Monday). > > I think I found an issue by: > 1) creating two disjoint cpusets (CPUs 0 and 1 in the first cpuset, > CPUs 2 and 3 in the second one) and setting sched_load_balance to 0 > 2) starting a task in one of the two cpusets, and making it > SCHED_DEADLINE <--- up to here, everything looks fine > 3) setting sched_load_balance to 1 <--- At this point, I think there is > a bug: the system has only one root domain, and the task utilization > is summed to it... But the task affinity mask is still the one of > the "old root domain" that was associated with the cpuset where the > task is executing. I can reproduce the problem on my side as well. This is how CPUset works and the expected behaviour. For normal tasks it isn't a problem but I agree with you that for DL tasks, we need to address this. > > I still need to run some experiments about this. Thanks for the time, Mathieu > > > > Thanks, > Luca
On 25 August 2017 at 03:52, Luca Abeni <luca.abeni@santannapisa.it> wrote: > On Fri, 25 Aug 2017 08:02:43 +0200 > luca abeni <luca.abeni@santannapisa.it> wrote: > [...] >> > The above demonstrate that even if we have two CPUsets new task belong >> > to the "default" CPUset and as such can use all the available CPUs. >> >> I still have a doubt (probably showing all my ignorance about >> CPUsets :)... In this situation, we have 3 CPUsets: "default", >> set1, and set2... Is everyone of these CPUsets associated to a >> root domain (so, we have 3 root domains)? Or only set1 and set2 are >> associated to a root domain? > > Ok, after reading (and hopefully understanding better :) the code, I > think this question was kind of silly... There are only 2 root domains, > corresponding to set1 and set2 (right?). For this scenario yes, you are correct. > > [...] > >> > So above we'd run the acceptance test on root >> > domain A and B before promoting the task. Of course we'd also have to >> > add the utilisation of that task to both root domain. Although simple >> > it goes at the core of the DL scheduler and touches pretty much every >> > aspect of it, something I'm reluctant to embark on. >> >> I see... So, the "default" CPUset does not have any root domain >> associated to it? If it had, we could just subtract the maximum >> utilizations of set1 and set2 to it when creating the root domains of >> set1 and set2. > ... > So, this idea of mine had no sense. > > I think the correct solution is what you implemented in your patchset > (if I understand it correctly). > > If we want to have task spanning multiple root domains, many more > changes in the code are needed... I am wondering if it would make more > sense to track utilizations per runqueue (instead of per root domain): > - when a task tries to become SCHED_DEADLINE, we count how many CPUs are > in its affinity mask. Let's call "n" this number > - then, we sum u / n (where "u" is the task's utilization) to the > utilization of every runqueue that is in its affinity mask, and we > check if all the sums are below the schedulability bound > > For tasks spanning one single root domain, this should be equivalent to > the current admission test. Moreover, this check should ensure that no > root domain can be ever overloaded (even if tasks span multiple > domains). > But I do not know the locking implications for this idea... I suspect > it will not scale :( > > > > Luca
On Wed, Aug 16, 2017 at 03:20:36PM -0600, Mathieu Poirier wrote: > In this set the problem is addressed by relying on existing list of tasks > (sleeping or not) already maintained by CPUsets. Right, that's a much saner approach :-) > OPEN ISSUE: > > Regardless of how we proceed (using existing CPUset list or new ones) we > need to deal with DL tasks that span more than one root domain, something > that will typically happen after a CPUset operation. For example, if we > split the number of available CPUs on a system in two CPUsets and then turn > off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the > parent CPUset will end up spanning two root domains. > > One way to deal with this is to prevent CPUset operations from happening > when such condition is detected, as enacted in this set. Although simple > this approach feels brittle and akin to a "whack-a-mole" game. A better > and more reliable approach would be to teach the DL scheduler to deal with > tasks that span multiple root domains, a serious and substantial > undertaking. > > I am sending this as a starting point for discussion. I would be grateful > if you could take the time to comment on the approach and most importantly > provide input on how to deal with the open issue underlined above. Right, so teaching DEADLINE about arbitrary affinities is 'interesting'. Although the rules proposed by Tomasso; if found sufficient; would greatly simplify things. Also the online semi-partition approach to SMP could help with that. But yes, that's fairly massive surgery. For now I think we'll have to live and accept the limitations. So failing the various cpuset operations when they violate rules seems fine. Relaxing rules is always easier than tightening them (later). One 'series' you might be interested in when respinning these is: https://lkml.kernel.org/r/20171011094833.pdp4torvotvjdmkt@hirez.programming.kicks-ass.net By doing synchronous domain rebuild we loose a bunch of funnies.
On 11 October 2017 at 10:02, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Aug 16, 2017 at 03:20:36PM -0600, Mathieu Poirier wrote: > >> In this set the problem is addressed by relying on existing list of tasks >> (sleeping or not) already maintained by CPUsets. > > Right, that's a much saner approach :-) Luca and Juri had the same opinion so let's continue with that solution. > >> OPEN ISSUE: >> >> Regardless of how we proceed (using existing CPUset list or new ones) we >> need to deal with DL tasks that span more than one root domain, something >> that will typically happen after a CPUset operation. For example, if we >> split the number of available CPUs on a system in two CPUsets and then turn >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the >> parent CPUset will end up spanning two root domains. >> >> One way to deal with this is to prevent CPUset operations from happening >> when such condition is detected, as enacted in this set. Although simple >> this approach feels brittle and akin to a "whack-a-mole" game. A better >> and more reliable approach would be to teach the DL scheduler to deal with >> tasks that span multiple root domains, a serious and substantial >> undertaking. >> >> I am sending this as a starting point for discussion. I would be grateful >> if you could take the time to comment on the approach and most importantly >> provide input on how to deal with the open issue underlined above. > > Right, so teaching DEADLINE about arbitrary affinities is 'interesting'. > > Although the rules proposed by Tomasso; if found sufficient; would > greatly simplify things. Also the online semi-partition approach to SMP > could help with that. The "rules" proposed by Tomasso, are you referring to patches or the deadline/cgroup extension work that he presented at OSPM? I'd also be interested to know more about this "online semi-partition approach to SMP" you mentioned. Maybe that's a conversation we could have at the upcoming RT summit in Prague. > > But yes, that's fairly massive surgery. For now I think we'll have to > live and accept the limitations. So failing the various cpuset > operations when they violate rules seems fine. Relaxing rules is always > easier than tightening them (later). Agreed. > > One 'series' you might be interested in when respinning these is: > > https://lkml.kernel.org/r/20171011094833.pdp4torvotvjdmkt@hirez.programming.kicks-ass.net > > By doing synchronous domain rebuild we loose a bunch of funnies. Getting rid of the asynchronous nature of the hotplug path would be a delight - I'll start keeping track of that effort as well. Thanks for the review, Mathieu
Hi Mathieu, On Thu, 12 Oct 2017 10:57:09 -0600 Mathieu Poirier <mathieu.poirier@linaro.org> wrote: [...] > >> Regardless of how we proceed (using existing CPUset list or new ones) we > >> need to deal with DL tasks that span more than one root domain, something > >> that will typically happen after a CPUset operation. For example, if we > >> split the number of available CPUs on a system in two CPUsets and then turn > >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the > >> parent CPUset will end up spanning two root domains. > >> > >> One way to deal with this is to prevent CPUset operations from happening > >> when such condition is detected, as enacted in this set. Although simple > >> this approach feels brittle and akin to a "whack-a-mole" game. A better > >> and more reliable approach would be to teach the DL scheduler to deal with > >> tasks that span multiple root domains, a serious and substantial > >> undertaking. > >> > >> I am sending this as a starting point for discussion. I would be grateful > >> if you could take the time to comment on the approach and most importantly > >> provide input on how to deal with the open issue underlined above. > > > > Right, so teaching DEADLINE about arbitrary affinities is 'interesting'. > > > > Although the rules proposed by Tomasso; if found sufficient; would > > greatly simplify things. Also the online semi-partition approach to SMP > > could help with that. > > The "rules" proposed by Tomasso, are you referring to patches or the > deadline/cgroup extension work that he presented at OSPM? No, that is an unrelated thing... Tommaso previously proposed some improvements to the admission control mechanism to take arbitrary affinities into account. I think Tommaso's proposal is similar to what I previously proposed in this thread (to admit a SCHED_DEADLINE task with utilization u = runtime / period and affinity to N runqueues, we can account u / N to each one of the runqueues, and check if the sum of the utilizations on each runqueue is < 1). As previously noticed by Peter, this might have some scalability issues (a naive implementation would lock the root domain while iterating on all the runqueues). Few days ago, I was discussing with Tommaso about a possible solution based on not locking the root domain structure, and eventually using a roll-back strategy if the status of the root domain changes while we are updating it. I think in a previous email you mentioned RCU, which might result in a similar solution. Anyway, I am adding Tommaso in cc so that he can comment more. > I'd also be > interested to know more about this "online semi-partition approach to > SMP" you mentioned. It is basically an implementation (and extension to arbitrary affinities) of this work: http://drops.dagstuhl.de/opus/volltexte/2017/7165/ Luca > Maybe that's a conversation we could have at the > upcoming RT summit in Prague. > > > > > But yes, that's fairly massive surgery. For now I think we'll have to > > live and accept the limitations. So failing the various cpuset > > operations when they violate rules seems fine. Relaxing rules is always > > easier than tightening them (later). > > Agreed. > > > > > One 'series' you might be interested in when respinning these is: > > > > https://lkml.kernel.org/r/20171011094833.pdp4torvotvjdmkt@hirez.programming.kicks-ass.net > > > > By doing synchronous domain rebuild we loose a bunch of funnies. > > Getting rid of the asynchronous nature of the hotplug path would be a > delight - I'll start keeping track of that effort as well. > > Thanks for the review, > Mathieu