Message ID | 59b37c56b8fcb834f7d3234e776eaeff74ad117f.1556182965.git.viresh.kumar@linaro.org |
---|---|
State | New |
Headers | show |
Series | sched/fair: Fallback to sched-idle CPU for better performance | expand |
On Thu, Apr 25, 2019 at 03:07:40PM +0530, Viresh Kumar wrote: > We target for an idle CPU in select_idle_sibling() to run the next task, > but in case we don't find idle CPUs it is better to pick a CPU which > will run the task the soonest, for performance reason. A CPU which isn't > idle but has only SCHED_IDLE activity queued on it should be a good > target based on this criteria as any normal fair task will most likely > preempt the currently running SCHED_IDLE task immediately. In fact, > choosing a SCHED_IDLE CPU shall give better results as it should be able > to run the task sooner than an idle CPU (which requires to be woken up > from an idle state). > > This patch updates the fast path to fallback to a sched-idle CPU if the > idle CPU isn't found, the slow path can be updated separately later. > > Following is the order in which select_idle_sibling() picks up next CPU > to run the task now: > > 1. idle_cpu(target) OR sched_idle_cpu(target) > 2. idle_cpu(prev) OR sched_idle_cpu(prev) > 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu) > 4. idle core(sd) > 5. idle_cpu(sd) > 6. sched_idle_cpu(sd) > 7. idle_cpu(p) - smt > 8. sched_idle_cpu(p)- smt > > Though the policy can be tweaked a bit if we want to have different > priorities. I don't hate his per se; but the whole select_idle_sibling() thing is something that needs looking at. There was the task stealing thing from Steve that looked interesting and that would render your apporach unfeasible.
On 10-05-19, 09:21, Peter Zijlstra wrote: > On Thu, Apr 25, 2019 at 03:07:40PM +0530, Viresh Kumar wrote: > > We target for an idle CPU in select_idle_sibling() to run the next task, > > but in case we don't find idle CPUs it is better to pick a CPU which > > will run the task the soonest, for performance reason. A CPU which isn't > > idle but has only SCHED_IDLE activity queued on it should be a good > > target based on this criteria as any normal fair task will most likely > > preempt the currently running SCHED_IDLE task immediately. In fact, > > choosing a SCHED_IDLE CPU shall give better results as it should be able > > to run the task sooner than an idle CPU (which requires to be woken up > > from an idle state). > > > > This patch updates the fast path to fallback to a sched-idle CPU if the > > idle CPU isn't found, the slow path can be updated separately later. > > > > Following is the order in which select_idle_sibling() picks up next CPU > > to run the task now: > > > > 1. idle_cpu(target) OR sched_idle_cpu(target) > > 2. idle_cpu(prev) OR sched_idle_cpu(prev) > > 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu) > > 4. idle core(sd) > > 5. idle_cpu(sd) > > 6. sched_idle_cpu(sd) > > 7. idle_cpu(p) - smt > > 8. sched_idle_cpu(p)- smt > > > > Though the policy can be tweaked a bit if we want to have different > > priorities. > > I don't hate his per se; but the whole select_idle_sibling() thing is > something that needs looking at. > > There was the task stealing thing from Steve that looked interesting and > that would render your apporach unfeasible. I am surely missing something as I don't see how that patchset will make this patchset perform badly, than what it already does. The idea of this patchset is to find a CPU which can run the task the soonest if no other CPU is idle. If a CPU is idle we still want to run the task on that one to finish work asap. This patchset only updates the fast path right now and doesn't touch slow-path and periodic/idle load-balance path. That would be the next step for sure though. Steve's patchset (IIUC) adds a new fast way of doing idle-load-balance at the LLC level, that is no different than normal idle-load-balancing for this patchset. In fact, I will say that Steve's patchset makes our work easier to extend going forward as we can capitalize on the new *fast* infrastructure to pull tasks even when a CPU isn't fully idle but only has sched-idle stuff on it. Does this makes sense ? @Song: Thanks for giving this a try and I am really happy to see your results. I do see that we still don't get the performance we wanted, perhaps because we only touch the fast path. Maybe load-balance screws it up for us at a later point of time and CPUs are left with only sched-idle tasks. Not sure though. -- viresh
On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote: > On 10-05-19, 09:21, Peter Zijlstra wrote: > > I don't hate his per se; but the whole select_idle_sibling() thing is > > something that needs looking at. > > > > There was the task stealing thing from Steve that looked interesting and > > that would render your apporach unfeasible. > > I am surely missing something as I don't see how that patchset will > make this patchset perform badly, than what it already does. Nah; I just misremembered. I know Oracle has a patch set poking at select_idle_siblings() _somewhere_ (as do I), and I just found the wrong one. Basically everybody is complaining select_idle_sibling() is too expensive for checking the entire LLC domain, except for FB (and thus likely some other workloads too) that depend on it to kill their tail latency. But I suppose we could still do this, even if we scan only a subset of the LLC, just keep track of the last !idle CPU running only SCHED_IDLE tasks and pick that if you do not (in your limited scan) find a better candidate.
On 5/13/2019 7:35 AM, Peter Zijlstra wrote: > On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote: >> On 10-05-19, 09:21, Peter Zijlstra wrote: > >>> I don't hate his per se; but the whole select_idle_sibling() thing is >>> something that needs looking at. >>> >>> There was the task stealing thing from Steve that looked interesting and >>> that would render your apporach unfeasible. >> >> I am surely missing something as I don't see how that patchset will >> make this patchset perform badly, than what it already does. > > Nah; I just misremembered. I know Oracle has a patch set poking at > select_idle_siblings() _somewhere_ (as do I), and I just found the wrong > one. > > Basically everybody is complaining select_idle_sibling() is too > expensive for checking the entire LLC domain, except for FB (and thus > likely some other workloads too) that depend on it to kill their tail > latency. > > But I suppose we could still do this, even if we scan only a subset of > the LLC, just keep track of the last !idle CPU running only SCHED_IDLE > tasks and pick that if you do not (in your limited scan) find a better > candidate. Subhra posted a patch that incrementally searches for an idle CPU in the LLC, remembering the last CPU examined, and searching a fixed number of CPUs from there. That technique is compatible with the one that Viresh suggests; the incremental search would stop if a SCHED_IDLE cpu was found. I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs, updated with atomic operations. Performance was basically unchanged for the workloads I tested, and I inserted timers around the idle search showing it was a very small fraction of time both before and after my changes. That led me to ignore the push side and optimize the pull side with task stealing. I would be very interested in hearing from folks that have workloads that demonstrate that select_idle_sibling is too expensive. - Steve
On 5/14/19 9:03 AM, Steven Sistare wrote: > On 5/13/2019 7:35 AM, Peter Zijlstra wrote: >> On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote: >>> On 10-05-19, 09:21, Peter Zijlstra wrote: >>>> I don't hate his per se; but the whole select_idle_sibling() thing is >>>> something that needs looking at. >>>> >>>> There was the task stealing thing from Steve that looked interesting and >>>> that would render your apporach unfeasible. >>> I am surely missing something as I don't see how that patchset will >>> make this patchset perform badly, than what it already does. >> Nah; I just misremembered. I know Oracle has a patch set poking at >> select_idle_siblings() _somewhere_ (as do I), and I just found the wrong >> one. >> >> Basically everybody is complaining select_idle_sibling() is too >> expensive for checking the entire LLC domain, except for FB (and thus >> likely some other workloads too) that depend on it to kill their tail >> latency. >> >> But I suppose we could still do this, even if we scan only a subset of >> the LLC, just keep track of the last !idle CPU running only SCHED_IDLE >> tasks and pick that if you do not (in your limited scan) find a better >> candidate. > Subhra posted a patch that incrementally searches for an idle CPU in the LLC, > remembering the last CPU examined, and searching a fixed number of CPUs from there. > That technique is compatible with the one that Viresh suggests; the incremental > search would stop if a SCHED_IDLE cpu was found. This was the last version of patchset I sent: https://lkml.org/lkml/2018/6/28/810 Also select_idle_core is a net -ve for certain workloads like OLTP. So I had put a SCHED_FEAT to be able to disable it. Thanks, Subhra > > I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs, > updated with atomic operations. Performance was basically unchanged for the > workloads I tested, and I inserted timers around the idle search showing it was > a very small fraction of time both before and after my changes. That led me to > ignore the push side and optimize the pull side with task stealing. > > I would be very interested in hearing from folks that have workloads that demonstrate > that select_idle_sibling is too expensive. > > - Steve
On 5/14/19 10:27 AM, Subhra Mazumdar wrote: > > On 5/14/19 9:03 AM, Steven Sistare wrote: >> On 5/13/2019 7:35 AM, Peter Zijlstra wrote: >>> On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote: >>>> On 10-05-19, 09:21, Peter Zijlstra wrote: >>>>> I don't hate his per se; but the whole select_idle_sibling() thing is >>>>> something that needs looking at. >>>>> >>>>> There was the task stealing thing from Steve that looked >>>>> interesting and >>>>> that would render your apporach unfeasible. >>>> I am surely missing something as I don't see how that patchset will >>>> make this patchset perform badly, than what it already does. >>> Nah; I just misremembered. I know Oracle has a patch set poking at >>> select_idle_siblings() _somewhere_ (as do I), and I just found the >>> wrong >>> one. >>> >>> Basically everybody is complaining select_idle_sibling() is too >>> expensive for checking the entire LLC domain, except for FB (and thus >>> likely some other workloads too) that depend on it to kill their tail >>> latency. >>> >>> But I suppose we could still do this, even if we scan only a subset of >>> the LLC, just keep track of the last !idle CPU running only SCHED_IDLE >>> tasks and pick that if you do not (in your limited scan) find a better >>> candidate. >> Subhra posted a patch that incrementally searches for an idle CPU in >> the LLC, >> remembering the last CPU examined, and searching a fixed number of >> CPUs from there. >> That technique is compatible with the one that Viresh suggests; the >> incremental >> search would stop if a SCHED_IDLE cpu was found. > This was the last version of patchset I sent: > https://lkml.org/lkml/2018/6/28/810 > > Also select_idle_core is a net -ve for certain workloads like OLTP. So I > had put a SCHED_FEAT to be able to disable it. Forgot to add, the cpumask_weight computation may not be O(1) with large number of CPUs, so needs to be precomputed in a per-cpu variable to further optimize. That part is missing from the above patchset. > > Thanks, > Subhra >> >> I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap >> of idle CPUs, >> updated with atomic operations. Performance was basically unchanged >> for the >> workloads I tested, and I inserted timers around the idle search >> showing it was >> a very small fraction of time both before and after my changes. That >> led me to >> ignore the push side and optimize the pull side with task stealing. >> >> I would be very interested in hearing from folks that have workloads >> that demonstrate >> that select_idle_sibling is too expensive. >> >> - Steve
On 25-04-19, 15:07, Viresh Kumar wrote: > We target for an idle CPU in select_idle_sibling() to run the next task, > but in case we don't find idle CPUs it is better to pick a CPU which > will run the task the soonest, for performance reason. A CPU which isn't > idle but has only SCHED_IDLE activity queued on it should be a good > target based on this criteria as any normal fair task will most likely > preempt the currently running SCHED_IDLE task immediately. In fact, > choosing a SCHED_IDLE CPU shall give better results as it should be able > to run the task sooner than an idle CPU (which requires to be woken up > from an idle state). > > This patch updates the fast path to fallback to a sched-idle CPU if the > idle CPU isn't found, the slow path can be updated separately later. > > Following is the order in which select_idle_sibling() picks up next CPU > to run the task now: > > 1. idle_cpu(target) OR sched_idle_cpu(target) > 2. idle_cpu(prev) OR sched_idle_cpu(prev) > 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu) > 4. idle core(sd) > 5. idle_cpu(sd) > 6. sched_idle_cpu(sd) > 7. idle_cpu(p) - smt > 8. sched_idle_cpu(p)- smt > > Though the policy can be tweaked a bit if we want to have different > priorities. > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> > --- > kernel/sched/fair.c | 28 +++++++++++++++++++++------- > 1 file changed, 21 insertions(+), 7 deletions(-) Hi Peter, I was looking to send V3 with the changes you suggested for the patch 1/2, are there any changes that I should be doing in this patch along with it ? -- viresh
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6511cb57acdd..fbaefb9a9296 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6057,6 +6057,15 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p return new_cpu; } +/* CPU only has SCHED_IDLE tasks enqueued */ +static int sched_idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running && + rq->nr_running); +} + #ifdef CONFIG_SCHED_SMT DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); @@ -6154,7 +6163,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ static int select_idle_smt(struct task_struct *p, int target) { - int cpu; + int cpu, si_cpu = -1; if (!static_branch_likely(&sched_smt_present)) return -1; @@ -6164,9 +6173,11 @@ static int select_idle_smt(struct task_struct *p, int target) continue; if (available_idle_cpu(cpu)) return cpu; + if (si_cpu == -1 && sched_idle_cpu(cpu)) + si_cpu = cpu; } - return -1; + return si_cpu; } #else /* CONFIG_SCHED_SMT */ @@ -6194,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, nr = INT_MAX, si_cpu = -1; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6222,11 +6233,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { if (!--nr) - return -1; + return si_cpu; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; if (available_idle_cpu(cpu)) break; + if (si_cpu == -1 && sched_idle_cpu(cpu)) + si_cpu = cpu; } time = local_clock() - time; @@ -6245,13 +6258,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) struct sched_domain *sd; int i, recent_used_cpu; - if (available_idle_cpu(target)) + if (available_idle_cpu(target) || sched_idle_cpu(target)) return target; /* * If the previous CPU is cache affine and idle, don't be stupid: */ - if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev)) + if (prev != target && cpus_share_cache(prev, target) && + (available_idle_cpu(prev) || sched_idle_cpu(prev))) return prev; /* Check a recently used CPU as a potential idle candidate: */ @@ -6259,7 +6273,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && - available_idle_cpu(recent_used_cpu) && + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) { /* * Replace recent_used_cpu with prev as it is a potential
We target for an idle CPU in select_idle_sibling() to run the next task, but in case we don't find idle CPUs it is better to pick a CPU which will run the task the soonest, for performance reason. A CPU which isn't idle but has only SCHED_IDLE activity queued on it should be a good target based on this criteria as any normal fair task will most likely preempt the currently running SCHED_IDLE task immediately. In fact, choosing a SCHED_IDLE CPU shall give better results as it should be able to run the task sooner than an idle CPU (which requires to be woken up from an idle state). This patch updates the fast path to fallback to a sched-idle CPU if the idle CPU isn't found, the slow path can be updated separately later. Following is the order in which select_idle_sibling() picks up next CPU to run the task now: 1. idle_cpu(target) OR sched_idle_cpu(target) 2. idle_cpu(prev) OR sched_idle_cpu(prev) 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu) 4. idle core(sd) 5. idle_cpu(sd) 6. sched_idle_cpu(sd) 7. idle_cpu(p) - smt 8. sched_idle_cpu(p)- smt Though the policy can be tweaked a bit if we want to have different priorities. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> --- kernel/sched/fair.c | 28 +++++++++++++++++++++------- 1 file changed, 21 insertions(+), 7 deletions(-) -- 2.21.0.rc0.269.g1a574e7a288b