Message ID | 1311186383-24819-3-git-send-email-paulmck@linux.vnet.ibm.com |
---|---|
State | Accepted |
Commit | be0e1e21ef707be4d16ea6a96ac9997463e4b8d2 |
Headers | show |
On Wed, Jul 20, 2011 at 11:26 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > Given some common flag combinations, particularly -Os, gcc will inline > rcu_read_unlock_special() despite its being in an unlikely() clause. > Use noinline to prohibit this misoptimization. Btw, I suspect that we should at least look at what it would mean if we make the rcu_read_lock_nesting and the preempt counters both be per-cpu variables instead of making them per-thread/process counters. Then, when we switch threads, we'd just save/restore them from the process register save area. There's a lot of critical code sequences (spin-lock/unlock, rcu read-lock/unlock) that currently fetches the thread/process pointer only to then offset it and increment the count. I get the strong feeling that code generation could be improved and we could avoid one level of indirection by just making it a per-thread counter. For example, instead of __rcu_read_lock: looking like this (and being an external function, partly because of header file dependencies on the data structures involved): push %rbp mov %rsp,%rbp mov %gs:0xb580,%rax incl 0x100(%rax) leaveq retq it should inline to just something like incl %gs:0x100 instead. Same for the preempt counter. Of course, it would need to involve making sure that we pick a good cacheline etc that is already always dirty. But other than that, is there any real downside? Linus
On Wed, Jul 20, 2011 at 03:44:55PM -0700, Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 11:26 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > Given some common flag combinations, particularly -Os, gcc will inline > > rcu_read_unlock_special() despite its being in an unlikely() clause. > > Use noinline to prohibit this misoptimization. > > Btw, I suspect that we should at least look at what it would mean if > we make the rcu_read_lock_nesting and the preempt counters both be > per-cpu variables instead of making them per-thread/process counters. > > Then, when we switch threads, we'd just save/restore them from the > process register save area. > > There's a lot of critical code sequences (spin-lock/unlock, rcu > read-lock/unlock) that currently fetches the thread/process pointer > only to then offset it and increment the count. I get the strong > feeling that code generation could be improved and we could avoid one > level of indirection by just making it a per-thread counter. > > For example, instead of __rcu_read_lock: looking like this (and being > an external function, partly because of header file dependencies on > the data structures involved): > > push %rbp > mov %rsp,%rbp > mov %gs:0xb580,%rax > incl 0x100(%rax) > leaveq > retq > > it should inline to just something like > > incl %gs:0x100 > > instead. Same for the preempt counter. > > Of course, it would need to involve making sure that we pick a good > cacheline etc that is already always dirty. But other than that, is > there any real downside? We would need a form of per-CPU variable access that generated efficient code, but that didn't complain about being used when preemption was enabled. __this_cpu_add_4() might do the trick, but I haven't dug fully through it yet. Thanx, Paul
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index 3a0ae03..4d2c068 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -284,7 +284,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t, * notify RCU core processing or task having blocked during the RCU * read-side critical section. */ -static void rcu_read_unlock_special(struct task_struct *t) +static noinline void rcu_read_unlock_special(struct task_struct *t) { int empty; int empty_exp; @@ -391,11 +391,11 @@ void __rcu_read_unlock(void) struct task_struct *t = current; barrier(); /* needed if we ever invoke rcu_read_unlock in rcutree.c */ - --t->rcu_read_lock_nesting; - barrier(); /* decrement before load of ->rcu_read_unlock_special */ - if (t->rcu_read_lock_nesting == 0 && - unlikely(ACCESS_ONCE(t->rcu_read_unlock_special))) - rcu_read_unlock_special(t); + if (--t->rcu_read_lock_nesting == 0) { + barrier(); /* decr before ->rcu_read_unlock_special load */ + if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special))) + rcu_read_unlock_special(t); + } #ifdef CONFIG_PROVE_LOCKING WARN_ON_ONCE(ACCESS_ONCE(t->rcu_read_lock_nesting) < 0); #endif /* #ifdef CONFIG_PROVE_LOCKING */