Message ID | 20201106232908.364581-1-ira.weiny@intel.com |
---|---|
Headers | show |
Series | PKS: Add Protection Keys Supervisor (PKS) support V3 | expand |
Ira, On Fri, Nov 06 2020 at 15:29, ira weiny wrote: Subject prefix wants to 'entry:'. This changes generic code and the x86 part is just required to fix the generic code change. > Currently struct irqentry_state_t only contains a single bool value > which makes passing it by value is reasonable. However, future patches > propose to add information to this struct, for example the PKRS > register/thread state. > > Adding information to irqentry_state_t makes passing by value less > efficient. Therefore, change the entry/exit calls to pass irq_state by > reference. The PKRS muck needs to add an u32 to that struct. So how is that a problem? The resulting struct still fits into 64bit which is by far more efficiently passed by value than by reference. So which problem are you solving here? Thanks tglx
On Sun, Nov 15, 2020 at 07:58:52PM +0100, Thomas Gleixner wrote: > Ira, > > On Fri, Nov 06 2020 at 15:29, ira weiny wrote: > > Subject prefix wants to 'entry:'. This changes generic code and the x86 > part is just required to fix the generic code change. Sorry, yes that was carried incorrectly from earlier versions. > > > Currently struct irqentry_state_t only contains a single bool value > > which makes passing it by value is reasonable. However, future patches > > propose to add information to this struct, for example the PKRS > > register/thread state. > > > > Adding information to irqentry_state_t makes passing by value less > > efficient. Therefore, change the entry/exit calls to pass irq_state by > > reference. > > The PKRS muck needs to add an u32 to that struct. So how is that a > problem? There are more fields to be added for the kmap/pmem support. So this will be needed eventually. Even though it is not strictly necessary in the next patch. > > The resulting struct still fits into 64bit which is by far more > efficiently passed by value than by reference. So which problem are you > solving here? I'm getting ahead of myself a bit. I will be adding more fields for the kmap/pmem tracking. Would you accept just a clean up for the variable names in this patch? I could then add the pass by reference when I add the new fields later. Or would an update to the commit message be ok to land this now? Ira > > Thanks > > tglx >
Ira, On Mon, Nov 16 2020 at 10:49, Ira Weiny wrote: > On Sun, Nov 15, 2020 at 07:58:52PM +0100, Thomas Gleixner wrote: >> > Currently struct irqentry_state_t only contains a single bool value >> > which makes passing it by value is reasonable. However, future patches >> > propose to add information to this struct, for example the PKRS >> > register/thread state. >> > >> > Adding information to irqentry_state_t makes passing by value less >> > efficient. Therefore, change the entry/exit calls to pass irq_state by >> > reference. >> >> The PKRS muck needs to add an u32 to that struct. So how is that a >> problem? > > There are more fields to be added for the kmap/pmem support. So this will be > needed eventually. Even though it is not strictly necessary in the next patch. if you folks could start to write coherent changelogs with proper explanations why something is needed and why it's going beyond the immediate use case in the next patch, then getting stuff sorted would be way easier. Sorry my crystalball got lost years ago and I'm just not able to keep up with the flurry of patch sets which have dependencies in one way or the other. >> The resulting struct still fits into 64bit which is by far more >> efficiently passed by value than by reference. So which problem are you >> solving here? > > I'm getting ahead of myself a bit. I will be adding more fields for the > kmap/pmem tracking. > > Would you accept just a clean up for the variable names in this patch? I could > then add the pass by reference when I add the new fields later. Or would an > update to the commit message be ok to land this now? Can you provide a coherent explanation for the full set of things which needs to be added here first please? Thanks, tglx
Thomas, Is there any chance of this landing before the kmap stuff gets sorted out? It would be nice to have this in 5.11 to build off of. Ira On Fri, Nov 06, 2020 at 03:28:58PM -0800, 'Ira Weiny' wrote: > From: Ira Weiny <ira.weiny@intel.com> > > Changes from V2 [4] > Rebased on tip-tree/core/entry > From Thomas Gleixner > Address bisectability > Drop Patch: > x86/entry: Move nmi entry/exit into common code > From Greg KH > Remove WARN_ON's > From Dan Williams > Add __must_check to pks_key_alloc() > New patch: x86/pks: Add PKS defines and config options > Split from Enable patch to build on through the series > Fix compile errors > > Changes from V1 > Rebase to TIP master; resolve conflicts and test > Clean up some kernel docs updates missed in V1 > Add irqentry_state_t kernel doc for PKRS field > Removed redundant irq_state->pkrs > This is only needed when we add the global state and somehow > ended up in this patch series. That will come back when we add > the global functionality in. > From Thomas Gleixner > Update commit messages > Add kernel doc for struct irqentry_state_t > From Dave Hansen add flags to pks_key_alloc() > > Changes from RFC V3[3] > Rebase to TIP master > Update test error output > Standardize on 'irq_state' for state variables > From Dave Hansen > Update commit messages > Add/clean up comments > Add X86_FEATURE_PKS to disabled-features.h and remove some > explicit CONFIG checks > Move saved_pkrs member of thread_struct > Remove superfluous preempt_disable() > s/irq_save_pks/irq_save_set_pks/ > Ensure PKRS is not seen in faults if not configured or not > supported > s/pks_mknoaccess/pks_mk_noaccess/ > s/pks_mkread/pks_mk_readonly/ > s/pks_mkrdwr/pks_mk_readwrite/ > Change pks_key_alloc return to -EOPNOTSUPP when not supported > From Peter Zijlstra > Clean up Attribution > Remove superfluous preempt_disable() > Add union to differentiate exit_rcu/lockdep use in > irqentry_state_t > From Thomas Gleixner > Add preliminary clean up patch and adjust series as needed > > > Introduce a new page protection mechanism for supervisor pages, Protection Key > Supervisor (PKS). > > 2 use cases for PKS are being developed, trusted keys and PMEM. Trusted keys > is a newer use case which is still being explored. PMEM was submitted as part > of the RFC (v2) series[1]. However, since then it was found that some callers > of kmap() require a global implementation of PKS. Specifically some users of > kmap() expect mappings to be available to all kernel threads. While global use > of PKS is rare it needs to be included for correctness. Unfortunately the > kmap() updates required a large patch series to make the needed changes at the > various kmap() call sites so that patch set has been split out. Because the > global PKS feature is only required for that use case it will be deferred to > that set as well.[2] This patch set is being submitted as a precursor to both > of the use cases. > > For an overview of the entire PKS ecosystem, a git tree including this series > and 2 proposed use cases can be found here: > > https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/ > https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/ > > > PKS enables protections on 'domains' of supervisor pages to limit supervisor > mode access to those pages beyond the normal paging protections. PKS works in > a similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are > checked in addition to normal paging protections and Access or Writes can be > disabled via a MSR update without TLB flushes when permissions change. Also > like PKU, a page mapping is assigned to a domain by setting pkey bits in the > page table entry for that mapping. > > Access is controlled through a PKRS register which is updated via WRMSR/RDMSR. > > XSAVE is not supported for the PKRS MSR. Therefore the implementation > saves/restores the MSR across context switches and during exceptions. Nested > exceptions are supported by each exception getting a new PKS state. > > For consistent behavior with current paging protections, pkey 0 is reserved and > configured to allow full access via the pkey mechanism, thus preserving the > default paging protections on mappings with the default pkey value of 0. > > Other keys, (1-15) are allocated by an allocator which prepares us for key > contention from day one. Kernel users should be prepared for the allocator to > fail either because of key exhaustion or due to PKS not being supported on the > arch and/or CPU instance. > > The following are key attributes of PKS. > > 1) Fast switching of permissions > 1a) Prevents access without page table manipulations > 1b) No TLB flushes required > 2) Works on a per thread basis > > PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from > the PTE to store the pkey within the entry. > > > [1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/ > [2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.weiny@intel.com/ > [3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.weiny@intel.com/ > [4] https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/ > > Fenghua Yu (2): > x86/pks: Add PKS kernel API > x86/pks: Enable Protection Keys Supervisor (PKS) > > Ira Weiny (8): > x86/pkeys: Create pkeys_common.h > x86/fpu: Refactor arch_set_user_pkey_access() for PKS support > x86/pks: Add PKS defines and Kconfig options > x86/pks: Preserve the PKRS MSR on context switch > x86/entry: Pass irqentry_state_t by reference > x86/entry: Preserve PKRS MSR across exceptions > x86/fault: Report the PKRS state on fault > x86/pks: Add PKS test code > > Documentation/core-api/protection-keys.rst | 103 ++- > arch/x86/Kconfig | 1 + > arch/x86/entry/common.c | 46 +- > arch/x86/include/asm/cpufeatures.h | 1 + > arch/x86/include/asm/disabled-features.h | 8 +- > arch/x86/include/asm/idtentry.h | 25 +- > arch/x86/include/asm/msr-index.h | 1 + > arch/x86/include/asm/pgtable.h | 13 +- > arch/x86/include/asm/pgtable_types.h | 12 + > arch/x86/include/asm/pkeys.h | 15 + > arch/x86/include/asm/pkeys_common.h | 40 ++ > arch/x86/include/asm/processor.h | 18 +- > arch/x86/include/uapi/asm/processor-flags.h | 2 + > arch/x86/kernel/cpu/common.c | 15 + > arch/x86/kernel/cpu/mce/core.c | 4 +- > arch/x86/kernel/fpu/xstate.c | 22 +- > arch/x86/kernel/kvm.c | 6 +- > arch/x86/kernel/nmi.c | 4 +- > arch/x86/kernel/process.c | 26 + > arch/x86/kernel/traps.c | 21 +- > arch/x86/mm/fault.c | 87 ++- > arch/x86/mm/pkeys.c | 196 +++++- > include/linux/entry-common.h | 31 +- > include/linux/pgtable.h | 4 + > include/linux/pkeys.h | 24 + > kernel/entry/common.c | 44 +- > lib/Kconfig.debug | 12 + > lib/Makefile | 3 + > lib/pks/Makefile | 3 + > lib/pks/pks_test.c | 692 ++++++++++++++++++++ > mm/Kconfig | 2 + > tools/testing/selftests/x86/Makefile | 3 +- > tools/testing/selftests/x86/test_pks.c | 66 ++ > 33 files changed, 1410 insertions(+), 140 deletions(-) > create mode 100644 arch/x86/include/asm/pkeys_common.h > create mode 100644 lib/pks/Makefile > create mode 100644 lib/pks/pks_test.c > create mode 100644 tools/testing/selftests/x86/test_pks.c > > -- > 2.28.0.rc0.12.gb6a658bd00c9 >
Ira, On Mon, Dec 07 2020 at 14:14, Ira Weiny wrote: > Is there any chance of this landing before the kmap stuff gets sorted out? I have marked this as needs an update because the change log of 5/10 sucks. https://lore.kernel.org/r/87lff1xcmv.fsf@nanos.tec.linutronix.de > It would be nice to have this in 5.11 to build off of. It would be nice if people follow up on review request :) Thanks, tglx
On Tue, Dec 08, 2020 at 04:55:54PM +0100, Thomas Gleixner wrote: > Ira, > > On Mon, Dec 07 2020 at 14:14, Ira Weiny wrote: > > Is there any chance of this landing before the kmap stuff gets sorted out? > > I have marked this as needs an update because the change log of 5/10 > sucks. https://lore.kernel.org/r/87lff1xcmv.fsf@nanos.tec.linutronix.de > > > It would be nice to have this in 5.11 to build off of. > > It would be nice if people follow up on review request :) I did, but just as an update to that patch.[1] Sorry if this caused you to miss the response. It would have been better for me to ping you on that patch. :-/ I was trying to avoid a whole new series just for that single commit message. Is that generally ok? Is that commit message still lacking? Ira [1] https://lore.kernel.org/linux-doc/20201124060956.1405768-1-ira.weiny@intel.com/
On Fri, Nov 06 2020 at 15:29, ira weiny wrote: > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -43,6 +43,7 @@ > #include <asm/io_bitmap.h> > #include <asm/proto.h> > #include <asm/frame.h> > +#include <asm/pkeys_common.h> > > #include "process.h" > > @@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg, > return ret; > } > > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS > +DECLARE_PER_CPU(u32, pkrs_cache); > +static inline void pks_init_task(struct task_struct *tsk) First of all. I asked several times now not to glue stuff onto a function without a newline inbetween. It's unreadable. But what's worse is that the declaration of pkrs_cache which is global is in a C file and not in a header. And pkrs_cache is not even used in this file. So what? > +{ > + /* New tasks get the most restrictive PKRS value */ > + tsk->thread.saved_pkrs = INIT_PKRS_VALUE; > +} > +static inline void pks_sched_in(void) Newline between functions. It's fine for stubs, but not for a real implementation. > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c > index d1dfe743e79f..76a62419c446 100644 > --- a/arch/x86/mm/pkeys.c > +++ b/arch/x86/mm/pkeys.c > @@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) > > return pk_reg; > } > + > +DEFINE_PER_CPU(u32, pkrs_cache); Again, why is this global? > +void write_pkrs(u32 new_pkrs) > +{ > + u32 *pkrs; > + > + if (!static_cpu_has(X86_FEATURE_PKS)) > + return; > + > + pkrs = get_cpu_ptr(&pkrs_cache); So this is called from various places including schedule and also from the low level entry/exit code. Why do we need to have an extra preempt_disable/enable() there via get/put_cpu_ptr()? Just because performance in those code paths does not matter? > + if (*pkrs != new_pkrs) { > + *pkrs = new_pkrs; > + wrmsrl(MSR_IA32_PKRS, new_pkrs); > + } > + put_cpu_ptr(pkrs); Now back to the context switch: > @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) > > if ((tifp ^ tifn) & _TIF_SLD) > switch_to_sld(tifn); > + > + pks_sched_in(); > } How is this supposed to work? switch_to() { .... switch_to_extra() { .... if (unlikely(next_tif & _TIF_WORK_CTXSW_NEXT || prev_tif & _TIF_WORK_CTXSW_PREV)) __switch_to_xtra(prev, next); I.e. __switch_to_xtra() is only invoked when the above condition is true, which is not guaranteed at all. While I have to admit that I dropped the ball on the update for the entry patch, I'm not too sorry about it anymore when looking at this. Are you still sure that this is ready for merging? Thanks, tglx
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote: > void disable_TSC(void) > @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) > > if ((tifp ^ tifn) & _TIF_SLD) > switch_to_sld(tifn); > + > + pks_sched_in(); > } Does the selftest for this ever actually schedule()? I see it talking about context switching, but I don't immediately see how it would.
On 11/6/20 3:29 PM, ira.weiny@intel.com wrote: > + /* Arm for context switch test */ > + write(fd, "1", 1); > + > + /* Context switch out... */ > + sleep(4); > + > + /* Check msr restored */ > + write(fd, "2", 1); These are always tricky. What you ideally want here is: 1. Switch away from this task to a non-PKS task, or 2. Switch from this task to a PKS-using task, but one which has a different PKS value then, switch back to this task and make sure PKS maintained its value. *But*, there's no absolute guarantee that another task will run. It would not be totally unreasonable to have the kernel just sit in a loop without context switching here if no other tasks can run. The only way you *know* there is a context switch is by having two tasks bound to the same logical CPU and make sure they run one after another. This just gets itself into a state where it *CAN* context switch and prays that one will happen. You can also run a bunch of these in parallel bound to a single CPU. That would also give you higher levels of assurance that *some* context switch happens at sleep(). One critical thing with these tests is to sabotage the kernel and then run them and make *sure* they fail. Basically, if you screw up, do they actually work to catch it?
On Thu, Dec 17 2020 at 15:50, Thomas Gleixner wrote: > On Fri, Nov 06 2020 at 15:29, ira weiny wrote: > >> +void write_pkrs(u32 new_pkrs) >> +{ >> + u32 *pkrs; >> + >> + if (!static_cpu_has(X86_FEATURE_PKS)) >> + return; >> + >> + pkrs = get_cpu_ptr(&pkrs_cache); > > So this is called from various places including schedule and also from > the low level entry/exit code. Why do we need to have an extra > preempt_disable/enable() there via get/put_cpu_ptr()? > > Just because performance in those code paths does not matter? > >> + if (*pkrs != new_pkrs) { >> + *pkrs = new_pkrs; >> + wrmsrl(MSR_IA32_PKRS, new_pkrs); >> + } >> + put_cpu_ptr(pkrs); Which made me look at the other branch of your git repo just because I wanted to know about the 'other' storage requirements and I found this gem: > update_global_pkrs() > ... > /* > * If we are preventing access from the old value. Force the > * update on all running threads. > */ > if (((old_val == 0) && protection) || > ((old_val & PKR_WD_BIT) && (protection & PKEY_DISABLE_ACCESS))) { > int cpu; > > for_each_online_cpu(cpu) { > u32 *ptr = per_cpu_ptr(&pkrs_cache, cpu); > > *ptr = update_pkey_val(*ptr, pkey, protection); > wrmsrl_on_cpu(cpu, MSR_IA32_PKRS, *ptr); > put_cpu_ptr(ptr); 1) per_cpu_ptr() -> put_cpu_ptr() is broken as per_cpu_ptr() is not disabling preemption while put_cpu_ptr() enables it which wreckages the preemption count. How was that ever tested at all with any debug option enabled? Answer: Not at all 2) How is that sequence: ptr = per_cpu_ptr(&pkrs_cache, cpu); *ptr = update_pkey_val(*ptr, pkey, protection); wrmsrl_on_cpu(cpu, MSR_IA32_PKRS, *ptr); supposed to be correct vs. a concurrent modification of the pkrs_cache of the remote CPU? Answer: Not at all Also doing a wrmsrl_on_cpu() on _each_ online CPU is insane at best. A smp function call on a remote CPU takes ~3-5us when the remote CPU is not idle and can immediately respond. If the remote CPU is deep in idle it can take up to 100us depending on C-State it is in. Even if the remote CPU is not not idle and just has interrupts disabled for a few dozen of microseconds this adds up. So on a 256 CPU system depending on the state of the remote CPUs this stalls the CPU doing the update for anything between 1 and 25ms worst case. Of course that also violates _all_ CPU isolation mechanisms. What for? Just for the theoretical chance that _all_ remote CPUs have seen that global permission and have it still active? You're not serious about that, right? The only use case for this in your tree is: kmap() and the possible usage of that mapping outside of the thread context which sets it up. The only hint for doing this at all is: Some users, such as kmap(), sometimes requires PKS to be global. 'sometime requires' is really _not_ a technical explanation. Where is the explanation why kmap() usage 'sometimes' requires this global trainwreck in the first place and where is the analysis why this can't be solved differently? Detailed use case analysis please. Thanks, tglx
On Thu, Dec 17, 2020 at 12:55:39PM -0800, Dave Hansen wrote: > On 11/6/20 3:29 PM, ira.weiny@intel.com wrote: > > + /* Arm for context switch test */ > > + write(fd, "1", 1); > > + > > + /* Context switch out... */ > > + sleep(4); > > + > > + /* Check msr restored */ > > + write(fd, "2", 1); > > These are always tricky. What you ideally want here is: > > 1. Switch away from this task to a non-PKS task, or > 2. Switch from this task to a PKS-using task, but one which has a > different PKS value Or both... > > then, switch back to this task and make sure PKS maintained its value. > > *But*, there's no absolute guarantee that another task will run. It > would not be totally unreasonable to have the kernel just sit in a loop > without context switching here if no other tasks can run. > > The only way you *know* there is a context switch is by having two tasks > bound to the same logical CPU and make sure they run one after another. Ah... We do that. ... + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + /* Two processes run on CPU 0 so that they go through context switch. */ + sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset); ... I think this should be ensuring that both the parent and the child are running on CPU 0. At least according to the man page they should be. <man> A child created via fork(2) inherits its parent's CPU affinity mask. </man> Perhaps a better method would be to synchronize the 2 threads more to ensure that we are really running at the 'same time' and forcing the context switch. > This just gets itself into a state where it *CAN* context switch and > prays that one will happen. Not sure what you mean by 'This'? Do you mean that running on the same CPU will sometimes not force a context switch? Or do you mean that the sleeps could be badly timed and the 2 threads could run 1 after the other on the same CPU? The latter is AFAICT the most likely case. > > You can also run a bunch of these in parallel bound to a single CPU. > That would also give you higher levels of assurance that *some* context > switch happens at sleep(). I think more cycles is a good idea for sure. But I'm more comfortable with forcing the test to be more synchronized so that it is actually running in the order we think/want it to be. > > One critical thing with these tests is to sabotage the kernel and then > run them and make *sure* they fail. Basically, if you screw up, do they > actually work to catch it? I'll try and come up with a more stressful test. Ira
On Thu, Dec 17, 2020 at 03:50:55PM +0100, Thomas Gleixner wrote: > On Fri, Nov 06 2020 at 15:29, ira weiny wrote: > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -43,6 +43,7 @@ > > #include <asm/io_bitmap.h> > > #include <asm/proto.h> > > #include <asm/frame.h> > > +#include <asm/pkeys_common.h> > > > > #include "process.h" > > > > @@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg, > > return ret; > > } > > > > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS > > +DECLARE_PER_CPU(u32, pkrs_cache); > > +static inline void pks_init_task(struct task_struct *tsk) > > First of all. I asked several times now not to glue stuff onto a > function without a newline inbetween. It's unreadable. Fixed. > > But what's worse is that the declaration of pkrs_cache which is global > is in a C file and not in a header. And pkrs_cache is not even used in > this file. So what? OK, this was just a complete rebase/refactor mess up on my part. The global'ness is not required until we need a global update of the pkrs which was not part of this series. I've removed it from this patch. And cleaned it up in patch 6/10 as well. And cleaned it up in the global pkrs patch which you found in my git tree. > > > +{ > > + /* New tasks get the most restrictive PKRS value */ > > + tsk->thread.saved_pkrs = INIT_PKRS_VALUE; > > +} > > +static inline void pks_sched_in(void) > > Newline between functions. It's fine for stubs, but not for a real implementation. Again my apologies. Fixed. > > > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c > > index d1dfe743e79f..76a62419c446 100644 > > --- a/arch/x86/mm/pkeys.c > > +++ b/arch/x86/mm/pkeys.c > > @@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) > > > > return pk_reg; > > } > > + > > +DEFINE_PER_CPU(u32, pkrs_cache); > > Again, why is this global? In this patch it does not need to be. I've changed it to static. > > > +void write_pkrs(u32 new_pkrs) > > +{ > > + u32 *pkrs; > > + > > + if (!static_cpu_has(X86_FEATURE_PKS)) > > + return; > > + > > + pkrs = get_cpu_ptr(&pkrs_cache); > > So this is called from various places including schedule and also from > the low level entry/exit code. Why do we need to have an extra > preempt_disable/enable() there via get/put_cpu_ptr()? > > Just because performance in those code paths does not matter? Honestly I don't recall the full history at this point. The preempt_disable/enable() is required when this is called from pks_update_protection() AKA when a user is trying to update the protections of their key. What I do remember is that this was originally not preempt safe and we had a comment to that effect in the early patches.[1] Somewhere along the line the preempt discussion lead us to make write_pkrs() 'self contained' with the preemption protection here. I just did not think about any performance issues. It is safe to call preempt_disable() from a preempt disabled region, correct? I seem to recall asking that and the answer was 'yes'. I will audit the calls again and adjust the preemption disable as needed. [1] https://lore.kernel.org/lkml/20200717072056.73134-5-ira.weiny@intel.com/#t > > > + if (*pkrs != new_pkrs) { > > + *pkrs = new_pkrs; > > + wrmsrl(MSR_IA32_PKRS, new_pkrs); > > + } > > + put_cpu_ptr(pkrs); > > Now back to the context switch: > > > @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) > > > > if ((tifp ^ tifn) & _TIF_SLD) > > switch_to_sld(tifn); > > + > > + pks_sched_in(); > > } > > How is this supposed to work? > > switch_to() { > .... > switch_to_extra() { > .... > if (unlikely(next_tif & _TIF_WORK_CTXSW_NEXT || > prev_tif & _TIF_WORK_CTXSW_PREV)) > __switch_to_xtra(prev, next); > > I.e. __switch_to_xtra() is only invoked when the above condition is > true, which is not guaranteed at all. I did not know that. I completely missunderstood what __switch_to_xtra() meant. I thought it was arch specific 'extra' stuff so it seemed reasonable to me. Also, our test seemed to work. I'm still investigating what may be wrong. > > While I have to admit that I dropped the ball on the update for the > entry patch, I'm not too sorry about it anymore when looking at this. > > Are you still sure that this is ready for merging? Nope... Thanks for the review, Ira > > Thanks, > > tglx
On Thu, Dec 17, 2020 at 12:41:50PM -0800, Dave Hansen wrote: > On 11/6/20 3:29 PM, ira.weiny@intel.com wrote: > > void disable_TSC(void) > > @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) > > > > if ((tifp ^ tifn) & _TIF_SLD) > > switch_to_sld(tifn); > > + > > + pks_sched_in(); > > } > > Does the selftest for this ever actually schedule()? At this point I'm not sure. This code has been in since the beginning. So its seen a lot of soak time. > > I see it talking about context switching, but I don't immediately see > how it would. We were trying to force parent and child to run on the same CPU. I suspect something is wrong in the timing of that test. Ira
On Thu, Dec 17 2020 at 23:43, Thomas Gleixner wrote: > The only use case for this in your tree is: kmap() and the possible > usage of that mapping outside of the thread context which sets it up. > > The only hint for doing this at all is: > > Some users, such as kmap(), sometimes requires PKS to be global. > > 'sometime requires' is really _not_ a technical explanation. > > Where is the explanation why kmap() usage 'sometimes' requires this > global trainwreck in the first place and where is the analysis why this > can't be solved differently? > > Detailed use case analysis please. A lengthy conversation with Dan and Dave over IRC confirmed what I was suspecting. The approach of this whole PKS thing is to make _all_ existing code magically "work". That means aside of the obvious thread local mappings, the kmap() part is needed to solve the problem of async handling where the mapping is handed to some other context which then uses it and notifies the context which created the mapping when done. That's the principle which was used to make highmem work long time ago. IMO that was a mistake back then. The right thing would have been to change the code so that it does not rely on a temporary mapping created by the initiator. Instead let the initiator hand the page over to the other context which then creates a temporary mapping for fiddling with it. Water under the bridge... Glueing PKS on to that kmap() thing is horrible and global PKS is pretty much the opposite of what PKS wants to achieve. It's disabling protection systemwide for an unspecified amount of time and for all contexts. So instead of trying to make global PKS "work" we really should go and take a smarter approach. 1) Many kmap() use cases are strictly thread local and the mapped address is never handed to some other context, which means this can be replaced with kmap_local() now, which preserves the mapping accross preemption. PKS just works nicely on top of that. 2) Modify kmap() so that it marks the to be mapped page as 'globaly unprotected' instead of doing this global unprotect PKS dance. kunmap() undoes that. That obviously needs some thought vs. refcounting if there are concurrent users, but that's a solvable problem either as part of struct page itself or stored in some global hash. 3) Have PKS modes: - STRICT: No pardon - RELAXED: Warn and unprotect temporary for the current context - SILENT: Like RELAXED, but w/o warning to make sysadmins happy. Default should be RELAXED. - OFF: Disable the whole PKS thing 4) Have a smart #PF mechanism which does: if (error_code & X86_PF_PK) { page = virt_to_page(address); if (!page || !page_is_globaly_unprotected(page)) goto die; if (pks_mode == PKS_MODE_STRICT) goto die; WARN_ONCE(pks_mode == PKS_MODE_RELAXED, "Useful info ..."); temporary_unprotect(page, regs); return; } temporary_unprotect(page, regs) { key = page_to_key(page); /* Return from #PF will establish this for the faulting context */ extended_state(regs)->pks &= ~PKS_MASK(key); } This temporary unprotect is undone when the context is left, so depending on the context (thread, interrupt, softirq) the unprotected section might be way wider than actually needed, but that's still orders of magnitudes better than having this fully unrestricted global PKS mode which is completely scopeless. The above is at least restricted to the pages which are in use for a particular operation. Stray pointers during that time are obviously not caught, but that's not any different from that proposed global thingy. The warning allows to find the non-obvious places so they can be analyzed and worked on. 5) The DAX case which you made "work" with dev_access_enable() and dev_access_disable(), i.e. with yet another lazy approach of avoiding to change a handful of usage sites. The use cases are strictly context local which means the global magic is not used at all. Why does it exist in the first place? Aside of that this global thing would never work at all because the refcounting is per thread and not global. So that DAX use case is just a matter of: grant/revoke_access(DEV_PKS_KEY, READ/WRITE) which is effective for the current execution context and really wants to be a distinct READ/WRITE protection and not the magic global thing which just has on/off. All usage sites know whether they want to read or write. That leaves the question about the refcount. AFAICT, nothing nests in that use case for a given execution context. I'm surely missing something subtle here. Hmm? Thanks, tglx
On 12/17/20 8:10 PM, Ira Weiny wrote: > On Thu, Dec 17, 2020 at 12:41:50PM -0800, Dave Hansen wrote: >> On 11/6/20 3:29 PM, ira.weiny@intel.com wrote: >>> void disable_TSC(void) >>> @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) >>> >>> if ((tifp ^ tifn) & _TIF_SLD) >>> switch_to_sld(tifn); >>> + >>> + pks_sched_in(); >>> } >> >> Does the selftest for this ever actually schedule()? > > At this point I'm not sure. This code has been in since the beginning. So its > seen a lot of soak time. Think about it another way. Let's say this didn't get called on the first context switch away from the PKS-using task. Would anyone notice? How likely is this to happen? The function tracers or kprobes tend to be a great tool for this, at least for testing whether the code path you expect to hit is getting hit.
On Thu, Dec 17, 2020 at 8:05 PM Ira Weiny <ira.weiny@intel.com> wrote: > > On Thu, Dec 17, 2020 at 12:55:39PM -0800, Dave Hansen wrote: > > On 11/6/20 3:29 PM, ira.weiny@intel.com wrote: > > > + /* Arm for context switch test */ > > > + write(fd, "1", 1); > > > + > > > + /* Context switch out... */ > > > + sleep(4); > > > + > > > + /* Check msr restored */ > > > + write(fd, "2", 1); > > > > These are always tricky. What you ideally want here is: > > > > 1. Switch away from this task to a non-PKS task, or > > 2. Switch from this task to a PKS-using task, but one which has a > > different PKS value > > Or both... > > > > > then, switch back to this task and make sure PKS maintained its value. > > > > *But*, there's no absolute guarantee that another task will run. It > > would not be totally unreasonable to have the kernel just sit in a loop > > without context switching here if no other tasks can run. > > > > The only way you *know* there is a context switch is by having two tasks > > bound to the same logical CPU and make sure they run one after another. > > Ah... We do that. > > ... > + CPU_ZERO(&cpuset); > + CPU_SET(0, &cpuset); > + /* Two processes run on CPU 0 so that they go through context switch. */ > + sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset); > ... > > I think this should be ensuring that both the parent and the child are > running on CPU 0. At least according to the man page they should be. > > <man> > A child created via fork(2) inherits its parent's CPU affinity mask. > </man> > > Perhaps a better method would be to synchronize the 2 threads more to ensure > that we are really running at the 'same time' and forcing the context switch. > > > This just gets itself into a state where it *CAN* context switch and > > prays that one will happen. > > Not sure what you mean by 'This'? Do you mean that running on the same CPU > will sometimes not force a context switch? Or do you mean that the sleeps > could be badly timed and the 2 threads could run 1 after the other on the same > CPU? The latter is AFAICT the most likely case. > One way to guarantee that both threads run is to just pass a message between them over a pipe and wait for the submitter to receive an ack from the other end.
On Fri, Dec 18, 2020 at 5:58 AM Thomas Gleixner <tglx@linutronix.de> wrote: [..] > 5) The DAX case which you made "work" with dev_access_enable() and > dev_access_disable(), i.e. with yet another lazy approach of > avoiding to change a handful of usage sites. > > The use cases are strictly context local which means the global > magic is not used at all. Why does it exist in the first place? > > Aside of that this global thing would never work at all because the > refcounting is per thread and not global. > > So that DAX use case is just a matter of: > > grant/revoke_access(DEV_PKS_KEY, READ/WRITE) > > which is effective for the current execution context and really > wants to be a distinct READ/WRITE protection and not the magic > global thing which just has on/off. All usage sites know whether > they want to read or write. I was tracking and nodding until this point. Yes, kill the global / kmap() support, but if grant/revoke_access is not integrated behind kmap_{local,atomic}() then it's not a "handful" of sites that need to be instrumented it's 100s. Are you suggesting that "relaxed" mode enforcement is a way to distribute the work of teaching driver writers that they need to incorporate explicit grant/revoke-read/write in addition to kmap? The entire reason PTE_DEVMAP exists was to allow get_user_pages() for PMEM and not require every downstream-GUP code path to specifically consider whether it was talking to PMEM or RAM pages, and certainly not whether they were reading or writing to it.
On Fri, Dec 18, 2020 at 02:57:51PM +0100, Thomas Gleixner wrote: > On Thu, Dec 17 2020 at 23:43, Thomas Gleixner wrote: > > The only use case for this in your tree is: kmap() and the possible > > usage of that mapping outside of the thread context which sets it up. > > > > The only hint for doing this at all is: > > > > Some users, such as kmap(), sometimes requires PKS to be global. > > > > 'sometime requires' is really _not_ a technical explanation. > > > > Where is the explanation why kmap() usage 'sometimes' requires this > > global trainwreck in the first place and where is the analysis why this > > can't be solved differently? > > > > Detailed use case analysis please. > > A lengthy conversation with Dan and Dave over IRC confirmed what I was > suspecting. > > The approach of this whole PKS thing is to make _all_ existing code > magically "work". That means aside of the obvious thread local mappings, > the kmap() part is needed to solve the problem of async handling where > the mapping is handed to some other context which then uses it and > notifies the context which created the mapping when done. That's the > principle which was used to make highmem work long time ago. > > IMO that was a mistake back then. The right thing would have been to > change the code so that it does not rely on a temporary mapping created > by the initiator. Instead let the initiator hand the page over to the > other context which then creates a temporary mapping for fiddling with > it. Water under the bridge... But maybe not. We are getting rid of a lot of the kmaps and once the bulk are gone perhaps we can change this and remove kmap completely? > > Glueing PKS on to that kmap() thing is horrible and global PKS is pretty > much the opposite of what PKS wants to achieve. It's disabling > protection systemwide for an unspecified amount of time and for all > contexts. I agree. This is why I have been working on converting kmap() call sites to kmap_local_page().[1] > > So instead of trying to make global PKS "work" we really should go and > take a smarter approach. > > 1) Many kmap() use cases are strictly thread local and the mapped > address is never handed to some other context, which means this can > be replaced with kmap_local() now, which preserves the mapping > accross preemption. PKS just works nicely on top of that. Yes hence the massive kmap->kmap_thread patch set which is now becoming kmap_local_page().[2] > > 2) Modify kmap() so that it marks the to be mapped page as 'globaly > unprotected' instead of doing this global unprotect PKS dance. > kunmap() undoes that. That obviously needs some thought > vs. refcounting if there are concurrent users, but that's a > solvable problem either as part of struct page itself or > stored in some global hash. How would this globally unprotected flag work? I suppose if kmap created a new PTE we could make that PTE non-PKS protected then we don't have to fiddle with the register... I think I like that idea. > > 3) Have PKS modes: > > - STRICT: No pardon > > - RELAXED: Warn and unprotect temporary for the current context > > - SILENT: Like RELAXED, but w/o warning to make sysadmins happy. > Default should be RELAXED. > > - OFF: Disable the whole PKS thing I'm not really sure how this solves the global problem but it is probably worth having in general. > > > 4) Have a smart #PF mechanism which does: > > if (error_code & X86_PF_PK) { > page = virt_to_page(address); > > if (!page || !page_is_globaly_unprotected(page)) > goto die; > > if (pks_mode == PKS_MODE_STRICT) > goto die; > > WARN_ONCE(pks_mode == PKS_MODE_RELAXED, "Useful info ..."); > > temporary_unprotect(page, regs); > return; > } I feel like this is very similar to what I had in the global patch you found in my git tree with the exception of the RELAXED mode. I simply had globally unprotected or die. global_pkey_is_enabled() handles the page_is_globaly_unprotected() and temporary_unprotect().[3] Anyway, I'm sorry (but not sorry) that you found it. I've been trying to get 0-day and other testing on it and my public tree was the easiest way to do that. Anyway... The patch as a whole needs work. You are 100% correct that if a mapping is handed to another context it is going to suck performance wise. It has had some internal review but not much. Regardless I think unprotecting a global context is the easy part. The code you had a problem with (and I see is fully broken) was the restriction of access. A failure to update in that direction would only result in a wider window of access. I contemplated not doing a global update at all and just leave the access open until the next context switch. But the code as it stands tries to force an update for a couple of reasons: 1) kmap_local_page() removes most of the need for global pks. So I was thinking that global PKS could be a slow path. 2) kmap()'s that are handed to other contexts they are likely to be 'long term' and should not need to be updated 'too' often. I will admit that I don't know how often 'too often' is. But IMO these questions are best left to after the kmaps are converted. Thus this patch set was just basic support. Other uses cases beyond pmem such as trusted keys or secret mem don't need a global pks feature and could build on the patch set submitted. I was trying to break the problem down. > > temporary_unprotect(page, regs) > { > key = page_to_key(page); > > /* Return from #PF will establish this for the faulting context */ > extended_state(regs)->pks &= ~PKS_MASK(key); > } > > This temporary unprotect is undone when the context is left, so > depending on the context (thread, interrupt, softirq) the > unprotected section might be way wider than actually needed, but > that's still orders of magnitudes better than having this fully > unrestricted global PKS mode which is completely scopeless. I'm not sure I follow you. How would we know when the context is left? > > The above is at least restricted to the pages which are in use for > a particular operation. Stray pointers during that time are > obviously not caught, but that's not any different from that > proposed global thingy. > > The warning allows to find the non-obvious places so they can be > analyzed and worked on. I could add the warning for sure. > > 5) The DAX case which you made "work" with dev_access_enable() and > dev_access_disable(), i.e. with yet another lazy approach of > avoiding to change a handful of usage sites. > > The use cases are strictly context local which means the global > magic is not used at all. Why does it exist in the first place? I'm not following. What is 'it'? > > Aside of that this global thing would never work at all because the > refcounting is per thread and not global. > > So that DAX use case is just a matter of: > > grant/revoke_access(DEV_PKS_KEY, READ/WRITE) > > which is effective for the current execution context and really > wants to be a distinct READ/WRITE protection and not the magic > global thing which just has on/off. All usage sites know whether > they want to read or write. > > That leaves the question about the refcount. AFAICT, nothing nests > in that use case for a given execution context. I'm surely missing > something subtle here. The refcount is needed for non-global pks as well as global. I've not resolved if anything needs to be done with the refcount on the global update since the following is legal. kmap() kmap_local_page() kunmap() kunmap_local() Which would be a problem. But I don't think it is ever actually done. Another problem would be if the kmap and kunmap happened in different contexts... :-/ I don't think that is done either but I don't know for certain. Frankly, my main focus before any of this global support has been to get rid of as many kmaps as possible.[1] Once that is done I think more of these questions can be answered better. Ira [1] https://lore.kernel.org/lkml/20201210171834.2472353-1-ira.weiny@intel.com/ [2] https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/ [3] Latest untested patch pushed for reference here because I can't find exactly the branch you found. https://github.com/weiny2/linux-kernel/commit/37439e91e141be58c13ccc4462f7782311680636 > > Hmm? > > Thanks, > > tglx >
On 12/18/20 11:42 AM, Ira Weiny wrote: > Another problem would be if the kmap and kunmap happened in different > contexts... :-/ I don't think that is done either but I don't know for > certain. It would be really nice to put together some surveillance patches to help become more certain about these things. Even a per-task counter would be better than nothing. On kmap: current->kmaps++; On kunmap: current->kmaps--; WARN_ON(current->kmaps < 0); On exit: WARN_ON(current->kmaps); That would at least find imbalances. You could take it even further by having a little array, say: struct one_kmap { struct page *page; depot_stack_handle_t handle; }; Then: struct task_struct { ... + struct one_kmap kmaps[10]; }; On kmap() you make a new entry in current->kmaps[], and on kunmap() you try to find the corresponding entry. If you can't find one, in the current task you can even go search all the other tasks and see who might be responsible. If something goes and does more than 10 simultaneous kmap()s in one thread, dump a warning and give up. Or, dynamically allocate the kmaps[] array. Then you can dump out the stack of the kmap() culprit if it exits after a kmap() but without a corresponding kfree(). Something like that should be low overhead enough to get it into things like the 0day debug kernel. It should be way cheaper than something like lockdep.
On Fri, Dec 18 2020 at 11:20, Dan Williams wrote: > On Fri, Dec 18, 2020 at 5:58 AM Thomas Gleixner <tglx@linutronix.de> wrote: > [..] >> 5) The DAX case which you made "work" with dev_access_enable() and >> dev_access_disable(), i.e. with yet another lazy approach of >> avoiding to change a handful of usage sites. >> >> The use cases are strictly context local which means the global >> magic is not used at all. Why does it exist in the first place? >> >> Aside of that this global thing would never work at all because the >> refcounting is per thread and not global. >> >> So that DAX use case is just a matter of: >> >> grant/revoke_access(DEV_PKS_KEY, READ/WRITE) >> >> which is effective for the current execution context and really >> wants to be a distinct READ/WRITE protection and not the magic >> global thing which just has on/off. All usage sites know whether >> they want to read or write. > > I was tracking and nodding until this point. Yes, kill the global / > kmap() support, but if grant/revoke_access is not integrated behind > kmap_{local,atomic}() then it's not a "handful" of sites that need to > be instrumented it's 100s. Are you suggesting that "relaxed" mode > enforcement is a way to distribute the work of teaching driver writers > that they need to incorporate explicit grant/revoke-read/write in > addition to kmap? The entire reason PTE_DEVMAP exists was to allow > get_user_pages() for PMEM and not require every downstream-GUP code > path to specifically consider whether it was talking to PMEM or RAM > pages, and certainly not whether they were reading or writing to it. kmap_local() is fine. That can work automatically because it's strict local to the context which does the mapping. kmap() is dubious because it's a 'global' mapping as dictated per HIGHMEM. So doing the RELAXED mode for kmap() is sensible I think to identify cases where the mapped address is really handed to a different execution context. We want to see those cases and analyse whether this can't be solved in a different way. That's why I suggested to do a warning in that case. Also vs. the DAX use case I really meant the code in fs/dax and drivers/dax/ itself which is handling this via dax_read_[un]lock. Does that make more sense? Thanks, tglx
On Fri, Dec 18 2020 at 11:42, Ira Weiny wrote: > On Fri, Dec 18, 2020 at 02:57:51PM +0100, Thomas Gleixner wrote: >> 2) Modify kmap() so that it marks the to be mapped page as 'globaly >> unprotected' instead of doing this global unprotect PKS dance. >> kunmap() undoes that. That obviously needs some thought >> vs. refcounting if there are concurrent users, but that's a >> solvable problem either as part of struct page itself or >> stored in some global hash. > > How would this globally unprotected flag work? I suppose if kmap created a new > PTE we could make that PTE non-PKS protected then we don't have to fiddle with > the register... I think I like that idea. No. Look at the highmem implementation of kmap(). It's a terrible idea, really. Don't even think about that. There is _no_ global flag. The point is that the kmap is strictly bound to a particular struct page. So you can simply do: kmap(page) if (page_is_access_protected(page)) atomic_inc(&page->unprotect); kunmap(page) if (page_is_access_protected(page)) atomic_dec(&page->unprotect); and in the #PF handler: if (!page->unprotect) goto die; The reason why I said: either in struct page itself or in a global hash is that struct page is already packed and people are not really happy about increasing it's size. But the principle is roughly the same. >> >> 4) Have a smart #PF mechanism which does: >> >> if (error_code & X86_PF_PK) { >> page = virt_to_page(address); >> >> if (!page || !page_is_globaly_unprotected(page)) >> goto die; >> >> if (pks_mode == PKS_MODE_STRICT) >> goto die; >> >> WARN_ONCE(pks_mode == PKS_MODE_RELAXED, "Useful info ..."); >> >> temporary_unprotect(page, regs); >> return; >> } > > I feel like this is very similar to what I had in the global patch you found in > my git tree with the exception of the RELAXED mode. I simply had globally > unprotected or die. Your stuff depends on that global_pks_state which is not maintainable especially not the teardown side. This depends on per page state which is clearly way simpler and more focussed. > Regardless I think unprotecting a global context is the easy part. The code > you had a problem with (and I see is fully broken) was the restriction of > access. A failure to update in that direction would only result in a wider > window of access. I contemplated not doing a global update at all and just > leave the access open until the next context switch. But the code as it stands > tries to force an update for a couple of reasons: > > 1) kmap_local_page() removes most of the need for global pks. So I was > thinking that global PKS could be a slow path. > > 2) kmap()'s that are handed to other contexts they are likely to be 'long term' > and should not need to be updated 'too' often. I will admit that I don't > know how often 'too often' is. Even once in while is not a justification for stopping the world for N milliseconds. >> temporary_unprotect(page, regs) >> { >> key = page_to_key(page); >> >> /* Return from #PF will establish this for the faulting context */ >> extended_state(regs)->pks &= ~PKS_MASK(key); >> } >> >> This temporary unprotect is undone when the context is left, so >> depending on the context (thread, interrupt, softirq) the >> unprotected section might be way wider than actually needed, but >> that's still orders of magnitudes better than having this fully >> unrestricted global PKS mode which is completely scopeless. > > I'm not sure I follow you. How would we know when the context is > left? The context goes away on it's own. Either context switch or return from interrupt. As I said there is an extended window where the external context still might have unprotected access even if the initiating context has called kunmap() already. It's not pretty, but it's not the end of the world either. That's why I suggested to have that WARN_ONCE() so we can actually see why and where that happens and think about solutions to make this go into local context, e.g. by changing the vaddr pointer to a struct page pointer for these particular use cases and then the other context can do kmap/unmap_local(). >> 5) The DAX case which you made "work" with dev_access_enable() and >> dev_access_disable(), i.e. with yet another lazy approach of >> avoiding to change a handful of usage sites. >> >> The use cases are strictly context local which means the global >> magic is not used at all. Why does it exist in the first place? > > I'm not following. What is 'it'? That global argument to dev_access_enable()/disable(). >> That leaves the question about the refcount. AFAICT, nothing nests >> in that use case for a given execution context. I'm surely missing >> something subtle here. > > The refcount is needed for non-global pks as well as global. I've not resolved > if anything needs to be done with the refcount on the global update since the > following is legal. > > kmap() > kmap_local_page() > kunmap() > kunmap_local() > > Which would be a problem. But I don't think it is ever actually done. If it does not exist why would we support it in the first place? We can have some warning there to catch that case. > Another problem would be if the kmap and kunmap happened in different > contexts... :-/ I don't think that is done either but I don't know for > certain. > > Frankly, my main focus before any of this global support has been to > get rid of as many kmaps as possible.[1] Once that is done I think > more of these questions can be answered better. I was expecting that you could answer these questions :) Thanks, tglx
On Fri, Dec 18, 2020 at 1:06 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Fri, Dec 18 2020 at 11:20, Dan Williams wrote: > > On Fri, Dec 18, 2020 at 5:58 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > [..] > >> 5) The DAX case which you made "work" with dev_access_enable() and > >> dev_access_disable(), i.e. with yet another lazy approach of > >> avoiding to change a handful of usage sites. > >> > >> The use cases are strictly context local which means the global > >> magic is not used at all. Why does it exist in the first place? > >> > >> Aside of that this global thing would never work at all because the > >> refcounting is per thread and not global. > >> > >> So that DAX use case is just a matter of: > >> > >> grant/revoke_access(DEV_PKS_KEY, READ/WRITE) > >> > >> which is effective for the current execution context and really > >> wants to be a distinct READ/WRITE protection and not the magic > >> global thing which just has on/off. All usage sites know whether > >> they want to read or write. > > > > I was tracking and nodding until this point. Yes, kill the global / > > kmap() support, but if grant/revoke_access is not integrated behind > > kmap_{local,atomic}() then it's not a "handful" of sites that need to > > be instrumented it's 100s. Are you suggesting that "relaxed" mode > > enforcement is a way to distribute the work of teaching driver writers > > that they need to incorporate explicit grant/revoke-read/write in > > addition to kmap? The entire reason PTE_DEVMAP exists was to allow > > get_user_pages() for PMEM and not require every downstream-GUP code > > path to specifically consider whether it was talking to PMEM or RAM > > pages, and certainly not whether they were reading or writing to it. > > kmap_local() is fine. That can work automatically because it's strict > local to the context which does the mapping. > > kmap() is dubious because it's a 'global' mapping as dictated per > HIGHMEM. So doing the RELAXED mode for kmap() is sensible I think to > identify cases where the mapped address is really handed to a different > execution context. We want to see those cases and analyse whether this > can't be solved in a different way. That's why I suggested to do a > warning in that case. > > Also vs. the DAX use case I really meant the code in fs/dax and > drivers/dax/ itself which is handling this via dax_read_[un]lock. > > Does that make more sense? Yup, got it. The dax code can be precise wrt to PKS in a way that kmap_local() cannot.
On Fri, Dec 18 2020 at 13:58, Dan Williams wrote: > On Fri, Dec 18, 2020 at 1:06 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> kmap_local() is fine. That can work automatically because it's strict >> local to the context which does the mapping. >> >> kmap() is dubious because it's a 'global' mapping as dictated per >> HIGHMEM. So doing the RELAXED mode for kmap() is sensible I think to >> identify cases where the mapped address is really handed to a different >> execution context. We want to see those cases and analyse whether this >> can't be solved in a different way. That's why I suggested to do a >> warning in that case. >> >> Also vs. the DAX use case I really meant the code in fs/dax and >> drivers/dax/ itself which is handling this via dax_read_[un]lock. >> >> Does that make more sense? > > Yup, got it. The dax code can be precise wrt to PKS in a way that > kmap_local() cannot. Which makes me wonder whether we should have kmap_local_for_read() or something like that, which could be obviously only be RO enforced for the real HIGHMEM case or the (for now x86 only) enforced kmap_local() debug mechanics on 64bit. So for the !highmem case it would not magically make the existing kernel mapping RO, but this could be forwarded to the PKS protection. Aside of that it's a nice annotation in the code. That could be used right away for all the kmap[_atomic] -> kmap_local conversions. Thanks, tglx --- include/linux/highmem-internal.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+) --- a/include/linux/highmem-internal.h +++ b/include/linux/highmem-internal.h @@ -32,6 +32,10 @@ static inline void kmap_flush_tlb(unsign #define kmap_prot PAGE_KERNEL #endif +#ifndef kmap_prot_to +#define kmap_prot PAGE_KERNEL_RO +#endif + void *kmap_high(struct page *page); void kunmap_high(struct page *page); void __kmap_flush_unused(void); @@ -73,6 +77,11 @@ static inline void *kmap_local_page(stru return __kmap_local_page_prot(page, kmap_prot); } +static inline void *kmap_local_page_for_read(struct page *page) +{ + return __kmap_local_page_prot(page, kmap_prot_ro); +} + static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot) { return __kmap_local_page_prot(page, prot); @@ -169,6 +178,11 @@ static inline void *kmap_local_page_prot { return kmap_local_page(page); } + +static inline void *kmap_local_page_for_read(struct page *page) +{ + return kmap_local_page(page); +} static inline void *kmap_local_pfn(unsigned long pfn) {
From: Ira Weiny <ira.weiny@intel.com> Changes from V2 [4] Rebased on tip-tree/core/entry From Thomas Gleixner Address bisectability Drop Patch: x86/entry: Move nmi entry/exit into common code From Greg KH Remove WARN_ON's From Dan Williams Add __must_check to pks_key_alloc() New patch: x86/pks: Add PKS defines and config options Split from Enable patch to build on through the series Fix compile errors Changes from V1 Rebase to TIP master; resolve conflicts and test Clean up some kernel docs updates missed in V1 Add irqentry_state_t kernel doc for PKRS field Removed redundant irq_state->pkrs This is only needed when we add the global state and somehow ended up in this patch series. That will come back when we add the global functionality in. From Thomas Gleixner Update commit messages Add kernel doc for struct irqentry_state_t From Dave Hansen add flags to pks_key_alloc() Changes from RFC V3[3] Rebase to TIP master Update test error output Standardize on 'irq_state' for state variables From Dave Hansen Update commit messages Add/clean up comments Add X86_FEATURE_PKS to disabled-features.h and remove some explicit CONFIG checks Move saved_pkrs member of thread_struct Remove superfluous preempt_disable() s/irq_save_pks/irq_save_set_pks/ Ensure PKRS is not seen in faults if not configured or not supported s/pks_mknoaccess/pks_mk_noaccess/ s/pks_mkread/pks_mk_readonly/ s/pks_mkrdwr/pks_mk_readwrite/ Change pks_key_alloc return to -EOPNOTSUPP when not supported From Peter Zijlstra Clean up Attribution Remove superfluous preempt_disable() Add union to differentiate exit_rcu/lockdep use in irqentry_state_t From Thomas Gleixner Add preliminary clean up patch and adjust series as needed Introduce a new page protection mechanism for supervisor pages, Protection Key Supervisor (PKS). 2 use cases for PKS are being developed, trusted keys and PMEM. Trusted keys is a newer use case which is still being explored. PMEM was submitted as part of the RFC (v2) series[1]. However, since then it was found that some callers of kmap() require a global implementation of PKS. Specifically some users of kmap() expect mappings to be available to all kernel threads. While global use of PKS is rare it needs to be included for correctness. Unfortunately the kmap() updates required a large patch series to make the needed changes at the various kmap() call sites so that patch set has been split out. Because the global PKS feature is only required for that use case it will be deferred to that set as well.[2] This patch set is being submitted as a precursor to both of the use cases. For an overview of the entire PKS ecosystem, a git tree including this series and 2 proposed use cases can be found here: https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/ https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/ PKS enables protections on 'domains' of supervisor pages to limit supervisor mode access to those pages beyond the normal paging protections. PKS works in a similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are checked in addition to normal paging protections and Access or Writes can be disabled via a MSR update without TLB flushes when permissions change. Also like PKU, a page mapping is assigned to a domain by setting pkey bits in the page table entry for that mapping. Access is controlled through a PKRS register which is updated via WRMSR/RDMSR. XSAVE is not supported for the PKRS MSR. Therefore the implementation saves/restores the MSR across context switches and during exceptions. Nested exceptions are supported by each exception getting a new PKS state. For consistent behavior with current paging protections, pkey 0 is reserved and configured to allow full access via the pkey mechanism, thus preserving the default paging protections on mappings with the default pkey value of 0. Other keys, (1-15) are allocated by an allocator which prepares us for key contention from day one. Kernel users should be prepared for the allocator to fail either because of key exhaustion or due to PKS not being supported on the arch and/or CPU instance. The following are key attributes of PKS. 1) Fast switching of permissions 1a) Prevents access without page table manipulations 1b) No TLB flushes required 2) Works on a per thread basis PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from the PTE to store the pkey within the entry. [1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/ [2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.weiny@intel.com/ [3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.weiny@intel.com/ [4] https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/ Fenghua Yu (2): x86/pks: Add PKS kernel API x86/pks: Enable Protection Keys Supervisor (PKS) Ira Weiny (8): x86/pkeys: Create pkeys_common.h x86/fpu: Refactor arch_set_user_pkey_access() for PKS support x86/pks: Add PKS defines and Kconfig options x86/pks: Preserve the PKRS MSR on context switch x86/entry: Pass irqentry_state_t by reference x86/entry: Preserve PKRS MSR across exceptions x86/fault: Report the PKRS state on fault x86/pks: Add PKS test code Documentation/core-api/protection-keys.rst | 103 ++- arch/x86/Kconfig | 1 + arch/x86/entry/common.c | 46 +- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/idtentry.h | 25 +- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/pgtable.h | 13 +- arch/x86/include/asm/pgtable_types.h | 12 + arch/x86/include/asm/pkeys.h | 15 + arch/x86/include/asm/pkeys_common.h | 40 ++ arch/x86/include/asm/processor.h | 18 +- arch/x86/include/uapi/asm/processor-flags.h | 2 + arch/x86/kernel/cpu/common.c | 15 + arch/x86/kernel/cpu/mce/core.c | 4 +- arch/x86/kernel/fpu/xstate.c | 22 +- arch/x86/kernel/kvm.c | 6 +- arch/x86/kernel/nmi.c | 4 +- arch/x86/kernel/process.c | 26 + arch/x86/kernel/traps.c | 21 +- arch/x86/mm/fault.c | 87 ++- arch/x86/mm/pkeys.c | 196 +++++- include/linux/entry-common.h | 31 +- include/linux/pgtable.h | 4 + include/linux/pkeys.h | 24 + kernel/entry/common.c | 44 +- lib/Kconfig.debug | 12 + lib/Makefile | 3 + lib/pks/Makefile | 3 + lib/pks/pks_test.c | 692 ++++++++++++++++++++ mm/Kconfig | 2 + tools/testing/selftests/x86/Makefile | 3 +- tools/testing/selftests/x86/test_pks.c | 66 ++ 33 files changed, 1410 insertions(+), 140 deletions(-) create mode 100644 arch/x86/include/asm/pkeys_common.h create mode 100644 lib/pks/Makefile create mode 100644 lib/pks/pks_test.c create mode 100644 tools/testing/selftests/x86/test_pks.c