diff mbox series

Simplify load_unaligned_zeropad() (was Re: [GIT PULL] Ceph updates for 5.20-rc1)

Message ID CAHk-=wh1xHi-WeytuAK1-iSsR0wi=6e4-WgFq6ZPt8Z1mvqoNA@mail.gmail.com
State New
Headers show
Series Simplify load_unaligned_zeropad() (was Re: [GIT PULL] Ceph updates for 5.20-rc1) | expand

Commit Message

Linus Torvalds Aug. 14, 2022, 9:14 p.m. UTC
On Sun, Aug 14, 2022 at 1:29 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I dug into it some more, and it is really "load_unaligned_zeropad()"
> that makes clang really uncomfortable.
>
> The problem ends up being that clang sees that it's inside that inner
> loop, and tries very hard to optimize the shift-and-mask that happens
> if the exception happens.
>
> The fact that the exception *never* happens unless DEBUG_PAGEALLOC is
> enabled - and very very seldom even then - is not something we can
> really explain to clang.

Hmm.

The solution to that is actually to move the 'zeropad' part into the
exception handler.

That makes both gcc and clang generate quite nice code for this all.
But perhaps equally importantly, it actually simplifies the code
noticeably:

 arch/x86/include/asm/extable_fixup_types.h |  2 ++
 arch/x86/include/asm/word-at-a-time.h      | 50 +++---------------------------
 arch/x86/mm/extable.c                      | 30 ++++++++++++++++++
 3 files changed, 37 insertions(+), 45 deletions(-)

and that's with 11 of those 37 new lines being a new block comment.

It's mainly because now there's no need to worry about
CONFIG_CC_HAS_ASM_GOTO_OUTPUT in load_unaligned_zeropad(), because the
asm becomes a simple "address in, data out" thing.

And to make the fixup code simple and straightforward, I just made
"load_unaligned_zeropad()" use fixed registers for address and output,
which doesn't bother the compilers at all, they'll happily adjust
their register allocation. The code generation ends up much better
than with the goto and the subtle address games that never trigger
anyway.

PeterZ - you've touched both the load_unaligned_zeropad() and the
exception code last, so let's run this past you, but this really does
seem to not only fix the code generation issue in fs/dcache.s, it just
looks simpler too. Comments?

             Linus

Comments

Kirill A. Shutemov Aug. 14, 2022, 10:54 p.m. UTC | #1
On Sun, Aug 14, 2022 at 02:14:08PM -0700, Linus Torvalds wrote:
> On Sun, Aug 14, 2022 at 1:29 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I dug into it some more, and it is really "load_unaligned_zeropad()"
> > that makes clang really uncomfortable.
> >
> > The problem ends up being that clang sees that it's inside that inner
> > loop, and tries very hard to optimize the shift-and-mask that happens
> > if the exception happens.
> >
> > The fact that the exception *never* happens unless DEBUG_PAGEALLOC is
> > enabled - and very very seldom even then - is not something we can
> > really explain to clang.
> 
> Hmm.
> 
> The solution to that is actually to move the 'zeropad' part into the
> exception handler.
> 
> That makes both gcc and clang generate quite nice code for this all.
> But perhaps equally importantly, it actually simplifies the code
> noticeably:
> 
>  arch/x86/include/asm/extable_fixup_types.h |  2 ++
>  arch/x86/include/asm/word-at-a-time.h      | 50 +++---------------------------
>  arch/x86/mm/extable.c                      | 30 ++++++++++++++++++
>  3 files changed, 37 insertions(+), 45 deletions(-)
> 
> and that's with 11 of those 37 new lines being a new block comment.
> 
> It's mainly because now there's no need to worry about
> CONFIG_CC_HAS_ASM_GOTO_OUTPUT in load_unaligned_zeropad(), because the
> asm becomes a simple "address in, data out" thing.
> 
> And to make the fixup code simple and straightforward, I just made
> "load_unaligned_zeropad()" use fixed registers for address and output,
> which doesn't bother the compilers at all, they'll happily adjust
> their register allocation. The code generation ends up much better
> than with the goto and the subtle address games that never trigger
> anyway.
> 
> PeterZ - you've touched both the load_unaligned_zeropad() and the
> exception code last, so let's run this past you, but this really does
> seem to not only fix the code generation issue in fs/dcache.s, it just
> looks simpler too. Comments?

A bit of off-topic, but..

Recently, I dealt with few[1][2] issues in TDX guest that happens due to
unwanted loads generated by load_unaligned_zeropad(). They are tricky to
follow and I believe that most of developers are not aware of such sneaky
accesses (I was not, until stepped on it).

Do we gain enough benefit from the microoptimization to justify existing
load_unaligned_zeropad()? The helper has rather confusing side-effects.

[1] 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page")
[2] https://lore.kernel.org/lkml/20220614120231.48165-11-kirill.shutemov@linux.intel.com/
Linus Torvalds Aug. 14, 2022, 10:59 p.m. UTC | #2
On Sun, Aug 14, 2022 at 3:51 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Do we gain enough benefit from the microoptimization to justify existing
> load_unaligned_zeropad()? The helper has rather confusing side-effects.

It's a *big* deal in pathname handling, yes. Doing things a byte at a
time is very noticeably slower.

If TDX has problems with it, then TDX needs to be fixed. And it's
simple enough - just make sure you have a guard page between any
kernel RAM mapping and whatever odd crazy page.

There is nothing subtle about this at all. You probably would want
that guard page *anyway*.

That "uaccepted memory" type needs to be just clearly fixed.

                Linus
Linus Torvalds Aug. 15, 2022, 3:43 a.m. UTC | #3
On Sun, Aug 14, 2022 at 3:59 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If TDX has problems with it, then TDX needs to be fixed. And it's
> simple enough - just make sure you have a guard page between any
> kernel RAM mapping and whatever odd crazy page.

.. thinking about this more, I thought we had already done that in the
memory initialization code - ie make sure that we always leave a gap
between any page we mark and any IO memory after it.

But it's possible that I'm confused with the IO window allocation
code, which does the reverse (ie actively try to avoid starting
allocations close to the end-of-RAM because there is often
undocumented stolen memory there)

I'd much rather lose one page from the page allocator at the end of a
RAM region than lose the ability to do string word operations.

Of course, it's also entirely possible that even if my memory about us
already trying to do that is right (which it might not be), we might
also have lost that whole thing over time, since we've had a lot of
updates to the bootmem/memblock setup.

Bringing in Mike Rapoport in case he can point to the code (or lack there-of).

Mike?

               Linus
Kirill A. Shutemov Aug. 15, 2022, 4:12 a.m. UTC | #4
On Sun, Aug 14, 2022 at 08:43:09PM -0700, Linus Torvalds wrote:
> On Sun, Aug 14, 2022 at 3:59 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > If TDX has problems with it, then TDX needs to be fixed. And it's
> > simple enough - just make sure you have a guard page between any
> > kernel RAM mapping and whatever odd crazy page.
> 
> .. thinking about this more, I thought we had already done that in the
> memory initialization code - ie make sure that we always leave a gap
> between any page we mark and any IO memory after it.

ioremap()ed memory should not be a problem as it is not RAM from kernel
PoV and it is separated from memory allocated by buddy allocator.

But DMA buffer can be allocated from general pool of memory. We share it
TDX for I/O too. It does not cause problems as long as direct mapping is
adjusted to map it as shared. #VE handler is already aware of
load_unaligned_zeropad().

So far, so good.

But if somebody would try to be clever -- allocate memory and vmap() as
shared (with proper VMM notification), but leave direct mapping intact --
we have a problem. load_unaligned_zeropad() can step onto private mapping
of the shared memory in direct mapping and crash whole TD guest.

The worst part is that for somebody who is not aware about
load_unaligned_zeropad(), the vmap() trick is totally reasonable approach:
it helps to avoid direct mapping fragmentation. We considered the trick
for one of TDX-specific drivers.
Peter Zijlstra Aug. 15, 2022, 7:17 a.m. UTC | #5
On Sun, Aug 14, 2022 at 02:14:08PM -0700, Linus Torvalds wrote:

> PeterZ - you've touched both the load_unaligned_zeropad() and the
> exception code last, so let's run this past you, but this really does
> seem to not only fix the code generation issue in fs/dcache.s, it just
> looks simpler too. Comments?

Ha, freshly back from vacation and I barely know what a computer is :-)

>  arch/x86/include/asm/extable_fixup_types.h |  2 ++
>  arch/x86/include/asm/word-at-a-time.h      | 50 +++---------------------------
>  arch/x86/mm/extable.c                      | 30 ++++++++++++++++++
>  3 files changed, 37 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
> index 503622627400..b53f1919710b 100644
> --- a/arch/x86/include/asm/extable_fixup_types.h
> +++ b/arch/x86/include/asm/extable_fixup_types.h
> @@ -64,4 +64,6 @@
>  #define	EX_TYPE_UCOPY_LEN4		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(4))
>  #define	EX_TYPE_UCOPY_LEN8		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(8))

(weird tab stuff there, but that's for another day I suppose)

> +#define EX_TYPE_ZEROPAD			20 /* load ax from dx zero-padded */
> +
>  #endif
> diff --git a/arch/x86/include/asm/word-at-a-time.h b/arch/x86/include/asm/word-at-a-time.h
> index 8338b0432b50..4893f1b30dd6 100644
> --- a/arch/x86/include/asm/word-at-a-time.h
> +++ b/arch/x86/include/asm/word-at-a-time.h
> @@ -77,58 +77,18 @@ static inline unsigned long find_zero(unsigned long mask)
>   * and the next page not being mapped, take the exception and
>   * return zeroes in the non-existing part.
>   */
>  static inline unsigned long load_unaligned_zeropad(const void *addr)
>  {
>  	unsigned long ret;
>  
> +	asm volatile(
> +		"1:	mov (%[addr]), %[ret]\n"
>  		"2:\n"
> +		_ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_ZEROPAD)
> +		: [ret] "=a" (ret)
> +		: [addr] "d" (addr));
>  
>  	return ret;
>  }
>  
>  #endif /* _ASM_WORD_AT_A_TIME_H */
> diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
> index 331310c29349..58c79077a496 100644
> --- a/arch/x86/mm/extable.c
> +++ b/arch/x86/mm/extable.c
> @@ -41,6 +41,34 @@ static bool ex_handler_default(const struct exception_table_entry *e,
>  	return true;
>  }
>  
> +/*
> + * This is the *very* rare case where we do a "load_unaligned_zeropad()"
> + * and it's a page crosser into a non-existent page.
> + *
> + * This happens when we optimistically load a pathname a word-at-a-time
> + * and the name is less than the full word and the  next page is not
> + * mapped. Typically that only happens for CONFIG_DEBUG_PAGEALLOC.
> + *
> + * NOTE! The load is always of the form "mov (%edx),%eax" to make the
> + * fixup simple.

So obviously we could use _ASM_EXTABLE_TYPE_REG together with something
like: "mov (%[reg]), %[reg]" to not depend on these fixed registers, but
yeah, that doesn't seem needed. Code-gen is fine as is.

> + */
> +static bool ex_handler_zeropad(const struct exception_table_entry *e,
> +			       struct pt_regs *regs,
> +			       unsigned long fault_addr)
> +{
> +	const unsigned long mask = sizeof(long) - 1;
> +	unsigned long offset, addr;
> +
> +	offset = regs->dx & mask;
> +	addr = regs->dx & ~mask;
> +	if (fault_addr != addr + sizeof(long))
> +		return false;
> +
> +	regs->ax = *(unsigned long *)addr >> (offset * 8);


> +	regs->ip = ex_fixup_addr(e);
> +	return true;

I think the convention here is to do:

	return ex_handler_default(e, regs);

instead, that ensures there a bit of common post code.

> +}
> +
>  static bool ex_handler_fault(const struct exception_table_entry *fixup,
>  			     struct pt_regs *regs, int trapnr)
>  {

But yeah, looks good to me.
Mike Rapoport Aug. 15, 2022, 8:26 a.m. UTC | #6
On Sun, Aug 14, 2022 at 08:43:09PM -0700, Linus Torvalds wrote:
> On Sun, Aug 14, 2022 at 3:59 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > If TDX has problems with it, then TDX needs to be fixed. And it's
> > simple enough - just make sure you have a guard page between any
> > kernel RAM mapping and whatever odd crazy page.
> 
> .. thinking about this more, I thought we had already done that in the
> memory initialization code - ie make sure that we always leave a gap
> between any page we mark and any IO memory after it.
> 
> But it's possible that I'm confused with the IO window allocation
> code, which does the reverse (ie actively try to avoid starting
> allocations close to the end-of-RAM because there is often
> undocumented stolen memory there)
> 
> I'd much rather lose one page from the page allocator at the end of a
> RAM region than lose the ability to do string word operations.
> 
> Of course, it's also entirely possible that even if my memory about us
> already trying to do that is right (which it might not be), we might
> also have lost that whole thing over time, since we've had a lot of
> updates to the bootmem/memblock setup.

We don't create gaps in the memory initialization code, everything that's
E820_TYPE_RAM goes in the end to the page allocator with exception of
partial pages. And this didn't change in the last few years.
 
>                Linus
Linus Torvalds Aug. 15, 2022, 3:58 p.m. UTC | #7
On Mon, Aug 15, 2022 at 12:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So obviously we could use _ASM_EXTABLE_TYPE_REG together with something
> like: "mov (%[reg]), %[reg]" to not depend on these fixed registers, but
> yeah, that doesn't seem needed. Code-gen is fine as is.

I looked at using TYPE_REG, but it got *much* more complicated for no
obvious gain.

The biggest downside of the fixed registers is actually the address,
which in the fs/dcache.c case doesn't matter (the address is just in a
register anyway, and the compiler can pick the register it wants to
and seems to happily pick %rdx).

But in fs/namei.c, the code *used* to be able to use register indexed
addressing, and now needs to use a 'lea' to generate the address into
a single register.

Using _ASM_EXTABLE_TYPE_REG wouldn't help that case - we'd have to
actually disassemble the instruction that faulted, and figure out the
address that way. Because while the fault itself gives us an address,
it gives us the address of the fault, which will be the first byte of
the next page, not the beginning address for the access that we want.

And no, disassembling the instruction wouldn't kill us either (we know
it's a "mov" instruction, so it's just the modrm bytes), but again it
really didn't seem worth the pain. The generated code with the fixed
registers wasn't optimal, but it was close enough that it really
doesn't seem to matter.

> > +     regs->ip = ex_fixup_addr(e);
> > +     return true;
>
> I think the convention here is to do:
>
>         return ex_handler_default(e, regs);

Ahh, yes, except for the FP case where we restore it first (because
the code seems to want to print out the restored address for some
reason).

I had just started with the ex_handler_default() implementation as the
boiler plate, which is why I did that thing by hand.

Will fix.

The other question I had was actually the "return false" above - I
decided that if the address of the fault does *not* match the expected
"round %rdx up" address, we shouldn't do any fixup at all, and treat
it as a regular kernel oops - as if the exception table hadn't been
found at all.

That seems to be a good policy, but no other exception handler does
anything like that, so maybe somebody has comments about that thing?
All the other exception handler fixup functions always return true
unconditionally, but the fixup_exception() code is clearly written to
be able to return 0 for "no fixup".

I may have written that code originally, but it's _soo_ long ago that ....

              Linus
Peter Zijlstra Aug. 15, 2022, 5:53 p.m. UTC | #8
On Mon, Aug 15, 2022 at 08:58:21AM -0700, Linus Torvalds wrote:

> The other question I had was actually the "return false" above - I
> decided that if the address of the fault does *not* match the expected
> "round %rdx up" address, we shouldn't do any fixup at all, and treat
> it as a regular kernel oops - as if the exception table hadn't been
> found at all.

Ah yes, I did see that and I agree, that seems like the right thing to do.
Peter Zijlstra Aug. 15, 2022, 8:09 p.m. UTC | #9
On Mon, Aug 15, 2022 at 08:58:21AM -0700, Linus Torvalds wrote:
> And no, disassembling the instruction wouldn't kill us either (we know
> it's a "mov" instruction, so it's just the modrm bytes), but again it
> really didn't seem worth the pain. The generated code with the fixed
> registers wasn't optimal, but it was close enough that it really
> doesn't seem to matter.

I'm not at all suggesting we do this; but it might be
insn_get_addr_ref() does what is needed.
Linus Torvalds Aug. 15, 2022, 10:49 p.m. UTC | #10
On Mon, Aug 15, 2022 at 1:09 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> I'm not at all suggesting we do this; but it might be
> insn_get_addr_ref() does what is needed.

.. you didn't suggest it at all, but I started doing it anyway.

It's a bit more complicated, and the fixup certainly isn't that
trivial thing any more, but you were right, it's not *that*
complicated, and it does allow us to use arbitrary 'mov' instructions.

And while it now has more added lines than deletions, and the diffstat now says

 3 files changed, 60 insertions(+), 43 deletions(-)

most of the added lines are still that block comment, and some *very*
anal but trivial sanity checks of the instruction decode.

So I could have made it smaller than it used to be by just not doing
any of those verifications, and maybe I went a bit overboard, but I
think this is such a rare case that it's better to be ridiculously
careful than to try to minimize the number of lines.

So it may be a few more lines, but I can argue that it is still at
least conceptually simpler than the conditional asm goto with outputs
code was.

And yeah, it makes some of the code generation places marginally better.

So since I was tricked into writing this patch, and it's even tested
(the second attachment has a truly stupid patch with my test-case), I
think it's worth doing.

Comments? I left your "Acked-by" from the previous version of this
thing, so holler now if you think this got too ugly in the meantime..

               Linus
Peter Zijlstra Aug. 16, 2022, 8:02 a.m. UTC | #11
On Mon, Aug 15, 2022 at 03:49:44PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2022 at 1:09 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > I'm not at all suggesting we do this; but it might be
> > insn_get_addr_ref() does what is needed.
> 
> .. you didn't suggest it at all, but I started doing it anyway.

> So since I was tricked into writing this patch, and it's even tested
> (the second attachment has a truly stupid patch with my test-case), I
> think it's worth doing.

Haha, couldn't help yourself eh ;-)

> Comments? I left your "Acked-by" from the previous version of this
> thing, so holler now if you think this got too ugly in the meantime..

That's quite allright, a few nits below, but overall I like it. And yes
it's a bit over the top, but it's important to have fun..

> diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
> index 503622627400..b53f1919710b 100644
> --- a/arch/x86/include/asm/extable_fixup_types.h
> +++ b/arch/x86/include/asm/extable_fixup_types.h
> @@ -64,4 +64,6 @@
>  #define	EX_TYPE_UCOPY_LEN4		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(4))
>  #define	EX_TYPE_UCOPY_LEN8		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(8))
>  
> +#define EX_TYPE_ZEROPAD			20 /* load ax from dx zero-padded */

This comment is now woefully incorrect.

> +
>  #endif
> diff --git a/arch/x86/include/asm/word-at-a-time.h b/arch/x86/include/asm/word-at-a-time.h
> index 8338b0432b50..46b4f1f7f354 100644
> --- a/arch/x86/include/asm/word-at-a-time.h
> +++ b/arch/x86/include/asm/word-at-a-time.h
> @@ -77,58 +77,18 @@ static inline unsigned long find_zero(unsigned long mask)
>   * and the next page not being mapped, take the exception and
>   * return zeroes in the non-existing part.
>   */
>  static inline unsigned long load_unaligned_zeropad(const void *addr)
>  {
>  	unsigned long ret;
>  
> +	asm volatile(
>  		"1:	mov %[mem], %[ret]\n"
>  		"2:\n"
> +		_ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_ZEROPAD)
> +		: [ret] "=r" (ret)
>  		: [mem] "m" (*(unsigned long *)addr));

That looks delightfully simple :-)

>  
>  	return ret;
>  }
>  
>  #endif /* _ASM_WORD_AT_A_TIME_H */
> diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
> index 331310c29349..60814e110a54 100644
> --- a/arch/x86/mm/extable.c
> +++ b/arch/x86/mm/extable.c
> @@ -41,6 +41,59 @@ static bool ex_handler_default(const struct exception_table_entry *e,
>  	return true;
>  }
>  
> +/*
> + * This is the *very* rare case where we do a "load_unaligned_zeropad()"
> + * and it's a page crosser into a non-existent page.
> + *
> + * This happens when we optimistically load a pathname a word-at-a-time
> + * and the name is less than the full word and the  next page is not
> + * mapped. Typically that only happens for CONFIG_DEBUG_PAGEALLOC.
> + *
> + * NOTE! The faulting address is always a 'mov mem,reg' type instruction
> + * of size 'long', and the exception fixup must always point to right
> + * after the instruction.
> + */
> +static bool ex_handler_zeropad(const struct exception_table_entry *e,
> +			       struct pt_regs *regs,
> +			       unsigned long fault_addr)
> +{
> +	struct insn insn;
> +	const unsigned long mask = sizeof(long) - 1;
> +	unsigned long offset, addr, next_ip, len;
> +	unsigned long *reg;
> +
> +	next_ip = ex_fixup_addr(e);
> +	len = next_ip - regs->ip;
> +	if (len > MAX_INSN_SIZE)
> +		return false;
> +
> +	if (insn_decode(&insn, (void *) regs->ip, len, INSN_MODE_KERN))
> +		return false;

We have insn_decode_kernel() for exactly this (very) common case.

	if (insn_decode_kernel(&insn, (void *)regs->ip))
		return false;

> +	if (insn.length != len)
> +		return false;
> +
> +	if (insn.opcode.bytes[0] != 0x8b)
> +		return false;

I was wondering if we want something like MOV_INSN_OPCODE for 0x8b to
enhance readability, otoh it's currently 0x8b all over the place, so
whatever. At some point you gotta have the insn tables with you anyway.

> +	if (insn.opnd_bytes != sizeof(long))
> +		return false;
> +
> +	addr = (unsigned long) insn_get_addr_ref(&insn, regs);
> +	if (addr == ~0ul)
> +		return false;
> +
> +	offset = addr & mask;
> +	addr = addr & ~mask;
> +	if (fault_addr != addr + sizeof(long))
> +		return false;
> +
> +	reg = insn_get_modrm_reg_ptr(&insn, regs);
> +	if (!reg)
> +		return false;
> +
> +	*reg = *(unsigned long *)addr >> (offset * 8);
> +	return ex_handler_default(e, regs);
> +}

Yep, that all looks about right.

> +
>  static bool ex_handler_fault(const struct exception_table_entry *fixup,
>  			     struct pt_regs *regs, int trapnr)
>  {
> @@ -217,6 +270,8 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
>  		return ex_handler_sgx(e, regs, trapnr);
>  	case EX_TYPE_UCOPY_LEN:
>  		return ex_handler_ucopy_len(e, regs, trapnr, reg, imm);
> +	case EX_TYPE_ZEROPAD:
> +		return ex_handler_zeropad(e, regs, fault_addr);
>  	}
>  	BUG();
>  }
Linus Torvalds Aug. 16, 2022, 5:57 p.m. UTC | #12
On Tue, Aug 16, 2022 at 1:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > +#define EX_TYPE_ZEROPAD                      20 /* load ax from dx zero-padded */
>
> This comment is now woefully incorrect.

Yes it is. Will fix.

> > +     if (insn_decode(&insn, (void *) regs->ip, len, INSN_MODE_KERN))
> > +             return false;
>
> We have insn_decode_kernel() for exactly this (very) common case.

I did that originally, and then I undid it in disgust, because that
interface is too simple.

In particular, it just uses MAX_INSN_SIZE blindly. Which I didn't want
to do when I actually had the instruction size.

Yes, yes, I also check the decode size after-the-fact, but I didn't
want the decoder to even look at the invalid bytes.

This exception case is about the data being at the end of the page, I
wanted the fixup to be aware of code being at the end of a page too.

(And yeah, I'm not convinced that the decoder is that careful, but at
that point I feel it's a decoder issue, and not an issue with the code
I write).

> > +     if (insn.length != len)
> > +             return false;
> > +
> > +     if (insn.opcode.bytes[0] != 0x8b)
> > +             return false;
>
> I was wondering if we want something like MOV_INSN_OPCODE for 0x8b to
> enhance readability, otoh it's currently 0x8b all over the place, so
> whatever. At some point you gotta have the insn tables with you anyway.

Oh, I didn't even notice that we had another case of 0x8b checking.
But yeah, the MMIO decoding wants to see what kind of access it is.

But it wouldn't be MOV_INSN_OPCODE, it would have to be something like
MOV_WORD_INSN_MODRM_REG_OPCODE, because that's what people are
checking for - not just that it's a 'mov', but direction and size too.

And then you'd have to also decide whether you describe those
#define's using the Intel ordering or the one we actually use in our
asm. So now the symbolic names are ambiguous anyway, in ways that the
actual byte value isn't.

So yeah, I suspect it ends up just being an issue of "you have to have
the opcode tables in front of you anyway".

Because you also need to check that that's the only encoding for "mov"
(I checked, and yes, it is - there are other 'mov' encodings that move
directly from memory into %rax, but those are using absolute addresses
that don't make sense for a "this is an unaligned that might be a page
crosser")

Side note: now that I look at it, I note that the MMIO decoding
doesn't handle the absolute address case. It's not really relevant for
the kernel, but I could *imagine* that it is relevant in user mode,
and the SEV case actually does have a "decode and emulate user mode
instruction case".

Not a big issue. If some crazy user even maps IO at a fixed address,
and then uses a "mov %eax <-> moffset" instruction, the kernel
emulation will print out an error and refuse to emulate it.

                  Linus
Peter Zijlstra Aug. 17, 2022, 7:45 a.m. UTC | #13
On Tue, Aug 16, 2022 at 10:57:45AM -0700, Linus Torvalds wrote:

> > > +     if (insn_decode(&insn, (void *) regs->ip, len, INSN_MODE_KERN))
> > > +             return false;
> >
> > We have insn_decode_kernel() for exactly this (very) common case.
> 
> I did that originally, and then I undid it in disgust, because that
> interface is too simple.
> 
> In particular, it just uses MAX_INSN_SIZE blindly. Which I didn't want
> to do when I actually had the instruction size.
> 
> Yes, yes, I also check the decode size after-the-fact, but I didn't
> want the decoder to even look at the invalid bytes.
> 
> This exception case is about the data being at the end of the page, I
> wanted the fixup to be aware of code being at the end of a page too.

I don't want to argue this point too much; but I will anyway :-)

IMO if the decoder ends up out of bounds its a decoder bug either way
around. That is, we *know* there is a full instruction at the given IP
because we got into this exception path.

( it would be possible to add further constraints on trapnr )

Irrespective of the length constraint given to the decoder, it should
not decode/access things past this instruction (without being careful
about it).

Anyway, I'm fine with the patch as you have it.
diff mbox series

Patch

 arch/x86/include/asm/extable_fixup_types.h |  2 ++
 arch/x86/include/asm/word-at-a-time.h      | 50 +++---------------------------
 arch/x86/mm/extable.c                      | 30 ++++++++++++++++++
 3 files changed, 37 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
index 503622627400..b53f1919710b 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -64,4 +64,6 @@ 
 #define	EX_TYPE_UCOPY_LEN4		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(4))
 #define	EX_TYPE_UCOPY_LEN8		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(8))
 
+#define EX_TYPE_ZEROPAD			20 /* load ax from dx zero-padded */
+
 #endif
diff --git a/arch/x86/include/asm/word-at-a-time.h b/arch/x86/include/asm/word-at-a-time.h
index 8338b0432b50..4893f1b30dd6 100644
--- a/arch/x86/include/asm/word-at-a-time.h
+++ b/arch/x86/include/asm/word-at-a-time.h
@@ -77,58 +77,18 @@  static inline unsigned long find_zero(unsigned long mask)
  * and the next page not being mapped, take the exception and
  * return zeroes in the non-existing part.
  */
-#ifdef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
-
 static inline unsigned long load_unaligned_zeropad(const void *addr)
 {
-	unsigned long offset, data;
 	unsigned long ret;
 
-	asm_volatile_goto(
-		"1:	mov %[mem], %[ret]\n"
-
-		_ASM_EXTABLE(1b, %l[do_exception])
-
-		: [ret] "=r" (ret)
-		: [mem] "m" (*(unsigned long *)addr)
-		: : do_exception);
-
-	return ret;
-
-do_exception:
-	offset = (unsigned long)addr & (sizeof(long) - 1);
-	addr = (void *)((unsigned long)addr & ~(sizeof(long) - 1));
-	data = *(unsigned long *)addr;
-	ret = data >> offset * 8;
-
-	return ret;
-}
-
-#else /* !CONFIG_CC_HAS_ASM_GOTO_OUTPUT */
-
-static inline unsigned long load_unaligned_zeropad(const void *addr)
-{
-	unsigned long offset, data;
-	unsigned long ret, err = 0;
-
-	asm(	"1:	mov %[mem], %[ret]\n"
+	asm volatile(
+		"1:	mov (%[addr]), %[ret]\n"
 		"2:\n"
-
-		_ASM_EXTABLE_FAULT(1b, 2b)
-
-		: [ret] "=&r" (ret), "+a" (err)
-		: [mem] "m" (*(unsigned long *)addr));
-
-	if (unlikely(err)) {
-		offset = (unsigned long)addr & (sizeof(long) - 1);
-		addr = (void *)((unsigned long)addr & ~(sizeof(long) - 1));
-		data = *(unsigned long *)addr;
-		ret = data >> offset * 8;
-	}
+		_ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_ZEROPAD)
+		: [ret] "=a" (ret)
+		: [addr] "d" (addr));
 
 	return ret;
 }
 
-#endif /* CONFIG_CC_HAS_ASM_GOTO_OUTPUT */
-
 #endif /* _ASM_WORD_AT_A_TIME_H */
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 331310c29349..58c79077a496 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -41,6 +41,34 @@  static bool ex_handler_default(const struct exception_table_entry *e,
 	return true;
 }
 
+/*
+ * This is the *very* rare case where we do a "load_unaligned_zeropad()"
+ * and it's a page crosser into a non-existent page.
+ *
+ * This happens when we optimistically load a pathname a word-at-a-time
+ * and the name is less than the full word and the  next page is not
+ * mapped. Typically that only happens for CONFIG_DEBUG_PAGEALLOC.
+ *
+ * NOTE! The load is always of the form "mov (%edx),%eax" to make the
+ * fixup simple.
+ */
+static bool ex_handler_zeropad(const struct exception_table_entry *e,
+			       struct pt_regs *regs,
+			       unsigned long fault_addr)
+{
+	const unsigned long mask = sizeof(long) - 1;
+	unsigned long offset, addr;
+
+	offset = regs->dx & mask;
+	addr = regs->dx & ~mask;
+	if (fault_addr != addr + sizeof(long))
+		return false;
+
+	regs->ax = *(unsigned long *)addr >> (offset * 8);
+	regs->ip = ex_fixup_addr(e);
+	return true;
+}
+
 static bool ex_handler_fault(const struct exception_table_entry *fixup,
 			     struct pt_regs *regs, int trapnr)
 {
@@ -217,6 +245,8 @@  int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 		return ex_handler_sgx(e, regs, trapnr);
 	case EX_TYPE_UCOPY_LEN:
 		return ex_handler_ucopy_len(e, regs, trapnr, reg, imm);
+	case EX_TYPE_ZEROPAD:
+		return ex_handler_zeropad(e, regs, fault_addr);
 	}
 	BUG();
 }