diff mbox series

[1/9] kernel: add a PF_FORCE_COMPAT flag

Message ID 20200918124533.3487701-2-hch@lst.de
State New
Headers show
Series [1/9] kernel: add a PF_FORCE_COMPAT flag | expand

Commit Message

Christoph Hellwig Sept. 18, 2020, 12:45 p.m. UTC
Add a flag to force processing a syscall as a compat syscall.  This is
required so that in_compat_syscall() works for I/O submitted by io_uring
helper threads on behalf of compat syscalls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/sparc/include/asm/compat.h | 3 ++-
 arch/x86/include/asm/compat.h   | 2 +-
 fs/io_uring.c                   | 9 +++++++++
 include/linux/compat.h          | 5 ++++-
 include/linux/sched.h           | 1 +
 5 files changed, 17 insertions(+), 3 deletions(-)

Comments

Al Viro Sept. 18, 2020, 1:40 p.m. UTC | #1
On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> Add a flag to force processing a syscall as a compat syscall.  This is

> required so that in_compat_syscall() works for I/O submitted by io_uring

> helper threads on behalf of compat syscalls.

> 

> Signed-off-by: Christoph Hellwig <hch@lst.de>

> ---

>  arch/sparc/include/asm/compat.h | 3 ++-

>  arch/x86/include/asm/compat.h   | 2 +-

>  fs/io_uring.c                   | 9 +++++++++

>  include/linux/compat.h          | 5 ++++-

>  include/linux/sched.h           | 1 +

>  5 files changed, 17 insertions(+), 3 deletions(-)

> 

> diff --git a/arch/sparc/include/asm/compat.h b/arch/sparc/include/asm/compat.h

> index 40a267b3bd5208..fee6c51d36e869 100644

> --- a/arch/sparc/include/asm/compat.h

> +++ b/arch/sparc/include/asm/compat.h

> @@ -211,7 +211,8 @@ static inline int is_compat_task(void)

>  static inline bool in_compat_syscall(void)

>  {

>  	/* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */

> -	return pt_regs_trap_type(current_pt_regs()) == 0x110;

> +	return pt_regs_trap_type(current_pt_regs()) == 0x110 ||

> +		(current->flags & PF_FORCE_COMPAT);


Can't say I like that approach ;-/  Reasoning about the behaviour is much
harder when it's controlled like that - witness set_fs() shite...
Arnd Bergmann Sept. 18, 2020, 1:59 p.m. UTC | #2
On Fri, Sep 18, 2020 at 3:44 PM Christoph Hellwig <hch@lst.de> wrote:
>

> On Fri, Sep 18, 2020 at 02:40:12PM +0100, Al Viro wrote:

> > >     /* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */

> > > -   return pt_regs_trap_type(current_pt_regs()) == 0x110;

> > > +   return pt_regs_trap_type(current_pt_regs()) == 0x110 ||

> > > +           (current->flags & PF_FORCE_COMPAT);

> >

> > Can't say I like that approach ;-/  Reasoning about the behaviour is much

> > harder when it's controlled like that - witness set_fs() shite...

>

> I don't particularly like it either.  But do you have a better idea

> how to deal with io_uring vs compat tasks?


Do we need to worry about something other than the compat_iovec
struct for now? Regarding the code in io_import_iovec(), it would
seem that can easily be handled by exposing an internal helper.
Instead of

#ifdef CONFIG_COMPAT
     if (req->ctx->compat)
            return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
iovec, iter);
#endif
        return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter);

This could do

    __import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec,
                     iter, req->ctx->compat);

With the normal import_iovec() becoming a trivial wrapper around
the same thing:

ssize_t import_iovec(int type, const struct iovec __user * uvector,
                 unsigned nr_segs, unsigned fast_segs,
                 struct iovec **iov, struct iov_iter *i)
{
     return __import_iovec(type, uvector, nr_segs, fast_segs, iov,
              i, in_compat_syscall());
}


         Arnd
David Laight Sept. 19, 2020, 2:53 p.m. UTC | #3
From: Al Viro

> Sent: 18 September 2020 14:58

> 

> On Fri, Sep 18, 2020 at 03:44:06PM +0200, Christoph Hellwig wrote:

> > On Fri, Sep 18, 2020 at 02:40:12PM +0100, Al Viro wrote:

> > > >  	/* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */

> > > > -	return pt_regs_trap_type(current_pt_regs()) == 0x110;

> > > > +	return pt_regs_trap_type(current_pt_regs()) == 0x110 ||

> > > > +		(current->flags & PF_FORCE_COMPAT);

> > >

> > > Can't say I like that approach ;-/  Reasoning about the behaviour is much

> > > harder when it's controlled like that - witness set_fs() shite...

> >

> > I don't particularly like it either.  But do you have a better idea

> > how to deal with io_uring vs compat tasks?

> 

> <wry> git rm fs/io_uring.c would make a good starting point </wry>

> Yes, I know it's not going to happen, but one can dream...


Maybe the io_uring code needs some changes to make it vaguely safe.
- No support for 32-bit compat mixed working (or at all?).
  Plausibly a special worker could do 32bit work.
- ring structure (I'm assuming mapped by mmap()) never mapped
  in more than one process (not cloned by fork()).
- No implicit handover of files to another process.
  Would need an munmap, handover, mmap sequence.

In any case the io_ring rather abuses the import_iovec() interface.

The canonical sequence is (types from memory):
	struct iovec cache[8], *iov = cache;
	struct iter iter;
	...
	rval = import_iovec(..., &iov, 8, &iter);
	// Do read/write user using 'iter'
	free(iov);

I don't think there is any strict requirement that iter.iov
is set to either 'cache' or 'iov' (it probably must point
into one of them.)
But the io_uring code will make that assumption because the
actual copies can be done much later and it doesn't save 'iter'.
It gets itself in a right mess because it doesn't separate
the 'address I need to free' from 'the iov[] for any transfers'.

io_uring is also the only code that relies on import_iovec()
returning the iter.count on success.
It would be much better to have:
	iov = import_iovec(..., &cache, ...);
	free(iov);
and use ERR_PTR() et al for error detectoion.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Andy Lutomirski Sept. 19, 2020, 4:21 p.m. UTC | #4
On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
>

> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:

> > Said that, why not provide a variant that would take an explicit

> > "is it compat" argument and use it there?  And have the normal

> > one pass in_compat_syscall() to that...

>

> That would help to not introduce a regression with this series yes.

> But it wouldn't fix existing bugs when io_uring is used to access

> read or write methods that use in_compat_syscall().  One example that

> I recently ran into is drivers/scsi/sg.c.


Aside from the potentially nasty use of per-task variables, one thing
I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
going to have a generic mechanism for this, shouldn't we allow a full
override of the syscall arch instead of just allowing forcing compat
so that a compat syscall can do a non-compat operation?
Finn Thain Sept. 19, 2020, 9:52 p.m. UTC | #5
On Sat, 19 Sep 2020, Arnd Bergmann wrote:

> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:

> > On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:

> > > On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:

> > > > Said that, why not provide a variant that would take an explicit 

> > > > "is it compat" argument and use it there?  And have the normal one 

> > > > pass in_compat_syscall() to that...

> > >

> > > That would help to not introduce a regression with this series yes. 

> > > But it wouldn't fix existing bugs when io_uring is used to access 

> > > read or write methods that use in_compat_syscall().  One example 

> > > that I recently ran into is drivers/scsi/sg.c.

> 

> Ah, so reading /dev/input/event* would suffer from the same issue, and 

> that one would in fact be broken by your patch in the hypothetical case 

> that someone tried to use io_uring to read /dev/input/event on x32...

> 

> For reference, I checked the socket timestamp handling that has a number 

> of corner cases with time32/time64 formats in compat mode, but none of 

> those appear to be affected by the problem.

> 

> > Aside from the potentially nasty use of per-task variables, one thing 

> > I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're 

> > going to have a generic mechanism for this, shouldn't we allow a full 

> > override of the syscall arch instead of just allowing forcing compat 

> > so that a compat syscall can do a non-compat operation?

> 

> The only reason it's needed here is that the caller is in a kernel 

> thread rather than a system call. Are there any possible scenarios where 

> one would actually need the opposite?

> 


Quite possibly. The ext4 vs. compat getdents bug is still unresolved. 
Please see,
https://lore.kernel.org/lkml/CAFEAcA9W+JK7_TrtTnL1P2ES1knNPJX9wcUvhfLwxLq9augq1w@mail.gmail.com/

>        Arnd

>
Andy Lutomirski Sept. 20, 2020, 7:14 p.m. UTC | #6
On Sun, Sep 20, 2020 at 11:07 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>

> On Sun, Sep 20, 2020 at 04:15:10PM +0100, Matthew Wilcox wrote:

> > On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:

> > > Add a flag to force processing a syscall as a compat syscall.  This is

> > > required so that in_compat_syscall() works for I/O submitted by io_uring

> > > helper threads on behalf of compat syscalls.

> >

> > Al doesn't like this much, but my suggestion is to introduce two new

> > opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code

> > can translate IORING_OP_READV to IORING_OP_READV32 and then the core

> > code can know what that user pointer is pointing to.

>

> Let's separate two issues:

>         1) compat syscalls want 32bit iovecs.  Nothing to do with the

> drivers, dealt with just fine.

>         2) a few drivers are really fucked in head.  They use different

> *DATA* layouts for reads/writes, depending upon the calling process.

> IOW, if you fork/exec a 32bit binary and your stdin is one of those,

> reads from stdin in parent and child will yield different data layouts.

> On the same struct file.

>         That's what Christoph worries about (/dev/sg he'd mentioned is

> one of those).

>

>         IMO we should simply have that dozen or so of pathological files

> marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -

> it describes the userland ABI provided by those.  And it's cast in stone.

>


I wonder if this is really quite cast in stone.  We could also have
FMODE_SHITTY_COMPAT and set that when a file like this is *opened* in
compat mode.  Then that particular struct file would be read and
written using the compat data format.  The change would be
user-visible, but the user that would see it would be very strange
indeed.

I don't have a strong opinion as to whether that is better or worse
than denying io_uring access to these things, but at least it moves
the special case out of io_uring.

--Andy
Andy Lutomirski Sept. 20, 2020, 7:28 p.m. UTC | #7
On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote:
>

> On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:

> > IMO it's much saner to mark those and refuse to touch them from io_uring...

>

> Simpler solution is to remove io_uring from the 32-bit syscall list.

> If you're a 32-bit process, you don't get to use io_uring.  Would

> any real users actually care about that?


We could go one step farther and declare that we're done adding *any*
new compat syscalls :)



-- 
Andy Lutomirski
AMA Capital Management, LLC
David Laight Sept. 20, 2020, 9:13 p.m. UTC | #8
From: Arnd Bergmann

> Sent: 20 September 2020 21:49

> 

> On Sun, Sep 20, 2020 at 9:28 PM Andy Lutomirski <luto@kernel.org> wrote:

> > On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote:

> > >

> > > On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:

> > > > IMO it's much saner to mark those and refuse to touch them from io_uring...

> > >

> > > Simpler solution is to remove io_uring from the 32-bit syscall list.

> > > If you're a 32-bit process, you don't get to use io_uring.  Would

> > > any real users actually care about that?

> >

> > We could go one step farther and declare that we're done adding *any*

> > new compat syscalls :)

> 

> Would you also stop adding system calls to native 32-bit systems then?

> 

> On memory constrained systems (less than 2GB a.t.m.), there is still a

> strong demand for running 32-bit user space, but all of the recent Arm

> cores (after Cortex-A55) dropped the ability to run 32-bit kernels, so

> that compat mode may eventually become the primary way to run

> Linux on cheap embedded systems.

> 

> I don't think there is any chance we can realistically take away io_uring

> from the 32-bit ABI any more now.


Can't it just run requests from 32bit apps in a kernel thread that has
the 'in_compat_syscall' flag set?
Not that i recall seeing the code where it saves the 'compat' nature
of any requests.

It is already completely f*cked if you try to pass the command ring
to a child process - it uses the wrong 'mm'.
I suspect there are some really horrid security holes in that area.

	David.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Pavel Begunkov Sept. 21, 2020, 4:31 p.m. UTC | #9
On 21/09/2020 00:13, David Laight wrote:
> From: Arnd Bergmann

>> Sent: 20 September 2020 21:49

>>

>> On Sun, Sep 20, 2020 at 9:28 PM Andy Lutomirski <luto@kernel.org> wrote:

>>> On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote:

>>>>

>>>> On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:

>>>>> IMO it's much saner to mark those and refuse to touch them from io_uring...

>>>>

>>>> Simpler solution is to remove io_uring from the 32-bit syscall list.

>>>> If you're a 32-bit process, you don't get to use io_uring.  Would

>>>> any real users actually care about that?

>>>

>>> We could go one step farther and declare that we're done adding *any*

>>> new compat syscalls :)

>>

>> Would you also stop adding system calls to native 32-bit systems then?

>>

>> On memory constrained systems (less than 2GB a.t.m.), there is still a

>> strong demand for running 32-bit user space, but all of the recent Arm

>> cores (after Cortex-A55) dropped the ability to run 32-bit kernels, so

>> that compat mode may eventually become the primary way to run

>> Linux on cheap embedded systems.

>>

>> I don't think there is any chance we can realistically take away io_uring

>> from the 32-bit ABI any more now.

> 

> Can't it just run requests from 32bit apps in a kernel thread that has

> the 'in_compat_syscall' flag set?

> Not that i recall seeing the code where it saves the 'compat' nature

> of any requests.

> 

> It is already completely f*cked if you try to pass the command ring

> to a child process - it uses the wrong 'mm'.


And how so? io_uring uses mm of a submitter. The exception is SQPOLL
mode, but it requires CAP_SYS_ADMIN or CAP_SYS_NICE anyway.

> I suspect there are some really horrid security holes in that area.

> 

> 	David.

> 

> -

> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK

> Registration No: 1397386 (Wales)

> 


-- 
Pavel Begunkov
diff mbox series

Patch

diff --git a/arch/sparc/include/asm/compat.h b/arch/sparc/include/asm/compat.h
index 40a267b3bd5208..fee6c51d36e869 100644
--- a/arch/sparc/include/asm/compat.h
+++ b/arch/sparc/include/asm/compat.h
@@ -211,7 +211,8 @@  static inline int is_compat_task(void)
 static inline bool in_compat_syscall(void)
 {
 	/* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
-	return pt_regs_trap_type(current_pt_regs()) == 0x110;
+	return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
+		(current->flags & PF_FORCE_COMPAT);
 }
 #define in_compat_syscall in_compat_syscall
 #endif
diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
index d4edf281fff49d..fbab072d4e5b31 100644
--- a/arch/x86/include/asm/compat.h
+++ b/arch/x86/include/asm/compat.h
@@ -208,7 +208,7 @@  static inline bool in_32bit_syscall(void)
 #ifdef CONFIG_COMPAT
 static inline bool in_compat_syscall(void)
 {
-	return in_32bit_syscall();
+	return in_32bit_syscall() || (current->flags & PF_FORCE_COMPAT);
 }
 #define in_compat_syscall in_compat_syscall	/* override the generic impl */
 #endif
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3790c7fe9fee22..5755d557c3f7bc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5449,6 +5449,9 @@  static int io_req_defer_prep(struct io_kiocb *req,
 	if (unlikely(ret))
 		return ret;
 
+	if (req->ctx->compat)
+		current->flags |= PF_FORCE_COMPAT;
+
 	switch (req->opcode) {
 	case IORING_OP_NOP:
 		break;
@@ -5546,6 +5549,7 @@  static int io_req_defer_prep(struct io_kiocb *req,
 		break;
 	}
 
+	current->flags &= ~PF_FORCE_COMPAT;
 	return ret;
 }
 
@@ -5669,6 +5673,9 @@  static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct io_ring_ctx *ctx = req->ctx;
 	int ret;
 
+	if (ctx->compat)
+		current->flags |= PF_FORCE_COMPAT;
+
 	switch (req->opcode) {
 	case IORING_OP_NOP:
 		ret = io_nop(req, cs);
@@ -5898,6 +5905,8 @@  static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		break;
 	}
 
+	current->flags &= ~PF_FORCE_COMPAT;
+
 	if (ret)
 		return ret;
 
diff --git a/include/linux/compat.h b/include/linux/compat.h
index b354ce58966e2d..685066f7ad325f 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -891,7 +891,10 @@  asmlinkage long compat_sys_socketcall(int call, u32 __user *args);
  */
 
 #ifndef in_compat_syscall
-static inline bool in_compat_syscall(void) { return is_compat_task(); }
+static inline bool in_compat_syscall(void)
+{
+	return is_compat_task() || (current->flags & PF_FORCE_COMPAT);
+}
 #endif
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index afe01e232935fa..c8b183b5655a1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1491,6 +1491,7 @@  extern struct pid *cad_pid;
  */
 #define PF_IDLE			0x00000002	/* I am an IDLE thread */
 #define PF_EXITING		0x00000004	/* Getting shut down */
+#define PF_FORCE_COMPAT		0x00000008	/* acting as compat task */
 #define PF_VCPU			0x00000010	/* I'm a virtual CPU */
 #define PF_WQ_WORKER		0x00000020	/* I'm a workqueue worker */
 #define PF_FORKNOEXEC		0x00000040	/* Forked but didn't exec */