[RFC,v2,00/23] io_uring BPF requests

Message ID	cover.1621424513.git.asml.silence@gmail.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Pavel Begunkov <asml.silence@gmail.com> To: io-uring@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Horst Schirmeier <horst.schirmeier@tu-dortmund.de>, "Franz-B . Tuneke" <franz-bernhard.tuneke@tu-dortmund.de>, Christian Dietrich <stettberger@dokucode.de> Subject: [RFC v2 00/23] io_uring BPF requests Date: Wed, 19 May 2021 15:13:11 +0100 Message-Id: <cover.1621424513.git.asml.silence@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	io_uring BPF requests \| expand [RFC,v2,00/23] io_uring BPF requests [01/23] io_uring: shuffle rarely used ctx fields [02/23] io_uring: localise fixed resources fields [03/23] io_uring: remove dependency on ring->sq/cq_entries [04/23] io_uring: deduce cq_mask from cq_entries [05/23] io_uring: kill cached_cq_overflow [06/23] io_uring: rename io_get_cqring [07/23] io_uring: extract struct for CQ [08/23] io_uring: internally pass CQ indexes [09/23] io_uring: extract cq size helper [10/23] io_uring: add support for multiple CQs [11/23] io_uring: enable mmap'ing additional CQs [12/23] bpf: add IOURING program type [13/23] io_uring: implement bpf prog registration [14/23] io_uring: add support for bpf requests [15/23] io_uring: enable BPF to submit SQEs [16/23] io_uring: enable bpf to submit CQEs [17/23] io_uring: enable bpf to reap CQEs [18/23] libbpf: support io_uring [19/23] io_uring: pass user_data to bpf executor [20/23] bpf: Add bpf_copy_to_user() helper [21/23] io_uring: wire bpf copy to user [22/23] io_uring: don't wait on CQ exclusively [23/23] io_uring: enable bpf reqs to wait for CQs

Pavel Begunkov May 19, 2021, 2:13 p.m. UTC

The main problem solved is feeding completion information of other
requests in a form of CQEs back into BPF. I decided to wire up support
for multiple completion queues (aka CQs) and give BPF programs access to
them, so leaving userspace in control over synchronisation that should
be much more flexible that the link-based approach.

For instance, there can be a separate CQ for each BPF program, so no
extra sync is needed, and communication can be done by submitting a
request targeting a neighboring CQ or submitting a CQE there directly
(see test3 below). CQ is choosen by sqe->cq_idx, so everyone can
cross-fire if willing.

A bunch of other features was added to play around (see v1 changelog
below or test1), some are just experimental only. The interfaces are
not even close to settle.
Note: there are problems known, one may live-lock a task, unlikely
to happen but better to be aware.

For convenience git branch for the kernel part is at [1],
libbpf + examples [2]. Examples are written in restricted C and libbpf,
and are under examples/bpf/, see [3], with 4 BPF programs and 4
corresponding test cases in uring.c. It's already shaping interesting
to play with.

test1:            just a set of use examples for features
test2/counting:   ticks-react N times using timeout reqs and CQ waiting
test3/pingpong:   two BPF reqs do message-based communication by
                  repeatedly writing a CQE to another program's CQ and
                  waiting for a response
test4/write_file: BPF writes N bytes to a file keeping QD>1

[1] https://github.com/isilence/linux/tree/ebpf_v2
[2] https://github.com/isilence/liburing/tree/ebpf_v2
[3] https://github.com/isilence/liburing/tree/ebpf_v2/examples/bpf

since v1:
- several bug fixes
- support multiple CQs
- allow BPF requests to wait on CQs
- BPF helpers for emit/reap CQE
- expose user_data to BPF program
- sleepable + let BPF read/write from userspace

Pavel Begunkov (23):
  io_uring: shuffle rarely used ctx fields
  io_uring: localise fixed resources fields
  io_uring: remove dependency on ring->sq/cq_entries
  io_uring: deduce cq_mask from cq_entries
  io_uring: kill cached_cq_overflow
  io_uring: rename io_get_cqring
  io_uring: extract struct for CQ
  io_uring: internally pass CQ indexes
  io_uring: extract cq size helper
  io_uring: add support for multiple CQs
  io_uring: enable mmap'ing additional CQs
  bpf: add IOURING program type
  io_uring: implement bpf prog registration
  io_uring: add support for bpf requests
  io_uring: enable BPF to submit SQEs
  io_uring: enable bpf to submit CQEs
  io_uring: enable bpf to reap CQEs
  libbpf: support io_uring
  io_uring: pass user_data to bpf executor
  bpf: Add bpf_copy_to_user() helper
  io_uring: wire bpf copy to user
  io_uring: don't wait on CQ exclusively
  io_uring: enable bpf reqs to wait for CQs

 fs/io_uring.c                  | 794 +++++++++++++++++++++++++++------
 include/linux/bpf.h            |   1 +
 include/linux/bpf_types.h      |   2 +
 include/uapi/linux/bpf.h       |  12 +
 include/uapi/linux/io_uring.h  |  15 +-
 kernel/bpf/helpers.c           |  17 +
 kernel/bpf/syscall.c           |   1 +
 kernel/bpf/verifier.c          |   5 +-
 tools/include/uapi/linux/bpf.h |   7 +
 tools/lib/bpf/libbpf.c         |   7 +
 10 files changed, 722 insertions(+), 139 deletions(-)

Song Liu May 20, 2021, 9:46 p.m. UTC | #1

> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote:

> 

> There is a bunch of scattered around ctx fields that are almost never

> used, e.g. only on ring exit, plunge them to the end, better locality,

> better aesthetically.

> 

> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

> ---

> fs/io_uring.c | 36 +++++++++++++++++-------------------

> 1 file changed, 17 insertions(+), 19 deletions(-)

> 

> diff --git a/fs/io_uring.c b/fs/io_uring.c

> index 9ac5e278a91e..7e3410ce100a 100644

> --- a/fs/io_uring.c

> +++ b/fs/io_uring.c

> @@ -367,9 +367,6 @@ struct io_ring_ctx {

> 		unsigned		cached_cq_overflow;

> 		unsigned long		sq_check_overflow;

> 

> -		/* hashed buffered write serialization */

> -		struct io_wq_hash	*hash_map;

> -

> 		struct list_head	defer_list;

> 		struct list_head	timeout_list;

> 		struct list_head	cq_overflow_list;

> @@ -386,9 +383,6 @@ struct io_ring_ctx {

> 

> 	struct io_rings	*rings;

> 

> -	/* Only used for accounting purposes */

> -	struct mm_struct	*mm_account;

> -

> 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */

> 	struct io_sq_data	*sq_data;	/* if using sq thread polling */

> 

> @@ -409,14 +403,6 @@ struct io_ring_ctx {

> 	unsigned		nr_user_bufs;

> 	struct io_mapped_ubuf	**user_bufs;

> 

> -	struct user_struct	*user;

> -

> -	struct completion	ref_comp;

> -

> -#if defined(CONFIG_UNIX)

> -	struct socket		*ring_sock;

> -#endif

> -

> 	struct xarray		io_buffers;

> 

> 	struct xarray		personalities;

> @@ -460,12 +446,24 @@ struct io_ring_ctx {

> 

> 	struct io_restriction		restrictions;

> 

> -	/* exit task_work */

> -	struct callback_head		*exit_task_work;

> -

> 	/* Keep this last, we don't need it for the fast path */

> -	struct work_struct		exit_work;

> -	struct list_head		tctx_list;

> +	struct {


Why do we need an anonymous struct here? For cache line alignment?
Do we need ____cacheline_aligned_in_smp?

> +		#if defined(CONFIG_UNIX)

> +			struct socket		*ring_sock;

> +		#endif

> +		/* hashed buffered write serialization */

> +		struct io_wq_hash		*hash_map;

> +

> +		/* Only used for accounting purposes */

> +		struct user_struct		*user;

> +		struct mm_struct		*mm_account;

> +

> +		/* ctx exit and cancelation */

> +		struct callback_head		*exit_task_work;

> +		struct work_struct		exit_work;

> +		struct list_head		tctx_list;

> +		struct completion		ref_comp;

> +	};

> };

> 

> struct io_uring_task {

> -- 

> 2.31.1

>

Pavel Begunkov May 20, 2021, 10:46 p.m. UTC | #2

On 5/20/21 10:46 PM, Song Liu wrote:
>> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote:

>> There is a bunch of scattered around ctx fields that are almost never

>> used, e.g. only on ring exit, plunge them to the end, better locality,

>> better aesthetically.

>>

>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

>> ---

>> fs/io_uring.c | 36 +++++++++++++++++-------------------

>> 1 file changed, 17 insertions(+), 19 deletions(-)

>>

>> diff --git a/fs/io_uring.c b/fs/io_uring.c

>> index 9ac5e278a91e..7e3410ce100a 100644

>> --- a/fs/io_uring.c

>> +++ b/fs/io_uring.c

>> @@ -367,9 +367,6 @@ struct io_ring_ctx {

>> 		unsigned		cached_cq_overflow;

>> 		unsigned long		sq_check_overflow;

>>

>> -		/* hashed buffered write serialization */

>> -		struct io_wq_hash	*hash_map;

>> -

>> 		struct list_head	defer_list;

>> 		struct list_head	timeout_list;

>> 		struct list_head	cq_overflow_list;

>> @@ -386,9 +383,6 @@ struct io_ring_ctx {

>>

>> 	struct io_rings	*rings;

>>

>> -	/* Only used for accounting purposes */

>> -	struct mm_struct	*mm_account;

>> -

>> 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */

>> 	struct io_sq_data	*sq_data;	/* if using sq thread polling */

>>

>> @@ -409,14 +403,6 @@ struct io_ring_ctx {

>> 	unsigned		nr_user_bufs;

>> 	struct io_mapped_ubuf	**user_bufs;

>>

>> -	struct user_struct	*user;

>> -

>> -	struct completion	ref_comp;

>> -

>> -#if defined(CONFIG_UNIX)

>> -	struct socket		*ring_sock;

>> -#endif

>> -

>> 	struct xarray		io_buffers;

>>

>> 	struct xarray		personalities;

>> @@ -460,12 +446,24 @@ struct io_ring_ctx {

>>

>> 	struct io_restriction		restrictions;

>>

>> -	/* exit task_work */

>> -	struct callback_head		*exit_task_work;

>> -

>> 	/* Keep this last, we don't need it for the fast path */

>> -	struct work_struct		exit_work;

>> -	struct list_head		tctx_list;

>> +	struct {

> 

> Why do we need an anonymous struct here? For cache line alignment?

> Do we need ____cacheline_aligned_in_smp?


Rather as a visual hint considering that most of the field historically
are in structs (____cacheline_aligned_in_smp). Also preparing to
potentially splitting it out of the ctx struct as it grows big. 

First 2-3 patches are not strictly related to bpf and will go separately
earlier, just the set was based on.

>> +		#if defined(CONFIG_UNIX)

>> +			struct socket		*ring_sock;

>> +		#endif

>> +		/* hashed buffered write serialization */

>> +		struct io_wq_hash		*hash_map;

>> +

>> +		/* Only used for accounting purposes */

>> +		struct user_struct		*user;

>> +		struct mm_struct		*mm_account;

>> +

>> +		/* ctx exit and cancelation */

>> +		struct callback_head		*exit_task_work;

>> +		struct work_struct		exit_work;

>> +		struct list_head		tctx_list;

>> +		struct completion		ref_comp;

>> +	};

>> };

>>

>> struct io_uring_task {


-- 
Pavel Begunkov

Song Liu May 20, 2021, 11:34 p.m. UTC | #3

> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote:

> 

> Draft a new program type BPF_PROG_TYPE_IOURING, which will be used by

> io_uring to execute BPF-based requests.

> 

> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

> ---

> fs/io_uring.c             | 21 +++++++++++++++++++++

> include/linux/bpf_types.h |  2 ++

> include/uapi/linux/bpf.h  |  1 +

> kernel/bpf/syscall.c      |  1 +

> kernel/bpf/verifier.c     |  5 ++++-

> 5 files changed, 29 insertions(+), 1 deletion(-)

> 

> diff --git a/fs/io_uring.c b/fs/io_uring.c

> index 1a4c9e513ac9..882b16b5e5eb 100644

> --- a/fs/io_uring.c

> +++ b/fs/io_uring.c

> @@ -10201,6 +10201,27 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,

> 	return ret;

> }

> 

> +static const struct bpf_func_proto *

> +io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)

> +{

> +	return bpf_base_func_proto(func_id);

> +}

> +

> +static bool io_bpf_is_valid_access(int off, int size,

> +				   enum bpf_access_type type,

> +				   const struct bpf_prog *prog,

> +				   struct bpf_insn_access_aux *info)

> +{

> +	return false;

> +}

> +

> +const struct bpf_prog_ops bpf_io_uring_prog_ops = {};

> +

> +const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {

> +	.get_func_proto		= io_bpf_func_proto,

> +	.is_valid_access	= io_bpf_is_valid_access,

> +};

> +

> SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,

> 		void __user *, arg, unsigned int, nr_args)

> {

> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h

> index 99f7fd657d87..d0b7954887bd 100644

> --- a/include/linux/bpf_types.h

> +++ b/include/linux/bpf_types.h

> @@ -77,6 +77,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,

> 	       void *, void *)

> #endif /* CONFIG_BPF_LSM */

> #endif

> +BPF_PROG_TYPE(BPF_PROG_TYPE_IOURING, bpf_io_uring,

> +	      void *, void *)

> 

> BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)

> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> index 4ba4ef0ff63a..de544f0fbeef 100644

> --- a/include/uapi/linux/bpf.h

> +++ b/include/uapi/linux/bpf.h

> @@ -206,6 +206,7 @@ enum bpf_prog_type {

> 	BPF_PROG_TYPE_EXT,

> 	BPF_PROG_TYPE_LSM,

> 	BPF_PROG_TYPE_SK_LOOKUP,

> +	BPF_PROG_TYPE_IOURING,

> };

> 

> enum bpf_attach_type {

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c

> index 250503482cda..6ef7a26f4dc3 100644

> --- a/kernel/bpf/syscall.c

> +++ b/kernel/bpf/syscall.c

> @@ -2041,6 +2041,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)

> 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:

> 	case BPF_PROG_TYPE_CGROUP_SYSCTL:

> 	case BPF_PROG_TYPE_SOCK_OPS:

> +	case BPF_PROG_TYPE_IOURING:

> 	case BPF_PROG_TYPE_EXT: /* extends any prog */

> 		return true;

> 	case BPF_PROG_TYPE_CGROUP_SKB:

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c

> index 0399ac092b36..2a53f44618a7 100644

> --- a/kernel/bpf/verifier.c

> +++ b/kernel/bpf/verifier.c

> @@ -8558,6 +8558,9 @@ static int check_return_code(struct bpf_verifier_env *env)

> 	case BPF_PROG_TYPE_SK_LOOKUP:

> 		range = tnum_range(SK_DROP, SK_PASS);

> 		break;

> +	case BPF_PROG_TYPE_IOURING:

> +		range = tnum_const(0);

> +		break;

> 	case BPF_PROG_TYPE_EXT:

> 		/* freplace program can return anything as its return value

> 		 * depends on the to-be-replaced kernel func or bpf program.

> @@ -12560,7 +12563,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)

> 	u64 key;

> 

> 	if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&

> -	    prog->type != BPF_PROG_TYPE_LSM) {

> +	    prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_IOURING) {


Is IOURING program sleepable? If so, please highlight that in the commit log 
and update the warning below. 

> 		verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");

> 		return -EINVAL;

> 	}

> -- 

> 2.31.1

>

Song Liu May 21, 2021, 12:06 a.m. UTC | #4

> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote:

> 

> Add a BPF_FUNC_iouring_queue_sqe BPF function as a demonstration of

> submmiting a new request by a BPF request.

> 

> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

> ---

> fs/io_uring.c            | 51 ++++++++++++++++++++++++++++++++++++----

> include/uapi/linux/bpf.h |  1 +

> 2 files changed, 48 insertions(+), 4 deletions(-)

> 

> diff --git a/fs/io_uring.c b/fs/io_uring.c

> index 20fddc5945f2..aae786291c57 100644

> --- a/fs/io_uring.c

> +++ b/fs/io_uring.c

> @@ -882,6 +882,7 @@ struct io_defer_entry {

> };

> 

> struct io_bpf_ctx {

> +	struct io_ring_ctx	*ctx;

> };

> 

> struct io_op_def {

> @@ -6681,7 +6682,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,

> 			ret = -EBADF;

> 	}

> 

> -	state->ios_left--;

> +	if (state->ios_left > 1)

> +		state->ios_left--;

> 	return ret;

> }

> 

> @@ -10345,10 +10347,50 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,

> 	return ret;

> }

> 

> +BPF_CALL_3(io_bpf_queue_sqe, struct io_bpf_ctx *,		bpf_ctx,

> +			     const struct io_uring_sqe *,	sqe,

> +			     u32,				sqe_len)

> +{

> +	struct io_ring_ctx *ctx = bpf_ctx->ctx;

> +	struct io_kiocb *req;

> +

> +	if (sqe_len != sizeof(struct io_uring_sqe))

> +		return -EINVAL;

> +

> +	req = io_alloc_req(ctx);

> +	if (unlikely(!req))

> +		return -ENOMEM;

> +	if (!percpu_ref_tryget_many(&ctx->refs, 1)) {

> +		kmem_cache_free(req_cachep, req);

> +		return -EAGAIN;

> +	}

> +	percpu_counter_add(&current->io_uring->inflight, 1);

> +	refcount_add(1, &current->usage);

> +

> +	/* returns number of submitted SQEs or an error */

> +	return !io_submit_sqe(ctx, req, sqe);

> +}

> +

> +const struct bpf_func_proto io_bpf_queue_sqe_proto = {

> +	.func = io_bpf_queue_sqe,

> +	.gpl_only = false,

> +	.ret_type = RET_INTEGER,

> +	.arg1_type = ARG_PTR_TO_CTX,

> +	.arg2_type = ARG_PTR_TO_MEM,

> +	.arg3_type = ARG_CONST_SIZE,

> +};

> +

> static const struct bpf_func_proto *

> io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)

> {

> -	return bpf_base_func_proto(func_id);

> +	switch (func_id) {

> +	case BPF_FUNC_copy_from_user:

> +		return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL;

> +	case BPF_FUNC_iouring_queue_sqe:

> +		return prog->aux->sleepable ? &io_bpf_queue_sqe_proto : NULL;

> +	default:

> +		return bpf_base_func_proto(func_id);

> +	}

> }

> 

> static bool io_bpf_is_valid_access(int off, int size,

> @@ -10379,9 +10421,10 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)

> 		     atomic_read(&req->task->io_uring->in_idle)))

> 		goto done;

> 

> -	memset(&bpf_ctx, 0, sizeof(bpf_ctx));

> +	bpf_ctx.ctx = ctx;

> 	prog = req->bpf.prog;

> 

> +	io_submit_state_start(&ctx->submit_state, 1);

> 	if (prog->aux->sleepable) {

> 		rcu_read_lock();

> 		bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);

> @@ -10389,7 +10432,7 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)

> 	} else {

> 		bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);

> 	}

> -

> +	io_submit_state_end(&ctx->submit_state, ctx);

> 	ret = 0;

> done:

> 	__io_req_complete(req, issue_flags, ret, 0);

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> index de544f0fbeef..cc268f749a7d 100644

> --- a/include/uapi/linux/bpf.h

> +++ b/include/uapi/linux/bpf.h

> @@ -4082,6 +4082,7 @@ union bpf_attr {

> 	FN(ima_inode_hash),		\

> 	FN(sock_from_file),		\

> 	FN(check_mtu),			\

> +	FN(iouring_queue_sqe),		\


We need to describe this function in the comment above, just like 20/23 does. 

> 	/* */

> 

> /* integer value in 'imm' field of BPF_CALL instruction selects which helper

> -- 

> 2.31.1

>

Song Liu May 21, 2021, 12:35 a.m. UTC | #5

> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote:

> 

> The main problem solved is feeding completion information of other

> requests in a form of CQEs back into BPF. I decided to wire up support

> for multiple completion queues (aka CQs) and give BPF programs access to

> them, so leaving userspace in control over synchronisation that should

> be much more flexible that the link-based approach.

> 

> For instance, there can be a separate CQ for each BPF program, so no

> extra sync is needed, and communication can be done by submitting a

> request targeting a neighboring CQ or submitting a CQE there directly

> (see test3 below). CQ is choosen by sqe->cq_idx, so everyone can

> cross-fire if willing.

> 


[...]

>  bpf: add IOURING program type

>  io_uring: implement bpf prog registration

>  io_uring: add support for bpf requests

>  io_uring: enable BPF to submit SQEs

>  io_uring: enable bpf to submit CQEs

>  io_uring: enable bpf to reap CQEs

>  libbpf: support io_uring

>  io_uring: pass user_data to bpf executor

>  bpf: Add bpf_copy_to_user() helper

>  io_uring: wire bpf copy to user

>  io_uring: don't wait on CQ exclusively

>  io_uring: enable bpf reqs to wait for CQs


Besides the a few comments, these BPF related patches look sane to me. 
Please consider add some selftests (tools/testing/selftests/bpf). 

Thanks,
Song

Pavel Begunkov May 21, 2021, 12:56 a.m. UTC | #6

On 5/21/21 12:34 AM, Song Liu wrote:
>> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote:

>>

>> Draft a new program type BPF_PROG_TYPE_IOURING, which will be used by

>> io_uring to execute BPF-based requests.

>>

>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

>> ---

>> fs/io_uring.c             | 21 +++++++++++++++++++++

>> include/linux/bpf_types.h |  2 ++

>> include/uapi/linux/bpf.h  |  1 +

>> kernel/bpf/syscall.c      |  1 +

>> kernel/bpf/verifier.c     |  5 ++++-

>> 5 files changed, 29 insertions(+), 1 deletion(-)

>>

>> diff --git a/fs/io_uring.c b/fs/io_uring.c

>> index 1a4c9e513ac9..882b16b5e5eb 100644

[...]
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_IOURING, bpf_io_uring,

>> +	      void *, void *)

>>

>> BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)

>> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)

>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

>> index 4ba4ef0ff63a..de544f0fbeef 100644

>> --- a/include/uapi/linux/bpf.h

>> +++ b/include/uapi/linux/bpf.h

>> @@ -206,6 +206,7 @@ enum bpf_prog_type {

>> 	BPF_PROG_TYPE_EXT,

>> 	BPF_PROG_TYPE_LSM,

>> 	BPF_PROG_TYPE_SK_LOOKUP,

>> +	BPF_PROG_TYPE_IOURING,

>> };

>>

>> enum bpf_attach_type {

>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c

>> index 250503482cda..6ef7a26f4dc3 100644

>> --- a/kernel/bpf/syscall.c

>> +++ b/kernel/bpf/syscall.c

>> @@ -2041,6 +2041,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)

>> 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:

>> 	case BPF_PROG_TYPE_CGROUP_SYSCTL:

>> 	case BPF_PROG_TYPE_SOCK_OPS:

>> +	case BPF_PROG_TYPE_IOURING:

>> 	case BPF_PROG_TYPE_EXT: /* extends any prog */

>> 		return true;

>> 	case BPF_PROG_TYPE_CGROUP_SKB:

>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c

>> index 0399ac092b36..2a53f44618a7 100644

>> --- a/kernel/bpf/verifier.c

>> +++ b/kernel/bpf/verifier.c

>> @@ -8558,6 +8558,9 @@ static int check_return_code(struct bpf_verifier_env *env)

>> 	case BPF_PROG_TYPE_SK_LOOKUP:

>> 		range = tnum_range(SK_DROP, SK_PASS);

>> 		break;

>> +	case BPF_PROG_TYPE_IOURING:

>> +		range = tnum_const(0);

>> +		break;

>> 	case BPF_PROG_TYPE_EXT:

>> 		/* freplace program can return anything as its return value

>> 		 * depends on the to-be-replaced kernel func or bpf program.

>> @@ -12560,7 +12563,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)

>> 	u64 key;

>>

>> 	if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&

>> -	    prog->type != BPF_PROG_TYPE_LSM) {

>> +	    prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_IOURING) {

> 

> Is IOURING program sleepable? If so, please highlight that in the commit log 


Supposed to work with both, sleepable and not, but with different set
of helpers. e.g. can't submit requests nor do bpf_copy_from_user() if
can't sleep. The only other difference in handling is rcu around
non-sleepable, but please shut out if I forgot anything

> and update the warning below. 


Sure

> 

>> 		verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");

>> 		return -EINVAL;

>> 	}

>> -- 

>> 2.31.1

>>

> 


-- 
Pavel Begunkov

Pavel Begunkov May 21, 2021, 12:58 a.m. UTC | #7

On 5/21/21 1:35 AM, Song Liu wrote:
>> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote:

>> The main problem solved is feeding completion information of other

>> requests in a form of CQEs back into BPF. I decided to wire up support

>> for multiple completion queues (aka CQs) and give BPF programs access to

>> them, so leaving userspace in control over synchronisation that should

>> be much more flexible that the link-based approach.

>>

>> For instance, there can be a separate CQ for each BPF program, so no

>> extra sync is needed, and communication can be done by submitting a

>> request targeting a neighboring CQ or submitting a CQE there directly

>> (see test3 below). CQ is choosen by sqe->cq_idx, so everyone can

>> cross-fire if willing.

>>

> 

> [...]

> 

>>  bpf: add IOURING program type

>>  io_uring: implement bpf prog registration

>>  io_uring: add support for bpf requests

>>  io_uring: enable BPF to submit SQEs

>>  io_uring: enable bpf to submit CQEs

>>  io_uring: enable bpf to reap CQEs

>>  libbpf: support io_uring

>>  io_uring: pass user_data to bpf executor

>>  bpf: Add bpf_copy_to_user() helper

>>  io_uring: wire bpf copy to user

>>  io_uring: don't wait on CQ exclusively

>>  io_uring: enable bpf reqs to wait for CQs

> 

> Besides the a few comments, these BPF related patches look sane to me. 

> Please consider add some selftests (tools/testing/selftests/bpf). 


The comments are noted. Thanks Song

-- 
Pavel Begunkov

[RFC,v2,00/23] io_uring BPF requests

Message

Comments