Message ID | cover.1621424513.git.asml.silence@gmail.com |
---|---|
Headers | show |
Series | io_uring BPF requests | expand |
> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote: > > There is a bunch of scattered around ctx fields that are almost never > used, e.g. only on ring exit, plunge them to the end, better locality, > better aesthetically. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > fs/io_uring.c | 36 +++++++++++++++++------------------- > 1 file changed, 17 insertions(+), 19 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 9ac5e278a91e..7e3410ce100a 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -367,9 +367,6 @@ struct io_ring_ctx { > unsigned cached_cq_overflow; > unsigned long sq_check_overflow; > > - /* hashed buffered write serialization */ > - struct io_wq_hash *hash_map; > - > struct list_head defer_list; > struct list_head timeout_list; > struct list_head cq_overflow_list; > @@ -386,9 +383,6 @@ struct io_ring_ctx { > > struct io_rings *rings; > > - /* Only used for accounting purposes */ > - struct mm_struct *mm_account; > - > const struct cred *sq_creds; /* cred used for __io_sq_thread() */ > struct io_sq_data *sq_data; /* if using sq thread polling */ > > @@ -409,14 +403,6 @@ struct io_ring_ctx { > unsigned nr_user_bufs; > struct io_mapped_ubuf **user_bufs; > > - struct user_struct *user; > - > - struct completion ref_comp; > - > -#if defined(CONFIG_UNIX) > - struct socket *ring_sock; > -#endif > - > struct xarray io_buffers; > > struct xarray personalities; > @@ -460,12 +446,24 @@ struct io_ring_ctx { > > struct io_restriction restrictions; > > - /* exit task_work */ > - struct callback_head *exit_task_work; > - > /* Keep this last, we don't need it for the fast path */ > - struct work_struct exit_work; > - struct list_head tctx_list; > + struct { Why do we need an anonymous struct here? For cache line alignment? Do we need ____cacheline_aligned_in_smp? > + #if defined(CONFIG_UNIX) > + struct socket *ring_sock; > + #endif > + /* hashed buffered write serialization */ > + struct io_wq_hash *hash_map; > + > + /* Only used for accounting purposes */ > + struct user_struct *user; > + struct mm_struct *mm_account; > + > + /* ctx exit and cancelation */ > + struct callback_head *exit_task_work; > + struct work_struct exit_work; > + struct list_head tctx_list; > + struct completion ref_comp; > + }; > }; > > struct io_uring_task { > -- > 2.31.1 >
On 5/20/21 10:46 PM, Song Liu wrote: >> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote: >> There is a bunch of scattered around ctx fields that are almost never >> used, e.g. only on ring exit, plunge them to the end, better locality, >> better aesthetically. >> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >> --- >> fs/io_uring.c | 36 +++++++++++++++++------------------- >> 1 file changed, 17 insertions(+), 19 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 9ac5e278a91e..7e3410ce100a 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -367,9 +367,6 @@ struct io_ring_ctx { >> unsigned cached_cq_overflow; >> unsigned long sq_check_overflow; >> >> - /* hashed buffered write serialization */ >> - struct io_wq_hash *hash_map; >> - >> struct list_head defer_list; >> struct list_head timeout_list; >> struct list_head cq_overflow_list; >> @@ -386,9 +383,6 @@ struct io_ring_ctx { >> >> struct io_rings *rings; >> >> - /* Only used for accounting purposes */ >> - struct mm_struct *mm_account; >> - >> const struct cred *sq_creds; /* cred used for __io_sq_thread() */ >> struct io_sq_data *sq_data; /* if using sq thread polling */ >> >> @@ -409,14 +403,6 @@ struct io_ring_ctx { >> unsigned nr_user_bufs; >> struct io_mapped_ubuf **user_bufs; >> >> - struct user_struct *user; >> - >> - struct completion ref_comp; >> - >> -#if defined(CONFIG_UNIX) >> - struct socket *ring_sock; >> -#endif >> - >> struct xarray io_buffers; >> >> struct xarray personalities; >> @@ -460,12 +446,24 @@ struct io_ring_ctx { >> >> struct io_restriction restrictions; >> >> - /* exit task_work */ >> - struct callback_head *exit_task_work; >> - >> /* Keep this last, we don't need it for the fast path */ >> - struct work_struct exit_work; >> - struct list_head tctx_list; >> + struct { > > Why do we need an anonymous struct here? For cache line alignment? > Do we need ____cacheline_aligned_in_smp? Rather as a visual hint considering that most of the field historically are in structs (____cacheline_aligned_in_smp). Also preparing to potentially splitting it out of the ctx struct as it grows big. First 2-3 patches are not strictly related to bpf and will go separately earlier, just the set was based on. >> + #if defined(CONFIG_UNIX) >> + struct socket *ring_sock; >> + #endif >> + /* hashed buffered write serialization */ >> + struct io_wq_hash *hash_map; >> + >> + /* Only used for accounting purposes */ >> + struct user_struct *user; >> + struct mm_struct *mm_account; >> + >> + /* ctx exit and cancelation */ >> + struct callback_head *exit_task_work; >> + struct work_struct exit_work; >> + struct list_head tctx_list; >> + struct completion ref_comp; >> + }; >> }; >> >> struct io_uring_task { -- Pavel Begunkov
> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote: > > Draft a new program type BPF_PROG_TYPE_IOURING, which will be used by > io_uring to execute BPF-based requests. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > fs/io_uring.c | 21 +++++++++++++++++++++ > include/linux/bpf_types.h | 2 ++ > include/uapi/linux/bpf.h | 1 + > kernel/bpf/syscall.c | 1 + > kernel/bpf/verifier.c | 5 ++++- > 5 files changed, 29 insertions(+), 1 deletion(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 1a4c9e513ac9..882b16b5e5eb 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -10201,6 +10201,27 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, > return ret; > } > > +static const struct bpf_func_proto * > +io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > +{ > + return bpf_base_func_proto(func_id); > +} > + > +static bool io_bpf_is_valid_access(int off, int size, > + enum bpf_access_type type, > + const struct bpf_prog *prog, > + struct bpf_insn_access_aux *info) > +{ > + return false; > +} > + > +const struct bpf_prog_ops bpf_io_uring_prog_ops = {}; > + > +const struct bpf_verifier_ops bpf_io_uring_verifier_ops = { > + .get_func_proto = io_bpf_func_proto, > + .is_valid_access = io_bpf_is_valid_access, > +}; > + > SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, > void __user *, arg, unsigned int, nr_args) > { > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > index 99f7fd657d87..d0b7954887bd 100644 > --- a/include/linux/bpf_types.h > +++ b/include/linux/bpf_types.h > @@ -77,6 +77,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm, > void *, void *) > #endif /* CONFIG_BPF_LSM */ > #endif > +BPF_PROG_TYPE(BPF_PROG_TYPE_IOURING, bpf_io_uring, > + void *, void *) > > BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 4ba4ef0ff63a..de544f0fbeef 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -206,6 +206,7 @@ enum bpf_prog_type { > BPF_PROG_TYPE_EXT, > BPF_PROG_TYPE_LSM, > BPF_PROG_TYPE_SK_LOOKUP, > + BPF_PROG_TYPE_IOURING, > }; > > enum bpf_attach_type { > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 250503482cda..6ef7a26f4dc3 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -2041,6 +2041,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type) > case BPF_PROG_TYPE_CGROUP_SOCKOPT: > case BPF_PROG_TYPE_CGROUP_SYSCTL: > case BPF_PROG_TYPE_SOCK_OPS: > + case BPF_PROG_TYPE_IOURING: > case BPF_PROG_TYPE_EXT: /* extends any prog */ > return true; > case BPF_PROG_TYPE_CGROUP_SKB: > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 0399ac092b36..2a53f44618a7 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -8558,6 +8558,9 @@ static int check_return_code(struct bpf_verifier_env *env) > case BPF_PROG_TYPE_SK_LOOKUP: > range = tnum_range(SK_DROP, SK_PASS); > break; > + case BPF_PROG_TYPE_IOURING: > + range = tnum_const(0); > + break; > case BPF_PROG_TYPE_EXT: > /* freplace program can return anything as its return value > * depends on the to-be-replaced kernel func or bpf program. > @@ -12560,7 +12563,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) > u64 key; > > if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING && > - prog->type != BPF_PROG_TYPE_LSM) { > + prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_IOURING) { Is IOURING program sleepable? If so, please highlight that in the commit log and update the warning below. > verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n"); > return -EINVAL; > } > -- > 2.31.1 >
> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote: > > Add a BPF_FUNC_iouring_queue_sqe BPF function as a demonstration of > submmiting a new request by a BPF request. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > fs/io_uring.c | 51 ++++++++++++++++++++++++++++++++++++---- > include/uapi/linux/bpf.h | 1 + > 2 files changed, 48 insertions(+), 4 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 20fddc5945f2..aae786291c57 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -882,6 +882,7 @@ struct io_defer_entry { > }; > > struct io_bpf_ctx { > + struct io_ring_ctx *ctx; > }; > > struct io_op_def { > @@ -6681,7 +6682,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, > ret = -EBADF; > } > > - state->ios_left--; > + if (state->ios_left > 1) > + state->ios_left--; > return ret; > } > > @@ -10345,10 +10347,50 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, > return ret; > } > > +BPF_CALL_3(io_bpf_queue_sqe, struct io_bpf_ctx *, bpf_ctx, > + const struct io_uring_sqe *, sqe, > + u32, sqe_len) > +{ > + struct io_ring_ctx *ctx = bpf_ctx->ctx; > + struct io_kiocb *req; > + > + if (sqe_len != sizeof(struct io_uring_sqe)) > + return -EINVAL; > + > + req = io_alloc_req(ctx); > + if (unlikely(!req)) > + return -ENOMEM; > + if (!percpu_ref_tryget_many(&ctx->refs, 1)) { > + kmem_cache_free(req_cachep, req); > + return -EAGAIN; > + } > + percpu_counter_add(¤t->io_uring->inflight, 1); > + refcount_add(1, ¤t->usage); > + > + /* returns number of submitted SQEs or an error */ > + return !io_submit_sqe(ctx, req, sqe); > +} > + > +const struct bpf_func_proto io_bpf_queue_sqe_proto = { > + .func = io_bpf_queue_sqe, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_CTX, > + .arg2_type = ARG_PTR_TO_MEM, > + .arg3_type = ARG_CONST_SIZE, > +}; > + > static const struct bpf_func_proto * > io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > { > - return bpf_base_func_proto(func_id); > + switch (func_id) { > + case BPF_FUNC_copy_from_user: > + return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL; > + case BPF_FUNC_iouring_queue_sqe: > + return prog->aux->sleepable ? &io_bpf_queue_sqe_proto : NULL; > + default: > + return bpf_base_func_proto(func_id); > + } > } > > static bool io_bpf_is_valid_access(int off, int size, > @@ -10379,9 +10421,10 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags) > atomic_read(&req->task->io_uring->in_idle))) > goto done; > > - memset(&bpf_ctx, 0, sizeof(bpf_ctx)); > + bpf_ctx.ctx = ctx; > prog = req->bpf.prog; > > + io_submit_state_start(&ctx->submit_state, 1); > if (prog->aux->sleepable) { > rcu_read_lock(); > bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx); > @@ -10389,7 +10432,7 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags) > } else { > bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx); > } > - > + io_submit_state_end(&ctx->submit_state, ctx); > ret = 0; > done: > __io_req_complete(req, issue_flags, ret, 0); > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index de544f0fbeef..cc268f749a7d 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -4082,6 +4082,7 @@ union bpf_attr { > FN(ima_inode_hash), \ > FN(sock_from_file), \ > FN(check_mtu), \ > + FN(iouring_queue_sqe), \ We need to describe this function in the comment above, just like 20/23 does. > /* */ > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > -- > 2.31.1 >
> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote: > > The main problem solved is feeding completion information of other > requests in a form of CQEs back into BPF. I decided to wire up support > for multiple completion queues (aka CQs) and give BPF programs access to > them, so leaving userspace in control over synchronisation that should > be much more flexible that the link-based approach. > > For instance, there can be a separate CQ for each BPF program, so no > extra sync is needed, and communication can be done by submitting a > request targeting a neighboring CQ or submitting a CQE there directly > (see test3 below). CQ is choosen by sqe->cq_idx, so everyone can > cross-fire if willing. > [...] > bpf: add IOURING program type > io_uring: implement bpf prog registration > io_uring: add support for bpf requests > io_uring: enable BPF to submit SQEs > io_uring: enable bpf to submit CQEs > io_uring: enable bpf to reap CQEs > libbpf: support io_uring > io_uring: pass user_data to bpf executor > bpf: Add bpf_copy_to_user() helper > io_uring: wire bpf copy to user > io_uring: don't wait on CQ exclusively > io_uring: enable bpf reqs to wait for CQs Besides the a few comments, these BPF related patches look sane to me. Please consider add some selftests (tools/testing/selftests/bpf). Thanks, Song
On 5/21/21 12:34 AM, Song Liu wrote: >> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> Draft a new program type BPF_PROG_TYPE_IOURING, which will be used by >> io_uring to execute BPF-based requests. >> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >> --- >> fs/io_uring.c | 21 +++++++++++++++++++++ >> include/linux/bpf_types.h | 2 ++ >> include/uapi/linux/bpf.h | 1 + >> kernel/bpf/syscall.c | 1 + >> kernel/bpf/verifier.c | 5 ++++- >> 5 files changed, 29 insertions(+), 1 deletion(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 1a4c9e513ac9..882b16b5e5eb 100644 [...] >> +BPF_PROG_TYPE(BPF_PROG_TYPE_IOURING, bpf_io_uring, >> + void *, void *) >> >> BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 4ba4ef0ff63a..de544f0fbeef 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -206,6 +206,7 @@ enum bpf_prog_type { >> BPF_PROG_TYPE_EXT, >> BPF_PROG_TYPE_LSM, >> BPF_PROG_TYPE_SK_LOOKUP, >> + BPF_PROG_TYPE_IOURING, >> }; >> >> enum bpf_attach_type { >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index 250503482cda..6ef7a26f4dc3 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -2041,6 +2041,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type) >> case BPF_PROG_TYPE_CGROUP_SOCKOPT: >> case BPF_PROG_TYPE_CGROUP_SYSCTL: >> case BPF_PROG_TYPE_SOCK_OPS: >> + case BPF_PROG_TYPE_IOURING: >> case BPF_PROG_TYPE_EXT: /* extends any prog */ >> return true; >> case BPF_PROG_TYPE_CGROUP_SKB: >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 0399ac092b36..2a53f44618a7 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -8558,6 +8558,9 @@ static int check_return_code(struct bpf_verifier_env *env) >> case BPF_PROG_TYPE_SK_LOOKUP: >> range = tnum_range(SK_DROP, SK_PASS); >> break; >> + case BPF_PROG_TYPE_IOURING: >> + range = tnum_const(0); >> + break; >> case BPF_PROG_TYPE_EXT: >> /* freplace program can return anything as its return value >> * depends on the to-be-replaced kernel func or bpf program. >> @@ -12560,7 +12563,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) >> u64 key; >> >> if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING && >> - prog->type != BPF_PROG_TYPE_LSM) { >> + prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_IOURING) { > > Is IOURING program sleepable? If so, please highlight that in the commit log Supposed to work with both, sleepable and not, but with different set of helpers. e.g. can't submit requests nor do bpf_copy_from_user() if can't sleep. The only other difference in handling is rcu around non-sleepable, but please shut out if I forgot anything > and update the warning below. Sure > >> verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n"); >> return -EINVAL; >> } >> -- >> 2.31.1 >> > -- Pavel Begunkov
On 5/21/21 1:35 AM, Song Liu wrote: >> On May 19, 2021, at 7:13 AM, Pavel Begunkov <asml.silence@gmail.com> wrote: >> The main problem solved is feeding completion information of other >> requests in a form of CQEs back into BPF. I decided to wire up support >> for multiple completion queues (aka CQs) and give BPF programs access to >> them, so leaving userspace in control over synchronisation that should >> be much more flexible that the link-based approach. >> >> For instance, there can be a separate CQ for each BPF program, so no >> extra sync is needed, and communication can be done by submitting a >> request targeting a neighboring CQ or submitting a CQE there directly >> (see test3 below). CQ is choosen by sqe->cq_idx, so everyone can >> cross-fire if willing. >> > > [...] > >> bpf: add IOURING program type >> io_uring: implement bpf prog registration >> io_uring: add support for bpf requests >> io_uring: enable BPF to submit SQEs >> io_uring: enable bpf to submit CQEs >> io_uring: enable bpf to reap CQEs >> libbpf: support io_uring >> io_uring: pass user_data to bpf executor >> bpf: Add bpf_copy_to_user() helper >> io_uring: wire bpf copy to user >> io_uring: don't wait on CQ exclusively >> io_uring: enable bpf reqs to wait for CQs > > Besides the a few comments, these BPF related patches look sane to me. > Please consider add some selftests (tools/testing/selftests/bpf). The comments are noted. Thanks Song -- Pavel Begunkov