Message ID | 20241016112912.63542-8-anuj20.g@samsung.com |
---|---|
State | New |
Headers | show |
Series | Read/Write with meta/integrity | expand |
s/meta/metadata/ in the subject. > + const struct io_uring_meta *md = (struct io_uring_meta *)sqe->big_sqe_cmd; Overly long line. > + if (!meta_type) > + return 0; > + if (!(meta_type & META_TYPE_INTEGRITY)) > + return -EINVAL; What is the meta_type for? To distintinguish PI from non-PI metadata? Why doesn't this support non-PI metadata? Also PI or TO_PI might be a better name than the rather generic integrity. (but I'll defer to Martin if he has any good arguments for naming here). > static bool need_complete_io(struct io_kiocb *req) > { > + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); > + > + /* Exclude meta IO as we don't support partial completion for that */ > return req->flags & REQ_F_ISREG || > - S_ISBLK(file_inode(req->file)->i_mode); > + S_ISBLK(file_inode(req->file)->i_mode) || > + !(rw->kiocb.ki_flags & IOCB_HAS_METADATA); > } What partial ocmpletions aren't supported? Note that this would trigger easily as right now metadata is only added for block devices anyway. > + if (unlikely(kiocb->ki_flags & IOCB_HAS_METADATA)) { For a workload using metadata this is everything but unlikely. Is there a specific reason you're trying to override the existing branch predictor here (although on at least x86_64 gcc these kinds of unlikely calls tend to be no-ops anyway). > + struct io_async_rw *io = req->async_data; > + > + if (!(req->file->f_flags & O_DIRECT)) > + return -EOPNOTSUPP; I guess you are forcing this here rather in the file oerations instance because the union of the field in struct io_async_rw. Please add comment about that. > + /* wpq is for buffered io, while meta fields are used with direct io*/ missing whitespace before the closing of the comment.
On 10/17/24 2:10 AM, Christoph Hellwig wrote: > s/meta/metadata/ in the subject. > >> + const struct io_uring_meta *md = (struct io_uring_meta *)sqe->big_sqe_cmd; > > Overly long line. That's fine in io_uring, I prefer slightly longer lines rather than needlessly breaking them up.
> What is the meta_type for? To distintinguish PI from non-PI metadata? meta_type field is kept so that meta_types beyond integrity can also be supported in future. Pavel suggested this to Kanchan when this was discussed in LSF/MM. > Why doesn't this support non-PI metadata? It supports that. We have tested that (pi_type = 0 case). > Also PI or TO_PI might be > a better name than the rather generic integrity. (but I'll defer to > Martin if he has any good arguments for naming here). Open to a different/better name. > > > static bool need_complete_io(struct io_kiocb *req) > > { > > + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); > > + > > + /* Exclude meta IO as we don't support partial completion for that */ > > return req->flags & REQ_F_ISREG || > > - S_ISBLK(file_inode(req->file)->i_mode); > > + S_ISBLK(file_inode(req->file)->i_mode) || > > + !(rw->kiocb.ki_flags & IOCB_HAS_METADATA); > > } > > What partial ocmpletions aren't supported? Note that this would > trigger easily as right now metadata is only added for block devices > anyway. It seems that this scenario is less likely to happen. The plumbing seemed a bit non trivial. I have the plan to look at it, once the initial version of this series goes in. > > > + if (unlikely(kiocb->ki_flags & IOCB_HAS_METADATA)) { > > For a workload using metadata this is everything but unlikely. Is > there a specific reason you're trying to override the existing > branch predictor here (although on at least x86_64 gcc these kinds > of unlikely calls tend to be no-ops anyway). The branch predictions were added to make it a bit friendly for non-metadata read/write case.
Christoph, >> + if (!meta_type) >> + return 0; >> + if (!(meta_type & META_TYPE_INTEGRITY)) >> + return -EINVAL; > > What is the meta_type for? To distintinguish PI from non-PI metadata? > Why doesn't this support non-PI metadata? Also PI or TO_PI might be > a better name than the rather generic integrity. (but I'll defer to > Martin if he has any good arguments for naming here). It should probably be "META_TYPE_T10_PI". "Integrity" was meant as a protocol-agnostic name since there were other proposed protection information schemes being discussed in the standards at the time. I didn't want to limit the block layer changes to one particular storage protocol. NVMe implements features that are not defined by T10 PI such as the storage tag and the larger CRCs. But despite those, NVMe still follows the defined T10 PI model. So I think "META_TYPE_T10_PI" is fairly accurate.
On Mon, Oct 21, 2024 at 11:01:10AM +0530, Anuj Gupta wrote: > > What is the meta_type for? To distintinguish PI from non-PI metadata? > > meta_type field is kept so that meta_types beyond integrity can also > be supported in future. Pavel suggested this to Kanchan when this was > discussed in LSF/MM. > > > Why doesn't this support non-PI metadata? > > It supports that. We have tested that (pi_type = 0 case). What other metadata except for PI and plain non-integrity data do you plan to support? This seems like a weird field. In doubt just leave reserved space that is checked for 0 instead of adding an encondig that doesn't make much sense. If we actually do end up with a metadata scheme we can't encode into the pi_type we can still use that reserved space. > > > Also PI or TO_PI might be > > a better name than the rather generic integrity. (but I'll defer to > > Martin if he has any good arguments for naming here). > > Open to a different/better name. metadata? > > > + /* Exclude meta IO as we don't support partial completion for that */ > > > return req->flags & REQ_F_ISREG || > > > - S_ISBLK(file_inode(req->file)->i_mode); > > > + S_ISBLK(file_inode(req->file)->i_mode) || > > > + !(rw->kiocb.ki_flags & IOCB_HAS_METADATA); > > > } > > > > What partial ocmpletions aren't supported? Note that this would > > trigger easily as right now metadata is only added for block devices > > anyway. > > It seems that this scenario is less likely to happen. The plumbing > seemed a bit non trivial. I have the plan to look at it, once the > initial version of this series goes in. I still don't understand what this is trying to do, especially as it is dead code with the checks above. > > > > > + if (unlikely(kiocb->ki_flags & IOCB_HAS_METADATA)) { > > > > For a workload using metadata this is everything but unlikely. Is > > there a specific reason you're trying to override the existing > > branch predictor here (although on at least x86_64 gcc these kinds > > of unlikely calls tend to be no-ops anyway). > > The branch predictions were added to make it a bit friendly for > non-metadata read/write case. Throwing in these hints unless you have numbers justifying them is not a good idea.
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 86cb385fe0b5..1cd165720fcc 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -105,6 +105,32 @@ struct io_uring_sqe { */ __u8 cmd[0]; }; + /* + * If the ring is initialized with IORING_SETUP_SQE128, then + * this field is starting offset for 64 bytes of data. For meta io + * this contains 'struct io_uring_meta' + */ + __u8 big_sqe_cmd[0]; +}; + +enum io_uring_sqe_meta_type_bits { + META_TYPE_INTEGRITY_BIT, + /* not a real meta type; just to make sure that we don't overflow */ + META_TYPE_LAST_BIT, +}; + +/* meta type flags */ +#define META_TYPE_INTEGRITY (1U << META_TYPE_INTEGRITY_BIT) + +/* this goes to SQE128 */ +struct io_uring_meta { + __u16 meta_type; + __u16 meta_flags; + __u32 meta_len; + __u64 meta_addr; + __u32 seed; + __u16 app_tag; + __u8 pad[42]; }; /* diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index d7ad4ea5f40b..e5551e2e7bde 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3857,6 +3857,12 @@ static int __init io_uring_init(void) /* top 8bits are for internal use */ BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0); + BUILD_BUG_ON(sizeof(struct io_uring_meta) > + sizeof(struct io_uring_sqe)); + + BUILD_BUG_ON(META_TYPE_LAST_BIT > + 8 * sizeof_field(struct io_uring_meta, meta_type)); + io_uring_optable_init(); /* diff --git a/io_uring/rw.c b/io_uring/rw.c index 80ae3c2ebb70..b727e5ef19fc 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -257,6 +257,49 @@ static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import) return 0; } +static inline void io_meta_save_state(struct io_async_rw *io) +{ + io->meta_state.seed = io->meta.seed; + iov_iter_save_state(&io->meta.iter, &io->meta_state.iter_meta); +} + +static inline void io_meta_restore(struct io_async_rw *io) +{ + io->meta.seed = io->meta_state.seed; + iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta); +} + +static int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_rw *rw, int ddir) +{ + const struct io_uring_meta *md = (struct io_uring_meta *)sqe->big_sqe_cmd; + u16 meta_type = READ_ONCE(md->meta_type); + const struct io_issue_def *def; + struct io_async_rw *io; + int ret; + + if (!meta_type) + return 0; + if (!(meta_type & META_TYPE_INTEGRITY)) + return -EINVAL; + + def = &io_issue_defs[req->opcode]; + if (def->vectored) + return -EOPNOTSUPP; + + io = req->async_data; + io->meta.flags = READ_ONCE(md->meta_flags); + io->meta.app_tag = READ_ONCE(md->app_tag); + io->meta.seed = READ_ONCE(md->seed); + ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(md->meta_addr)), + READ_ONCE(md->meta_len), &io->meta.iter); + if (unlikely(ret < 0)) + return ret; + rw->kiocb.ki_flags |= IOCB_HAS_METADATA; + io_meta_save_state(io); + return ret; +} + static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, int ddir, bool do_import) { @@ -279,11 +322,18 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, rw->kiocb.ki_ioprio = get_current_ioprio(); } rw->kiocb.dio_complete = NULL; + rw->kiocb.ki_flags = 0; rw->addr = READ_ONCE(sqe->addr); rw->len = READ_ONCE(sqe->len); rw->flags = READ_ONCE(sqe->rw_flags); - return io_prep_rw_setup(req, ddir, do_import); + ret = io_prep_rw_setup(req, ddir, do_import); + + if (unlikely(ret)) + return ret; + if (unlikely(req->ctx->flags & IORING_SETUP_SQE128)) + ret = io_prep_rw_meta(req, sqe, rw, ddir); + return ret; } int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -410,7 +460,10 @@ static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req) static void io_resubmit_prep(struct io_kiocb *req) { struct io_async_rw *io = req->async_data; + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + if (unlikely(rw->kiocb.ki_flags & IOCB_HAS_METADATA)) + io_meta_restore(io); iov_iter_restore(&io->iter, &io->iter_state); } @@ -777,8 +830,12 @@ static inline int io_iter_do_read(struct io_rw *rw, struct iov_iter *iter) static bool need_complete_io(struct io_kiocb *req) { + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + + /* Exclude meta IO as we don't support partial completion for that */ return req->flags & REQ_F_ISREG || - S_ISBLK(file_inode(req->file)->i_mode); + S_ISBLK(file_inode(req->file)->i_mode) || + !(rw->kiocb.ki_flags & IOCB_HAS_METADATA); } static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) @@ -795,7 +852,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) if (!(req->flags & REQ_F_FIXED_FILE)) req->flags |= io_file_get_flags(file); - kiocb->ki_flags = file->f_iocb_flags; + kiocb->ki_flags |= file->f_iocb_flags; ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type); if (unlikely(ret)) return ret; @@ -824,6 +881,14 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) kiocb->ki_complete = io_complete_rw; } + if (unlikely(kiocb->ki_flags & IOCB_HAS_METADATA)) { + struct io_async_rw *io = req->async_data; + + if (!(req->file->f_flags & O_DIRECT)) + return -EOPNOTSUPP; + kiocb->private = &io->meta; + } + return 0; } @@ -898,6 +963,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) * manually if we need to. */ iov_iter_restore(&io->iter, &io->iter_state); + if (unlikely(kiocb->ki_flags & IOCB_HAS_METADATA)) + io_meta_restore(io); do { /* @@ -1102,6 +1169,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) } else { ret_eagain: iov_iter_restore(&io->iter, &io->iter_state); + if (unlikely(kiocb->ki_flags & IOCB_HAS_METADATA)) + io_meta_restore(io); if (kiocb->ki_flags & IOCB_WRITE) io_req_end_write(req); return -EAGAIN; diff --git a/io_uring/rw.h b/io_uring/rw.h index 3f432dc75441..af9e338b2bd8 100644 --- a/io_uring/rw.h +++ b/io_uring/rw.h @@ -1,6 +1,12 @@ // SPDX-License-Identifier: GPL-2.0 #include <linux/pagemap.h> +#include <linux/bio-integrity.h> + +struct io_meta_state { + u32 seed; + struct iov_iter_state iter_meta; +}; struct io_async_rw { size_t bytes_done; @@ -9,7 +15,14 @@ struct io_async_rw { struct iovec fast_iov; struct iovec *free_iovec; int free_iov_nr; - struct wait_page_queue wpq; + /* wpq is for buffered io, while meta fields are used with direct io*/ + union { + struct wait_page_queue wpq; + struct { + struct uio_meta meta; + struct io_meta_state meta_state; + }; + }; }; int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);