Message ID | 20240219130109.341523-1-john.g.garry@oracle.com |
---|---|
Headers | show |
Series | block atomic writes | expand |
On Mon, Feb 19, 2024 at 01:00:59PM +0000, John Garry wrote: > Currently blk_queue_get_max_sectors() is passed a enum req_op. In future > the value returned from blk_queue_get_max_sectors() may depend on certain > request flags, so pass a request pointer. > > Also use rq->cmd_flags instead of rq->bio->bi_opf when possible. Looks good. Reviewed-by: Keith Busch <kbusch@kernel.org>
On Mon, Feb 19, 2024 at 01:01:01PM +0000, John Garry wrote: > From: Prasad Singamsetty <prasad.singamsetty@oracle.com> > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -301,9 +301,12 @@ typedef int __bitwise __kernel_rwf_t; > /* per-IO O_APPEND */ > #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) > > +/* Atomic Write */ > +#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000040) Should this be 0x20 so it's the next bit after RWF_APPEND?
On Mon, Feb 19, 2024 at 01:01:08PM +0000, John Garry wrote: > From: Alan Adamson <alan.adamson@oracle.com> > > Add support to set block layer request_queue atomic write limits. The > limits will be derived from either the namespace or controller atomic > parameters. > > NVMe atomic-related parameters are grouped into "normal" and "power-fail" > (or PF) class of parameter. For atomic write support, only PF parameters > are of interest. The "normal" parameters are concerned with racing reads > and writes (which also applies to PF). See NVM Command Set Specification > Revision 1.0d section 2.1.4 for reference. > > Whether to use per namespace or controller atomic parameters is decided by > NSFEAT bit 1 - see Figure 97: Identify - Identify Namespace Data Structure, > #NVM Command Set. > > NVMe namespaces may define an atomic boundary, whereby no atomic guarantees > are provided for a write which straddles this per-lba space boundary. The > block layer merging policy is such that no merges may occur in which the > resultant request would straddle such a boundary. > > Unlike SCSI, NVMe specifies no granularity or alignment rules. In addition, > again unlike SCSI, there is no dedicated atomic write command - a write > which adheres to the atomic size limit and boundary is implicitly atomic. > > If NSFEAT bit 1 is set, the following parameters are of interest: > - NAWUPF (Namespace Atomic Write Unit Power Fail) > - NABSPF (Namespace Atomic Boundary Size Power Fail) > - NABO (Namespace Atomic Boundary Offset) > > and we set request_queue limits as follows: > - atomic_write_unit_max = rounddown_pow_of_two(NAWUPF) > - atomic_write_max_bytes = NAWUPF > - atomic_write_boundary = NABSPF > > If in the unlikely scenario that NABO is non-zero, then atomic writes will > not be supported at all as dealing with this adds extra complexity. This > policy may change in future. > > In all cases, atomic_write_unit_min is set to the logical block size. > > If NSFEAT bit 1 is unset, the following parameter is of interest: > - AWUPF (Atomic Write Unit Power Fail) > > and we set request_queue limits as follows: > - atomic_write_unit_max = rounddown_pow_of_two(AWUPF) > - atomic_write_max_bytes = AWUPF > - atomic_write_boundary = 0 > > The block layer requires that the atomic_write_boundary value is a > power-of-2. However, it is really only required that atomic_write_boundary > be a multiple of atomic_write_unit_max. As such, if NABSPF were not a > power-of-2, atomic_write_unit_max could be reduced such that it was > divisible into NABSPF. However, this complexity will not be yet supported. > > A helper function, nvme_valid_atomic_write(), is also added for the > submission path to verify that a request has been submitted to the driver > will actually be executed atomically. Maybe patch 11 should be folded into this one. No bigged, the series as a whole looks good. Reviewed-by: Keith Busch <kbusch@kernel.org>
On Mon, Feb 19, 2024 at 01:01:02PM +0000, John Garry wrote: > From: Prasad Singamsetty <prasad.singamsetty@oracle.com> > > Extend statx system call to return additional info for atomic write support > support for a file. > > Helper function generic_fill_statx_atomic_writes() can be used by FSes to > fill in the relevant statx fields. > > Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> > #jpg: relocate bdev support to another patch > Signed-off-by: John Garry <john.g.garry@oracle.com> > --- > fs/stat.c | 34 ++++++++++++++++++++++++++++++++++ > include/linux/fs.h | 3 +++ > include/linux/stat.h | 3 +++ > include/uapi/linux/stat.h | 9 ++++++++- > 4 files changed, 48 insertions(+), 1 deletion(-) > > diff --git a/fs/stat.c b/fs/stat.c > index 77cdc69eb422..522787a4ab6a 100644 > --- a/fs/stat.c > +++ b/fs/stat.c > @@ -89,6 +89,37 @@ void generic_fill_statx_attr(struct inode *inode, struct kstat *stat) > } > EXPORT_SYMBOL(generic_fill_statx_attr); > > +/** > + * generic_fill_statx_atomic_writes - Fill in the atomic writes statx attributes > + * @stat: Where to fill in the attribute flags > + * @unit_min: Minimum supported atomic write length > + * @unit_max: Maximum supported atomic write length > + * > + * Fill in the STATX{_ATTR}_WRITE_ATOMIC flags in the kstat structure from > + * atomic write unit_min and unit_max values. > + */ > +void generic_fill_statx_atomic_writes(struct kstat *stat, > + unsigned int unit_min, > + unsigned int unit_max) > +{ > + /* Confirm that the request type is known */ > + stat->result_mask |= STATX_WRITE_ATOMIC; > + > + /* Confirm that the file attribute type is known */ > + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; > + > + if (unit_min) { > + stat->atomic_write_unit_min = unit_min; > + stat->atomic_write_unit_max = unit_max; > + /* Initially only allow 1x segment */ > + stat->atomic_write_segments_max = 1; > + > + /* Confirm atomic writes are actually supported */ > + stat->attributes |= STATX_ATTR_WRITE_ATOMIC; > + } > +} > +EXPORT_SYMBOL(generic_fill_statx_atomic_writes); What units are these in? Nothing in the patch or commit description tells us.... -Dave.
On Mon, Feb 19, 2024 at 01:01:01PM +0000, John Garry wrote: > @@ -3523,4 +3535,26 @@ extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len, > extern int generic_fadvise(struct file *file, loff_t offset, loff_t len, > int advice); > > +static inline bool atomic_write_valid(loff_t pos, struct iov_iter *iter, > + unsigned int unit_min, unsigned int unit_max) > +{ > + size_t len = iov_iter_count(iter); > + > + if (!iter_is_ubuf(iter)) > + return false; > + > + if (len == unit_min || len == unit_max) { > + /* ok if exactly min or max */ > + } else if (len < unit_min || len > unit_max) { > + return false; > + } else if (!is_power_of_2(len)) { > + return false; > + } This doesn't need if else if else if and it doesn't need to check for exact unit min/max matches. The exact matches require the length to be a power of 2, so the checks are simply: if (len < unit_min || len > unit_max) return false; if (!is_power_of_2(len)) return false; > + if (pos & (len - 1)) > + return false; This has typing issues - 64 bit value, 32 bit mask. probably should use: if (!IS_ALIGNED(pos, len)) return false; -Dave.
On Mon, Feb 19, 2024 at 12:21:14PM -0700, Keith Busch wrote: > Maybe patch 11 should be folded into this one. No bigged, the series as > a whole looks good. I did suggest just that last round already..
On Mon, Feb 19, 2024 at 01:01:07PM +0000, John Garry wrote: > Add initial support for atomic writes. > > As is standard method, feed device properties via modules param, those > being: > - atomic_max_size_blks > - atomic_alignment_blks > - atomic_granularity_blks > - atomic_max_size_with_boundary_blks > - atomic_max_boundary_blks > > These just match sbc4r22 section 6.6.4 - Block limits VPD page. > > We just support ATOMIC WRITE (16). > > The major change in the driver is how we lock the device for RW accesses. > > Currently the driver uses a per-device lock for accessing device metadata > and "media" data (calls to do_device_access()) atomically for the duration > of the whole read/write command. > > This should not suit verifying atomic writes. Reason being that currently > all reads/writes are atomic, so using atomic writes does not prove > anything. > > Change device access model to basis that regular writes only atomic on a > per-sector basis, while reads and atomic writes are fully atomic. > > As mentioned, since accessing metadata and device media is atomic, > continue to have regular writes involving metadata - like discard or PI - > as atomic. We can improve this later. > > Currently we only support model where overlapping going reads or writes > wait for current access to complete before commencing an atomic write. > This is described in 4.29.3.2 section of the SBC. However, we simplify, > things and wait for all accesses to complete (when issuing an atomic > write). > > Signed-off-by: John Garry <john.g.garry@oracle.com> > --- <snip> > +#define DEF_ATOMIC_WR 0 <snip> > +static unsigned int sdebug_atomic_wr = DEF_ATOMIC_WR; <snip> > +MODULE_PARM_DESC(atomic_write, "enable ATOMIC WRITE support, support WRITE ATOMIC(16) (def=1)"); Hi John, The default value here seems to be 0 and not 1. Got me a bit confused while testing :) Regards, ojaswin > MODULE_PARM_DESC(lowest_aligned, "lowest aligned lba (def=0)"); > MODULE_PARM_DESC(lun_format, "LUN format: 0->peripheral (def); 1 --> flat address method"); > MODULE_PARM_DESC(max_luns, "number of LUNs per target to simulate(def=1)"); > @@ -6260,6 +6575,11 @@ MODULE_PARM_DESC(unmap_alignment, "lowest aligned thin provisioning lba (def=0)" > MODULE_PARM_DESC(unmap_granularity, "thin provisioning granularity in blocks (def=1)"); > MODULE_PARM_DESC(unmap_max_blocks, "max # of blocks can be unmapped in one cmd (def=0xffffffff)"); > MODULE_PARM_DESC(unmap_max_desc, "max # of ranges that can be unmapped in one cmd (def=256)"); > +MODULE_PARM_DESC(atomic_wr_max_length, "max # of blocks can be atomically written in one cmd (def=8192)"); > +MODULE_PARM_DESC(atomic_wr_align, "minimum alignment of atomic write in blocks (def=2)"); > +MODULE_PARM_DESC(atomic_wr_gran, "minimum granularity of atomic write in blocks (def=2)"); > +MODULE_PARM_DESC(atomic_wr_max_length_bndry, "max # of blocks can be atomically written in one cmd with boundary set (def=8192)"); > +MODULE_PARM_DESC(atomic_wr_max_bndry, "max # boundaries per atomic write (def=128)"); > MODULE_PARM_DESC(uuid_ctl, > "1->use uuid for lu name, 0->don't, 2->all use same (def=0)"); > MODULE_PARM_DESC(virtual_gb, "virtual gigabyte (GiB) size (def=0 -> use dev_size_mb)"); > @@ -7406,6 +7726,7 @@ static int __init scsi_debug_init(void) > return -EINVAL; > } > } > + > xa_init_flags(per_store_ap, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ); > if (want_store) { > idx = sdebug_add_store(); > @@ -7613,7 +7934,9 @@ static int sdebug_add_store(void) > map_region(sip, 0, 2); > } > > - rwlock_init(&sip->macc_lck); > + rwlock_init(&sip->macc_data_lck); > + rwlock_init(&sip->macc_meta_lck); > + rwlock_init(&sip->macc_sector_lck); > return (int)n_idx; > err: > sdebug_erase_store((int)n_idx, sip); > -- > 2.31.1 >
On 19/02/2024 19:16, David Sterba wrote: > On Mon, Feb 19, 2024 at 01:01:01PM +0000, John Garry wrote: >> From: Prasad Singamsetty<prasad.singamsetty@oracle.com> >> --- a/include/uapi/linux/fs.h >> +++ b/include/uapi/linux/fs.h >> @@ -301,9 +301,12 @@ typedef int __bitwise __kernel_rwf_t; >> /* per-IO O_APPEND */ >> #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) >> >> +/* Atomic Write */ >> +#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000040) > Should this be 0x20 so it's the next bit after RWF_APPEND? Support for new flag RWF_NOAPPEND - which has value 0x20 - has been picked up on the vfs tree for 6.9 . I had been just basing my series on Linus' release so far while also trying to anticipate any merge issue. Thanks, John
On 20/02/2024 06:55, Christoph Hellwig wrote: > On Mon, Feb 19, 2024 at 12:21:14PM -0700, Keith Busch wrote: >> Maybe patch 11 should be folded into this one. No bigged, the series as >> a whole looks good. > > I did suggest just that last round already.. > I created the helper function and folded it in, but not the callsite. Not folding in the callsite change did not make sense to me, but that was my understanding of the request. Anyway, I can fold everything into this patch. Thanks, John
> +EXPORT_SYMBOL(generic_fill_statx_atomic_writes);
EXPORT_SYMBOL_GPL for any new feature, please.
Thanks for writing a good commit message! > NVMe namespaces may define an atomic boundary, whereby no atomic guarantees > are provided for a write which straddles this per-lba space boundary. The > block layer merging policy is such that no merges may occur in which the > resultant request would straddle such a boundary. > > Unlike SCSI, NVMe specifies no granularity or alignment rules. Well, the boundary really is sort of a granularity and alignment, isn't it?
On 20/02/2024 08:31, Christoph Hellwig wrote: > Thanks for writing a good commit message! > >> NVMe namespaces may define an atomic boundary, whereby no atomic guarantees >> are provided for a write which straddles this per-lba space boundary. The >> block layer merging policy is such that no merges may occur in which the >> resultant request would straddle such a boundary. >> >> Unlike SCSI, NVMe specifies no granularity or alignment rules. > > Well, the boundary really is sort of a granularity and alignment, > isn't it? NVMe does indeed have the boundary rule, but it is not really the same as SCSI granularity and alignment. Anyway, I can word that statement to be clearer. Thanks, John
On 20/02/2024 07:12, Ojaswin Mujoo wrote: >> +MODULE_PARM_DESC(atomic_write, "enable ATOMIC WRITE support, support WRITE ATOMIC(16) (def=1)"); > Hi John, > > The default value here seems to be 0 and not 1. Got me a bit confused > while testing 🙂 I can fix that MODULE_PARM_DESC() text. I don't think that many disks support atomic writes, so I was leaving this disabled by default. Thanks, John
On 20/02/2024 08:20, Christoph Hellwig wrote: >> +EXPORT_SYMBOL(generic_fill_statx_atomic_writes); > EXPORT_SYMBOL_GPL for any new feature, please. ok, thanks.
On 19/02/2024 22:28, Dave Chinner wrote: >> >> +/** >> + * generic_fill_statx_atomic_writes - Fill in the atomic writes statx attributes >> + * @stat: Where to fill in the attribute flags >> + * @unit_min: Minimum supported atomic write length >> + * @unit_max: Maximum supported atomic write length >> + * >> + * Fill in the STATX{_ATTR}_WRITE_ATOMIC flags in the kstat structure from >> + * atomic write unit_min and unit_max values. >> + */ >> +void generic_fill_statx_atomic_writes(struct kstat *stat, >> + unsigned int unit_min, >> + unsigned int unit_max) >> +{ >> + /* Confirm that the request type is known */ >> + stat->result_mask |= STATX_WRITE_ATOMIC; >> + >> + /* Confirm that the file attribute type is known */ >> + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; >> + >> + if (unit_min) { >> + stat->atomic_write_unit_min = unit_min; >> + stat->atomic_write_unit_max = unit_max; >> + /* Initially only allow 1x segment */ >> + stat->atomic_write_segments_max = 1; >> + >> + /* Confirm atomic writes are actually supported */ >> + stat->attributes |= STATX_ATTR_WRITE_ATOMIC; >> + } >> +} >> +EXPORT_SYMBOL(generic_fill_statx_atomic_writes); > What units are these in? Nothing in the patch or commit description > tells us.... I can append the current comments to mention that the unit is bytes. Thanks, John
On 19/02/2024 22:44, Dave Chinner wrote: > On Mon, Feb 19, 2024 at 01:01:01PM +0000, John Garry wrote: >> @@ -3523,4 +3535,26 @@ extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len, >> extern int generic_fadvise(struct file *file, loff_t offset, loff_t len, >> int advice); >> >> +static inline bool atomic_write_valid(loff_t pos, struct iov_iter *iter, >> + unsigned int unit_min, unsigned int unit_max) >> +{ >> + size_t len = iov_iter_count(iter); >> + >> + if (!iter_is_ubuf(iter)) >> + return false; >> + >> + if (len == unit_min || len == unit_max) { >> + /* ok if exactly min or max */ >> + } else if (len < unit_min || len > unit_max) { >> + return false; >> + } else if (!is_power_of_2(len)) { >> + return false; >> + } > This doesn't need if else if else if and it doesn't need to check > for exact unit min/max matches. This is fastpath code, and I thought it quicker to just check if min/max first. Based on recent discussions, for FS support I expect typically len == unit_max. But I can change to your simpler checking and later change to the current method if those FS assumptions hold true. > The exact matches require the > length to be a power of 2, so the checks are simply: > > if (len < unit_min || len > unit_max) > return false; > if (!is_power_of_2(len)) > return false; > >> + if (pos & (len - 1)) >> + return false; > This has typing issues - 64 bit value, 32 bit mask. probably should > use: > > if (!IS_ALIGNED(pos, len)) > return false; ok, good idea. Thanks, John