Message ID | 20250318073545.3518707-2-yi.zhang@huaweicloud.com |
---|---|
State | New |
Headers | show |
Series | fallocate: introduce FALLOC_FL_WRITE_ZEROES flag | expand |
On Tue, Mar 18, 2025 at 03:35:36PM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > Currently, disks primarily implement the write zeroes command (aka > REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves > physically writing zeros to the disk media (e.g., HDDs), while the > second performs an unmap operation on the logical blocks, effectively > putting them into a deallocated state (e.g., SSDs). The first method is > generally slow, while the second method is typically very fast. > > For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting > REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate > the write zeros operation by placing disk blocks into Note that this is a can, not a must. The NVMe definition of Write Zeroes is unfortunately pretty stupid. > + [RO] Devices that explicitly support the unmap write zeroes > + operation in which a single write zeroes request with the unmap > + bit set to zero out the range of contiguous blocks on storage > + by freeing blocks, rather than writing physical zeroes to the > + media. This is not actually guaranteed for nvme or scsi.
On 2025/4/10 16:20, Keith Busch wrote: > On Thu, Apr 10, 2025 at 09:15:59AM +0200, Christoph Hellwig wrote: >> On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote: >>> >>> Thank you for your review and comments. However, I'm not sure I fully >>> understand your points. Could you please provide more details? >>> >>> AFAIK, the NVMe protocol has the following description in the latest >>> NVM Command Set Specification Figure 82 and Figure 114: >>> >>> === >>> Deallocate (DEAC): If this bit is set to `1´, then the host is >>> requesting that the controller deallocate the specified logical blocks. >>> If this bit is cleared to `0´, then the host is not requesting that >>> the controller deallocate the specified logical blocks... >>> >>> DLFEAT: >>> Write Zeroes Deallocation Support (WZDS): If this bit is set to `1´, >>> then the controller supports the Deallocate bit in the Write Zeroes >>> command for this namespace... >> >> Yes. The host is requesting, not the controller shall. It's not >> guaranteed behavior and the controller might as well actually write >> zeroes to the media. That is rather stupid, but still. > > I guess some controllers _really_ want specific alignments to > successfully do a proper discard. While still not guaranteed in spec, I > think it is safe to assume a proper deallocation will occur if you align > to NPDA and NPDG. Otherwise, the controller may do a read-modify-write > to ensure zeroes are returned for the requested LBA range on anything > that straddles an implementation specific boundary. > I understand. A proper deallocation has certain constraints, but I guess it should be useful for most scenarios. Thank you for the explanation. Thanks, Yi.
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block index 890cde28bf90..67513c0d9233 100644 --- a/Documentation/ABI/stable/sysfs-block +++ b/Documentation/ABI/stable/sysfs-block @@ -742,6 +742,20 @@ Description: 0, write zeroes is not supported by the device. +What: /sys/block/<disk>/queue/write_zeroes_unmap +Date: January 2025 +Contact: Zhang Yi <yi.zhang@huawei.com> +Description: + [RO] Devices that explicitly support the unmap write zeroes + operation in which a single write zeroes request with the unmap + bit set to zero out the range of contiguous blocks on storage + by freeing blocks, rather than writing physical zeroes to the + media. If write_zeroes_unmap is 1, this indicates that the + device explicitly supports the write zero command. Otherwise, + the device either does not support it, or its support status is + unknown. + + What: /sys/block/<disk>/queue/zone_append_max_bytes Date: May 2020 Contact: linux-block@vger.kernel.org diff --git a/block/blk-settings.c b/block/blk-settings.c index 6b2dbe645d23..3331d07bd5d9 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -697,6 +697,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->features &= ~BLK_FEAT_NOWAIT; if (!(b->features & BLK_FEAT_POLL)) t->features &= ~BLK_FEAT_POLL; + if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP)) + t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP; t->flags |= (b->flags & BLK_FLAG_MISALIGNED); @@ -819,6 +821,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->zone_write_granularity = 0; t->max_zone_append_sectors = 0; } + + if (!t->max_write_zeroes_sectors) + t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP; + blk_stack_atomic_writes_limits(t, b, start); return ret; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index d584461a1d84..6f00e9a8f8b6 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -261,6 +261,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \ QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA); QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX); +QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP); static ssize_t queue_poll_show(struct gendisk *disk, char *page) { @@ -510,6 +511,7 @@ QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes"); QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes"); QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes"); +QUEUE_LIM_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap"); QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes"); QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity"); @@ -656,6 +658,7 @@ static struct attribute *queue_attrs[] = { &queue_atomic_write_unit_min_entry.attr, &queue_atomic_write_unit_max_entry.attr, &queue_max_write_zeroes_sectors_entry.attr, + &queue_write_zeroes_unmap_entry.attr, &queue_max_zone_append_sectors_entry.attr, &queue_zone_write_granularity_entry.attr, &queue_rotational_entry.attr, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e39c45bc0a97..5d280c7fba65 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -342,6 +342,9 @@ typedef unsigned int __bitwise blk_features_t; #define BLK_FEAT_ATOMIC_WRITES \ ((__force blk_features_t)(1u << 16)) +/* supports unmap write zeroes command */ +#define BLK_FEAT_WRITE_ZEROES_UNMAP ((__force blk_features_t)(1u << 17)) + /* * Flags automatically inherited when stacking limits. */