diff mbox series

[v4,2/2] scsi: sd: Set a default optimal IO size if one is not defined

Message ID 20250613062909.2505759-3-dlemoal@kernel.org
State Superseded
Headers show
Series Improve optimal IO size initialization | expand

Commit Message

Damien Le Moal June 13, 2025, 6:29 a.m. UTC
Introduce the helper function sd_set_io_opt() to set a disk io_opt
limit. This new way of setting this limit falls back to using the
max_sectors limit if the host does not define an optimal sector limit
and the device did not indicate an optimal transfer size (e.g. as is
the case for ATA devices). io_opt calculation is done using a local
64-bits variable to avoid overflows. The final value is clamped to
UINT_MAX aligned down to the device physical block size.

This fallback io_opt limit avoids setting up the disk with a zero
io_opt limit, which result in the rather small 128 KB read_ahead_kb
attribute. The larger read_ahead_kb value set with the default non-zero
io_opt limit significantly improves buffered read performance with file
systems without any intervention from the user.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

Comments

Damien Le Moal June 16, 2025, 5:34 a.m. UTC | #1
On 6/13/25 23:31, John Garry wrote:
> On 13/06/2025 07:29, Damien Le Moal wrote:
>> Introduce the helper function sd_set_io_opt() to set a disk io_opt
>> limit. This new way of setting this limit falls back to using the
>> max_sectors limit if the host does not define an optimal sector limit
>> and the device did not indicate an optimal transfer size (e.g. as is
>> the case for ATA devices). io_opt calculation is done using a local
>> 64-bits variable to avoid overflows. The final value is clamped to
>> UINT_MAX aligned down to the device physical block size.
>>
>> This fallback io_opt limit avoids setting up the disk with a zero
>> io_opt limit, which result in the rather small 128 KB read_ahead_kb
>> attribute. The larger read_ahead_kb value set with the default non-zero
>> io_opt limit significantly improves buffered read performance with file
>> systems without any intervention from the user.
> 
> Out of curiosity, why do this just for sd.c and not always set up the 
> default like this in blk_validate_limits()?

Good point. Though I think we do not want to have a large io_opt for slow
devices like MMC/SD Cards. So something like this, which is indeed simpler than
hacking lim->io_opt in sd.c.

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a000daafbfb4..d3ec6f4100f4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -58,16 +58,24 @@ EXPORT_SYMBOL(blk_set_stacking_limits);
 void blk_apply_bdi_limits(struct backing_dev_info *bdi,
                struct queue_limits *lim)
 {
+       u64 io_opt = lim->io_opt;
+
        /*
         * For read-ahead of large files to be effective, we need to read ahead
-        * at least twice the optimal I/O size.
+        * at least twice the optimal I/O size. For rotational devices that do
+        * not report an optimal I/O size (e.g. ATA HDDs), use the maximum I/O
+        * size to avoid falling back to the (rather inefficient) small default
+        * read-ahead size.
         *
         * There is no hardware limitation for the read-ahead size and the user
         * might have increased the read-ahead size through sysfs, so don't ever
         * decrease it.
         */
+       if (!io_opt && (lim->features & BLK_FEAT_ROTATIONAL))
+               io_opt = lim->max_sectors;
+
        bdi->ra_pages = max3(bdi->ra_pages,
-                               lim->io_opt * 2 / PAGE_SIZE,
+                               io_opt * 2 >> PAGE_SHIFT,
                                VM_READAHEAD_PAGES);
        bdi->io_pages = lim->max_sectors >> PAGE_SECTORS_SHIFT;
 }

I will make a proper patch of this and send it out as a replacement.
Damien Le Moal June 16, 2025, 6:26 a.m. UTC | #2
On 6/16/25 14:34, Damien Le Moal wrote:
> On 6/13/25 23:31, John Garry wrote:
>> On 13/06/2025 07:29, Damien Le Moal wrote:
>>> Introduce the helper function sd_set_io_opt() to set a disk io_opt
>>> limit. This new way of setting this limit falls back to using the
>>> max_sectors limit if the host does not define an optimal sector limit
>>> and the device did not indicate an optimal transfer size (e.g. as is
>>> the case for ATA devices). io_opt calculation is done using a local
>>> 64-bits variable to avoid overflows. The final value is clamped to
>>> UINT_MAX aligned down to the device physical block size.
>>>
>>> This fallback io_opt limit avoids setting up the disk with a zero
>>> io_opt limit, which result in the rather small 128 KB read_ahead_kb
>>> attribute. The larger read_ahead_kb value set with the default non-zero
>>> io_opt limit significantly improves buffered read performance with file
>>> systems without any intervention from the user.
>>
>> Out of curiosity, why do this just for sd.c and not always set up the 
>> default like this in blk_validate_limits()?
> 
> Good point. Though I think we do not want to have a large io_opt for slow
> devices like MMC/SD Cards. So something like this, which is indeed simpler than
> hacking lim->io_opt in sd.c.
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a000daafbfb4..d3ec6f4100f4 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -58,16 +58,24 @@ EXPORT_SYMBOL(blk_set_stacking_limits);
>  void blk_apply_bdi_limits(struct backing_dev_info *bdi,
>                 struct queue_limits *lim)
>  {
> +       u64 io_opt = lim->io_opt;
> +
>         /*
>          * For read-ahead of large files to be effective, we need to read ahead
> -        * at least twice the optimal I/O size.
> +        * at least twice the optimal I/O size. For rotational devices that do
> +        * not report an optimal I/O size (e.g. ATA HDDs), use the maximum I/O
> +        * size to avoid falling back to the (rather inefficient) small default
> +        * read-ahead size.
>          *
>          * There is no hardware limitation for the read-ahead size and the user
>          * might have increased the read-ahead size through sysfs, so don't ever
>          * decrease it.
>          */
> +       if (!io_opt && (lim->features & BLK_FEAT_ROTATIONAL))
> +               io_opt = lim->max_sectors;

Oops... This should of course be:

		io_opt = (u64)lim->max_sectors << SECTOR_SHIFT;

> +
>         bdi->ra_pages = max3(bdi->ra_pages,
> -                               lim->io_opt * 2 / PAGE_SIZE,
> +                               io_opt * 2 >> PAGE_SHIFT,
>                                 VM_READAHEAD_PAGES);
>         bdi->io_pages = lim->max_sectors >> PAGE_SECTORS_SHIFT;
>  }
> 
> I will make a proper patch of this and send it out as a replacement.
>
diff mbox series

Patch

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index daddef2e9e87..8070356285a7 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3681,6 +3681,40 @@  static void sd_read_block_zero(struct scsi_disk *sdkp)
 	kfree(buffer);
 }
 
+/*
+ * Set the optimal I/O size: limit the default to the SCSI host optimal sector
+ * limit if it is set. There may be an impact on performance when the size of
+ * a request exceeds this host limit. If the host did not set any optimal
+ * sector limit and the device did not indicate an optimal transfer size
+ * (e.g. ATA devices), default to using the device max_sectors limit.
+ */
+static void sd_set_io_opt(struct scsi_disk *sdkp, unsigned int dev_max,
+			  struct queue_limits *lim)
+{
+	struct scsi_device *sdp = sdkp->device;
+	struct Scsi_Host *shost = sdp->host;
+	u64 io_opt;
+
+	io_opt = (u64)shost->opt_sectors << SECTOR_SHIFT;
+	if (sd_validate_opt_xfer_size(sdkp, dev_max))
+		io_opt = min_not_zero(io_opt,
+				logical_to_bytes(sdp, sdkp->opt_xfer_blocks));
+	if (io_opt) {
+		lim->io_opt = ALIGN_DOWN(min_t(u64, io_opt, UINT_MAX),
+					 sdkp->physical_block_size - 1);
+		return;
+	}
+
+	/* Set default */
+	io_opt = (u64)lim->max_sectors << SECTOR_SHIFT;
+	lim->io_opt = ALIGN_DOWN(min_t(u64, io_opt, UINT_MAX),
+				 sdkp->physical_block_size - 1);
+
+	sd_first_printk(KERN_INFO, sdkp,
+			"Using default optimal transfer size of %u bytes\n",
+			lim->io_opt);
+}
+
 /**
  *	sd_revalidate_disk - called the first time a new disk is seen,
  *	performs disk spin up, read_capacity, etc.
@@ -3777,16 +3811,7 @@  static int sd_revalidate_disk(struct gendisk *disk)
 	else
 		lim.io_min = 0;
 
-	/*
-	 * Limit default to SCSI host optimal sector limit if set. There may be
-	 * an impact on performance for when the size of a request exceeds this
-	 * host limit.
-	 */
-	lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT;
-	if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
-		lim.io_opt = min_not_zero(lim.io_opt,
-				logical_to_bytes(sdp, sdkp->opt_xfer_blocks));
-	}
+	sd_set_io_opt(sdkp, dev_max, &lim);
 
 	sdkp->first_scan = 0;