Message ID | 20230302144330.274947-1-ulf.hansson@linaro.org |
---|---|
State | New |
Headers | show |
Series | [RFC] mmc: core: Disable REQ_FUA if the eMMC supports an internal cache | expand |
On 02.03.23 3:43 PM, Ulf Hansson wrote: > REQ_FUA is in general supported for eMMC cards, which translates into so > called "reliable writes". To support these write operations, the CMD23 > (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common > but not always the case. > > For some eMMC devices, it has been reported that reliable writes are quite > costly, leading to performance degradations. > > In a way to improve the situation, let's avoid announcing REQ_FUA support > if the eMMC supports an internal cache, as that allows us to rely solely on > flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper. > Note that, those mmc hosts that lacks CMD23 support are already using this > type of configuration, whatever that could mean. > > Reported-by: Wenchao Chen<wenchao.chen666@gmail.com> > Signed-off-by: Ulf Hansson<ulf.hansson@linaro.org> Acked-by: Bean Huo <beanhuo@micron.com>
> Subject: [RFC PATCH] mmc: core: Disable REQ_FUA if the eMMC supports an internal cache Hey Uffe, I may have the setup to play with this. (Powerfailing eMMC devices) > > REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case. > > For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations. > > In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper. > Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean. Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec. (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in? Regards, Christian Hyperstone GmbH | Reichenaustr. 39a | 78467 Konstanz Managing Director: Dr. Jan Peter Berns. Commercial register of local courts: Freiburg HRB381782
> On 02.03.23 3:43 PM, Ulf Hansson wrote: > > REQ_FUA is in general supported for eMMC cards, which translates into > > so called "reliable writes". To support these write operations, the > > CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, > > which is common but not always the case. > > > > For some eMMC devices, it has been reported that reliable writes are > > quite costly, leading to performance degradations. > > > > In a way to improve the situation, let's avoid announcing REQ_FUA > > support if the eMMC supports an internal cache, as that allows us to > > rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a > lot cheaper. > > Note that, those mmc hosts that lacks CMD23 support are already using > > this type of configuration, whatever that could mean. > > > > Reported-by: Wenchao Chen<wenchao.chen666@gmail.com> > > Signed-off-by: Ulf Hansson<ulf.hansson@linaro.org> > Acked-by: Bean Huo <beanhuo@micron.com> Acked-by: Avri Altman <avri.altman@wdc.com> Another option might be, allowing to report "broken_fua", should the platform owner chooses to, much like scsi does per sdev. Thanks, Avri
On Fri, 3 Mar 2023 at 10:39, Avri Altman <Avri.Altman@wdc.com> wrote: > > > On 02.03.23 3:43 PM, Ulf Hansson wrote: > > > REQ_FUA is in general supported for eMMC cards, which translates into > > > so called "reliable writes". To support these write operations, the > > > CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, > > > which is common but not always the case. > > > > > > For some eMMC devices, it has been reported that reliable writes are > > > quite costly, leading to performance degradations. > > > > > > In a way to improve the situation, let's avoid announcing REQ_FUA > > > support if the eMMC supports an internal cache, as that allows us to > > > rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a > > lot cheaper. > > > Note that, those mmc hosts that lacks CMD23 support are already using > > > this type of configuration, whatever that could mean. > > > > > > Reported-by: Wenchao Chen<wenchao.chen666@gmail.com> > > > Signed-off-by: Ulf Hansson<ulf.hansson@linaro.org> > > Acked-by: Bean Huo <beanhuo@micron.com> > Acked-by: Avri Altman <avri.altman@wdc.com> Thanks! > > Another option might be, allowing to report "broken_fua", > should the platform owner chooses to, much like scsi does per sdev. Are you suggesting a static or dynamic configuration option? For mmc, we also have the card quirks that may be used to configure the support for FUA, based upon what would work best for the card. Is that what you were thinking of? > > Thanks, > Avri Kind regards Uffe
> On Fri, 3 Mar 2023 at 10:39, Avri Altman <Avri.Altman@wdc.com> wrote: > > > > > On 02.03.23 3:43 PM, Ulf Hansson wrote: > > > > REQ_FUA is in general supported for eMMC cards, which translates > > > > into so called "reliable writes". To support these write > > > > operations, the > > > > CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host > too, > > > > which is common but not always the case. > > > > > > > > For some eMMC devices, it has been reported that reliable writes > > > > are quite costly, leading to performance degradations. > > > > > > > > In a way to improve the situation, let's avoid announcing REQ_FUA > > > > support if the eMMC supports an internal cache, as that allows us > > > > to rely solely on flush-requests (REQ_OP_FLUSH) instead, which > > > > seems to be a > > > lot cheaper. > > > > Note that, those mmc hosts that lacks CMD23 support are already > > > > using this type of configuration, whatever that could mean. > > > > > > > > Reported-by: Wenchao Chen<wenchao.chen666@gmail.com> > > > > Signed-off-by: Ulf Hansson<ulf.hansson@linaro.org> > > > Acked-by: Bean Huo <beanhuo@micron.com> > > Acked-by: Avri Altman <avri.altman@wdc.com> > > Thanks! > > > > > Another option might be, allowing to report "broken_fua", should the > > platform owner chooses to, much like scsi does per sdev. > > Are you suggesting a static or dynamic configuration option? Static > > For mmc, we also have the card quirks that may be used to configure the > support for FUA, based upon what would work best for the card. Is that > what you were thinking of? I was thinking to allow the platform owner the flexibility to decide if to use it or not, More like SDHCI_QUIRK_xxx But I am totally fine with your current suggestions. Thanks, Avri > > > > > Thanks, > > Avri > > Kind regards > Uffe
>> >> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case. >> >> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations. >> >> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper. >> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean. > > Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec. > (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in? Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case. If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if a) < sector chunks are committed to flash b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write. I guess the cards which increase performance do b)? Or something else? Anyway regarding FUA i don't have any concerns regarding reliability with cache flush. I can add some performance comparisons with some eMMCs I have around though. Regards, Christian Hyperstone GmbH | Reichenaustr. 39a | 78467 Konstanz Managing Director: Dr. Jan Peter Berns. Commercial register of local courts: Freiburg HRB381782
On Fri, 3 Mar 2023 at 12:40, Christian Löhle <CLoehle@hyperstone.com> wrote: > > > >> > >> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case. > >> > >> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations. > >> > >> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper. > >> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean. > > > > Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec. > > (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in? > > Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case. > If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if > a) < sector chunks are committed to flash > b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write. Right, I agree! Note 1) Reliable write was introduced way before cache management in the eMMC spec. So, if the support for reliable write would have a stricter implementation than needed, I would not be surprised. Note 2) In the eMMC v5.1 spec, the cache flushing support has been extended to allow an explicit barrier operation. Perhaps, we should let that option take precedence over a regular flush+barrier, for REQ_OP_FLUSH!? > > I guess the cards which increase performance do b)? Or something else? That's the tricky part to know, as it's the internals of the eMMC. Although, it seems like both Avri (WDC) and Bean (Micron) would be happy to proceed with $subject patch, which makes me more comfortable to move forward. > Anyway regarding FUA i don't have any concerns regarding reliability with cache flush. > I can add some performance comparisons with some eMMCs I have around though. That would be great, thanks a lot for helping out with testing! Kind regards Uffe
On 3/03/23 14:01, Ulf Hansson wrote: > On Fri, 3 Mar 2023 at 12:40, Christian Löhle <CLoehle@hyperstone.com> wrote: >> >> >>>> >>>> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case. >>>> >>>> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations. >>>> >>>> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper. >>>> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean. >>> >>> Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec. >>> (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in? >> >> Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case. >> If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if >> a) < sector chunks are committed to flash >> b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write. > > Right, I agree! > > Note 1) Reliable write was introduced way before cache management in > the eMMC spec. So, if the support for reliable write would have a > stricter implementation than needed, I would not be surprised. I am not sure when you say stricter than needed. Historically file systems assumed that sectors are updated atomically i.e. there is never a sector with a mixture of old and new data. The eMMC spec does not guarantee that, except for reliable write. File systems may use REQ_FUA for important information, like the superblock or a journal commit record, so using reliable write for REQ_FUA would seem to give better protection against file system corruption than a cache flush which could leave a sector half-written. On the other hand, sudden power loss is probably rare in battery powered systems because they are designed to monitor the battery power and shutdown when it gets too low. And file systems can use checksums to detect half-written updates. And there is anyway no protection for other (non REQ_FUA) writes a file system might do and expect not to tear sectors. And you are more likely to smash the screen than bounce the battery out and cause an unrecoverable file system error. Nevertheless, the commit message of this patch reads like the change is an optimization, whereas it seems more like a policy change. The commit message should perhaps say something like: "The consensus is that the benefit of improved performance by not using reliable-write for REQ_FUA is much greater than any potential benefit that reliable-write might provide to avoid file system corruption in the event of sudden power loss." As for allowing for the policy to be overridden, perhaps an mmc_core module option? > > Note 2) In the eMMC v5.1 spec, the cache flushing support has been > extended to allow an explicit barrier operation. Perhaps, we should > let that option take precedence over a regular flush+barrier, for > REQ_OP_FLUSH!? > >> >> I guess the cards which increase performance do b)? Or something else? > > That's the tricky part to know, as it's the internals of the eMMC. It is the natural conclusion though. The eMMC probably does not update mapping information with every write, instead if power is lost, it scans the updated areas at the next initialization. (The power-off notify feature would commit the mapping information to media to avoid that). So a reliable write might have to: 1. write information to record that the old mapping should be used, not what might be discovered by scanning 2. do the actual write 3. write mapping information to record the new mapping > > Although, it seems like both Avri (WDC) and Bean (Micron) would be > happy to proceed with $subject patch, which makes me more comfortable > to move forward. > >> Anyway regarding FUA i don't have any concerns regarding reliability with cache flush. >> I can add some performance comparisons with some eMMCs I have around though. > > That would be great, thanks a lot for helping out with testing! > > Kind regards > Uffe
On Fri, 3 Mar 2023 at 15:41, Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 3/03/23 14:01, Ulf Hansson wrote: > > On Fri, 3 Mar 2023 at 12:40, Christian Löhle <CLoehle@hyperstone.com> wrote: > >> > >> > >>>> > >>>> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case. > >>>> > >>>> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations. > >>>> > >>>> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper. > >>>> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean. > >>> > >>> Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec. > >>> (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in? > >> > >> Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case. > >> If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if > >> a) < sector chunks are committed to flash > >> b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write. > > > > Right, I agree! > > > > Note 1) Reliable write was introduced way before cache management in > > the eMMC spec. So, if the support for reliable write would have a > > stricter implementation than needed, I would not be surprised. > > I am not sure when you say stricter than needed. Historically > file systems assumed that sectors are updated atomically i.e. > there is never a sector with a mixture of old and new data. > The eMMC spec does not guarantee that, except for reliable > write. Yes, I agree. With stricker, I was merely thinking of whether the eMMC makes the entire write request (multiple sectors) being atomic, or the guarantee is only per sector basis. According to the eMMC spec, that seems to be implementation specific. One option could be heavier than the other, I guess. > > File systems may use REQ_FUA for important information, like the > superblock or a journal commit record, so using reliable write > for REQ_FUA would seem to give better protection against file system > corruption than a cache flush which could leave a sector > half-written. Yes, I agree. If we should fully conform to what is stated in the eMMC spec, we should probably keep the current path to support REQ_FUA. > > On the other hand, sudden power loss is probably rare in battery > powered systems because they are designed to monitor the battery > power and shutdown when it gets too low. > > And file systems can use checksums to detect half-written updates. > > And there is anyway no protection for other (non REQ_FUA) writes a > file system might do and expect not to tear sectors. > > And you are more likely to smash the screen than bounce the battery > out and cause an unrecoverable file system error. Right, these are good arguments to why $subject patch perhaps makes sense to move forward with anyway. Moreover, it seems like some eMMC vendors don't really have any concerns with us moving away from reliable writes, to instead use only "cache flushing". I guess it can indicate that the regular writes may already be treated in an atomic kind of way, but who knows. > > Nevertheless, the commit message of this patch reads like the change > is an optimization, whereas it seems more like a policy change. > The commit message should perhaps say something like: > "The consensus is that the benefit of improved performance by not > using reliable-write for REQ_FUA is much greater than any potential > benefit that reliable-write might provide to avoid file system > corruption in the event of sudden power loss." I agree. I will improve it along the lines of what you suggest. > > As for allowing for the policy to be overridden, perhaps an mmc_core > module option? Even if I am not very fond of module parameters, this seems like a reasonable thing to use for this case. I was also looking at using a card quirk. > > > > > Note 2) In the eMMC v5.1 spec, the cache flushing support has been > > extended to allow an explicit barrier operation. Perhaps, we should > > let that option take precedence over a regular flush+barrier, for > > REQ_OP_FLUSH!? > > > >> > >> I guess the cards which increase performance do b)? Or something else? > > > > That's the tricky part to know, as it's the internals of the eMMC. > > It is the natural conclusion though. The eMMC probably does not > update mapping information with every write, instead if power is > lost, it scans the updated areas at the next initialization. (The > power-off notify feature would commit the mapping information to > media to avoid that). So a reliable write might have to: > 1. write information to record that the old mapping > should be used, not what might be discovered by scanning > 2. do the actual write > 3. write mapping information to record the new mapping Yes. And depending on the eMMC device, some may be more clever than others for how to actually deal with this. > > > > > Although, it seems like both Avri (WDC) and Bean (Micron) would be > > happy to proceed with $subject patch, which makes me more comfortable > > to move forward. > > > >> Anyway regarding FUA i don't have any concerns regarding reliability with cache flush. > >> I can add some performance comparisons with some eMMCs I have around though. > > > > That would be great, thanks a lot for helping out with testing! > > Kind regards Uffe
On 6/03/23 18:09, Ulf Hansson wrote: > On Fri, 3 Mar 2023 at 15:41, Adrian Hunter <adrian.hunter@intel.com> wrote: >> >> On 3/03/23 14:01, Ulf Hansson wrote: >>> On Fri, 3 Mar 2023 at 12:40, Christian Löhle <CLoehle@hyperstone.com> wrote: >>>> >>>> >>>>>> >>>>>> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case. >>>>>> >>>>>> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations. >>>>>> >>>>>> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper. >>>>>> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean. >>>>> >>>>> Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec. >>>>> (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in? >>>> >>>> Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case. >>>> If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if >>>> a) < sector chunks are committed to flash >>>> b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write. >>> >>> Right, I agree! >>> >>> Note 1) Reliable write was introduced way before cache management in >>> the eMMC spec. So, if the support for reliable write would have a >>> stricter implementation than needed, I would not be surprised. >> >> I am not sure when you say stricter than needed. Historically >> file systems assumed that sectors are updated atomically i.e. >> there is never a sector with a mixture of old and new data. >> The eMMC spec does not guarantee that, except for reliable >> write. > > Yes, I agree. With stricker, I was merely thinking of whether the eMMC > makes the entire write request (multiple sectors) being atomic, or the > guarantee is only per sector basis. > > According to the eMMC spec, that seems to be implementation specific. > One option could be heavier than the other, I guess. > >> >> File systems may use REQ_FUA for important information, like the >> superblock or a journal commit record, so using reliable write >> for REQ_FUA would seem to give better protection against file system >> corruption than a cache flush which could leave a sector >> half-written. > > Yes, I agree. If we should fully conform to what is stated in the eMMC > spec, we should probably keep the current path to support REQ_FUA. > >> >> On the other hand, sudden power loss is probably rare in battery >> powered systems because they are designed to monitor the battery >> power and shutdown when it gets too low. >> >> And file systems can use checksums to detect half-written updates. >> >> And there is anyway no protection for other (non REQ_FUA) writes a >> file system might do and expect not to tear sectors. >> >> And you are more likely to smash the screen than bounce the battery >> out and cause an unrecoverable file system error. > > Right, these are good arguments to why $subject patch perhaps makes > sense to move forward with anyway. Yes > > Moreover, it seems like some eMMC vendors don't really have any > concerns with us moving away from reliable writes, to instead use only > "cache flushing". I guess it can indicate that the regular writes may > already be treated in an atomic kind of way, but who knows. Indeed > >> >> Nevertheless, the commit message of this patch reads like the change >> is an optimization, whereas it seems more like a policy change. >> The commit message should perhaps say something like: >> "The consensus is that the benefit of improved performance by not >> using reliable-write for REQ_FUA is much greater than any potential >> benefit that reliable-write might provide to avoid file system >> corruption in the event of sudden power loss." > > I agree. I will improve it along the lines of what you suggest. > >> >> As for allowing for the policy to be overridden, perhaps an mmc_core >> module option? > > Even if I am not very fond of module parameters, this seems like a > reasonable thing to use for this case. > > I was also looking at using a card quirk. Yes that makes sense > >> >>> >>> Note 2) In the eMMC v5.1 spec, the cache flushing support has been >>> extended to allow an explicit barrier operation. Perhaps, we should >>> let that option take precedence over a regular flush+barrier, for >>> REQ_OP_FLUSH!? >>> >>>> >>>> I guess the cards which increase performance do b)? Or something else? >>> >>> That's the tricky part to know, as it's the internals of the eMMC. >> >> It is the natural conclusion though. The eMMC probably does not >> update mapping information with every write, instead if power is >> lost, it scans the updated areas at the next initialization. (The >> power-off notify feature would commit the mapping information to >> media to avoid that). So a reliable write might have to: >> 1. write information to record that the old mapping >> should be used, not what might be discovered by scanning >> 2. do the actual write >> 3. write mapping information to record the new mapping > > Yes. And depending on the eMMC device, some may be more clever than > others for how to actually deal with this. > >> >>> >>> Although, it seems like both Avri (WDC) and Bean (Micron) would be >>> happy to proceed with $subject patch, which makes me more comfortable >>> to move forward. >>> >>>> Anyway regarding FUA i don't have any concerns regarding reliability with cache flush. >>>> I can add some performance comparisons with some eMMCs I have around though. >>> >>> That would be great, thanks a lot for helping out with testing! >>> > > Kind regards > Uffe
I have benchmarked the FUA/Cache behavior a bit. I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: # call with # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M of=/dev/mmcblk2; done; for loop in {1..5}; do time ./filesystembenchmark.sh; umount /mnt; done mkfs.ext4 -F /dev/mmcblk2 mount /dev/mmcblk2 /mnt for i in {1..3} do cp -r linux-6.2.2 /mnt/$i done for i in {1..3} do rm -r /mnt/$i done for i in {1..3} do cp -r linux-6.2.2 /mnt/$i done I found a couple of DUTs that I can link, I also tested one industrial card. DUT1: blue PCB Foresee eMMC https://pine64.com/product/32gb-emmc-module/ DUT2: green PCB SiliconGo eMMC Couldn't find that one online anymore unfortunately DUT3: orange hardkernel PCB 8GB https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ DUT4: orange hardkernel PCB white dot https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odroid-xu3.html DUT5: Industrial card The test issued 461 DO_REL_WR during one of the iterations for DUT5 DUT1: Cache, no FUA: 13:04.49 13:13.82 13:30.59 13:28:13 13:20:64 FUA: 13:30.32 13:36.26 13:10.86 13:32.52 13:48.59 DUT2: FUA: 8:11.24 7:47.73 7:48.00 7:48.18 7:47.38 Cache, no FUA: 8:10.30 7:48.97 7:48.47 7:47.93 7:44.18 DUT3: Cache, no FUA: 7:02.82 6:58.94 7:03.20 7:00.27 7:00.88 FUA: 7:05.43 7:03.44 7:04.82 7:03.26 7:04.74 DUT4: FUA: 7:23.92 7:20.15 7:20.52 7:19.10 7:20.71 Cache, no FUA: 7:20.23 7:20.48 7:19.94 7:18.90 7:19.88 Cache, no FUA: 7:19.36 7:02.11 7:01.53 7:01.35 7:00.37 Cache, no FUA CQE: 7:17.55 7:00.73 6:59.25 6:58.44 6:58.60 FUA: 7:15.10 6:58.99 6:58.94 6:59.17 6:60.00 FUA CQE: 7:11.03 6:58.04 6:56.89 6:56.43 6:56:28 If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. Regards, Christian Hyperstone GmbH | Reichenaustr. 39a | 78467 Konstanz Managing Director: Dr. Jan Peter Berns. Commercial register of local courts: Freiburg HRB381782
On Fri, 10 Mar 2023 at 14:43, Christian Löhle <CLoehle@hyperstone.com> wrote: > > I have benchmarked the FUA/Cache behavior a bit. > I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: > > # call with > # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M of=/dev/mmcblk2; done; for loop in {1..5}; do time ./filesystembenchmark.sh; umount /mnt; done > mkfs.ext4 -F /dev/mmcblk2 > mount /dev/mmcblk2 /mnt > for i in {1..3} > do > cp -r linux-6.2.2 /mnt/$i > done > for i in {1..3} > do > rm -r /mnt/$i > done > for i in {1..3} > do > cp -r linux-6.2.2 /mnt/$i > done > > > I found a couple of DUTs that I can link, I also tested one industrial card. > > DUT1: blue PCB Foresee eMMC > https://pine64.com/product/32gb-emmc-module/ > DUT2: green PCB SiliconGo eMMC > Couldn't find that one online anymore unfortunately > DUT3: orange hardkernel PCB 8GB > https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ > DUT4: orange hardkernel PCB white dot > https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odroid-xu3.html > DUT5: Industrial card Thanks a lot for helping out with testing! Much appreciated! > > > The test issued 461 DO_REL_WR during one of the iterations for DUT5 > > DUT1: > Cache, no FUA: > 13:04.49 > 13:13.82 > 13:30.59 > 13:28:13 > 13:20:64 > FUA: > 13:30.32 > 13:36.26 > 13:10.86 > 13:32.52 > 13:48.59 > > DUT2: > FUA: > 8:11.24 > 7:47.73 > 7:48.00 > 7:48.18 > 7:47.38 > Cache, no FUA: > 8:10.30 > 7:48.97 > 7:48.47 > 7:47.93 > 7:44.18 > > DUT3: > Cache, no FUA: > 7:02.82 > 6:58.94 > 7:03.20 > 7:00.27 > 7:00.88 > FUA: > 7:05.43 > 7:03.44 > 7:04.82 > 7:03.26 > 7:04.74 > > DUT4: > FUA: > 7:23.92 > 7:20.15 > 7:20.52 > 7:19.10 > 7:20.71 > Cache, no FUA: > 7:20.23 > 7:20.48 > 7:19.94 > 7:18.90 > 7:19.88 Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. Do you agree? > > Cache, no FUA: > 7:19.36 > 7:02.11 > 7:01.53 > 7:01.35 > 7:00.37 > Cache, no FUA CQE: > 7:17.55 > 7:00.73 > 6:59.25 > 6:58.44 > 6:58.60 > FUA: > 7:15.10 > 6:58.99 > 6:58.94 > 6:59.17 > 6:60.00 > FUA CQE: > 7:11.03 > 6:58.04 > 6:56.89 > 6:56.43 > 6:56:28 > > If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? Kind regards Uffe
>> >> I have benchmarked the FUA/Cache behavior a bit. >> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: >> >> # call with >> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M >> of=/dev/mmcblk2; done; for loop in {1..5}; do time >> ./filesystembenchmark.sh; umount /mnt; done >> mkfs.ext4 -F /dev/mmcblk2 >> mount /dev/mmcblk2 /mnt >> for i in {1..3} >> do >> cp -r linux-6.2.2 /mnt/$i >> done >> for i in {1..3} >> do >> rm -r /mnt/$i >> done >> for i in {1..3} >> do >> cp -r linux-6.2.2 /mnt/$i >> done >> >> >> I found a couple of DUTs that I can link, I also tested one industrial card. >> >> DUT1: blue PCB Foresee eMMC >> https://pine64.com/product/32gb-emmc-module/ >> DUT2: green PCB SiliconGo eMMC >> Couldn't find that one online anymore unfortunately >> DUT3: orange hardkernel PCB 8GB >> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ >> DUT4: orange hardkernel PCB white dot >> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro >> id-xu3.html >> DUT5: Industrial card > > Thanks a lot for helping out with testing! Much appreciated! No problem, glad to be of help. > >> >> >> The test issued 461 DO_REL_WR during one of the iterations for DUT5 >> >> DUT1: >> Cache, no FUA: >> 13:04.49 >> 13:13.82 >> 13:30.59 >> 13:28:13 >> 13:20:64 >> FUA: >> 13:30.32 >> 13:36.26 >> 13:10.86 >> 13:32.52 >> 13:48.59 >> >> DUT2: >> FUA: >> 8:11.24 >> 7:47.73 >> 7:48.00 >> 7:48.18 >> 7:47.38 >> Cache, no FUA: >> 8:10.30 >> 7:48.97 >> 7:48.47 >> 7:47.93 >> 7:44.18 >> >> DUT3: >> Cache, no FUA: >> 7:02.82 >> 6:58.94 >> 7:03.20 >> 7:00.27 >> 7:00.88 >> FUA: >> 7:05.43 >> 7:03.44 >> 7:04.82 >> 7:03.26 >> 7:04.74 >> >> DUT4: >> FUA: >> 7:23.92 >> 7:20.15 >> 7:20.52 >> 7:19.10 >> 7:20.71 >> Cache, no FUA: >> 7:20.23 >> 7:20.48 >> 7:19.94 >> 7:18.90 >> 7:19.88 > > Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. > > Do you agree? That is a good question, that's why I just posted the data without further comment from my side. I was honestly expecting the difference to be much higher, given the original patch. If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. If there are cards where the difference is much more significant then of course a quirk would be nicer. On the other side I don't see why not and any improvement is a good one? > >> >> Cache, no FUA: >> 7:19.36 >> 7:02.11 >> 7:01.53 >> 7:01.35 >> 7:00.37 >> Cache, no FUA CQE: >> 7:17.55 >> 7:00.73 >> 6:59.25 >> 6:58.44 >> 6:58.60 >> FUA: >> 7:15.10 >> 6:58.99 >> 6:58.94 >> 6:59.17 >> 6:60.00 >> FUA CQE: >> 7:11.03 >> 6:58.04 >> 6:56.89 >> 6:56.43 >> 6:56:28 >> >> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. > > If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. > > For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. > > When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? Sure, I'm also interested in Adrian's take on this. Regards, Christian Hyperstone GmbH | Reichenaustr. 39a | 78467 Konstanz Managing Director: Dr. Jan Peter Berns. Commercial register of local courts: Freiburg HRB381782
Hi, > I have benchmarked the FUA/Cache behavior a bit. > I don't have an actual filesystem benchmark that does what I wanted and is > easy to port to the target so I used: I guess I'm a bit late here - sorry about that. If you happened to have traces of the below runs - I'll be happy to provide further analysis. You might also want to look into fio which allows you do write to a raw block device, And has various sync options - so you are much more in control of the payload size And how many reliable-writes / sync-caches were issued - if any. Thanks, Avri > # call with > # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M of=/dev/mmcblk2; > done; for loop in {1..5}; do time ./filesystembenchmark.sh; umount /mnt; > done > mkfs.ext4 -F /dev/mmcblk2 > mount /dev/mmcblk2 /mnt > for i in {1..3} > do > cp -r linux-6.2.2 /mnt/$i > done > for i in {1..3} > do > rm -r /mnt/$i > done > for i in {1..3} > do > cp -r linux-6.2.2 /mnt/$i > done > > > I found a couple of DUTs that I can link, I also tested one industrial card. > > DUT1: blue PCB Foresee eMMC > https://pine64.com/product/32gb-emmc-module/ > DUT2: green PCB SiliconGo eMMC > Couldn't find that one online anymore unfortunately > DUT3: orange hardkernel PCB 8GB > https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ > DUT4: orange hardkernel PCB white dot > https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for- > odroid-xu3.html > DUT5: Industrial card > > > The test issued 461 DO_REL_WR during one of the iterations for DUT5 > > DUT1: > Cache, no FUA: > 13:04.49 > 13:13.82 > 13:30.59 > 13:28:13 > 13:20:64 > FUA: > 13:30.32 > 13:36.26 > 13:10.86 > 13:32.52 > 13:48.59 > > DUT2: > FUA: > 8:11.24 > 7:47.73 > 7:48.00 > 7:48.18 > 7:47.38 > Cache, no FUA: > 8:10.30 > 7:48.97 > 7:48.47 > 7:47.93 > 7:44.18 > > DUT3: > Cache, no FUA: > 7:02.82 > 6:58.94 > 7:03.20 > 7:00.27 > 7:00.88 > FUA: > 7:05.43 > 7:03.44 > 7:04.82 > 7:03.26 > 7:04.74 > > DUT4: > FUA: > 7:23.92 > 7:20.15 > 7:20.52 > 7:19.10 > 7:20.71 > Cache, no FUA: > 7:20.23 > 7:20.48 > 7:19.94 > 7:18.90 > 7:19.88 > > Cache, no FUA: > 7:19.36 > 7:02.11 > 7:01.53 > 7:01.35 > 7:00.37 > Cache, no FUA CQE: > 7:17.55 > 7:00.73 > 6:59.25 > 6:58.44 > 6:58.60 > FUA: > 7:15.10 > 6:58.99 > 6:58.94 > 6:59.17 > 6:60.00 > FUA CQE: > 7:11.03 > 6:58.04 > 6:56.89 > 6:56.43 > 6:56:28 > > If anyone has any comments or disagrees with the benchmark, or has a > specific eMMC to test, let me know. > > Regards, > Christian > > Hyperstone GmbH | Reichenaustr. 39a | 78467 Konstanz Managing Director: > Dr. Jan Peter Berns. > Commercial register of local courts: Freiburg HRB381782
On 10/03/23 19:06, Christian Löhle wrote: > >>> >>> I have benchmarked the FUA/Cache behavior a bit. >>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: >>> >>> # call with >>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M >>> of=/dev/mmcblk2; done; for loop in {1..5}; do time >>> ./filesystembenchmark.sh; umount /mnt; done >>> mkfs.ext4 -F /dev/mmcblk2 >>> mount /dev/mmcblk2 /mnt >>> for i in {1..3} >>> do >>> cp -r linux-6.2.2 /mnt/$i >>> done >>> for i in {1..3} >>> do >>> rm -r /mnt/$i >>> done >>> for i in {1..3} >>> do >>> cp -r linux-6.2.2 /mnt/$i >>> done >>> >>> >>> I found a couple of DUTs that I can link, I also tested one industrial card. >>> >>> DUT1: blue PCB Foresee eMMC >>> https://pine64.com/product/32gb-emmc-module/ >>> DUT2: green PCB SiliconGo eMMC >>> Couldn't find that one online anymore unfortunately >>> DUT3: orange hardkernel PCB 8GB >>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ >>> DUT4: orange hardkernel PCB white dot >>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro >>> id-xu3.html >>> DUT5: Industrial card >> >> Thanks a lot for helping out with testing! Much appreciated! > > No problem, glad to be of help. > >> >>> >>> >>> The test issued 461 DO_REL_WR during one of the iterations for DUT5 >>> >>> DUT1: >>> Cache, no FUA: >>> 13:04.49 >>> 13:13.82 >>> 13:30.59 >>> 13:28:13 >>> 13:20:64 >>> FUA: >>> 13:30.32 >>> 13:36.26 >>> 13:10.86 >>> 13:32.52 >>> 13:48.59 >>> >>> DUT2: >>> FUA: >>> 8:11.24 >>> 7:47.73 >>> 7:48.00 >>> 7:48.18 >>> 7:47.38 >>> Cache, no FUA: >>> 8:10.30 >>> 7:48.97 >>> 7:48.47 >>> 7:47.93 >>> 7:44.18 >>> >>> DUT3: >>> Cache, no FUA: >>> 7:02.82 >>> 6:58.94 >>> 7:03.20 >>> 7:00.27 >>> 7:00.88 >>> FUA: >>> 7:05.43 >>> 7:03.44 >>> 7:04.82 >>> 7:03.26 >>> 7:04.74 >>> >>> DUT4: >>> FUA: >>> 7:23.92 >>> 7:20.15 >>> 7:20.52 >>> 7:19.10 >>> 7:20.71 >>> Cache, no FUA: >>> 7:20.23 >>> 7:20.48 >>> 7:19.94 >>> 7:18.90 >>> 7:19.88 >> >> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. >> >> Do you agree? > > That is a good question, that's why I just posted the data without further comment from my side. > I was honestly expecting the difference to be much higher, given the original patch. > If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. > If there are cards where the difference is much more significant then of course a quirk would be nicer. > On the other side I don't see why not and any improvement is a good one? > >> >>> >>> Cache, no FUA: >>> 7:19.36 >>> 7:02.11 >>> 7:01.53 >>> 7:01.35 >>> 7:00.37 >>> Cache, no FUA CQE: >>> 7:17.55 >>> 7:00.73 >>> 6:59.25 >>> 6:58.44 >>> 6:58.60 >>> FUA: >>> 7:15.10 >>> 6:58.99 >>> 6:58.94 >>> 6:59.17 >>> 6:60.00 >>> FUA CQE: >>> 7:11.03 >>> 6:58.04 >>> 6:56.89 >>> 6:56.43 >>> 6:56:28 >>> >>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. >> >> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? > > That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. > >> >> For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). > It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. > >> >> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? > Sure, I'm also interested in Adrian's take on this. Testing an arbitrary system and looking only at individual I/Os, which may not be representative of any use-case, resulted in FUA always winning, see below. All values are approximate and in microseconds. With FUA Without FUA With CQE Reliable Write 350 Write 125 Flush 300 Total 350 425 Without CQE Reliable Write 350 Write 125 CMD13 100 CMD13 100 Flush 300 CMD13 100 Total 450 625 FYI the test I was doing was: # cat test.sh #!/bin/sh echo "hi" > /mnt/mmc/hi.txt sync # perf record --no-bpf-event -e mmc:* -a -- ./test.sh # perf script --ns --deltatime The conclusion in this case would seem to be that CQE makes the case for removing FUA less bad. Perhaps CQE is more common in newer eMMCs which in turn have better FUA implementations.
On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 10/03/23 19:06, Christian Löhle wrote: > > > >>> > >>> I have benchmarked the FUA/Cache behavior a bit. > >>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: > >>> > >>> # call with > >>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M > >>> of=/dev/mmcblk2; done; for loop in {1..5}; do time > >>> ./filesystembenchmark.sh; umount /mnt; done > >>> mkfs.ext4 -F /dev/mmcblk2 > >>> mount /dev/mmcblk2 /mnt > >>> for i in {1..3} > >>> do > >>> cp -r linux-6.2.2 /mnt/$i > >>> done > >>> for i in {1..3} > >>> do > >>> rm -r /mnt/$i > >>> done > >>> for i in {1..3} > >>> do > >>> cp -r linux-6.2.2 /mnt/$i > >>> done > >>> > >>> > >>> I found a couple of DUTs that I can link, I also tested one industrial card. > >>> > >>> DUT1: blue PCB Foresee eMMC > >>> https://pine64.com/product/32gb-emmc-module/ > >>> DUT2: green PCB SiliconGo eMMC > >>> Couldn't find that one online anymore unfortunately > >>> DUT3: orange hardkernel PCB 8GB > >>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ > >>> DUT4: orange hardkernel PCB white dot > >>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro > >>> id-xu3.html > >>> DUT5: Industrial card > >> > >> Thanks a lot for helping out with testing! Much appreciated! > > > > No problem, glad to be of help. > > > >> > >>> > >>> > >>> The test issued 461 DO_REL_WR during one of the iterations for DUT5 > >>> > >>> DUT1: > >>> Cache, no FUA: > >>> 13:04.49 > >>> 13:13.82 > >>> 13:30.59 > >>> 13:28:13 > >>> 13:20:64 > >>> FUA: > >>> 13:30.32 > >>> 13:36.26 > >>> 13:10.86 > >>> 13:32.52 > >>> 13:48.59 > >>> > >>> DUT2: > >>> FUA: > >>> 8:11.24 > >>> 7:47.73 > >>> 7:48.00 > >>> 7:48.18 > >>> 7:47.38 > >>> Cache, no FUA: > >>> 8:10.30 > >>> 7:48.97 > >>> 7:48.47 > >>> 7:47.93 > >>> 7:44.18 > >>> > >>> DUT3: > >>> Cache, no FUA: > >>> 7:02.82 > >>> 6:58.94 > >>> 7:03.20 > >>> 7:00.27 > >>> 7:00.88 > >>> FUA: > >>> 7:05.43 > >>> 7:03.44 > >>> 7:04.82 > >>> 7:03.26 > >>> 7:04.74 > >>> > >>> DUT4: > >>> FUA: > >>> 7:23.92 > >>> 7:20.15 > >>> 7:20.52 > >>> 7:19.10 > >>> 7:20.71 > >>> Cache, no FUA: > >>> 7:20.23 > >>> 7:20.48 > >>> 7:19.94 > >>> 7:18.90 > >>> 7:19.88 > >> > >> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. > >> > >> Do you agree? > > > > That is a good question, that's why I just posted the data without further comment from my side. > > I was honestly expecting the difference to be much higher, given the original patch. > > If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. > > If there are cards where the difference is much more significant then of course a quirk would be nicer. > > On the other side I don't see why not and any improvement is a good one? > > > >> > >>> > >>> Cache, no FUA: > >>> 7:19.36 > >>> 7:02.11 > >>> 7:01.53 > >>> 7:01.35 > >>> 7:00.37 > >>> Cache, no FUA CQE: > >>> 7:17.55 > >>> 7:00.73 > >>> 6:59.25 > >>> 6:58.44 > >>> 6:58.60 > >>> FUA: > >>> 7:15.10 > >>> 6:58.99 > >>> 6:58.94 > >>> 6:59.17 > >>> 6:60.00 > >>> FUA CQE: > >>> 7:11.03 > >>> 6:58.04 > >>> 6:56.89 > >>> 6:56.43 > >>> 6:56:28 > >>> > >>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. > >> > >> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? > > > > That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. > > > >> > >> For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). > > It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. > > > >> > >> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? > > Sure, I'm also interested in Adrian's take on this. > > Testing an arbitrary system and looking only at individual I/Os, > which may not be representative of any use-case, resulted in > FUA always winning, see below. > > All values are approximate and in microseconds. > > With FUA Without FUA > > With CQE Reliable Write 350 Write 125 > Flush 300 > Total 350 425 > > Without CQE Reliable Write 350 Write 125 > CMD13 100 CMD13 100 > Flush 300 > CMD13 100 > Total 450 625 > > FYI the test I was doing was: > > # cat test.sh > #!/bin/sh > > echo "hi" > /mnt/mmc/hi.txt > > sync > > > # perf record --no-bpf-event -e mmc:* -a -- ./test.sh > # perf script --ns --deltatime > > > The conclusion in this case would seem to be that CQE > makes the case for removing FUA less bad. > > Perhaps CQE is more common in newer eMMCs which in turn > have better FUA implementations. Very interesting data - and thanks for helping out with tests! A few observations and thoughts from the above. 1) A "normal" use case would probably include additional writes (regular writes) and I guess that could impact the flushing behavior. Maybe the flushing becomes less heavy, if the device internally/occasionally needs to flush its cache anyway? Or - maybe it doesn't matter at all, because the reliable writes are triggering the cache to be flushed too. 2) Assuming that a reliable write is triggering the internal cache to be flushed too, then we need less number of commands to be sent/acked to the eMMC - compared to not using FUA. This means less protocol overhead when using FUA - and perhaps that's what your tests is actually telling us? Kind regards Uffe
On 14/03/23 09:56, Ulf Hansson wrote: > On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@intel.com> wrote: >> >> On 10/03/23 19:06, Christian Löhle wrote: >>> >>>>> >>>>> I have benchmarked the FUA/Cache behavior a bit. >>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: >>>>> >>>>> # call with >>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M >>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time >>>>> ./filesystembenchmark.sh; umount /mnt; done >>>>> mkfs.ext4 -F /dev/mmcblk2 >>>>> mount /dev/mmcblk2 /mnt >>>>> for i in {1..3} >>>>> do >>>>> cp -r linux-6.2.2 /mnt/$i >>>>> done >>>>> for i in {1..3} >>>>> do >>>>> rm -r /mnt/$i >>>>> done >>>>> for i in {1..3} >>>>> do >>>>> cp -r linux-6.2.2 /mnt/$i >>>>> done >>>>> >>>>> >>>>> I found a couple of DUTs that I can link, I also tested one industrial card. >>>>> >>>>> DUT1: blue PCB Foresee eMMC >>>>> https://pine64.com/product/32gb-emmc-module/ >>>>> DUT2: green PCB SiliconGo eMMC >>>>> Couldn't find that one online anymore unfortunately >>>>> DUT3: orange hardkernel PCB 8GB >>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ >>>>> DUT4: orange hardkernel PCB white dot >>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro >>>>> id-xu3.html >>>>> DUT5: Industrial card >>>> >>>> Thanks a lot for helping out with testing! Much appreciated! >>> >>> No problem, glad to be of help. >>> >>>> >>>>> >>>>> >>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5 >>>>> >>>>> DUT1: >>>>> Cache, no FUA: >>>>> 13:04.49 >>>>> 13:13.82 >>>>> 13:30.59 >>>>> 13:28:13 >>>>> 13:20:64 >>>>> FUA: >>>>> 13:30.32 >>>>> 13:36.26 >>>>> 13:10.86 >>>>> 13:32.52 >>>>> 13:48.59 >>>>> >>>>> DUT2: >>>>> FUA: >>>>> 8:11.24 >>>>> 7:47.73 >>>>> 7:48.00 >>>>> 7:48.18 >>>>> 7:47.38 >>>>> Cache, no FUA: >>>>> 8:10.30 >>>>> 7:48.97 >>>>> 7:48.47 >>>>> 7:47.93 >>>>> 7:44.18 >>>>> >>>>> DUT3: >>>>> Cache, no FUA: >>>>> 7:02.82 >>>>> 6:58.94 >>>>> 7:03.20 >>>>> 7:00.27 >>>>> 7:00.88 >>>>> FUA: >>>>> 7:05.43 >>>>> 7:03.44 >>>>> 7:04.82 >>>>> 7:03.26 >>>>> 7:04.74 >>>>> >>>>> DUT4: >>>>> FUA: >>>>> 7:23.92 >>>>> 7:20.15 >>>>> 7:20.52 >>>>> 7:19.10 >>>>> 7:20.71 >>>>> Cache, no FUA: >>>>> 7:20.23 >>>>> 7:20.48 >>>>> 7:19.94 >>>>> 7:18.90 >>>>> 7:19.88 >>>> >>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. >>>> >>>> Do you agree? >>> >>> That is a good question, that's why I just posted the data without further comment from my side. >>> I was honestly expecting the difference to be much higher, given the original patch. >>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. >>> If there are cards where the difference is much more significant then of course a quirk would be nicer. >>> On the other side I don't see why not and any improvement is a good one? >>> >>>> >>>>> >>>>> Cache, no FUA: >>>>> 7:19.36 >>>>> 7:02.11 >>>>> 7:01.53 >>>>> 7:01.35 >>>>> 7:00.37 >>>>> Cache, no FUA CQE: >>>>> 7:17.55 >>>>> 7:00.73 >>>>> 6:59.25 >>>>> 6:58.44 >>>>> 6:58.60 >>>>> FUA: >>>>> 7:15.10 >>>>> 6:58.99 >>>>> 6:58.94 >>>>> 6:59.17 >>>>> 6:60.00 >>>>> FUA CQE: >>>>> 7:11.03 >>>>> 6:58.04 >>>>> 6:56.89 >>>>> 6:56.43 >>>>> 6:56:28 >>>>> >>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. >>>> >>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? >>> >>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. >>> >>>> >>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). >>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. >>> >>>> >>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? >>> Sure, I'm also interested in Adrian's take on this. >> >> Testing an arbitrary system and looking only at individual I/Os, >> which may not be representative of any use-case, resulted in >> FUA always winning, see below. >> >> All values are approximate and in microseconds. >> >> With FUA Without FUA >> >> With CQE Reliable Write 350 Write 125 >> Flush 300 >> Total 350 425 >> >> Without CQE Reliable Write 350 Write 125 >> CMD13 100 CMD13 100 >> Flush 300 >> CMD13 100 >> Total 450 625 >> >> FYI the test I was doing was: >> >> # cat test.sh >> #!/bin/sh >> >> echo "hi" > /mnt/mmc/hi.txt >> >> sync >> >> >> # perf record --no-bpf-event -e mmc:* -a -- ./test.sh >> # perf script --ns --deltatime >> >> >> The conclusion in this case would seem to be that CQE >> makes the case for removing FUA less bad. >> >> Perhaps CQE is more common in newer eMMCs which in turn >> have better FUA implementations. > > Very interesting data - and thanks for helping out with tests! > > A few observations and thoughts from the above. > > 1) > A "normal" use case would probably include additional writes (regular > writes) and I guess that could impact the flushing behavior. Maybe the > flushing becomes less heavy, if the device internally/occasionally > needs to flush its cache anyway? Or - maybe it doesn't matter at all, > because the reliable writes are triggering the cache to be flushed > too. The sync is presumably causing an EXT4 journal commit which seems to use REQ_PREFLUSH and REQ_FUA. That is: Flush (the journal to media) Write (the commit record) (FUA) So it does a flush anyway. The no-FUA case is: Flush (the journal to media) Write (the commit record) Flush (the commit record) > > 2) > Assuming that a reliable write is triggering the internal cache to be > flushed too, then we need less number of commands to be sent/acked to > the eMMC - compared to not using FUA. This means less protocol > overhead when using FUA - and perhaps that's what your tests is > actually telling us? There is definitely less protocol overhead because the no-FUA case has to do an extra CMD6 (flush) and CMD13. Note also, in this case auto-CMD23 is being used, which is why is is not listed. Using an older system (no CQE but also auto-CMD23), resulted in a win for no-FUA: With FUA Without FUA Reliable Write 1200 Write 850 CMD13 100 CMD13 100 Flush 120 CMD13 65 Total 1300 1135
On Tue, 14 Mar 2023 at 09:58, Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 14/03/23 09:56, Ulf Hansson wrote: > > On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@intel.com> wrote: > >> > >> On 10/03/23 19:06, Christian Löhle wrote: > >>> > >>>>> > >>>>> I have benchmarked the FUA/Cache behavior a bit. > >>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: > >>>>> > >>>>> # call with > >>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M > >>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time > >>>>> ./filesystembenchmark.sh; umount /mnt; done > >>>>> mkfs.ext4 -F /dev/mmcblk2 > >>>>> mount /dev/mmcblk2 /mnt > >>>>> for i in {1..3} > >>>>> do > >>>>> cp -r linux-6.2.2 /mnt/$i > >>>>> done > >>>>> for i in {1..3} > >>>>> do > >>>>> rm -r /mnt/$i > >>>>> done > >>>>> for i in {1..3} > >>>>> do > >>>>> cp -r linux-6.2.2 /mnt/$i > >>>>> done > >>>>> > >>>>> > >>>>> I found a couple of DUTs that I can link, I also tested one industrial card. > >>>>> > >>>>> DUT1: blue PCB Foresee eMMC > >>>>> https://pine64.com/product/32gb-emmc-module/ > >>>>> DUT2: green PCB SiliconGo eMMC > >>>>> Couldn't find that one online anymore unfortunately > >>>>> DUT3: orange hardkernel PCB 8GB > >>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ > >>>>> DUT4: orange hardkernel PCB white dot > >>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro > >>>>> id-xu3.html > >>>>> DUT5: Industrial card > >>>> > >>>> Thanks a lot for helping out with testing! Much appreciated! > >>> > >>> No problem, glad to be of help. > >>> > >>>> > >>>>> > >>>>> > >>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5 > >>>>> > >>>>> DUT1: > >>>>> Cache, no FUA: > >>>>> 13:04.49 > >>>>> 13:13.82 > >>>>> 13:30.59 > >>>>> 13:28:13 > >>>>> 13:20:64 > >>>>> FUA: > >>>>> 13:30.32 > >>>>> 13:36.26 > >>>>> 13:10.86 > >>>>> 13:32.52 > >>>>> 13:48.59 > >>>>> > >>>>> DUT2: > >>>>> FUA: > >>>>> 8:11.24 > >>>>> 7:47.73 > >>>>> 7:48.00 > >>>>> 7:48.18 > >>>>> 7:47.38 > >>>>> Cache, no FUA: > >>>>> 8:10.30 > >>>>> 7:48.97 > >>>>> 7:48.47 > >>>>> 7:47.93 > >>>>> 7:44.18 > >>>>> > >>>>> DUT3: > >>>>> Cache, no FUA: > >>>>> 7:02.82 > >>>>> 6:58.94 > >>>>> 7:03.20 > >>>>> 7:00.27 > >>>>> 7:00.88 > >>>>> FUA: > >>>>> 7:05.43 > >>>>> 7:03.44 > >>>>> 7:04.82 > >>>>> 7:03.26 > >>>>> 7:04.74 > >>>>> > >>>>> DUT4: > >>>>> FUA: > >>>>> 7:23.92 > >>>>> 7:20.15 > >>>>> 7:20.52 > >>>>> 7:19.10 > >>>>> 7:20.71 > >>>>> Cache, no FUA: > >>>>> 7:20.23 > >>>>> 7:20.48 > >>>>> 7:19.94 > >>>>> 7:18.90 > >>>>> 7:19.88 > >>>> > >>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. > >>>> > >>>> Do you agree? > >>> > >>> That is a good question, that's why I just posted the data without further comment from my side. > >>> I was honestly expecting the difference to be much higher, given the original patch. > >>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. > >>> If there are cards where the difference is much more significant then of course a quirk would be nicer. > >>> On the other side I don't see why not and any improvement is a good one? > >>> > >>>> > >>>>> > >>>>> Cache, no FUA: > >>>>> 7:19.36 > >>>>> 7:02.11 > >>>>> 7:01.53 > >>>>> 7:01.35 > >>>>> 7:00.37 > >>>>> Cache, no FUA CQE: > >>>>> 7:17.55 > >>>>> 7:00.73 > >>>>> 6:59.25 > >>>>> 6:58.44 > >>>>> 6:58.60 > >>>>> FUA: > >>>>> 7:15.10 > >>>>> 6:58.99 > >>>>> 6:58.94 > >>>>> 6:59.17 > >>>>> 6:60.00 > >>>>> FUA CQE: > >>>>> 7:11.03 > >>>>> 6:58.04 > >>>>> 6:56.89 > >>>>> 6:56.43 > >>>>> 6:56:28 > >>>>> > >>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. > >>>> > >>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? > >>> > >>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. > >>> > >>>> > >>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). > >>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. > >>> > >>>> > >>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? > >>> Sure, I'm also interested in Adrian's take on this. > >> > >> Testing an arbitrary system and looking only at individual I/Os, > >> which may not be representative of any use-case, resulted in > >> FUA always winning, see below. > >> > >> All values are approximate and in microseconds. > >> > >> With FUA Without FUA > >> > >> With CQE Reliable Write 350 Write 125 > >> Flush 300 > >> Total 350 425 > >> > >> Without CQE Reliable Write 350 Write 125 > >> CMD13 100 CMD13 100 > >> Flush 300 > >> CMD13 100 > >> Total 450 625 > >> > >> FYI the test I was doing was: > >> > >> # cat test.sh > >> #!/bin/sh > >> > >> echo "hi" > /mnt/mmc/hi.txt > >> > >> sync > >> > >> > >> # perf record --no-bpf-event -e mmc:* -a -- ./test.sh > >> # perf script --ns --deltatime > >> > >> > >> The conclusion in this case would seem to be that CQE > >> makes the case for removing FUA less bad. > >> > >> Perhaps CQE is more common in newer eMMCs which in turn > >> have better FUA implementations. > > > > Very interesting data - and thanks for helping out with tests! > > > > A few observations and thoughts from the above. > > > > 1) > > A "normal" use case would probably include additional writes (regular > > writes) and I guess that could impact the flushing behavior. Maybe the > > flushing becomes less heavy, if the device internally/occasionally > > needs to flush its cache anyway? Or - maybe it doesn't matter at all, > > because the reliable writes are triggering the cache to be flushed > > too. > > The sync is presumably causing an EXT4 journal commit which > seems to use REQ_PREFLUSH and REQ_FUA. That is: > Flush (the journal to media) > Write (the commit record) (FUA) > So it does a flush anyway. The no-FUA case is: > Flush (the journal to media) > Write (the commit record) > Flush (the commit record) > > > > > 2) > > Assuming that a reliable write is triggering the internal cache to be > > flushed too, then we need less number of commands to be sent/acked to > > the eMMC - compared to not using FUA. This means less protocol > > overhead when using FUA - and perhaps that's what your tests is > > actually telling us? > > There is definitely less protocol overhead because the no-FUA > case has to do an extra CMD6 (flush) and CMD13. > > Note also, in this case auto-CMD23 is being used, which is why > is is not listed. > > Using an older system (no CQE but also auto-CMD23), resulted > in a win for no-FUA: > > With FUA Without FUA > > Reliable Write 1200 Write 850 > CMD13 100 CMD13 100 > Flush 120 > CMD13 65 > Total 1300 1135 > > Alright, so it seems like just checking whether the cache control feature is available, isn't sufficient when deciding to avoid FUA. That said, in the next version I am going to add a card quirk, which needs to be set too, to avoid FUA. Then we can see for what cards it should actually be set for. What eMMC cards did you use in your tests above? Kind regards Uffe
On 16/03/23 14:12, Ulf Hansson wrote: > On Tue, 14 Mar 2023 at 09:58, Adrian Hunter <adrian.hunter@intel.com> wrote: >> >> On 14/03/23 09:56, Ulf Hansson wrote: >>> On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@intel.com> wrote: >>>> >>>> On 10/03/23 19:06, Christian Löhle wrote: >>>>> >>>>>>> >>>>>>> I have benchmarked the FUA/Cache behavior a bit. >>>>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: >>>>>>> >>>>>>> # call with >>>>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M >>>>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time >>>>>>> ./filesystembenchmark.sh; umount /mnt; done >>>>>>> mkfs.ext4 -F /dev/mmcblk2 >>>>>>> mount /dev/mmcblk2 /mnt >>>>>>> for i in {1..3} >>>>>>> do >>>>>>> cp -r linux-6.2.2 /mnt/$i >>>>>>> done >>>>>>> for i in {1..3} >>>>>>> do >>>>>>> rm -r /mnt/$i >>>>>>> done >>>>>>> for i in {1..3} >>>>>>> do >>>>>>> cp -r linux-6.2.2 /mnt/$i >>>>>>> done >>>>>>> >>>>>>> >>>>>>> I found a couple of DUTs that I can link, I also tested one industrial card. >>>>>>> >>>>>>> DUT1: blue PCB Foresee eMMC >>>>>>> https://pine64.com/product/32gb-emmc-module/ >>>>>>> DUT2: green PCB SiliconGo eMMC >>>>>>> Couldn't find that one online anymore unfortunately >>>>>>> DUT3: orange hardkernel PCB 8GB >>>>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ >>>>>>> DUT4: orange hardkernel PCB white dot >>>>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro >>>>>>> id-xu3.html >>>>>>> DUT5: Industrial card >>>>>> >>>>>> Thanks a lot for helping out with testing! Much appreciated! >>>>> >>>>> No problem, glad to be of help. >>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5 >>>>>>> >>>>>>> DUT1: >>>>>>> Cache, no FUA: >>>>>>> 13:04.49 >>>>>>> 13:13.82 >>>>>>> 13:30.59 >>>>>>> 13:28:13 >>>>>>> 13:20:64 >>>>>>> FUA: >>>>>>> 13:30.32 >>>>>>> 13:36.26 >>>>>>> 13:10.86 >>>>>>> 13:32.52 >>>>>>> 13:48.59 >>>>>>> >>>>>>> DUT2: >>>>>>> FUA: >>>>>>> 8:11.24 >>>>>>> 7:47.73 >>>>>>> 7:48.00 >>>>>>> 7:48.18 >>>>>>> 7:47.38 >>>>>>> Cache, no FUA: >>>>>>> 8:10.30 >>>>>>> 7:48.97 >>>>>>> 7:48.47 >>>>>>> 7:47.93 >>>>>>> 7:44.18 >>>>>>> >>>>>>> DUT3: >>>>>>> Cache, no FUA: >>>>>>> 7:02.82 >>>>>>> 6:58.94 >>>>>>> 7:03.20 >>>>>>> 7:00.27 >>>>>>> 7:00.88 >>>>>>> FUA: >>>>>>> 7:05.43 >>>>>>> 7:03.44 >>>>>>> 7:04.82 >>>>>>> 7:03.26 >>>>>>> 7:04.74 >>>>>>> >>>>>>> DUT4: >>>>>>> FUA: >>>>>>> 7:23.92 >>>>>>> 7:20.15 >>>>>>> 7:20.52 >>>>>>> 7:19.10 >>>>>>> 7:20.71 >>>>>>> Cache, no FUA: >>>>>>> 7:20.23 >>>>>>> 7:20.48 >>>>>>> 7:19.94 >>>>>>> 7:18.90 >>>>>>> 7:19.88 >>>>>> >>>>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. >>>>>> >>>>>> Do you agree? >>>>> >>>>> That is a good question, that's why I just posted the data without further comment from my side. >>>>> I was honestly expecting the difference to be much higher, given the original patch. >>>>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. >>>>> If there are cards where the difference is much more significant then of course a quirk would be nicer. >>>>> On the other side I don't see why not and any improvement is a good one? >>>>> >>>>>> >>>>>>> >>>>>>> Cache, no FUA: >>>>>>> 7:19.36 >>>>>>> 7:02.11 >>>>>>> 7:01.53 >>>>>>> 7:01.35 >>>>>>> 7:00.37 >>>>>>> Cache, no FUA CQE: >>>>>>> 7:17.55 >>>>>>> 7:00.73 >>>>>>> 6:59.25 >>>>>>> 6:58.44 >>>>>>> 6:58.60 >>>>>>> FUA: >>>>>>> 7:15.10 >>>>>>> 6:58.99 >>>>>>> 6:58.94 >>>>>>> 6:59.17 >>>>>>> 6:60.00 >>>>>>> FUA CQE: >>>>>>> 7:11.03 >>>>>>> 6:58.04 >>>>>>> 6:56.89 >>>>>>> 6:56.43 >>>>>>> 6:56:28 >>>>>>> >>>>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. >>>>>> >>>>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? >>>>> >>>>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. >>>>> >>>>>> >>>>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). >>>>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. >>>>> >>>>>> >>>>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? >>>>> Sure, I'm also interested in Adrian's take on this. >>>> >>>> Testing an arbitrary system and looking only at individual I/Os, >>>> which may not be representative of any use-case, resulted in >>>> FUA always winning, see below. >>>> >>>> All values are approximate and in microseconds. >>>> >>>> With FUA Without FUA >>>> >>>> With CQE Reliable Write 350 Write 125 >>>> Flush 300 >>>> Total 350 425 >>>> >>>> Without CQE Reliable Write 350 Write 125 >>>> CMD13 100 CMD13 100 >>>> Flush 300 >>>> CMD13 100 >>>> Total 450 625 >>>> >>>> FYI the test I was doing was: >>>> >>>> # cat test.sh >>>> #!/bin/sh >>>> >>>> echo "hi" > /mnt/mmc/hi.txt >>>> >>>> sync >>>> >>>> >>>> # perf record --no-bpf-event -e mmc:* -a -- ./test.sh >>>> # perf script --ns --deltatime >>>> >>>> >>>> The conclusion in this case would seem to be that CQE >>>> makes the case for removing FUA less bad. >>>> >>>> Perhaps CQE is more common in newer eMMCs which in turn >>>> have better FUA implementations. >>> >>> Very interesting data - and thanks for helping out with tests! >>> >>> A few observations and thoughts from the above. >>> >>> 1) >>> A "normal" use case would probably include additional writes (regular >>> writes) and I guess that could impact the flushing behavior. Maybe the >>> flushing becomes less heavy, if the device internally/occasionally >>> needs to flush its cache anyway? Or - maybe it doesn't matter at all, >>> because the reliable writes are triggering the cache to be flushed >>> too. >> >> The sync is presumably causing an EXT4 journal commit which >> seems to use REQ_PREFLUSH and REQ_FUA. That is: >> Flush (the journal to media) >> Write (the commit record) (FUA) >> So it does a flush anyway. The no-FUA case is: >> Flush (the journal to media) >> Write (the commit record) >> Flush (the commit record) >> >>> >>> 2) >>> Assuming that a reliable write is triggering the internal cache to be >>> flushed too, then we need less number of commands to be sent/acked to >>> the eMMC - compared to not using FUA. This means less protocol >>> overhead when using FUA - and perhaps that's what your tests is >>> actually telling us? >> >> There is definitely less protocol overhead because the no-FUA >> case has to do an extra CMD6 (flush) and CMD13. >> >> Note also, in this case auto-CMD23 is being used, which is why >> is is not listed. >> >> Using an older system (no CQE but also auto-CMD23), resulted >> in a win for no-FUA: >> >> With FUA Without FUA >> >> Reliable Write 1200 Write 850 >> CMD13 100 CMD13 100 >> Flush 120 >> CMD13 65 >> Total 1300 1135 >> >> > > Alright, so it seems like just checking whether the cache control > feature is available, isn't sufficient when deciding to avoid FUA. > > That said, in the next version I am going to add a card quirk, which > needs to be set too, to avoid FUA. Then we can see for what cards it > should actually be set for. > > What eMMC cards did you use in your tests above? I don't know the part numbers Manu ID Name Date Machine Newer: 0x11 (Toshiba) 064G30 11/2019 Jasper Lake Older: 0x45 SEM128 07/2014 Lenovo Thinkpad 10
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c index 672ab90c4b2d..2a49531bf023 100644 --- a/drivers/mmc/core/block.c +++ b/drivers/mmc/core/block.c @@ -2490,15 +2490,20 @@ static struct mmc_blk_data *mmc_blk_alloc_req(struct mmc_card *card, md->flags |= MMC_BLK_CMD23; } - if (md->flags & MMC_BLK_CMD23 && - ((card->ext_csd.rel_param & EXT_CSD_WR_REL_PARAM_EN) || - card->ext_csd.rel_sectors)) { + /* + * REQ_FUA is supported through eMMC reliable writes, which has been + * reported to be quite costly for some eMMCs. Therefore, let's rely + * on flush requests (REQ_OP_FLUSH), if an internal cache is supported. + */ + if (mmc_cache_enabled(card->host)) { + cache_enabled = true; + } else if (md->flags & MMC_BLK_CMD23 && + (card->ext_csd.rel_param & EXT_CSD_WR_REL_PARAM_EN || + card->ext_csd.rel_sectors)) { md->flags |= MMC_BLK_REL_WR; fua_enabled = true; cache_enabled = true; } - if (mmc_cache_enabled(card->host)) - cache_enabled = true; blk_queue_write_cache(md->queue.queue, cache_enabled, fua_enabled);
REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case. For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations. In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper. Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean. Reported-by: Wenchao Chen <wenchao.chen666@gmail.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> --- Note that, I haven't been able to test this patch myself, but are relying on Wenchao and others to help out. Sharing some performance number before and after the patch, would be nice. Moreover, what is not clear to me (hence the RFC), is whether relying solely on flush requests are sufficient and as such if it's a good idea after all. Comments are highly appreciated in this regards. Kind regards Ulf Hansson --- drivers/mmc/core/block.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)