Message ID | 20210609145315.25750-1-saiprakash.ranjan@codeaurora.org |
---|---|
State | New |
Headers | show |
Series | iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list | expand |
On 2021-06-09 15:53, Sai Prakash Ranjan wrote: > Currently for iommu_unmap() of large scatter-gather list with page size > elements, the majority of time is spent in flushing of partial walks in > __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for > arm-smmu). > > For example: to unmap a 32MB scatter-gather list with page size elements > (8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB > for 4K granule) and each of 2MB will further result in 512 TLBIVAs (2MB/4K) > resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge > overhead. > > So instead use io_pgtable_tlb_flush_all() to invalidate the entire context > if size (pgsize) is greater than the granule size (4K, 16K, 64K). For this > example of 32MB scatter-gather list unmap, this results in just 16 ASID > based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of > arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the performance of > unmaps drastically. > > Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all() > because for any granule with supported pgsizes, we will have at least 512 > TLB invalidations for which tlb_flush_all() is already recommended. For > example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA > in partial walk flush. > > Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: > (average over 10 iterations) > > Before this optimization: > > size iommu_map_sg iommu_unmap > 4K 2.067 us 1.854 us > 64K 9.598 us 8.802 us > 1M 148.890 us 130.718 us > 2M 305.864 us 67.291 us > 12M 1793.604 us 390.838 us > 16M 2386.848 us 518.187 us > 24M 3563.296 us 775.989 us > 32M 4747.171 us 1033.364 us > > After this optimization: > > size iommu_map_sg iommu_unmap > 4K 1.723 us 1.765 us > 64K 9.880 us 8.869 us > 1M 155.364 us 135.223 us > 2M 303.906 us 5.385 us > 12M 1786.557 us 21.250 us > 16M 2391.890 us 27.437 us > 24M 3570.895 us 39.937 us > 32M 4755.234 us 51.797 us > > This is further reduced once the map/unmap_pages() support gets in which > will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all(). > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> > --- > drivers/iommu/io-pgtable-arm.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c > index 87def58e79b5..c3cb9add3179 100644 > --- a/drivers/iommu/io-pgtable-arm.c > +++ b/drivers/iommu/io-pgtable-arm.c > @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data, > > if (!iopte_leaf(pte, lvl, iop->fmt)) { > /* Also flush any partial walks */ > - io_pgtable_tlb_flush_walk(iop, iova, size, > - ARM_LPAE_GRANULE(data)); > + if (size > ARM_LPAE_GRANULE(data)) > + io_pgtable_tlb_flush_all(iop); > + else Erm, when will the above condition ever not be true? ;) Taking a step back, though, what about the impact to drivers other than SMMUv2? In particular I'm thinking of SMMUv3.2 where the whole range can be invalidated by VA in a single command anyway, so the additional penalties of TLBIALL are undesirable. Robin. > + io_pgtable_tlb_flush_walk(iop, iova, size, > + ARM_LPAE_GRANULE(data)); > ptep = iopte_deref(pte, data); > __arm_lpae_free_pgtable(data, lvl + 1, ptep); > } else if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) { >
Hi Robin, On 2021-06-10 00:14, Robin Murphy wrote: > On 2021-06-09 15:53, Sai Prakash Ranjan wrote: >> Currently for iommu_unmap() of large scatter-gather list with page >> size >> elements, the majority of time is spent in flushing of partial walks >> in >> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for >> arm-smmu). >> >> For example: to unmap a 32MB scatter-gather list with page size >> elements >> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize >> (2MB >> for 4K granule) and each of 2MB will further result in 512 TLBIVAs >> (2MB/4K) >> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a >> huge >> overhead. >> >> So instead use io_pgtable_tlb_flush_all() to invalidate the entire >> context >> if size (pgsize) is greater than the granule size (4K, 16K, 64K). For >> this >> example of 32MB scatter-gather list unmap, this results in just 16 >> ASID >> based TLB invalidations or tlb_flush_all() callback (TLBIASID in case >> of >> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the >> performance of >> unmaps drastically. >> >> Condition (size > granule size) is chosen for >> io_pgtable_tlb_flush_all() >> because for any granule with supported pgsizes, we will have at least >> 512 >> TLB invalidations for which tlb_flush_all() is already recommended. >> For >> example, take 4K granule with 2MB pgsize, this will result in 512 >> TLBIVA >> in partial walk flush. >> >> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: >> (average over 10 iterations) >> >> Before this optimization: >> >> size iommu_map_sg iommu_unmap >> 4K 2.067 us 1.854 us >> 64K 9.598 us 8.802 us >> 1M 148.890 us 130.718 us >> 2M 305.864 us 67.291 us >> 12M 1793.604 us 390.838 us >> 16M 2386.848 us 518.187 us >> 24M 3563.296 us 775.989 us >> 32M 4747.171 us 1033.364 us >> >> After this optimization: >> >> size iommu_map_sg iommu_unmap >> 4K 1.723 us 1.765 us >> 64K 9.880 us 8.869 us >> 1M 155.364 us 135.223 us >> 2M 303.906 us 5.385 us >> 12M 1786.557 us 21.250 us >> 16M 2391.890 us 27.437 us >> 24M 3570.895 us 39.937 us >> 32M 4755.234 us 51.797 us >> >> This is further reduced once the map/unmap_pages() support gets in >> which >> will result in just 1 tlb_flush_all() as opposed to 16 >> tlb_flush_all(). >> >> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> >> --- >> drivers/iommu/io-pgtable-arm.c | 7 +++++-- >> 1 file changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/iommu/io-pgtable-arm.c >> b/drivers/iommu/io-pgtable-arm.c >> index 87def58e79b5..c3cb9add3179 100644 >> --- a/drivers/iommu/io-pgtable-arm.c >> +++ b/drivers/iommu/io-pgtable-arm.c >> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct >> arm_lpae_io_pgtable *data, >> if (!iopte_leaf(pte, lvl, iop->fmt)) { >> /* Also flush any partial walks */ >> - io_pgtable_tlb_flush_walk(iop, iova, size, >> - ARM_LPAE_GRANULE(data)); >> + if (size > ARM_LPAE_GRANULE(data)) >> + io_pgtable_tlb_flush_all(iop); >> + else > > Erm, when will the above condition ever not be true? ;) > Ah right, silly me :) > Taking a step back, though, what about the impact to drivers other > than SMMUv2? Other drivers would be msm_iommu.c, qcom_iommu.c which does the same thing as arm-smmu-v2 (page based invalidations), then there is ipmmu-vmsa.c which does tlb_flush_all() for flush walk. > In particular I'm thinking of SMMUv3.2 where the whole > range can be invalidated by VA in a single command anyway, so the > additional penalties of TLBIALL are undesirable. > Right, so I am thinking we can have a new generic quirk IO_PGTABLE_QUIRK_RANGE_INV to choose between range based invalidations(tlb_flush_walk) and tlb_flush_all(). In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV with this quirk and have something like below, thoughts? if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) io_pgtable_tlb_flush_walk(iop, iova, size, ARM_LPAE_GRANULE(data)); else io_pgtable_tlb_flush_all(iop); Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On 2021-06-10 06:24, Sai Prakash Ranjan wrote: > Hi Robin, > > On 2021-06-10 00:14, Robin Murphy wrote: >> On 2021-06-09 15:53, Sai Prakash Ranjan wrote: >>> Currently for iommu_unmap() of large scatter-gather list with page size >>> elements, the majority of time is spent in flushing of partial walks in >>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for >>> arm-smmu). >>> >>> For example: to unmap a 32MB scatter-gather list with page size elements >>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB >>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs >>> (2MB/4K) >>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge >>> overhead. >>> >>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire >>> context >>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). For >>> this >>> example of 32MB scatter-gather list unmap, this results in just 16 ASID >>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of >>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the >>> performance of >>> unmaps drastically. >>> >>> Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all() >>> because for any granule with supported pgsizes, we will have at least >>> 512 >>> TLB invalidations for which tlb_flush_all() is already recommended. For >>> example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA >>> in partial walk flush. >>> >>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: >>> (average over 10 iterations) >>> >>> Before this optimization: >>> >>> size iommu_map_sg iommu_unmap >>> 4K 2.067 us 1.854 us >>> 64K 9.598 us 8.802 us >>> 1M 148.890 us 130.718 us >>> 2M 305.864 us 67.291 us >>> 12M 1793.604 us 390.838 us >>> 16M 2386.848 us 518.187 us >>> 24M 3563.296 us 775.989 us >>> 32M 4747.171 us 1033.364 us >>> >>> After this optimization: >>> >>> size iommu_map_sg iommu_unmap >>> 4K 1.723 us 1.765 us >>> 64K 9.880 us 8.869 us >>> 1M 155.364 us 135.223 us >>> 2M 303.906 us 5.385 us >>> 12M 1786.557 us 21.250 us >>> 16M 2391.890 us 27.437 us >>> 24M 3570.895 us 39.937 us >>> 32M 4755.234 us 51.797 us >>> >>> This is further reduced once the map/unmap_pages() support gets in which >>> will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all(). >>> >>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> >>> --- >>> drivers/iommu/io-pgtable-arm.c | 7 +++++-- >>> 1 file changed, 5 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/iommu/io-pgtable-arm.c >>> b/drivers/iommu/io-pgtable-arm.c >>> index 87def58e79b5..c3cb9add3179 100644 >>> --- a/drivers/iommu/io-pgtable-arm.c >>> +++ b/drivers/iommu/io-pgtable-arm.c >>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct >>> arm_lpae_io_pgtable *data, >>> if (!iopte_leaf(pte, lvl, iop->fmt)) { >>> /* Also flush any partial walks */ >>> - io_pgtable_tlb_flush_walk(iop, iova, size, >>> - ARM_LPAE_GRANULE(data)); >>> + if (size > ARM_LPAE_GRANULE(data)) >>> + io_pgtable_tlb_flush_all(iop); >>> + else >> >> Erm, when will the above condition ever not be true? ;) >> > > Ah right, silly me :) > >> Taking a step back, though, what about the impact to drivers other >> than SMMUv2? > > Other drivers would be msm_iommu.c, qcom_iommu.c which does the same > thing as arm-smmu-v2 (page based invalidations), then there is ipmmu-vmsa.c > which does tlb_flush_all() for flush walk. > >> In particular I'm thinking of SMMUv3.2 where the whole >> range can be invalidated by VA in a single command anyway, so the >> additional penalties of TLBIALL are undesirable. >> > > Right, so I am thinking we can have a new generic quirk > IO_PGTABLE_QUIRK_RANGE_INV > to choose between range based invalidations(tlb_flush_walk) and > tlb_flush_all(). > In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV > with this quirk > and have something like below, thoughts? > > if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) > io_pgtable_tlb_flush_walk(iop, iova, size, > ARM_LPAE_GRANULE(data)); > else > io_pgtable_tlb_flush_all(iop); The design here has always been that io-pgtable says *what* needs invalidating, and we left it up to the drivers to decide exactly *how*. Even though things have evolved a bit I don't think that has fundamentally changed - tlb_flush_walk is now only used in this one place (technically I suppose it could be renamed tlb_flush_table but it's not worth the churn), so drivers can implement their own preferred table-invalidating behaviour even more easily than choosing whether to bounce a quirk through the common code or not. Consider what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2... I'm instinctively a little twitchy about making this a blanket optimisation for SMMUv2 since I still remember the palaver with our display and MMU-500 integrations, where it had to implement the dodgy "prefetch" register to trigger translations before scanning out a frame since it couldn't ever afford a TLB miss, thus TLBIALL when freeing an old buffer would be a dangerous hammer to swing. However IIRC it also had to ensure everything was mapped as 2MB blocks to guarantee fitting everything in the TLBs in the first place, so I guess it would still work out OK due to never realistically unmapping a whole table at once anyway. Cheers, Robin.
Hi Robin, On 2021-06-10 14:38, Robin Murphy wrote: > On 2021-06-10 06:24, Sai Prakash Ranjan wrote: >> Hi Robin, >> >> On 2021-06-10 00:14, Robin Murphy wrote: >>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote: >>>> Currently for iommu_unmap() of large scatter-gather list with page >>>> size >>>> elements, the majority of time is spent in flushing of partial walks >>>> in >>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for >>>> arm-smmu). >>>> >>>> For example: to unmap a 32MB scatter-gather list with page size >>>> elements >>>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize >>>> (2MB >>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs >>>> (2MB/4K) >>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a >>>> huge >>>> overhead. >>>> >>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire >>>> context >>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). >>>> For this >>>> example of 32MB scatter-gather list unmap, this results in just 16 >>>> ASID >>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in >>>> case of >>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the >>>> performance of >>>> unmaps drastically. >>>> >>>> Condition (size > granule size) is chosen for >>>> io_pgtable_tlb_flush_all() >>>> because for any granule with supported pgsizes, we will have at >>>> least 512 >>>> TLB invalidations for which tlb_flush_all() is already recommended. >>>> For >>>> example, take 4K granule with 2MB pgsize, this will result in 512 >>>> TLBIVA >>>> in partial walk flush. >>>> >>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: >>>> (average over 10 iterations) >>>> >>>> Before this optimization: >>>> >>>> size iommu_map_sg iommu_unmap >>>> 4K 2.067 us 1.854 us >>>> 64K 9.598 us 8.802 us >>>> 1M 148.890 us 130.718 us >>>> 2M 305.864 us 67.291 us >>>> 12M 1793.604 us 390.838 us >>>> 16M 2386.848 us 518.187 us >>>> 24M 3563.296 us 775.989 us >>>> 32M 4747.171 us 1033.364 us >>>> >>>> After this optimization: >>>> >>>> size iommu_map_sg iommu_unmap >>>> 4K 1.723 us 1.765 us >>>> 64K 9.880 us 8.869 us >>>> 1M 155.364 us 135.223 us >>>> 2M 303.906 us 5.385 us >>>> 12M 1786.557 us 21.250 us >>>> 16M 2391.890 us 27.437 us >>>> 24M 3570.895 us 39.937 us >>>> 32M 4755.234 us 51.797 us >>>> >>>> This is further reduced once the map/unmap_pages() support gets in >>>> which >>>> will result in just 1 tlb_flush_all() as opposed to 16 >>>> tlb_flush_all(). >>>> >>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> >>>> --- >>>> drivers/iommu/io-pgtable-arm.c | 7 +++++-- >>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/iommu/io-pgtable-arm.c >>>> b/drivers/iommu/io-pgtable-arm.c >>>> index 87def58e79b5..c3cb9add3179 100644 >>>> --- a/drivers/iommu/io-pgtable-arm.c >>>> +++ b/drivers/iommu/io-pgtable-arm.c >>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct >>>> arm_lpae_io_pgtable *data, >>>> if (!iopte_leaf(pte, lvl, iop->fmt)) { >>>> /* Also flush any partial walks */ >>>> - io_pgtable_tlb_flush_walk(iop, iova, size, >>>> - ARM_LPAE_GRANULE(data)); >>>> + if (size > ARM_LPAE_GRANULE(data)) >>>> + io_pgtable_tlb_flush_all(iop); >>>> + else >>> >>> Erm, when will the above condition ever not be true? ;) >>> >> >> Ah right, silly me :) >> >>> Taking a step back, though, what about the impact to drivers other >>> than SMMUv2? >> >> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same >> thing as arm-smmu-v2 (page based invalidations), then there is >> ipmmu-vmsa.c >> which does tlb_flush_all() for flush walk. >> >>> In particular I'm thinking of SMMUv3.2 where the whole >>> range can be invalidated by VA in a single command anyway, so the >>> additional penalties of TLBIALL are undesirable. >>> >> >> Right, so I am thinking we can have a new generic quirk >> IO_PGTABLE_QUIRK_RANGE_INV >> to choose between range based invalidations(tlb_flush_walk) and >> tlb_flush_all(). >> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV >> with this quirk >> and have something like below, thoughts? >> >> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) >> io_pgtable_tlb_flush_walk(iop, iova, size, >> ARM_LPAE_GRANULE(data)); >> else >> io_pgtable_tlb_flush_all(iop); > > The design here has always been that io-pgtable says *what* needs > invalidating, and we left it up to the drivers to decide exactly > *how*. Even though things have evolved a bit I don't think that has > fundamentally changed - tlb_flush_walk is now only used in this one > place (technically I suppose it could be renamed tlb_flush_table but > it's not worth the churn), so drivers can implement their own > preferred table-invalidating behaviour even more easily than choosing > whether to bounce a quirk through the common code or not. Consider > what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2... > Thanks for the explanation, makes sense. If I am not mistaken, I see that you are suggesting to move this logic based on size and granule-size to arm-smmu-v2 driver and one more thing below.. > I'm instinctively a little twitchy about making this a blanket > optimisation for SMMUv2 since I still remember the palaver with our > display and MMU-500 integrations, where it had to implement the dodgy > "prefetch" register to trigger translations before scanning out a > frame since it couldn't ever afford a TLB miss, thus TLBIALL when > freeing an old buffer would be a dangerous hammer to swing. However > IIRC it also had to ensure everything was mapped as 2MB blocks to > guarantee fitting everything in the TLBs in the first place, so I > guess it would still work out OK due to never realistically unmapping > a whole table at once anyway. > You are also hinting to not do this for all SMMUv2 implementations and make it QCOM specific? If I am wrong in my assumptions here, please let me know otherwise I will prepare the patch :) Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On 2021-06-10 10:36, Sai Prakash Ranjan wrote: > Hi Robin, > > On 2021-06-10 14:38, Robin Murphy wrote: >> On 2021-06-10 06:24, Sai Prakash Ranjan wrote: >>> Hi Robin, >>> >>> On 2021-06-10 00:14, Robin Murphy wrote: >>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote: >>>>> Currently for iommu_unmap() of large scatter-gather list with page >>>>> size >>>>> elements, the majority of time is spent in flushing of partial >>>>> walks in >>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for >>>>> arm-smmu). >>>>> >>>>> For example: to unmap a 32MB scatter-gather list with page size >>>>> elements >>>>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize >>>>> (2MB >>>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs >>>>> (2MB/4K) >>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a >>>>> huge >>>>> overhead. >>>>> >>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire >>>>> context >>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). >>>>> For this >>>>> example of 32MB scatter-gather list unmap, this results in just 16 >>>>> ASID >>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in >>>>> case of >>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the >>>>> performance of >>>>> unmaps drastically. >>>>> >>>>> Condition (size > granule size) is chosen for >>>>> io_pgtable_tlb_flush_all() >>>>> because for any granule with supported pgsizes, we will have at >>>>> least 512 >>>>> TLB invalidations for which tlb_flush_all() is already recommended. >>>>> For >>>>> example, take 4K granule with 2MB pgsize, this will result in 512 >>>>> TLBIVA >>>>> in partial walk flush. >>>>> >>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: >>>>> (average over 10 iterations) >>>>> >>>>> Before this optimization: >>>>> >>>>> size iommu_map_sg iommu_unmap >>>>> 4K 2.067 us 1.854 us >>>>> 64K 9.598 us 8.802 us >>>>> 1M 148.890 us 130.718 us >>>>> 2M 305.864 us 67.291 us >>>>> 12M 1793.604 us 390.838 us >>>>> 16M 2386.848 us 518.187 us >>>>> 24M 3563.296 us 775.989 us >>>>> 32M 4747.171 us 1033.364 us >>>>> >>>>> After this optimization: >>>>> >>>>> size iommu_map_sg iommu_unmap >>>>> 4K 1.723 us 1.765 us >>>>> 64K 9.880 us 8.869 us >>>>> 1M 155.364 us 135.223 us >>>>> 2M 303.906 us 5.385 us >>>>> 12M 1786.557 us 21.250 us >>>>> 16M 2391.890 us 27.437 us >>>>> 24M 3570.895 us 39.937 us >>>>> 32M 4755.234 us 51.797 us >>>>> >>>>> This is further reduced once the map/unmap_pages() support gets in >>>>> which >>>>> will result in just 1 tlb_flush_all() as opposed to 16 >>>>> tlb_flush_all(). >>>>> >>>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> >>>>> --- >>>>> drivers/iommu/io-pgtable-arm.c | 7 +++++-- >>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/iommu/io-pgtable-arm.c >>>>> b/drivers/iommu/io-pgtable-arm.c >>>>> index 87def58e79b5..c3cb9add3179 100644 >>>>> --- a/drivers/iommu/io-pgtable-arm.c >>>>> +++ b/drivers/iommu/io-pgtable-arm.c >>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct >>>>> arm_lpae_io_pgtable *data, >>>>> if (!iopte_leaf(pte, lvl, iop->fmt)) { >>>>> /* Also flush any partial walks */ >>>>> - io_pgtable_tlb_flush_walk(iop, iova, size, >>>>> - ARM_LPAE_GRANULE(data)); >>>>> + if (size > ARM_LPAE_GRANULE(data)) >>>>> + io_pgtable_tlb_flush_all(iop); >>>>> + else >>>> >>>> Erm, when will the above condition ever not be true? ;) >>>> >>> >>> Ah right, silly me :) >>> >>>> Taking a step back, though, what about the impact to drivers other >>>> than SMMUv2? >>> >>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same >>> thing as arm-smmu-v2 (page based invalidations), then there is >>> ipmmu-vmsa.c >>> which does tlb_flush_all() for flush walk. >>> >>>> In particular I'm thinking of SMMUv3.2 where the whole >>>> range can be invalidated by VA in a single command anyway, so the >>>> additional penalties of TLBIALL are undesirable. >>>> >>> >>> Right, so I am thinking we can have a new generic quirk >>> IO_PGTABLE_QUIRK_RANGE_INV >>> to choose between range based invalidations(tlb_flush_walk) and >>> tlb_flush_all(). >>> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV >>> with this quirk >>> and have something like below, thoughts? >>> >>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) >>> io_pgtable_tlb_flush_walk(iop, iova, size, >>> ARM_LPAE_GRANULE(data)); >>> else >>> io_pgtable_tlb_flush_all(iop); >> >> The design here has always been that io-pgtable says *what* needs >> invalidating, and we left it up to the drivers to decide exactly >> *how*. Even though things have evolved a bit I don't think that has >> fundamentally changed - tlb_flush_walk is now only used in this one >> place (technically I suppose it could be renamed tlb_flush_table but >> it's not worth the churn), so drivers can implement their own >> preferred table-invalidating behaviour even more easily than choosing >> whether to bounce a quirk through the common code or not. Consider >> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2... >> > > Thanks for the explanation, makes sense. If I am not mistaken, I see that > you are suggesting to move this logic based on size and granule-size to > arm-smmu-v2 driver and one more thing below.. Simpler than that - following on from my original comment above, tlb_flush_walk already knows it's invalidating at least one full level of table so there's nothing it even needs to check. Adding a size-based heuristic to arm_smmu_inv_range_* for leaf invalidations would be a separate concern (note that changing the non-leaf behaviour might allow cleaning up the "reg" indirection there too). >> I'm instinctively a little twitchy about making this a blanket >> optimisation for SMMUv2 since I still remember the palaver with our >> display and MMU-500 integrations, where it had to implement the dodgy >> "prefetch" register to trigger translations before scanning out a >> frame since it couldn't ever afford a TLB miss, thus TLBIALL when >> freeing an old buffer would be a dangerous hammer to swing. However >> IIRC it also had to ensure everything was mapped as 2MB blocks to >> guarantee fitting everything in the TLBs in the first place, so I >> guess it would still work out OK due to never realistically unmapping >> a whole table at once anyway. >> > > You are also hinting to not do this for all SMMUv2 implementations and make > it QCOM specific? No, I'm really just wary that the performance implication is more complex than a simple unmap latency benefit, possibly even for QCOM. Consider the access latency, power and memory bandwidth hit from all the additional pagetable walks incurred by other ongoing traffic fighting against those 16 successive TLBIASIDs. Whether it's an overall win really depends on the specific workload and system conditions as much as the SMMU implementation. Thinking some more, I wonder if the Tegra folks might have an opinion to add here, given that their multiple-SMMU solution was seemingly about trying to get enough TLB and pagetable walk bandwidth in the first place? Robin.
Hi Robin, On 2021-06-10 17:03, Robin Murphy wrote: > On 2021-06-10 10:36, Sai Prakash Ranjan wrote: >> Hi Robin, >> >> On 2021-06-10 14:38, Robin Murphy wrote: >>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote: >>>> Hi Robin, >>>> >>>> On 2021-06-10 00:14, Robin Murphy wrote: >>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote: >>>>>> Currently for iommu_unmap() of large scatter-gather list with page >>>>>> size >>>>>> elements, the majority of time is spent in flushing of partial >>>>>> walks in >>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA >>>>>> for >>>>>> arm-smmu). >>>>>> >>>>>> For example: to unmap a 32MB scatter-gather list with page size >>>>>> elements >>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the >>>>>> pgsize (2MB >>>>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs >>>>>> (2MB/4K) >>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing >>>>>> a huge >>>>>> overhead. >>>>>> >>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire >>>>>> context >>>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). >>>>>> For this >>>>>> example of 32MB scatter-gather list unmap, this results in just 16 >>>>>> ASID >>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in >>>>>> case of >>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the >>>>>> performance of >>>>>> unmaps drastically. >>>>>> >>>>>> Condition (size > granule size) is chosen for >>>>>> io_pgtable_tlb_flush_all() >>>>>> because for any granule with supported pgsizes, we will have at >>>>>> least 512 >>>>>> TLB invalidations for which tlb_flush_all() is already >>>>>> recommended. For >>>>>> example, take 4K granule with 2MB pgsize, this will result in 512 >>>>>> TLBIVA >>>>>> in partial walk flush. >>>>>> >>>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: >>>>>> (average over 10 iterations) >>>>>> >>>>>> Before this optimization: >>>>>> >>>>>> size iommu_map_sg iommu_unmap >>>>>> 4K 2.067 us 1.854 us >>>>>> 64K 9.598 us 8.802 us >>>>>> 1M 148.890 us 130.718 us >>>>>> 2M 305.864 us 67.291 us >>>>>> 12M 1793.604 us 390.838 us >>>>>> 16M 2386.848 us 518.187 us >>>>>> 24M 3563.296 us 775.989 us >>>>>> 32M 4747.171 us 1033.364 us >>>>>> >>>>>> After this optimization: >>>>>> >>>>>> size iommu_map_sg iommu_unmap >>>>>> 4K 1.723 us 1.765 us >>>>>> 64K 9.880 us 8.869 us >>>>>> 1M 155.364 us 135.223 us >>>>>> 2M 303.906 us 5.385 us >>>>>> 12M 1786.557 us 21.250 us >>>>>> 16M 2391.890 us 27.437 us >>>>>> 24M 3570.895 us 39.937 us >>>>>> 32M 4755.234 us 51.797 us >>>>>> >>>>>> This is further reduced once the map/unmap_pages() support gets in >>>>>> which >>>>>> will result in just 1 tlb_flush_all() as opposed to 16 >>>>>> tlb_flush_all(). >>>>>> >>>>>> Signed-off-by: Sai Prakash Ranjan >>>>>> <saiprakash.ranjan@codeaurora.org> >>>>>> --- >>>>>> drivers/iommu/io-pgtable-arm.c | 7 +++++-- >>>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c >>>>>> b/drivers/iommu/io-pgtable-arm.c >>>>>> index 87def58e79b5..c3cb9add3179 100644 >>>>>> --- a/drivers/iommu/io-pgtable-arm.c >>>>>> +++ b/drivers/iommu/io-pgtable-arm.c >>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct >>>>>> arm_lpae_io_pgtable *data, >>>>>> if (!iopte_leaf(pte, lvl, iop->fmt)) { >>>>>> /* Also flush any partial walks */ >>>>>> - io_pgtable_tlb_flush_walk(iop, iova, size, >>>>>> - ARM_LPAE_GRANULE(data)); >>>>>> + if (size > ARM_LPAE_GRANULE(data)) >>>>>> + io_pgtable_tlb_flush_all(iop); >>>>>> + else >>>>> >>>>> Erm, when will the above condition ever not be true? ;) >>>>> >>>> >>>> Ah right, silly me :) >>>> >>>>> Taking a step back, though, what about the impact to drivers other >>>>> than SMMUv2? >>>> >>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same >>>> thing as arm-smmu-v2 (page based invalidations), then there is >>>> ipmmu-vmsa.c >>>> which does tlb_flush_all() for flush walk. >>>> >>>>> In particular I'm thinking of SMMUv3.2 where the whole >>>>> range can be invalidated by VA in a single command anyway, so the >>>>> additional penalties of TLBIALL are undesirable. >>>>> >>>> >>>> Right, so I am thinking we can have a new generic quirk >>>> IO_PGTABLE_QUIRK_RANGE_INV >>>> to choose between range based invalidations(tlb_flush_walk) and >>>> tlb_flush_all(). >>>> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV >>>> with this quirk >>>> and have something like below, thoughts? >>>> >>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) >>>> io_pgtable_tlb_flush_walk(iop, iova, size, >>>> ARM_LPAE_GRANULE(data)); >>>> else >>>> io_pgtable_tlb_flush_all(iop); >>> >>> The design here has always been that io-pgtable says *what* needs >>> invalidating, and we left it up to the drivers to decide exactly >>> *how*. Even though things have evolved a bit I don't think that has >>> fundamentally changed - tlb_flush_walk is now only used in this one >>> place (technically I suppose it could be renamed tlb_flush_table but >>> it's not worth the churn), so drivers can implement their own >>> preferred table-invalidating behaviour even more easily than choosing >>> whether to bounce a quirk through the common code or not. Consider >>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2... >>> >> >> Thanks for the explanation, makes sense. If I am not mistaken, I see >> that >> you are suggesting to move this logic based on size and granule-size >> to >> arm-smmu-v2 driver and one more thing below.. > > Simpler than that - following on from my original comment above, > tlb_flush_walk already knows it's invalidating at least one full level > of table so there's nothing it even needs to check. Adding a > size-based heuristic to arm_smmu_inv_range_* for leaf invalidations > would be a separate concern (note that changing the non-leaf behaviour > might allow cleaning up the "reg" indirection there too). Right, sorry I didn't mean to mention the size check as it was obvious from your first reply, but rather just calling impl->tlb_inv() in arm_smmu_tlb_inv_walk_s1(). > >>> I'm instinctively a little twitchy about making this a blanket >>> optimisation for SMMUv2 since I still remember the palaver with our >>> display and MMU-500 integrations, where it had to implement the dodgy >>> "prefetch" register to trigger translations before scanning out a >>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when >>> freeing an old buffer would be a dangerous hammer to swing. However >>> IIRC it also had to ensure everything was mapped as 2MB blocks to >>> guarantee fitting everything in the TLBs in the first place, so I >>> guess it would still work out OK due to never realistically unmapping >>> a whole table at once anyway. >>> >> >> You are also hinting to not do this for all SMMUv2 implementations and >> make >> it QCOM specific? > > No, I'm really just wary that the performance implication is more > complex than a simple unmap latency benefit, possibly even for QCOM. > Consider the access latency, power and memory bandwidth hit from all > the additional pagetable walks incurred by other ongoing traffic > fighting against those 16 successive TLBIASIDs. Whether it's an > overall win really depends on the specific workload and system > conditions as much as the SMMU implementation. No, the unmap latency is not just in some test case written, the issue is very real and we have workloads where camera is reporting frame drops because of this unmap latency in the order of 100s of milliseconds. And hardware team recommends using ASID based invalidations for anything larger than 128 TLB entries. So yes, we have taken note of impacts here before going this way and hence feel more inclined to make this qcom specific if required. > Thinking some more, I > wonder if the Tegra folks might have an opinion to add here, given > that their multiple-SMMU solution was seemingly about trying to get > enough TLB and pagetable walk bandwidth in the first place? > Sure but I do not see how that will help with the unmap latency? Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On Thu, Jun 10, 2021 at 12:33:56PM +0100, Robin Murphy wrote: > On 2021-06-10 10:36, Sai Prakash Ranjan wrote: > > Hi Robin, > > > > On 2021-06-10 14:38, Robin Murphy wrote: > > > On 2021-06-10 06:24, Sai Prakash Ranjan wrote: > > > > Hi Robin, > > > > > > > > On 2021-06-10 00:14, Robin Murphy wrote: > > > > > On 2021-06-09 15:53, Sai Prakash Ranjan wrote: > > > > > > Currently for iommu_unmap() of large scatter-gather list > > > > > > with page size > > > > > > elements, the majority of time is spent in flushing of > > > > > > partial walks in > > > > > > __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for > > > > > > arm-smmu). > > > > > > > > > > > > For example: to unmap a 32MB scatter-gather list with > > > > > > page size elements > > > > > > (8192 entries), there are 16->2MB buffer unmaps based on > > > > > > the pgsize (2MB > > > > > > for 4K granule) and each of 2MB will further result in > > > > > > 512 TLBIVAs (2MB/4K) > > > > > > resulting in a total of 8192 TLBIVAs (512*16) for > > > > > > 16->2MB causing a huge > > > > > > overhead. > > > > > > > > > > > > So instead use io_pgtable_tlb_flush_all() to invalidate > > > > > > the entire context > > > > > > if size (pgsize) is greater than the granule size (4K, > > > > > > 16K, 64K). For this > > > > > > example of 32MB scatter-gather list unmap, this results > > > > > > in just 16 ASID > > > > > > based TLB invalidations or tlb_flush_all() callback > > > > > > (TLBIASID in case of > > > > > > arm-smmu) as opposed to 8192 TLBIVAs thereby increasing > > > > > > the performance of > > > > > > unmaps drastically. > > > > > > > > > > > > Condition (size > granule size) is chosen for > > > > > > io_pgtable_tlb_flush_all() > > > > > > because for any granule with supported pgsizes, we will > > > > > > have at least 512 > > > > > > TLB invalidations for which tlb_flush_all() is already > > > > > > recommended. For > > > > > > example, take 4K granule with 2MB pgsize, this will > > > > > > result in 512 TLBIVA > > > > > > in partial walk flush. > > > > > > > > > > > > Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: > > > > > > (average over 10 iterations) > > > > > > > > > > > > Before this optimization: > > > > > > > > > > > > size iommu_map_sg iommu_unmap > > > > > > 4K 2.067 us 1.854 us > > > > > > 64K 9.598 us 8.802 us > > > > > > 1M 148.890 us 130.718 us > > > > > > 2M 305.864 us 67.291 us > > > > > > 12M 1793.604 us 390.838 us > > > > > > 16M 2386.848 us 518.187 us > > > > > > 24M 3563.296 us 775.989 us > > > > > > 32M 4747.171 us 1033.364 us > > > > > > > > > > > > After this optimization: > > > > > > > > > > > > size iommu_map_sg iommu_unmap > > > > > > 4K 1.723 us 1.765 us > > > > > > 64K 9.880 us 8.869 us > > > > > > 1M 155.364 us 135.223 us > > > > > > 2M 303.906 us 5.385 us > > > > > > 12M 1786.557 us 21.250 us > > > > > > 16M 2391.890 us 27.437 us > > > > > > 24M 3570.895 us 39.937 us > > > > > > 32M 4755.234 us 51.797 us > > > > > > > > > > > > This is further reduced once the map/unmap_pages() > > > > > > support gets in which > > > > > > will result in just 1 tlb_flush_all() as opposed to 16 > > > > > > tlb_flush_all(). > > > > > > > > > > > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> > > > > > > --- > > > > > > drivers/iommu/io-pgtable-arm.c | 7 +++++-- > > > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/drivers/iommu/io-pgtable-arm.c > > > > > > b/drivers/iommu/io-pgtable-arm.c > > > > > > index 87def58e79b5..c3cb9add3179 100644 > > > > > > --- a/drivers/iommu/io-pgtable-arm.c > > > > > > +++ b/drivers/iommu/io-pgtable-arm.c > > > > > > @@ -589,8 +589,11 @@ static size_t > > > > > > __arm_lpae_unmap(struct arm_lpae_io_pgtable *data, > > > > > > if (!iopte_leaf(pte, lvl, iop->fmt)) { > > > > > > /* Also flush any partial walks */ > > > > > > - io_pgtable_tlb_flush_walk(iop, iova, size, > > > > > > - ARM_LPAE_GRANULE(data)); > > > > > > + if (size > ARM_LPAE_GRANULE(data)) > > > > > > + io_pgtable_tlb_flush_all(iop); > > > > > > + else > > > > > > > > > > Erm, when will the above condition ever not be true? ;) > > > > > > > > > > > > > Ah right, silly me :) > > > > > > > > > Taking a step back, though, what about the impact to drivers other > > > > > than SMMUv2? > > > > > > > > Other drivers would be msm_iommu.c, qcom_iommu.c which does the same > > > > thing as arm-smmu-v2 (page based invalidations), then there is > > > > ipmmu-vmsa.c > > > > which does tlb_flush_all() for flush walk. > > > > > > > > > In particular I'm thinking of SMMUv3.2 where the whole > > > > > range can be invalidated by VA in a single command anyway, so the > > > > > additional penalties of TLBIALL are undesirable. > > > > > > > > > > > > > Right, so I am thinking we can have a new generic quirk > > > > IO_PGTABLE_QUIRK_RANGE_INV > > > > to choose between range based invalidations(tlb_flush_walk) and > > > > tlb_flush_all(). > > > > In this case of arm-smmu-v3.2, we can tie up > > > > ARM_SMMU_FEAT_RANGE_INV with this quirk > > > > and have something like below, thoughts? > > > > > > > > if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) > > > > io_pgtable_tlb_flush_walk(iop, iova, size, > > > > ARM_LPAE_GRANULE(data)); > > > > else > > > > io_pgtable_tlb_flush_all(iop); > > > > > > The design here has always been that io-pgtable says *what* needs > > > invalidating, and we left it up to the drivers to decide exactly > > > *how*. Even though things have evolved a bit I don't think that has > > > fundamentally changed - tlb_flush_walk is now only used in this one > > > place (technically I suppose it could be renamed tlb_flush_table but > > > it's not worth the churn), so drivers can implement their own > > > preferred table-invalidating behaviour even more easily than choosing > > > whether to bounce a quirk through the common code or not. Consider > > > what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2... > > > > > > > Thanks for the explanation, makes sense. If I am not mistaken, I see that > > you are suggesting to move this logic based on size and granule-size to > > arm-smmu-v2 driver and one more thing below.. > > Simpler than that - following on from my original comment above, > tlb_flush_walk already knows it's invalidating at least one full level of > table so there's nothing it even needs to check. Adding a size-based > heuristic to arm_smmu_inv_range_* for leaf invalidations would be a separate > concern (note that changing the non-leaf behaviour might allow cleaning up > the "reg" indirection there too). > > > > I'm instinctively a little twitchy about making this a blanket > > > optimisation for SMMUv2 since I still remember the palaver with our > > > display and MMU-500 integrations, where it had to implement the dodgy > > > "prefetch" register to trigger translations before scanning out a > > > frame since it couldn't ever afford a TLB miss, thus TLBIALL when > > > freeing an old buffer would be a dangerous hammer to swing. However > > > IIRC it also had to ensure everything was mapped as 2MB blocks to > > > guarantee fitting everything in the TLBs in the first place, so I > > > guess it would still work out OK due to never realistically unmapping > > > a whole table at once anyway. > > > > > > > You are also hinting to not do this for all SMMUv2 implementations and make > > it QCOM specific? > > No, I'm really just wary that the performance implication is more complex > than a simple unmap latency benefit, possibly even for QCOM. Consider the > access latency, power and memory bandwidth hit from all the additional > pagetable walks incurred by other ongoing traffic fighting against those 16 > successive TLBIASIDs. Whether it's an overall win really depends on the > specific workload and system conditions as much as the SMMU implementation. > Thinking some more, I wonder if the Tegra folks might have an opinion to add > here, given that their multiple-SMMU solution was seemingly about trying to > get enough TLB and pagetable walk bandwidth in the first place? Yes, so Tegra194 has three different instances of the SMMU. Two of them are programmed in an interleaved mode to basically double the bandwidth available. A third instance is specifically reserved for isochronous memory clients and is used by the display controller to avoid potential pressure on the dual-SMMU from interfering with display functionality. I'm not sure if we've ever measured the impact of map/unmap operations under heavy load. Krishna, do you know if this might be helpful for some of the use-cases we have on Tegra? Or if it might negatively impact performance under pressure? Thierry
On 2021-06-10 12:54, Sai Prakash Ranjan wrote: > Hi Robin, > > On 2021-06-10 17:03, Robin Murphy wrote: >> On 2021-06-10 10:36, Sai Prakash Ranjan wrote: >>> Hi Robin, >>> >>> On 2021-06-10 14:38, Robin Murphy wrote: >>>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote: >>>>> Hi Robin, >>>>> >>>>> On 2021-06-10 00:14, Robin Murphy wrote: >>>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote: >>>>>>> Currently for iommu_unmap() of large scatter-gather list with >>>>>>> page size >>>>>>> elements, the majority of time is spent in flushing of partial >>>>>>> walks in >>>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for >>>>>>> arm-smmu). >>>>>>> >>>>>>> For example: to unmap a 32MB scatter-gather list with page size >>>>>>> elements >>>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the >>>>>>> pgsize (2MB >>>>>>> for 4K granule) and each of 2MB will further result in 512 >>>>>>> TLBIVAs (2MB/4K) >>>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing >>>>>>> a huge >>>>>>> overhead. >>>>>>> >>>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the >>>>>>> entire context >>>>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). >>>>>>> For this >>>>>>> example of 32MB scatter-gather list unmap, this results in just >>>>>>> 16 ASID >>>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in >>>>>>> case of >>>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the >>>>>>> performance of >>>>>>> unmaps drastically. >>>>>>> >>>>>>> Condition (size > granule size) is chosen for >>>>>>> io_pgtable_tlb_flush_all() >>>>>>> because for any granule with supported pgsizes, we will have at >>>>>>> least 512 >>>>>>> TLB invalidations for which tlb_flush_all() is already >>>>>>> recommended. For >>>>>>> example, take 4K granule with 2MB pgsize, this will result in 512 >>>>>>> TLBIVA >>>>>>> in partial walk flush. >>>>>>> >>>>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: >>>>>>> (average over 10 iterations) >>>>>>> >>>>>>> Before this optimization: >>>>>>> >>>>>>> size iommu_map_sg iommu_unmap >>>>>>> 4K 2.067 us 1.854 us >>>>>>> 64K 9.598 us 8.802 us >>>>>>> 1M 148.890 us 130.718 us >>>>>>> 2M 305.864 us 67.291 us >>>>>>> 12M 1793.604 us 390.838 us >>>>>>> 16M 2386.848 us 518.187 us >>>>>>> 24M 3563.296 us 775.989 us >>>>>>> 32M 4747.171 us 1033.364 us >>>>>>> >>>>>>> After this optimization: >>>>>>> >>>>>>> size iommu_map_sg iommu_unmap >>>>>>> 4K 1.723 us 1.765 us >>>>>>> 64K 9.880 us 8.869 us >>>>>>> 1M 155.364 us 135.223 us >>>>>>> 2M 303.906 us 5.385 us >>>>>>> 12M 1786.557 us 21.250 us >>>>>>> 16M 2391.890 us 27.437 us >>>>>>> 24M 3570.895 us 39.937 us >>>>>>> 32M 4755.234 us 51.797 us >>>>>>> >>>>>>> This is further reduced once the map/unmap_pages() support gets >>>>>>> in which >>>>>>> will result in just 1 tlb_flush_all() as opposed to 16 >>>>>>> tlb_flush_all(). >>>>>>> >>>>>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> >>>>>>> --- >>>>>>> drivers/iommu/io-pgtable-arm.c | 7 +++++-- >>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c >>>>>>> b/drivers/iommu/io-pgtable-arm.c >>>>>>> index 87def58e79b5..c3cb9add3179 100644 >>>>>>> --- a/drivers/iommu/io-pgtable-arm.c >>>>>>> +++ b/drivers/iommu/io-pgtable-arm.c >>>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct >>>>>>> arm_lpae_io_pgtable *data, >>>>>>> if (!iopte_leaf(pte, lvl, iop->fmt)) { >>>>>>> /* Also flush any partial walks */ >>>>>>> - io_pgtable_tlb_flush_walk(iop, iova, size, >>>>>>> - ARM_LPAE_GRANULE(data)); >>>>>>> + if (size > ARM_LPAE_GRANULE(data)) >>>>>>> + io_pgtable_tlb_flush_all(iop); >>>>>>> + else >>>>>> >>>>>> Erm, when will the above condition ever not be true? ;) >>>>>> >>>>> >>>>> Ah right, silly me :) >>>>> >>>>>> Taking a step back, though, what about the impact to drivers other >>>>>> than SMMUv2? >>>>> >>>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same >>>>> thing as arm-smmu-v2 (page based invalidations), then there is >>>>> ipmmu-vmsa.c >>>>> which does tlb_flush_all() for flush walk. >>>>> >>>>>> In particular I'm thinking of SMMUv3.2 where the whole >>>>>> range can be invalidated by VA in a single command anyway, so the >>>>>> additional penalties of TLBIALL are undesirable. >>>>>> >>>>> >>>>> Right, so I am thinking we can have a new generic quirk >>>>> IO_PGTABLE_QUIRK_RANGE_INV >>>>> to choose between range based invalidations(tlb_flush_walk) and >>>>> tlb_flush_all(). >>>>> In this case of arm-smmu-v3.2, we can tie up >>>>> ARM_SMMU_FEAT_RANGE_INV with this quirk >>>>> and have something like below, thoughts? >>>>> >>>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) >>>>> io_pgtable_tlb_flush_walk(iop, iova, size, >>>>> ARM_LPAE_GRANULE(data)); >>>>> else >>>>> io_pgtable_tlb_flush_all(iop); >>>> >>>> The design here has always been that io-pgtable says *what* needs >>>> invalidating, and we left it up to the drivers to decide exactly >>>> *how*. Even though things have evolved a bit I don't think that has >>>> fundamentally changed - tlb_flush_walk is now only used in this one >>>> place (technically I suppose it could be renamed tlb_flush_table but >>>> it's not worth the churn), so drivers can implement their own >>>> preferred table-invalidating behaviour even more easily than choosing >>>> whether to bounce a quirk through the common code or not. Consider >>>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2... >>>> >>> >>> Thanks for the explanation, makes sense. If I am not mistaken, I see >>> that >>> you are suggesting to move this logic based on size and granule-size to >>> arm-smmu-v2 driver and one more thing below.. >> >> Simpler than that - following on from my original comment above, >> tlb_flush_walk already knows it's invalidating at least one full level >> of table so there's nothing it even needs to check. Adding a >> size-based heuristic to arm_smmu_inv_range_* for leaf invalidations >> would be a separate concern (note that changing the non-leaf behaviour >> might allow cleaning up the "reg" indirection there too). > > Right, sorry I didn't mean to mention the size check as it was obvious > from your first reply, but rather just calling impl->tlb_inv() in > arm_smmu_tlb_inv_walk_s1(). > >> >>>> I'm instinctively a little twitchy about making this a blanket >>>> optimisation for SMMUv2 since I still remember the palaver with our >>>> display and MMU-500 integrations, where it had to implement the dodgy >>>> "prefetch" register to trigger translations before scanning out a >>>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when >>>> freeing an old buffer would be a dangerous hammer to swing. However >>>> IIRC it also had to ensure everything was mapped as 2MB blocks to >>>> guarantee fitting everything in the TLBs in the first place, so I >>>> guess it would still work out OK due to never realistically unmapping >>>> a whole table at once anyway. >>>> >>> >>> You are also hinting to not do this for all SMMUv2 implementations >>> and make >>> it QCOM specific? >> >> No, I'm really just wary that the performance implication is more >> complex than a simple unmap latency benefit, possibly even for QCOM. >> Consider the access latency, power and memory bandwidth hit from all >> the additional pagetable walks incurred by other ongoing traffic >> fighting against those 16 successive TLBIASIDs. Whether it's an >> overall win really depends on the specific workload and system >> conditions as much as the SMMU implementation. > > No, the unmap latency is not just in some test case written, the issue > is very real and we have workloads where camera is reporting frame drops > because of this unmap latency in the order of 100s of milliseconds. > And hardware team recommends using ASID based invalidations for anything > larger than 128 TLB entries. So yes, we have taken note of impacts here > before going this way and hence feel more inclined to make this qcom > specific if required. OK, that's good to know. I never suggested that CPU unmap latency wasn't a valid concern in itself - obviously spending millions of cycles in, say, an interrupt handler doing pointless busy work has some serious downsides - just that it might not always be the most important concern for everyone, so I wanted to make sure this discussion was had in the open. TBH I *am* inclined to make this a core SMMU driver change provided nobody pops up with a strong counter-argument. >> Thinking some more, I >> wonder if the Tegra folks might have an opinion to add here, given >> that their multiple-SMMU solution was seemingly about trying to get >> enough TLB and pagetable walk bandwidth in the first place? >> > > Sure but I do not see how that will help with the unmap latency? It won't. However it implies a use-case which is already sensitive to translation bandwidth, and thus is somewhat more likely to be sensitive to over-invalidation. But even then they also have more to gain from reducing the number of MMIO writes that have to be duplicated :) Thanks, Robin.
Hi Robin, On 2021-06-10 20:59, Robin Murphy wrote: > On 2021-06-10 12:54, Sai Prakash Ranjan wrote: >> Hi Robin, >> >> On 2021-06-10 17:03, Robin Murphy wrote: >>> On 2021-06-10 10:36, Sai Prakash Ranjan wrote: >>>> Hi Robin, >>>> >>>> On 2021-06-10 14:38, Robin Murphy wrote: >>>>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote: >>>>>> Hi Robin, >>>>>> >>>>>> On 2021-06-10 00:14, Robin Murphy wrote: >>>>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote: >>>>>>>> Currently for iommu_unmap() of large scatter-gather list with >>>>>>>> page size >>>>>>>> elements, the majority of time is spent in flushing of partial >>>>>>>> walks in >>>>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA >>>>>>>> for >>>>>>>> arm-smmu). >>>>>>>> >>>>>>>> For example: to unmap a 32MB scatter-gather list with page size >>>>>>>> elements >>>>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the >>>>>>>> pgsize (2MB >>>>>>>> for 4K granule) and each of 2MB will further result in 512 >>>>>>>> TLBIVAs (2MB/4K) >>>>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB >>>>>>>> causing a huge >>>>>>>> overhead. >>>>>>>> >>>>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the >>>>>>>> entire context >>>>>>>> if size (pgsize) is greater than the granule size (4K, 16K, >>>>>>>> 64K). For this >>>>>>>> example of 32MB scatter-gather list unmap, this results in just >>>>>>>> 16 ASID >>>>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in >>>>>>>> case of >>>>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the >>>>>>>> performance of >>>>>>>> unmaps drastically. >>>>>>>> >>>>>>>> Condition (size > granule size) is chosen for >>>>>>>> io_pgtable_tlb_flush_all() >>>>>>>> because for any granule with supported pgsizes, we will have at >>>>>>>> least 512 >>>>>>>> TLB invalidations for which tlb_flush_all() is already >>>>>>>> recommended. For >>>>>>>> example, take 4K granule with 2MB pgsize, this will result in >>>>>>>> 512 TLBIVA >>>>>>>> in partial walk flush. >>>>>>>> >>>>>>>> Test on QTI SM8150 SoC for 10 iterations of >>>>>>>> iommu_{map_sg}/unmap: >>>>>>>> (average over 10 iterations) >>>>>>>> >>>>>>>> Before this optimization: >>>>>>>> >>>>>>>> size iommu_map_sg iommu_unmap >>>>>>>> 4K 2.067 us 1.854 us >>>>>>>> 64K 9.598 us 8.802 us >>>>>>>> 1M 148.890 us 130.718 us >>>>>>>> 2M 305.864 us 67.291 us >>>>>>>> 12M 1793.604 us 390.838 us >>>>>>>> 16M 2386.848 us 518.187 us >>>>>>>> 24M 3563.296 us 775.989 us >>>>>>>> 32M 4747.171 us 1033.364 us >>>>>>>> >>>>>>>> After this optimization: >>>>>>>> >>>>>>>> size iommu_map_sg iommu_unmap >>>>>>>> 4K 1.723 us 1.765 us >>>>>>>> 64K 9.880 us 8.869 us >>>>>>>> 1M 155.364 us 135.223 us >>>>>>>> 2M 303.906 us 5.385 us >>>>>>>> 12M 1786.557 us 21.250 us >>>>>>>> 16M 2391.890 us 27.437 us >>>>>>>> 24M 3570.895 us 39.937 us >>>>>>>> 32M 4755.234 us 51.797 us >>>>>>>> >>>>>>>> This is further reduced once the map/unmap_pages() support gets >>>>>>>> in which >>>>>>>> will result in just 1 tlb_flush_all() as opposed to 16 >>>>>>>> tlb_flush_all(). >>>>>>>> >>>>>>>> Signed-off-by: Sai Prakash Ranjan >>>>>>>> <saiprakash.ranjan@codeaurora.org> >>>>>>>> --- >>>>>>>> drivers/iommu/io-pgtable-arm.c | 7 +++++-- >>>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c >>>>>>>> b/drivers/iommu/io-pgtable-arm.c >>>>>>>> index 87def58e79b5..c3cb9add3179 100644 >>>>>>>> --- a/drivers/iommu/io-pgtable-arm.c >>>>>>>> +++ b/drivers/iommu/io-pgtable-arm.c >>>>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct >>>>>>>> arm_lpae_io_pgtable *data, >>>>>>>> if (!iopte_leaf(pte, lvl, iop->fmt)) { >>>>>>>> /* Also flush any partial walks */ >>>>>>>> - io_pgtable_tlb_flush_walk(iop, iova, size, >>>>>>>> - ARM_LPAE_GRANULE(data)); >>>>>>>> + if (size > ARM_LPAE_GRANULE(data)) >>>>>>>> + io_pgtable_tlb_flush_all(iop); >>>>>>>> + else >>>>>>> >>>>>>> Erm, when will the above condition ever not be true? ;) >>>>>>> >>>>>> >>>>>> Ah right, silly me :) >>>>>> >>>>>>> Taking a step back, though, what about the impact to drivers >>>>>>> other >>>>>>> than SMMUv2? >>>>>> >>>>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the >>>>>> same >>>>>> thing as arm-smmu-v2 (page based invalidations), then there is >>>>>> ipmmu-vmsa.c >>>>>> which does tlb_flush_all() for flush walk. >>>>>> >>>>>>> In particular I'm thinking of SMMUv3.2 where the whole >>>>>>> range can be invalidated by VA in a single command anyway, so the >>>>>>> additional penalties of TLBIALL are undesirable. >>>>>>> >>>>>> >>>>>> Right, so I am thinking we can have a new generic quirk >>>>>> IO_PGTABLE_QUIRK_RANGE_INV >>>>>> to choose between range based invalidations(tlb_flush_walk) and >>>>>> tlb_flush_all(). >>>>>> In this case of arm-smmu-v3.2, we can tie up >>>>>> ARM_SMMU_FEAT_RANGE_INV with this quirk >>>>>> and have something like below, thoughts? >>>>>> >>>>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) >>>>>> io_pgtable_tlb_flush_walk(iop, iova, size, >>>>>> ARM_LPAE_GRANULE(data)); >>>>>> else >>>>>> io_pgtable_tlb_flush_all(iop); >>>>> >>>>> The design here has always been that io-pgtable says *what* needs >>>>> invalidating, and we left it up to the drivers to decide exactly >>>>> *how*. Even though things have evolved a bit I don't think that has >>>>> fundamentally changed - tlb_flush_walk is now only used in this one >>>>> place (technically I suppose it could be renamed tlb_flush_table >>>>> but >>>>> it's not worth the churn), so drivers can implement their own >>>>> preferred table-invalidating behaviour even more easily than >>>>> choosing >>>>> whether to bounce a quirk through the common code or not. Consider >>>>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage >>>>> 2... >>>>> >>>> >>>> Thanks for the explanation, makes sense. If I am not mistaken, I see >>>> that >>>> you are suggesting to move this logic based on size and granule-size >>>> to >>>> arm-smmu-v2 driver and one more thing below.. >>> >>> Simpler than that - following on from my original comment above, >>> tlb_flush_walk already knows it's invalidating at least one full >>> level >>> of table so there's nothing it even needs to check. Adding a >>> size-based heuristic to arm_smmu_inv_range_* for leaf invalidations >>> would be a separate concern (note that changing the non-leaf >>> behaviour >>> might allow cleaning up the "reg" indirection there too). >> >> Right, sorry I didn't mean to mention the size check as it was obvious >> from your first reply, but rather just calling impl->tlb_inv() in >> arm_smmu_tlb_inv_walk_s1(). >> >>> >>>>> I'm instinctively a little twitchy about making this a blanket >>>>> optimisation for SMMUv2 since I still remember the palaver with our >>>>> display and MMU-500 integrations, where it had to implement the >>>>> dodgy >>>>> "prefetch" register to trigger translations before scanning out a >>>>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when >>>>> freeing an old buffer would be a dangerous hammer to swing. However >>>>> IIRC it also had to ensure everything was mapped as 2MB blocks to >>>>> guarantee fitting everything in the TLBs in the first place, so I >>>>> guess it would still work out OK due to never realistically >>>>> unmapping >>>>> a whole table at once anyway. >>>>> >>>> >>>> You are also hinting to not do this for all SMMUv2 implementations >>>> and make >>>> it QCOM specific? >>> >>> No, I'm really just wary that the performance implication is more >>> complex than a simple unmap latency benefit, possibly even for QCOM. >>> Consider the access latency, power and memory bandwidth hit from all >>> the additional pagetable walks incurred by other ongoing traffic >>> fighting against those 16 successive TLBIASIDs. Whether it's an >>> overall win really depends on the specific workload and system >>> conditions as much as the SMMU implementation. >> >> No, the unmap latency is not just in some test case written, the issue >> is very real and we have workloads where camera is reporting frame >> drops >> because of this unmap latency in the order of 100s of milliseconds. >> And hardware team recommends using ASID based invalidations for >> anything >> larger than 128 TLB entries. So yes, we have taken note of impacts >> here >> before going this way and hence feel more inclined to make this qcom >> specific if required. > > OK, that's good to know. I never suggested that CPU unmap latency > wasn't a valid concern in itself - obviously spending millions of > cycles in, say, an interrupt handler doing pointless busy work has > some serious downsides - just that it might not always be the most > important concern for everyone, so I wanted to make sure this > discussion was had in the open. > Right, my mistake that I missed to mention these details of real world data in commit text, will add them in next version. > TBH I *am* inclined to make this a core SMMU driver change provided > nobody pops up with a strong counter-argument. > Ok that's even better in case it helps others as well. >>> Thinking some more, I >>> wonder if the Tegra folks might have an opinion to add here, given >>> that their multiple-SMMU solution was seemingly about trying to get >>> enough TLB and pagetable walk bandwidth in the first place? >>> >> >> Sure but I do not see how that will help with the unmap latency? > > It won't. However it implies a use-case which is already sensitive to > translation bandwidth, and thus is somewhat more likely to be > sensitive to over-invalidation. But even then they also have more to > gain from reducing the number of MMIO writes that have to be > duplicated :) > Ah I see, sorry I misunderstood. It's definitely better if this gets tested on other systems as well. Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
> > No, the unmap latency is not just in some test case written, the issue > > is very real and we have workloads where camera is reporting frame > > drops because of this unmap latency in the order of 100s of milliseconds. > > And hardware team recommends using ASID based invalidations for > > anything larger than 128 TLB entries. So yes, we have taken note of > > impacts here before going this way and hence feel more inclined to > > make this qcom specific if required. Seems like the real issue here is not the unmap API latency. It should be the high number of back to back SMMU TLB invalidate register writes that is resulting in lower ISO BW to Camera and overflow. Isn't it? Even Tegra186 SoC has similar issue and HW team recommended to rate limit the number of back to back SMMU tlb invalidate registers writes. The subsequent Tegra194 SoC has a dedicated SMMU for ISO clients to avoid the impact of TLB invalidates from NISO clients on ISO BW. >> Thinking some more, I >> wonder if the Tegra folks might have an opinion to add here, given >> that their multiple-SMMU solution was seemingly about trying to get >> enough TLB and pagetable walk bandwidth in the first place? While it is good to reduce the number of tlb register writes, Flushing all TLB entries at context granularity arbitrarily can have negative impact on active traffic and BW. I don't have much data on possible impact at this point. Can the flushing at context granularity be made a quirk than performing it as default? -KR
Hi Krishna, On 2021-06-11 06:07, Krishna Reddy wrote: >> > No, the unmap latency is not just in some test case written, the issue >> > is very real and we have workloads where camera is reporting frame >> > drops because of this unmap latency in the order of 100s of milliseconds. >> > And hardware team recommends using ASID based invalidations for >> > anything larger than 128 TLB entries. So yes, we have taken note of >> > impacts here before going this way and hence feel more inclined to >> > make this qcom specific if required. > > Seems like the real issue here is not the unmap API latency. > It should be the high number of back to back SMMU TLB invalidate > register writes that is resulting > in lower ISO BW to Camera and overflow. Isn't it? > Even Tegra186 SoC has similar issue and HW team recommended to rate > limit the number of > back to back SMMU tlb invalidate registers writes. The subsequent > Tegra194 SoC has a dedicated SMMU for > ISO clients to avoid the impact of TLB invalidates from NISO clients on > ISO BW. > Not exactly, this issue is not specific to camera. If you look at the numbers in the commit text, even for the test device its the same observation. It depends on the buffer size we are unmapping which affects the number of TLBIs issue. I am not aware of any such HW side bw issues for camera specifically on QCOM devices. Thanks, Sai >>> Thinking some more, I >>> wonder if the Tegra folks might have an opinion to add here, given >>> that their multiple-SMMU solution was seemingly about trying to get >>> enough TLB and pagetable walk bandwidth in the first place? > > While it is good to reduce the number of tlb register writes, Flushing > all TLB entries at context granularity arbitrarily > can have negative impact on active traffic and BW. I don't have much > data on possible impact at this point. > Can the flushing at context granularity be made a quirk than > performing it as default? > > -KR -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Hi Sai, > >> > No, the unmap latency is not just in some test case written, the > >> > issue is very real and we have workloads where camera is reporting > >> > frame drops because of this unmap latency in the order of 100s of > milliseconds. > Not exactly, this issue is not specific to camera. If you look at the numbers in the > commit text, even for the test device its the same observation. It depends on > the buffer size we are unmapping which affects the number of TLBIs issue. I am > not aware of any such HW side bw issues for camera specifically on QCOM > devices. It is clear that reducing number of TLBIs reduces the umap API latency. But, It is at the expense of throwing away valid tlb entries. Quantifying the impact of arbitrary invalidation of valid tlb entries at context level is not straight forward and use case dependent. The side-effects might be rare or won't be known until they are noticed. Can you provide more details on How the unmap latency is causing camera to drop frames? Is unmap performed in the perf path? If unmap is queued and performed on a back ground thread, would it resolve the frame drops? -KR
Hi Krishna, On 2021-06-11 22:19, Krishna Reddy wrote: > Hi Sai, >> >> > No, the unmap latency is not just in some test case written, the >> >> > issue is very real and we have workloads where camera is reporting >> >> > frame drops because of this unmap latency in the order of 100s of >> milliseconds. > >> Not exactly, this issue is not specific to camera. If you look at the >> numbers in the >> commit text, even for the test device its the same observation. It >> depends on >> the buffer size we are unmapping which affects the number of TLBIs >> issue. I am >> not aware of any such HW side bw issues for camera specifically on >> QCOM >> devices. > > It is clear that reducing number of TLBIs reduces the umap API > latency. But, It is > at the expense of throwing away valid tlb entries. > Quantifying the impact of arbitrary invalidation of valid tlb entries > at context level is not straight forward and > use case dependent. The side-effects might be rare or won't be known > until they are noticed. Right but we won't know until we profile the specific usecases or try them in generic workload to see if they affect the performance. Sure, over invalidation is a concern where multiple buffers can be mapped to same context and the cache is not usable at the time for lookup and such but we don't do it for small buffers and only for large buffers which means thousands of TLB entry mappings in which case TLBIASID is preferred (note: I mentioned the HW team recommendation to use it for anything greater than 128 TLB entries) in my earlier reply. And also note that we do this only for partial walk flush, we are not arbitrarily changing all the TLBIs to ASID based. > Can you provide more details on How the unmap latency is causing > camera to drop frames? > Is unmap performed in the perf path? I am no camera expert but from what the camera team mentioned is that there is a thread which frees memory(large unused memory buffers) periodically which ends up taking around 100+ms and causing some camera test failures with frame drops. Parallel efforts are already being made to optimize this usage of thread but as I mentioned previously, this is *not a camera specific*, lets say someone else invokes such large unmaps, it's going to face the same issue. > If unmap is queued and performed on a back ground thread, would it > resolve the frame drops? Not sure I understand what you mean by queuing on background thread but with that or not, we still do the same number of TLBIs and hop through iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that help? Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
> Right but we won't know until we profile the specific usecases or try them in > generic workload to see if they affect the performance. Sure, over invalidation is > a concern where multiple buffers can be mapped to same context and the cache > is not usable at the time for lookup and such but we don't do it for small buffers > and only for large buffers which means thousands of TLB entry mappings in > which case TLBIASID is preferred (note: I mentioned the HW team > recommendation to use it for anything greater than 128 TLB entries) in my > earlier reply. And also note that we do this only for partial walk flush, we are not > arbitrarily changing all the TLBIs to ASID based. Most of the heavy bw use cases does involve processing larger buffers. When the physical memory is allocated dis-contiguously at page_size (let's use 4KB here) granularity, each aligned 2MB chunks IOVA unmap would involve performing a TLBIASID as 2MB is not a leaf. Essentially, It happens all the time during large buffer unmaps and potentially impact active traffic on other large buffers. Depending on how much latency HW engines can absorb, the overflow/underflow issues for ISO engines can be sporadic and vendor specific. Performing TLBIASID as default for all SoCs is not a safe operation. > I am no camera expert but from what the camera team mentioned is that there > is a thread which frees memory(large unused memory buffers) periodically which > ends up taking around 100+ms and causing some camera test failures with > frame drops. Parallel efforts are already being made to optimize this usage of > thread but as I mentioned previously, this is *not a camera specific*, lets say > someone else invokes such large unmaps, it's going to face the same issue. From the above, It doesn't look like the root cause of frame drops is fully understood. Why is 100+ms delay causing camera frame drop? Is the same thread submitting the buffers to camera after unmap is complete? If not, how is the unmap latency causing issue here? > > If unmap is queued and performed on a back ground thread, would it > > resolve the frame drops? > > Not sure I understand what you mean by queuing on background thread but with > that or not, we still do the same number of TLBIs and hop through > iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that > help? I mean adding the unmap requests into a queue and processing them from a different thread. It is not to reduce the TLBIs. But, not to block subsequent buffer allocation, IOVA map requests, if they are being requested from same thread that is performing unmap. If unmap is already performed from a different thread, then the issue still need to be root caused to understand it fully. Check for any serialization issues. -KR
Hi Krishna, On 2021-06-14 23:18, Krishna Reddy wrote: >> Right but we won't know until we profile the specific usecases or try >> them in >> generic workload to see if they affect the performance. Sure, over >> invalidation is >> a concern where multiple buffers can be mapped to same context and the >> cache >> is not usable at the time for lookup and such but we don't do it for >> small buffers >> and only for large buffers which means thousands of TLB entry mappings >> in >> which case TLBIASID is preferred (note: I mentioned the HW team >> recommendation to use it for anything greater than 128 TLB entries) in >> my >> earlier reply. And also note that we do this only for partial walk >> flush, we are not >> arbitrarily changing all the TLBIs to ASID based. > > Most of the heavy bw use cases does involve processing larger buffers. > When the physical memory is allocated dis-contiguously at page_size > (let's use 4KB here) > granularity, each aligned 2MB chunks IOVA unmap would involve > performing a TLBIASID > as 2MB is not a leaf. Essentially, It happens all the time during > large buffer unmaps and > potentially impact active traffic on other large buffers. Depending on > how much > latency HW engines can absorb, the overflow/underflow issues for ISO > engines can be > sporadic and vendor specific. > Performing TLBIASID as default for all SoCs is not a safe operation. > Ok so from what I gather from this is that its not easy to test for the negative impact and you don't have data on such yet and the behaviour is very vendor specific. To add on qcom impl, we have several performance improvements for TLB cache invalidations in HW like wait-for-safe(for realtime clients such as camera and display) and few others to allow for cache lookups/updates when TLBI is in progress for the same context bank, so atleast we are good here. > >> I am no camera expert but from what the camera team mentioned is that >> there >> is a thread which frees memory(large unused memory buffers) >> periodically which >> ends up taking around 100+ms and causing some camera test failures >> with >> frame drops. Parallel efforts are already being made to optimize this >> usage of >> thread but as I mentioned previously, this is *not a camera specific*, >> lets say >> someone else invokes such large unmaps, it's going to face the same >> issue. > > From the above, It doesn't look like the root cause of frame drops is > fully understood. > Why is 100+ms delay causing camera frame drop? Is the same thread > submitting the buffers > to camera after unmap is complete? If not, how is the unmap latency > causing issue here? > Ok since you are interested in camera usecase, I have requested for more details from the camera team and will give it once they comeback. However I don't think its good to have unmap latency at all and that is being addressed by this patch. > >> > If unmap is queued and performed on a back ground thread, would it >> > resolve the frame drops? >> >> Not sure I understand what you mean by queuing on background thread >> but with >> that or not, we still do the same number of TLBIs and hop through >> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that >> help? > > I mean adding the unmap requests into a queue and processing them from > a different thread. > It is not to reduce the TLBIs. But, not to block subsequent buffer > allocation, IOVA map requests, if they > are being requested from same thread that is performing unmap. If > unmap is already performed from > a different thread, then the issue still need to be root caused to > understand it fully. Check for any > serialization issues. > This patch is to optimize unmap latency because of large number of mmio writes(TLBIVAs) wasting CPU cycles and not to fix camera issue which can probably be solved by parallelization. It seems to me like you are ok with the unmap latency in general which we are not and want to avoid that latency. Hi @Robin, from these discussions it seems they are not ok with the change for all SoC vendor implementations and do not have any data on such impact. As I mentioned above, on QCOM platforms we do have several optimizations in HW for TLBIs and would like to make use of it and reduce the unmap latency. What do you think, should this be made implementation specific? Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On 2021-06-15 12:51, Sai Prakash Ranjan wrote: > Hi Krishna, > > On 2021-06-14 23:18, Krishna Reddy wrote: >>> Right but we won't know until we profile the specific usecases or try >>> them in >>> generic workload to see if they affect the performance. Sure, over >>> invalidation is >>> a concern where multiple buffers can be mapped to same context and >>> the cache >>> is not usable at the time for lookup and such but we don't do it for >>> small buffers >>> and only for large buffers which means thousands of TLB entry >>> mappings in >>> which case TLBIASID is preferred (note: I mentioned the HW team >>> recommendation to use it for anything greater than 128 TLB entries) >>> in my >>> earlier reply. And also note that we do this only for partial walk >>> flush, we are not >>> arbitrarily changing all the TLBIs to ASID based. >> >> Most of the heavy bw use cases does involve processing larger buffers. >> When the physical memory is allocated dis-contiguously at page_size >> (let's use 4KB here) >> granularity, each aligned 2MB chunks IOVA unmap would involve >> performing a TLBIASID >> as 2MB is not a leaf. Essentially, It happens all the time during >> large buffer unmaps and >> potentially impact active traffic on other large buffers. Depending on >> how much >> latency HW engines can absorb, the overflow/underflow issues for ISO >> engines can be >> sporadic and vendor specific. >> Performing TLBIASID as default for all SoCs is not a safe operation. >> > > Ok so from what I gather from this is that its not easy to test for the > negative impact and you don't have data on such yet and the behaviour is > very vendor specific. To add on qcom impl, we have several performance > improvements for TLB cache invalidations in HW like wait-for-safe(for > realtime > clients such as camera and display) and few others to allow for cache > lookups/updates when TLBI is in progress for the same context bank, so > atleast > we are good here. > >> >>> I am no camera expert but from what the camera team mentioned is that >>> there >>> is a thread which frees memory(large unused memory buffers) >>> periodically which >>> ends up taking around 100+ms and causing some camera test failures with >>> frame drops. Parallel efforts are already being made to optimize this >>> usage of >>> thread but as I mentioned previously, this is *not a camera >>> specific*, lets say >>> someone else invokes such large unmaps, it's going to face the same >>> issue. >> >> From the above, It doesn't look like the root cause of frame drops is >> fully understood. >> Why is 100+ms delay causing camera frame drop? Is the same thread >> submitting the buffers >> to camera after unmap is complete? If not, how is the unmap latency >> causing issue here? >> > > Ok since you are interested in camera usecase, I have requested for more > details > from the camera team and will give it once they comeback. However I > don't think > its good to have unmap latency at all and that is being addressed by > this patch. > >> >>> > If unmap is queued and performed on a back ground thread, would it >>> > resolve the frame drops? >>> >>> Not sure I understand what you mean by queuing on background thread >>> but with >>> that or not, we still do the same number of TLBIs and hop through >>> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that >>> help? >> >> I mean adding the unmap requests into a queue and processing them from >> a different thread. >> It is not to reduce the TLBIs. But, not to block subsequent buffer >> allocation, IOVA map requests, if they >> are being requested from same thread that is performing unmap. If >> unmap is already performed from >> a different thread, then the issue still need to be root caused to >> understand it fully. Check for any >> serialization issues. >> > > This patch is to optimize unmap latency because of large number of mmio > writes(TLBIVAs) > wasting CPU cycles and not to fix camera issue which can probably be > solved by > parallelization. It seems to me like you are ok with the unmap latency > in general > which we are not and want to avoid that latency. > > Hi @Robin, from these discussions it seems they are not ok with the change > for all SoC vendor implementations and do not have any data on such impact. > As I mentioned above, on QCOM platforms we do have several optimizations > in HW > for TLBIs and would like to make use of it and reduce the unmap latency. > What do you think, should this be made implementation specific? Yes, it sounds like there's enough uncertainty for now that this needs to be an opt-in feature. However, I still think that non-strict mode could use it generically, since that's all about over-invalidating to save time on individual unmaps - and relatively non-deterministic - already. So maybe we have a second set of iommu_flush_ops, or just a flag somewhere to control the tlb_flush_walk functions internally, and the choice can be made in the iommu_get_dma_strict() test, but also forced on all the time by your init_context hook. What do you reckon? Robin.
Hi Robin, On 2021-06-15 19:23, Robin Murphy wrote: > On 2021-06-15 12:51, Sai Prakash Ranjan wrote: <snip>... >> Hi @Robin, from these discussions it seems they are not ok with the >> change >> for all SoC vendor implementations and do not have any data on such >> impact. >> As I mentioned above, on QCOM platforms we do have several >> optimizations in HW >> for TLBIs and would like to make use of it and reduce the unmap >> latency. >> What do you think, should this be made implementation specific? > > Yes, it sounds like there's enough uncertainty for now that this needs > to be an opt-in feature. However, I still think that non-strict mode > could use it generically, since that's all about over-invalidating to > save time on individual unmaps - and relatively non-deterministic - > already. > > So maybe we have a second set of iommu_flush_ops, or just a flag > somewhere to control the tlb_flush_walk functions internally, and the > choice can be made in the iommu_get_dma_strict() test, but also forced > on all the time by your init_context hook. What do you reckon? > Sounds good to me. Since you mentioned non-strict mode using it generically, can't we just set tlb_flush_all() in io_pgtable_tlb_flush_walk() like below based on quirk so that we don't need to add any check in iommu_get_dma_strict() and just force the new flush_ops in init_context hook? if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) { iop->cfg.tlb->tlb_flush_all(iop->cookie); return; } Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On 2021-06-16 12:28, Sai Prakash Ranjan wrote: > Hi Robin, > > On 2021-06-15 19:23, Robin Murphy wrote: >> On 2021-06-15 12:51, Sai Prakash Ranjan wrote: > > <snip>... > >>> Hi @Robin, from these discussions it seems they are not ok with the >>> change >>> for all SoC vendor implementations and do not have any data on such >>> impact. >>> As I mentioned above, on QCOM platforms we do have several >>> optimizations in HW >>> for TLBIs and would like to make use of it and reduce the unmap >>> latency. >>> What do you think, should this be made implementation specific? >> >> Yes, it sounds like there's enough uncertainty for now that this needs >> to be an opt-in feature. However, I still think that non-strict mode >> could use it generically, since that's all about over-invalidating to >> save time on individual unmaps - and relatively non-deterministic - >> already. >> >> So maybe we have a second set of iommu_flush_ops, or just a flag >> somewhere to control the tlb_flush_walk functions internally, and the >> choice can be made in the iommu_get_dma_strict() test, but also forced >> on all the time by your init_context hook. What do you reckon? >> > > Sounds good to me. Since you mentioned non-strict mode using it > generically, > can't we just set tlb_flush_all() in io_pgtable_tlb_flush_walk() like > below > based on quirk so that we don't need to add any check in > iommu_get_dma_strict() > and just force the new flush_ops in init_context hook? > > if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) { > iop->cfg.tlb->tlb_flush_all(iop->cookie); > return; > } > Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk since this is related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV which will be set in init_context impl hook and the prev condition in io_pgtable_tlb_flush_walk() becomes something like below. Seems very minimal and neat instead of poking into tlb_flush_walk functions or touching dma strict with some flag? if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT || iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) { iop->cfg.tlb->tlb_flush_all(iop->cookie); return; } Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
> Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk since this is > related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV which will > be set in init_context impl hook and the prev condition in > io_pgtable_tlb_flush_walk() > becomes something like below. Seems very minimal and neat instead of poking > into tlb_flush_walk functions or touching dma strict with some flag? > > if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT || > iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) { > iop->cfg.tlb->tlb_flush_all(iop->cookie); > return; > } Can you name it as IO_PGTABLE_QUIRK_TLB_INV_ASID or IO_PGTABLE_QUIRK_TLB_INV_ALL_ASID? -KR
Hi Krishna, On 2021-06-18 02:48, Krishna Reddy wrote: >> Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk >> since this is >> related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV which >> will >> be set in init_context impl hook and the prev condition in >> io_pgtable_tlb_flush_walk() >> becomes something like below. Seems very minimal and neat instead of >> poking >> into tlb_flush_walk functions or touching dma strict with some flag? >> >> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT || >> iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) { >> iop->cfg.tlb->tlb_flush_all(iop->cookie); >> return; >> } > > Can you name it as IO_PGTABLE_QUIRK_TLB_INV_ASID or > IO_PGTABLE_QUIRK_TLB_INV_ALL_ASID? > tlb_flush_all() callback implementations can use TLBIALL or TLBIASID. so having ASID in the quirk name doesn't sound right given this quirk should be generic enough to be usable on other implementations as well. Instead I will go with IO_PGTABLE_QUIRK_TLB_INV_ALL and will be happy to change if others have some other preference. Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On 2021-06-15 17:21, Sai Prakash Ranjan wrote: > Hi Krishna, > > On 2021-06-14 23:18, Krishna Reddy wrote: >>> Right but we won't know until we profile the specific usecases or try them in >>> generic workload to see if they affect the performance. Sure, over invalidation is >>> a concern where multiple buffers can be mapped to same context and the cache >>> is not usable at the time for lookup and such but we don't do it for small buffers >>> and only for large buffers which means thousands of TLB entry mappings in >>> which case TLBIASID is preferred (note: I mentioned the HW team >>> recommendation to use it for anything greater than 128 TLB entries) in my >>> earlier reply. And also note that we do this only for partial walk flush, we are not >>> arbitrarily changing all the TLBIs to ASID based. >> >> Most of the heavy bw use cases does involve processing larger buffers. >> When the physical memory is allocated dis-contiguously at page_size >> (let's use 4KB here) >> granularity, each aligned 2MB chunks IOVA unmap would involve >> performing a TLBIASID >> as 2MB is not a leaf. Essentially, It happens all the time during >> large buffer unmaps and >> potentially impact active traffic on other large buffers. Depending on how much >> latency HW engines can absorb, the overflow/underflow issues for ISO >> engines can be >> sporadic and vendor specific. >> Performing TLBIASID as default for all SoCs is not a safe operation. >> > > Ok so from what I gather from this is that its not easy to test for the > negative impact and you don't have data on such yet and the behaviour is > very vendor specific. To add on qcom impl, we have several performance > improvements for TLB cache invalidations in HW like wait-for-safe(for realtime > clients such as camera and display) and few others to allow for cache > lookups/updates when TLBI is in progress for the same context bank, so atleast > we are good here. > >> >>> I am no camera expert but from what the camera team mentioned is that there >>> is a thread which frees memory(large unused memory buffers) periodically which >>> ends up taking around 100+ms and causing some camera test failures with >>> frame drops. Parallel efforts are already being made to optimize this usage of >>> thread but as I mentioned previously, this is *not a camera specific*, lets say >>> someone else invokes such large unmaps, it's going to face the same issue. >> >> From the above, It doesn't look like the root cause of frame drops is >> fully understood. >> Why is 100+ms delay causing camera frame drop? Is the same thread >> submitting the buffers >> to camera after unmap is complete? If not, how is the unmap latency >> causing issue here? >> > > Ok since you are interested in camera usecase, I have requested for more details > from the camera team and will give it once they comeback. However I don't think > its good to have unmap latency at all and that is being addressed by this patch. > As promised, here are some more details shared by camera team: Mapping of a framework buffer happens at the time of process request and unmapping of a framework buffer happens once the buffer is available from hardware and result will be notified to camera framework. * When there is a delay in unmapping of a buffer, result notification to framework will be delayed and based on pipeline delay depth, new requests from framework will be delayed. * Camera stack uses internal buffer managers for internal and framework buffers. While mapping and unmapping these managers will be accessed, so uses common lock and hence is a blocking call. So unmapping delay will cause the delay for mapping of a new request and leads to framedrop. Map and unmap happens in the camera service process context. There is no separate perf path to perform unmapping. In Camera stack along with map/unmap delay, additional delays are due to HW. So HW should be able to get the requests in time from SW to avoid frame drops. Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index 87def58e79b5..c3cb9add3179 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data, if (!iopte_leaf(pte, lvl, iop->fmt)) { /* Also flush any partial walks */ - io_pgtable_tlb_flush_walk(iop, iova, size, - ARM_LPAE_GRANULE(data)); + if (size > ARM_LPAE_GRANULE(data)) + io_pgtable_tlb_flush_all(iop); + else + io_pgtable_tlb_flush_walk(iop, iova, size, + ARM_LPAE_GRANULE(data)); ptep = iopte_deref(pte, data); __arm_lpae_free_pgtable(data, lvl + 1, ptep); } else if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {
Currently for iommu_unmap() of large scatter-gather list with page size elements, the majority of time is spent in flushing of partial walks in __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for arm-smmu). For example: to unmap a 32MB scatter-gather list with page size elements (8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB for 4K granule) and each of 2MB will further result in 512 TLBIVAs (2MB/4K) resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge overhead. So instead use io_pgtable_tlb_flush_all() to invalidate the entire context if size (pgsize) is greater than the granule size (4K, 16K, 64K). For this example of 32MB scatter-gather list unmap, this results in just 16 ASID based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the performance of unmaps drastically. Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all() because for any granule with supported pgsizes, we will have at least 512 TLB invalidations for which tlb_flush_all() is already recommended. For example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA in partial walk flush. Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: (average over 10 iterations) Before this optimization: size iommu_map_sg iommu_unmap 4K 2.067 us 1.854 us 64K 9.598 us 8.802 us 1M 148.890 us 130.718 us 2M 305.864 us 67.291 us 12M 1793.604 us 390.838 us 16M 2386.848 us 518.187 us 24M 3563.296 us 775.989 us 32M 4747.171 us 1033.364 us After this optimization: size iommu_map_sg iommu_unmap 4K 1.723 us 1.765 us 64K 9.880 us 8.869 us 1M 155.364 us 135.223 us 2M 303.906 us 5.385 us 12M 1786.557 us 21.250 us 16M 2391.890 us 27.437 us 24M 3570.895 us 39.937 us 32M 4755.234 us 51.797 us This is further reduced once the map/unmap_pages() support gets in which will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all(). Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> --- drivers/iommu/io-pgtable-arm.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)