diff mbox series

iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

Message ID 20210609145315.25750-1-saiprakash.ranjan@codeaurora.org
State New
Headers show
Series iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list | expand

Commit Message

Sai Prakash Ranjan June 9, 2021, 2:53 p.m. UTC
Currently for iommu_unmap() of large scatter-gather list with page size
elements, the majority of time is spent in flushing of partial walks in
__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size elements
(8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB
for 4K granule) and each of 2MB will further result in 512 TLBIVAs (2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge
overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the entire context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). For this
example of 32MB scatter-gather list unmap, this results in just 16 ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the performance of
unmaps drastically.

Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at least 512
TLB invalidations for which tlb_flush_all() is already recommended. For
example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA
in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

    size        iommu_map_sg      iommu_unmap
      4K            2.067 us         1.854 us
     64K            9.598 us         8.802 us
      1M          148.890 us       130.718 us
      2M          305.864 us        67.291 us
     12M         1793.604 us       390.838 us
     16M         2386.848 us       518.187 us
     24M         3563.296 us       775.989 us
     32M         4747.171 us      1033.364 us

After this optimization:

    size        iommu_map_sg      iommu_unmap
      4K            1.723 us         1.765 us
     64K            9.880 us         8.869 us
      1M          155.364 us       135.223 us
      2M          303.906 us         5.385 us
     12M         1786.557 us        21.250 us
     16M         2391.890 us        27.437 us
     24M         3570.895 us        39.937 us
     32M         4755.234 us        51.797 us

This is further reduced once the map/unmap_pages() support gets in which
will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all().

Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/io-pgtable-arm.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Robin Murphy June 9, 2021, 6:44 p.m. UTC | #1
On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
> Currently for iommu_unmap() of large scatter-gather list with page size
> elements, the majority of time is spent in flushing of partial walks in
> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
> arm-smmu).
> 
> For example: to unmap a 32MB scatter-gather list with page size elements
> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB
> for 4K granule) and each of 2MB will further result in 512 TLBIVAs (2MB/4K)
> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge
> overhead.
> 
> So instead use io_pgtable_tlb_flush_all() to invalidate the entire context
> if size (pgsize) is greater than the granule size (4K, 16K, 64K). For this
> example of 32MB scatter-gather list unmap, this results in just 16 ASID
> based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of
> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the performance of
> unmaps drastically.
> 
> Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all()
> because for any granule with supported pgsizes, we will have at least 512
> TLB invalidations for which tlb_flush_all() is already recommended. For
> example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA
> in partial walk flush.
> 
> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
> (average over 10 iterations)
> 
> Before this optimization:
> 
>      size        iommu_map_sg      iommu_unmap
>        4K            2.067 us         1.854 us
>       64K            9.598 us         8.802 us
>        1M          148.890 us       130.718 us
>        2M          305.864 us        67.291 us
>       12M         1793.604 us       390.838 us
>       16M         2386.848 us       518.187 us
>       24M         3563.296 us       775.989 us
>       32M         4747.171 us      1033.364 us
> 
> After this optimization:
> 
>      size        iommu_map_sg      iommu_unmap
>        4K            1.723 us         1.765 us
>       64K            9.880 us         8.869 us
>        1M          155.364 us       135.223 us
>        2M          303.906 us         5.385 us
>       12M         1786.557 us        21.250 us
>       16M         2391.890 us        27.437 us
>       24M         3570.895 us        39.937 us
>       32M         4755.234 us        51.797 us
> 
> This is further reduced once the map/unmap_pages() support gets in which
> will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all().
> 
> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> ---
>   drivers/iommu/io-pgtable-arm.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 87def58e79b5..c3cb9add3179 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   
>   		if (!iopte_leaf(pte, lvl, iop->fmt)) {
>   			/* Also flush any partial walks */
> -			io_pgtable_tlb_flush_walk(iop, iova, size,
> -						  ARM_LPAE_GRANULE(data));
> +			if (size > ARM_LPAE_GRANULE(data))
> +				io_pgtable_tlb_flush_all(iop);
> +			else

Erm, when will the above condition ever not be true? ;)

Taking a step back, though, what about the impact to drivers other than 
SMMUv2? In particular I'm thinking of SMMUv3.2 where the whole range can 
be invalidated by VA in a single command anyway, so the additional 
penalties of TLBIALL are undesirable.

Robin.

> +				io_pgtable_tlb_flush_walk(iop, iova, size,
> +							  ARM_LPAE_GRANULE(data));
>   			ptep = iopte_deref(pte, data);
>   			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
>   		} else if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {
>
Sai Prakash Ranjan June 10, 2021, 5:24 a.m. UTC | #2
Hi Robin,

On 2021-06-10 00:14, Robin Murphy wrote:
> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

>> Currently for iommu_unmap() of large scatter-gather list with page 

>> size

>> elements, the majority of time is spent in flushing of partial walks 

>> in

>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for

>> arm-smmu).

>> 

>> For example: to unmap a 32MB scatter-gather list with page size 

>> elements

>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize 

>> (2MB

>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs 

>> (2MB/4K)

>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a 

>> huge

>> overhead.

>> 

>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire 

>> context

>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). For 

>> this

>> example of 32MB scatter-gather list unmap, this results in just 16 

>> ASID

>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in case 

>> of

>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 

>> performance of

>> unmaps drastically.

>> 

>> Condition (size > granule size) is chosen for 

>> io_pgtable_tlb_flush_all()

>> because for any granule with supported pgsizes, we will have at least 

>> 512

>> TLB invalidations for which tlb_flush_all() is already recommended. 

>> For

>> example, take 4K granule with 2MB pgsize, this will result in 512 

>> TLBIVA

>> in partial walk flush.

>> 

>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:

>> (average over 10 iterations)

>> 

>> Before this optimization:

>> 

>>      size        iommu_map_sg      iommu_unmap

>>        4K            2.067 us         1.854 us

>>       64K            9.598 us         8.802 us

>>        1M          148.890 us       130.718 us

>>        2M          305.864 us        67.291 us

>>       12M         1793.604 us       390.838 us

>>       16M         2386.848 us       518.187 us

>>       24M         3563.296 us       775.989 us

>>       32M         4747.171 us      1033.364 us

>> 

>> After this optimization:

>> 

>>      size        iommu_map_sg      iommu_unmap

>>        4K            1.723 us         1.765 us

>>       64K            9.880 us         8.869 us

>>        1M          155.364 us       135.223 us

>>        2M          303.906 us         5.385 us

>>       12M         1786.557 us        21.250 us

>>       16M         2391.890 us        27.437 us

>>       24M         3570.895 us        39.937 us

>>       32M         4755.234 us        51.797 us

>> 

>> This is further reduced once the map/unmap_pages() support gets in 

>> which

>> will result in just 1 tlb_flush_all() as opposed to 16 

>> tlb_flush_all().

>> 

>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>

>> ---

>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--

>>   1 file changed, 5 insertions(+), 2 deletions(-)

>> 

>> diff --git a/drivers/iommu/io-pgtable-arm.c 

>> b/drivers/iommu/io-pgtable-arm.c

>> index 87def58e79b5..c3cb9add3179 100644

>> --- a/drivers/iommu/io-pgtable-arm.c

>> +++ b/drivers/iommu/io-pgtable-arm.c

>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 

>> arm_lpae_io_pgtable *data,

>>     		if (!iopte_leaf(pte, lvl, iop->fmt)) {

>>   			/* Also flush any partial walks */

>> -			io_pgtable_tlb_flush_walk(iop, iova, size,

>> -						  ARM_LPAE_GRANULE(data));

>> +			if (size > ARM_LPAE_GRANULE(data))

>> +				io_pgtable_tlb_flush_all(iop);

>> +			else

> 

> Erm, when will the above condition ever not be true? ;)

> 


Ah right, silly me :)

> Taking a step back, though, what about the impact to drivers other

> than SMMUv2?


Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
thing as arm-smmu-v2 (page based invalidations), then there is 
ipmmu-vmsa.c
which does tlb_flush_all() for flush walk.

> In particular I'm thinking of SMMUv3.2 where the whole

> range can be invalidated by VA in a single command anyway, so the

> additional penalties of TLBIALL are undesirable.

> 


Right, so I am thinking we can have a new generic quirk 
IO_PGTABLE_QUIRK_RANGE_INV
to choose between range based invalidations(tlb_flush_walk) and 
tlb_flush_all().
In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 
with this quirk
and have something like below, thoughts?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
         io_pgtable_tlb_flush_walk(iop, iova, size,
                                   ARM_LPAE_GRANULE(data));
else
         io_pgtable_tlb_flush_all(iop);

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Robin Murphy June 10, 2021, 9:08 a.m. UTC | #3
On 2021-06-10 06:24, Sai Prakash Ranjan wrote:
> Hi Robin,

> 

> On 2021-06-10 00:14, Robin Murphy wrote:

>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

>>> Currently for iommu_unmap() of large scatter-gather list with page size

>>> elements, the majority of time is spent in flushing of partial walks in

>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for

>>> arm-smmu).

>>>

>>> For example: to unmap a 32MB scatter-gather list with page size elements

>>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB

>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs 

>>> (2MB/4K)

>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge

>>> overhead.

>>>

>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire 

>>> context

>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). For 

>>> this

>>> example of 32MB scatter-gather list unmap, this results in just 16 ASID

>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of

>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 

>>> performance of

>>> unmaps drastically.

>>>

>>> Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all()

>>> because for any granule with supported pgsizes, we will have at least 

>>> 512

>>> TLB invalidations for which tlb_flush_all() is already recommended. For

>>> example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA

>>> in partial walk flush.

>>>

>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:

>>> (average over 10 iterations)

>>>

>>> Before this optimization:

>>>

>>>      size        iommu_map_sg      iommu_unmap

>>>        4K            2.067 us         1.854 us

>>>       64K            9.598 us         8.802 us

>>>        1M          148.890 us       130.718 us

>>>        2M          305.864 us        67.291 us

>>>       12M         1793.604 us       390.838 us

>>>       16M         2386.848 us       518.187 us

>>>       24M         3563.296 us       775.989 us

>>>       32M         4747.171 us      1033.364 us

>>>

>>> After this optimization:

>>>

>>>      size        iommu_map_sg      iommu_unmap

>>>        4K            1.723 us         1.765 us

>>>       64K            9.880 us         8.869 us

>>>        1M          155.364 us       135.223 us

>>>        2M          303.906 us         5.385 us

>>>       12M         1786.557 us        21.250 us

>>>       16M         2391.890 us        27.437 us

>>>       24M         3570.895 us        39.937 us

>>>       32M         4755.234 us        51.797 us

>>>

>>> This is further reduced once the map/unmap_pages() support gets in which

>>> will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all().

>>>

>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>

>>> ---

>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--

>>>   1 file changed, 5 insertions(+), 2 deletions(-)

>>>

>>> diff --git a/drivers/iommu/io-pgtable-arm.c 

>>> b/drivers/iommu/io-pgtable-arm.c

>>> index 87def58e79b5..c3cb9add3179 100644

>>> --- a/drivers/iommu/io-pgtable-arm.c

>>> +++ b/drivers/iommu/io-pgtable-arm.c

>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 

>>> arm_lpae_io_pgtable *data,

>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {

>>>               /* Also flush any partial walks */

>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,

>>> -                          ARM_LPAE_GRANULE(data));

>>> +            if (size > ARM_LPAE_GRANULE(data))

>>> +                io_pgtable_tlb_flush_all(iop);

>>> +            else

>>

>> Erm, when will the above condition ever not be true? ;)

>>

> 

> Ah right, silly me :)

> 

>> Taking a step back, though, what about the impact to drivers other

>> than SMMUv2?

> 

> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same

> thing as arm-smmu-v2 (page based invalidations), then there is ipmmu-vmsa.c

> which does tlb_flush_all() for flush walk.

> 

>> In particular I'm thinking of SMMUv3.2 where the whole

>> range can be invalidated by VA in a single command anyway, so the

>> additional penalties of TLBIALL are undesirable.

>>

> 

> Right, so I am thinking we can have a new generic quirk 

> IO_PGTABLE_QUIRK_RANGE_INV

> to choose between range based invalidations(tlb_flush_walk) and 

> tlb_flush_all().

> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 

> with this quirk

> and have something like below, thoughts?

> 

> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)

>          io_pgtable_tlb_flush_walk(iop, iova, size,

>                                    ARM_LPAE_GRANULE(data));

> else

>          io_pgtable_tlb_flush_all(iop);


The design here has always been that io-pgtable says *what* needs 
invalidating, and we left it up to the drivers to decide exactly *how*. 
Even though things have evolved a bit I don't think that has 
fundamentally changed - tlb_flush_walk is now only used in this one 
place (technically I suppose it could be renamed tlb_flush_table but 
it's not worth the churn), so drivers can implement their own preferred 
table-invalidating behaviour even more easily than choosing whether to 
bounce a quirk through the common code or not. Consider what you've 
already seen for the Renesas IPMMU, or SMMUv1 stage 2...

I'm instinctively a little twitchy about making this a blanket 
optimisation for SMMUv2 since I still remember the palaver with our 
display and MMU-500 integrations, where it had to implement the dodgy 
"prefetch" register to trigger translations before scanning out a frame 
since it couldn't ever afford a TLB miss, thus TLBIALL when freeing an 
old buffer would be a dangerous hammer to swing. However IIRC it also 
had to ensure everything was mapped as 2MB blocks to guarantee fitting 
everything in the TLBs in the first place, so I guess it would still 
work out OK due to never realistically unmapping a whole table at once 
anyway.

Cheers,
Robin.
Sai Prakash Ranjan June 10, 2021, 9:36 a.m. UTC | #4
Hi Robin,

On 2021-06-10 14:38, Robin Murphy wrote:
> On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

>> Hi Robin,

>> 

>> On 2021-06-10 00:14, Robin Murphy wrote:

>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

>>>> Currently for iommu_unmap() of large scatter-gather list with page 

>>>> size

>>>> elements, the majority of time is spent in flushing of partial walks 

>>>> in

>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for

>>>> arm-smmu).

>>>> 

>>>> For example: to unmap a 32MB scatter-gather list with page size 

>>>> elements

>>>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize 

>>>> (2MB

>>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs 

>>>> (2MB/4K)

>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a 

>>>> huge

>>>> overhead.

>>>> 

>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire 

>>>> context

>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). 

>>>> For this

>>>> example of 32MB scatter-gather list unmap, this results in just 16 

>>>> ASID

>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in 

>>>> case of

>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 

>>>> performance of

>>>> unmaps drastically.

>>>> 

>>>> Condition (size > granule size) is chosen for 

>>>> io_pgtable_tlb_flush_all()

>>>> because for any granule with supported pgsizes, we will have at 

>>>> least 512

>>>> TLB invalidations for which tlb_flush_all() is already recommended. 

>>>> For

>>>> example, take 4K granule with 2MB pgsize, this will result in 512 

>>>> TLBIVA

>>>> in partial walk flush.

>>>> 

>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:

>>>> (average over 10 iterations)

>>>> 

>>>> Before this optimization:

>>>> 

>>>>      size        iommu_map_sg      iommu_unmap

>>>>        4K            2.067 us         1.854 us

>>>>       64K            9.598 us         8.802 us

>>>>        1M          148.890 us       130.718 us

>>>>        2M          305.864 us        67.291 us

>>>>       12M         1793.604 us       390.838 us

>>>>       16M         2386.848 us       518.187 us

>>>>       24M         3563.296 us       775.989 us

>>>>       32M         4747.171 us      1033.364 us

>>>> 

>>>> After this optimization:

>>>> 

>>>>      size        iommu_map_sg      iommu_unmap

>>>>        4K            1.723 us         1.765 us

>>>>       64K            9.880 us         8.869 us

>>>>        1M          155.364 us       135.223 us

>>>>        2M          303.906 us         5.385 us

>>>>       12M         1786.557 us        21.250 us

>>>>       16M         2391.890 us        27.437 us

>>>>       24M         3570.895 us        39.937 us

>>>>       32M         4755.234 us        51.797 us

>>>> 

>>>> This is further reduced once the map/unmap_pages() support gets in 

>>>> which

>>>> will result in just 1 tlb_flush_all() as opposed to 16 

>>>> tlb_flush_all().

>>>> 

>>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>

>>>> ---

>>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--

>>>>   1 file changed, 5 insertions(+), 2 deletions(-)

>>>> 

>>>> diff --git a/drivers/iommu/io-pgtable-arm.c 

>>>> b/drivers/iommu/io-pgtable-arm.c

>>>> index 87def58e79b5..c3cb9add3179 100644

>>>> --- a/drivers/iommu/io-pgtable-arm.c

>>>> +++ b/drivers/iommu/io-pgtable-arm.c

>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 

>>>> arm_lpae_io_pgtable *data,

>>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {

>>>>               /* Also flush any partial walks */

>>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,

>>>> -                          ARM_LPAE_GRANULE(data));

>>>> +            if (size > ARM_LPAE_GRANULE(data))

>>>> +                io_pgtable_tlb_flush_all(iop);

>>>> +            else

>>> 

>>> Erm, when will the above condition ever not be true? ;)

>>> 

>> 

>> Ah right, silly me :)

>> 

>>> Taking a step back, though, what about the impact to drivers other

>>> than SMMUv2?

>> 

>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same

>> thing as arm-smmu-v2 (page based invalidations), then there is 

>> ipmmu-vmsa.c

>> which does tlb_flush_all() for flush walk.

>> 

>>> In particular I'm thinking of SMMUv3.2 where the whole

>>> range can be invalidated by VA in a single command anyway, so the

>>> additional penalties of TLBIALL are undesirable.

>>> 

>> 

>> Right, so I am thinking we can have a new generic quirk 

>> IO_PGTABLE_QUIRK_RANGE_INV

>> to choose between range based invalidations(tlb_flush_walk) and 

>> tlb_flush_all().

>> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 

>> with this quirk

>> and have something like below, thoughts?

>> 

>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)

>>          io_pgtable_tlb_flush_walk(iop, iova, size,

>>                                    ARM_LPAE_GRANULE(data));

>> else

>>          io_pgtable_tlb_flush_all(iop);

> 

> The design here has always been that io-pgtable says *what* needs

> invalidating, and we left it up to the drivers to decide exactly

> *how*. Even though things have evolved a bit I don't think that has

> fundamentally changed - tlb_flush_walk is now only used in this one

> place (technically I suppose it could be renamed tlb_flush_table but

> it's not worth the churn), so drivers can implement their own

> preferred table-invalidating behaviour even more easily than choosing

> whether to bounce a quirk through the common code or not. Consider

> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...

> 


Thanks for the explanation, makes sense. If I am not mistaken, I see 
that
you are suggesting to move this logic based on size and granule-size to
arm-smmu-v2 driver and one more thing below..


> I'm instinctively a little twitchy about making this a blanket

> optimisation for SMMUv2 since I still remember the palaver with our

> display and MMU-500 integrations, where it had to implement the dodgy

> "prefetch" register to trigger translations before scanning out a

> frame since it couldn't ever afford a TLB miss, thus TLBIALL when

> freeing an old buffer would be a dangerous hammer to swing. However

> IIRC it also had to ensure everything was mapped as 2MB blocks to

> guarantee fitting everything in the TLBs in the first place, so I

> guess it would still work out OK due to never realistically unmapping

> a whole table at once anyway.

> 


You are also hinting to not do this for all SMMUv2 implementations and 
make
it QCOM specific?

If I am wrong in my assumptions here, please let me know otherwise I 
will
prepare the patch :)

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Robin Murphy June 10, 2021, 11:33 a.m. UTC | #5
On 2021-06-10 10:36, Sai Prakash Ranjan wrote:
> Hi Robin,

> 

> On 2021-06-10 14:38, Robin Murphy wrote:

>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

>>> Hi Robin,

>>>

>>> On 2021-06-10 00:14, Robin Murphy wrote:

>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

>>>>> Currently for iommu_unmap() of large scatter-gather list with page 

>>>>> size

>>>>> elements, the majority of time is spent in flushing of partial 

>>>>> walks in

>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for

>>>>> arm-smmu).

>>>>>

>>>>> For example: to unmap a 32MB scatter-gather list with page size 

>>>>> elements

>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize 

>>>>> (2MB

>>>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs 

>>>>> (2MB/4K)

>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a 

>>>>> huge

>>>>> overhead.

>>>>>

>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire 

>>>>> context

>>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). 

>>>>> For this

>>>>> example of 32MB scatter-gather list unmap, this results in just 16 

>>>>> ASID

>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in 

>>>>> case of

>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 

>>>>> performance of

>>>>> unmaps drastically.

>>>>>

>>>>> Condition (size > granule size) is chosen for 

>>>>> io_pgtable_tlb_flush_all()

>>>>> because for any granule with supported pgsizes, we will have at 

>>>>> least 512

>>>>> TLB invalidations for which tlb_flush_all() is already recommended. 

>>>>> For

>>>>> example, take 4K granule with 2MB pgsize, this will result in 512 

>>>>> TLBIVA

>>>>> in partial walk flush.

>>>>>

>>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:

>>>>> (average over 10 iterations)

>>>>>

>>>>> Before this optimization:

>>>>>

>>>>>      size        iommu_map_sg      iommu_unmap

>>>>>        4K            2.067 us         1.854 us

>>>>>       64K            9.598 us         8.802 us

>>>>>        1M          148.890 us       130.718 us

>>>>>        2M          305.864 us        67.291 us

>>>>>       12M         1793.604 us       390.838 us

>>>>>       16M         2386.848 us       518.187 us

>>>>>       24M         3563.296 us       775.989 us

>>>>>       32M         4747.171 us      1033.364 us

>>>>>

>>>>> After this optimization:

>>>>>

>>>>>      size        iommu_map_sg      iommu_unmap

>>>>>        4K            1.723 us         1.765 us

>>>>>       64K            9.880 us         8.869 us

>>>>>        1M          155.364 us       135.223 us

>>>>>        2M          303.906 us         5.385 us

>>>>>       12M         1786.557 us        21.250 us

>>>>>       16M         2391.890 us        27.437 us

>>>>>       24M         3570.895 us        39.937 us

>>>>>       32M         4755.234 us        51.797 us

>>>>>

>>>>> This is further reduced once the map/unmap_pages() support gets in 

>>>>> which

>>>>> will result in just 1 tlb_flush_all() as opposed to 16 

>>>>> tlb_flush_all().

>>>>>

>>>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>

>>>>> ---

>>>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--

>>>>>   1 file changed, 5 insertions(+), 2 deletions(-)

>>>>>

>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c 

>>>>> b/drivers/iommu/io-pgtable-arm.c

>>>>> index 87def58e79b5..c3cb9add3179 100644

>>>>> --- a/drivers/iommu/io-pgtable-arm.c

>>>>> +++ b/drivers/iommu/io-pgtable-arm.c

>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 

>>>>> arm_lpae_io_pgtable *data,

>>>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {

>>>>>               /* Also flush any partial walks */

>>>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,

>>>>> -                          ARM_LPAE_GRANULE(data));

>>>>> +            if (size > ARM_LPAE_GRANULE(data))

>>>>> +                io_pgtable_tlb_flush_all(iop);

>>>>> +            else

>>>>

>>>> Erm, when will the above condition ever not be true? ;)

>>>>

>>>

>>> Ah right, silly me :)

>>>

>>>> Taking a step back, though, what about the impact to drivers other

>>>> than SMMUv2?

>>>

>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same

>>> thing as arm-smmu-v2 (page based invalidations), then there is 

>>> ipmmu-vmsa.c

>>> which does tlb_flush_all() for flush walk.

>>>

>>>> In particular I'm thinking of SMMUv3.2 where the whole

>>>> range can be invalidated by VA in a single command anyway, so the

>>>> additional penalties of TLBIALL are undesirable.

>>>>

>>>

>>> Right, so I am thinking we can have a new generic quirk 

>>> IO_PGTABLE_QUIRK_RANGE_INV

>>> to choose between range based invalidations(tlb_flush_walk) and 

>>> tlb_flush_all().

>>> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 

>>> with this quirk

>>> and have something like below, thoughts?

>>>

>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)

>>>          io_pgtable_tlb_flush_walk(iop, iova, size,

>>>                                    ARM_LPAE_GRANULE(data));

>>> else

>>>          io_pgtable_tlb_flush_all(iop);

>>

>> The design here has always been that io-pgtable says *what* needs

>> invalidating, and we left it up to the drivers to decide exactly

>> *how*. Even though things have evolved a bit I don't think that has

>> fundamentally changed - tlb_flush_walk is now only used in this one

>> place (technically I suppose it could be renamed tlb_flush_table but

>> it's not worth the churn), so drivers can implement their own

>> preferred table-invalidating behaviour even more easily than choosing

>> whether to bounce a quirk through the common code or not. Consider

>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...

>>

> 

> Thanks for the explanation, makes sense. If I am not mistaken, I see that

> you are suggesting to move this logic based on size and granule-size to

> arm-smmu-v2 driver and one more thing below..


Simpler than that - following on from my original comment above, 
tlb_flush_walk already knows it's invalidating at least one full level 
of table so there's nothing it even needs to check. Adding a size-based 
heuristic to arm_smmu_inv_range_* for leaf invalidations would be a 
separate concern (note that changing the non-leaf behaviour might allow 
cleaning up the "reg" indirection there too).

>> I'm instinctively a little twitchy about making this a blanket

>> optimisation for SMMUv2 since I still remember the palaver with our

>> display and MMU-500 integrations, where it had to implement the dodgy

>> "prefetch" register to trigger translations before scanning out a

>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when

>> freeing an old buffer would be a dangerous hammer to swing. However

>> IIRC it also had to ensure everything was mapped as 2MB blocks to

>> guarantee fitting everything in the TLBs in the first place, so I

>> guess it would still work out OK due to never realistically unmapping

>> a whole table at once anyway.

>>

> 

> You are also hinting to not do this for all SMMUv2 implementations and make

> it QCOM specific?


No, I'm really just wary that the performance implication is more 
complex than a simple unmap latency benefit, possibly even for QCOM. 
Consider the access latency, power and memory bandwidth hit from all the 
additional pagetable walks incurred by other ongoing traffic fighting 
against those 16 successive TLBIASIDs. Whether it's an overall win 
really depends on the specific workload and system conditions as much as 
the SMMU implementation. Thinking some more, I wonder if the Tegra folks 
might have an opinion to add here, given that their multiple-SMMU 
solution was seemingly about trying to get enough TLB and pagetable walk 
bandwidth in the first place?

Robin.
Sai Prakash Ranjan June 10, 2021, 11:54 a.m. UTC | #6
Hi Robin,

On 2021-06-10 17:03, Robin Murphy wrote:
> On 2021-06-10 10:36, Sai Prakash Ranjan wrote:

>> Hi Robin,

>> 

>> On 2021-06-10 14:38, Robin Murphy wrote:

>>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

>>>> Hi Robin,

>>>> 

>>>> On 2021-06-10 00:14, Robin Murphy wrote:

>>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

>>>>>> Currently for iommu_unmap() of large scatter-gather list with page 

>>>>>> size

>>>>>> elements, the majority of time is spent in flushing of partial 

>>>>>> walks in

>>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA 

>>>>>> for

>>>>>> arm-smmu).

>>>>>> 

>>>>>> For example: to unmap a 32MB scatter-gather list with page size 

>>>>>> elements

>>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the 

>>>>>> pgsize (2MB

>>>>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs 

>>>>>> (2MB/4K)

>>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing 

>>>>>> a huge

>>>>>> overhead.

>>>>>> 

>>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire 

>>>>>> context

>>>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). 

>>>>>> For this

>>>>>> example of 32MB scatter-gather list unmap, this results in just 16 

>>>>>> ASID

>>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in 

>>>>>> case of

>>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 

>>>>>> performance of

>>>>>> unmaps drastically.

>>>>>> 

>>>>>> Condition (size > granule size) is chosen for 

>>>>>> io_pgtable_tlb_flush_all()

>>>>>> because for any granule with supported pgsizes, we will have at 

>>>>>> least 512

>>>>>> TLB invalidations for which tlb_flush_all() is already 

>>>>>> recommended. For

>>>>>> example, take 4K granule with 2MB pgsize, this will result in 512 

>>>>>> TLBIVA

>>>>>> in partial walk flush.

>>>>>> 

>>>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:

>>>>>> (average over 10 iterations)

>>>>>> 

>>>>>> Before this optimization:

>>>>>> 

>>>>>>      size        iommu_map_sg      iommu_unmap

>>>>>>        4K            2.067 us         1.854 us

>>>>>>       64K            9.598 us         8.802 us

>>>>>>        1M          148.890 us       130.718 us

>>>>>>        2M          305.864 us        67.291 us

>>>>>>       12M         1793.604 us       390.838 us

>>>>>>       16M         2386.848 us       518.187 us

>>>>>>       24M         3563.296 us       775.989 us

>>>>>>       32M         4747.171 us      1033.364 us

>>>>>> 

>>>>>> After this optimization:

>>>>>> 

>>>>>>      size        iommu_map_sg      iommu_unmap

>>>>>>        4K            1.723 us         1.765 us

>>>>>>       64K            9.880 us         8.869 us

>>>>>>        1M          155.364 us       135.223 us

>>>>>>        2M          303.906 us         5.385 us

>>>>>>       12M         1786.557 us        21.250 us

>>>>>>       16M         2391.890 us        27.437 us

>>>>>>       24M         3570.895 us        39.937 us

>>>>>>       32M         4755.234 us        51.797 us

>>>>>> 

>>>>>> This is further reduced once the map/unmap_pages() support gets in 

>>>>>> which

>>>>>> will result in just 1 tlb_flush_all() as opposed to 16 

>>>>>> tlb_flush_all().

>>>>>> 

>>>>>> Signed-off-by: Sai Prakash Ranjan 

>>>>>> <saiprakash.ranjan@codeaurora.org>

>>>>>> ---

>>>>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--

>>>>>>   1 file changed, 5 insertions(+), 2 deletions(-)

>>>>>> 

>>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c 

>>>>>> b/drivers/iommu/io-pgtable-arm.c

>>>>>> index 87def58e79b5..c3cb9add3179 100644

>>>>>> --- a/drivers/iommu/io-pgtable-arm.c

>>>>>> +++ b/drivers/iommu/io-pgtable-arm.c

>>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 

>>>>>> arm_lpae_io_pgtable *data,

>>>>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {

>>>>>>               /* Also flush any partial walks */

>>>>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,

>>>>>> -                          ARM_LPAE_GRANULE(data));

>>>>>> +            if (size > ARM_LPAE_GRANULE(data))

>>>>>> +                io_pgtable_tlb_flush_all(iop);

>>>>>> +            else

>>>>> 

>>>>> Erm, when will the above condition ever not be true? ;)

>>>>> 

>>>> 

>>>> Ah right, silly me :)

>>>> 

>>>>> Taking a step back, though, what about the impact to drivers other

>>>>> than SMMUv2?

>>>> 

>>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same

>>>> thing as arm-smmu-v2 (page based invalidations), then there is 

>>>> ipmmu-vmsa.c

>>>> which does tlb_flush_all() for flush walk.

>>>> 

>>>>> In particular I'm thinking of SMMUv3.2 where the whole

>>>>> range can be invalidated by VA in a single command anyway, so the

>>>>> additional penalties of TLBIALL are undesirable.

>>>>> 

>>>> 

>>>> Right, so I am thinking we can have a new generic quirk 

>>>> IO_PGTABLE_QUIRK_RANGE_INV

>>>> to choose between range based invalidations(tlb_flush_walk) and 

>>>> tlb_flush_all().

>>>> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 

>>>> with this quirk

>>>> and have something like below, thoughts?

>>>> 

>>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)

>>>>          io_pgtable_tlb_flush_walk(iop, iova, size,

>>>>                                    ARM_LPAE_GRANULE(data));

>>>> else

>>>>          io_pgtable_tlb_flush_all(iop);

>>> 

>>> The design here has always been that io-pgtable says *what* needs

>>> invalidating, and we left it up to the drivers to decide exactly

>>> *how*. Even though things have evolved a bit I don't think that has

>>> fundamentally changed - tlb_flush_walk is now only used in this one

>>> place (technically I suppose it could be renamed tlb_flush_table but

>>> it's not worth the churn), so drivers can implement their own

>>> preferred table-invalidating behaviour even more easily than choosing

>>> whether to bounce a quirk through the common code or not. Consider

>>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...

>>> 

>> 

>> Thanks for the explanation, makes sense. If I am not mistaken, I see 

>> that

>> you are suggesting to move this logic based on size and granule-size 

>> to

>> arm-smmu-v2 driver and one more thing below..

> 

> Simpler than that - following on from my original comment above,

> tlb_flush_walk already knows it's invalidating at least one full level

> of table so there's nothing it even needs to check. Adding a

> size-based heuristic to arm_smmu_inv_range_* for leaf invalidations

> would be a separate concern (note that changing the non-leaf behaviour

> might allow cleaning up the "reg" indirection there too).


Right, sorry I didn't mean to mention the size check as it was obvious
from your first reply, but rather just calling impl->tlb_inv() in
arm_smmu_tlb_inv_walk_s1().

> 

>>> I'm instinctively a little twitchy about making this a blanket

>>> optimisation for SMMUv2 since I still remember the palaver with our

>>> display and MMU-500 integrations, where it had to implement the dodgy

>>> "prefetch" register to trigger translations before scanning out a

>>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when

>>> freeing an old buffer would be a dangerous hammer to swing. However

>>> IIRC it also had to ensure everything was mapped as 2MB blocks to

>>> guarantee fitting everything in the TLBs in the first place, so I

>>> guess it would still work out OK due to never realistically unmapping

>>> a whole table at once anyway.

>>> 

>> 

>> You are also hinting to not do this for all SMMUv2 implementations and 

>> make

>> it QCOM specific?

> 

> No, I'm really just wary that the performance implication is more

> complex than a simple unmap latency benefit, possibly even for QCOM.

> Consider the access latency, power and memory bandwidth hit from all

> the additional pagetable walks incurred by other ongoing traffic

> fighting against those 16 successive TLBIASIDs. Whether it's an

> overall win really depends on the specific workload and system

> conditions as much as the SMMU implementation.


No, the unmap latency is not just in some test case written, the issue
is very real and we have workloads where camera is reporting frame drops
because of this unmap latency in the order of 100s of milliseconds.
And hardware team recommends using ASID based invalidations for anything
larger than 128 TLB entries. So yes, we have taken note of impacts here
before going this way and hence feel more inclined to make this qcom
specific if required.

> Thinking some more, I

> wonder if the Tegra folks might have an opinion to add here, given

> that their multiple-SMMU solution was seemingly about trying to get

> enough TLB and pagetable walk bandwidth in the first place?

> 


Sure but I do not see how that will help with the unmap latency?

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Thierry Reding June 10, 2021, 12:03 p.m. UTC | #7
On Thu, Jun 10, 2021 at 12:33:56PM +0100, Robin Murphy wrote:
> On 2021-06-10 10:36, Sai Prakash Ranjan wrote:

> > Hi Robin,

> > 

> > On 2021-06-10 14:38, Robin Murphy wrote:

> > > On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

> > > > Hi Robin,

> > > > 

> > > > On 2021-06-10 00:14, Robin Murphy wrote:

> > > > > On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

> > > > > > Currently for iommu_unmap() of large scatter-gather list

> > > > > > with page size

> > > > > > elements, the majority of time is spent in flushing of

> > > > > > partial walks in

> > > > > > __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for

> > > > > > arm-smmu).

> > > > > > 

> > > > > > For example: to unmap a 32MB scatter-gather list with

> > > > > > page size elements

> > > > > > (8192 entries), there are 16->2MB buffer unmaps based on

> > > > > > the pgsize (2MB

> > > > > > for 4K granule) and each of 2MB will further result in

> > > > > > 512 TLBIVAs (2MB/4K)

> > > > > > resulting in a total of 8192 TLBIVAs (512*16) for

> > > > > > 16->2MB causing a huge

> > > > > > overhead.

> > > > > > 

> > > > > > So instead use io_pgtable_tlb_flush_all() to invalidate

> > > > > > the entire context

> > > > > > if size (pgsize) is greater than the granule size (4K,

> > > > > > 16K, 64K). For this

> > > > > > example of 32MB scatter-gather list unmap, this results

> > > > > > in just 16 ASID

> > > > > > based TLB invalidations or tlb_flush_all() callback

> > > > > > (TLBIASID in case of

> > > > > > arm-smmu) as opposed to 8192 TLBIVAs thereby increasing

> > > > > > the performance of

> > > > > > unmaps drastically.

> > > > > > 

> > > > > > Condition (size > granule size) is chosen for

> > > > > > io_pgtable_tlb_flush_all()

> > > > > > because for any granule with supported pgsizes, we will

> > > > > > have at least 512

> > > > > > TLB invalidations for which tlb_flush_all() is already

> > > > > > recommended. For

> > > > > > example, take 4K granule with 2MB pgsize, this will

> > > > > > result in 512 TLBIVA

> > > > > > in partial walk flush.

> > > > > > 

> > > > > > Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:

> > > > > > (average over 10 iterations)

> > > > > > 

> > > > > > Before this optimization:

> > > > > > 

> > > > > >      size        iommu_map_sg      iommu_unmap

> > > > > >        4K            2.067 us         1.854 us

> > > > > >       64K            9.598 us         8.802 us

> > > > > >        1M          148.890 us       130.718 us

> > > > > >        2M          305.864 us        67.291 us

> > > > > >       12M         1793.604 us       390.838 us

> > > > > >       16M         2386.848 us       518.187 us

> > > > > >       24M         3563.296 us       775.989 us

> > > > > >       32M         4747.171 us      1033.364 us

> > > > > > 

> > > > > > After this optimization:

> > > > > > 

> > > > > >      size        iommu_map_sg      iommu_unmap

> > > > > >        4K            1.723 us         1.765 us

> > > > > >       64K            9.880 us         8.869 us

> > > > > >        1M          155.364 us       135.223 us

> > > > > >        2M          303.906 us         5.385 us

> > > > > >       12M         1786.557 us        21.250 us

> > > > > >       16M         2391.890 us        27.437 us

> > > > > >       24M         3570.895 us        39.937 us

> > > > > >       32M         4755.234 us        51.797 us

> > > > > > 

> > > > > > This is further reduced once the map/unmap_pages()

> > > > > > support gets in which

> > > > > > will result in just 1 tlb_flush_all() as opposed to 16

> > > > > > tlb_flush_all().

> > > > > > 

> > > > > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>

> > > > > > ---

> > > > > >   drivers/iommu/io-pgtable-arm.c | 7 +++++--

> > > > > >   1 file changed, 5 insertions(+), 2 deletions(-)

> > > > > > 

> > > > > > diff --git a/drivers/iommu/io-pgtable-arm.c

> > > > > > b/drivers/iommu/io-pgtable-arm.c

> > > > > > index 87def58e79b5..c3cb9add3179 100644

> > > > > > --- a/drivers/iommu/io-pgtable-arm.c

> > > > > > +++ b/drivers/iommu/io-pgtable-arm.c

> > > > > > @@ -589,8 +589,11 @@ static size_t

> > > > > > __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,

> > > > > >             if (!iopte_leaf(pte, lvl, iop->fmt)) {

> > > > > >               /* Also flush any partial walks */

> > > > > > -            io_pgtable_tlb_flush_walk(iop, iova, size,

> > > > > > -                          ARM_LPAE_GRANULE(data));

> > > > > > +            if (size > ARM_LPAE_GRANULE(data))

> > > > > > +                io_pgtable_tlb_flush_all(iop);

> > > > > > +            else

> > > > > 

> > > > > Erm, when will the above condition ever not be true? ;)

> > > > > 

> > > > 

> > > > Ah right, silly me :)

> > > > 

> > > > > Taking a step back, though, what about the impact to drivers other

> > > > > than SMMUv2?

> > > > 

> > > > Other drivers would be msm_iommu.c, qcom_iommu.c which does the same

> > > > thing as arm-smmu-v2 (page based invalidations), then there is

> > > > ipmmu-vmsa.c

> > > > which does tlb_flush_all() for flush walk.

> > > > 

> > > > > In particular I'm thinking of SMMUv3.2 where the whole

> > > > > range can be invalidated by VA in a single command anyway, so the

> > > > > additional penalties of TLBIALL are undesirable.

> > > > > 

> > > > 

> > > > Right, so I am thinking we can have a new generic quirk

> > > > IO_PGTABLE_QUIRK_RANGE_INV

> > > > to choose between range based invalidations(tlb_flush_walk) and

> > > > tlb_flush_all().

> > > > In this case of arm-smmu-v3.2, we can tie up

> > > > ARM_SMMU_FEAT_RANGE_INV with this quirk

> > > > and have something like below, thoughts?

> > > > 

> > > > if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)

> > > >          io_pgtable_tlb_flush_walk(iop, iova, size,

> > > >                                    ARM_LPAE_GRANULE(data));

> > > > else

> > > >          io_pgtable_tlb_flush_all(iop);

> > > 

> > > The design here has always been that io-pgtable says *what* needs

> > > invalidating, and we left it up to the drivers to decide exactly

> > > *how*. Even though things have evolved a bit I don't think that has

> > > fundamentally changed - tlb_flush_walk is now only used in this one

> > > place (technically I suppose it could be renamed tlb_flush_table but

> > > it's not worth the churn), so drivers can implement their own

> > > preferred table-invalidating behaviour even more easily than choosing

> > > whether to bounce a quirk through the common code or not. Consider

> > > what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...

> > > 

> > 

> > Thanks for the explanation, makes sense. If I am not mistaken, I see that

> > you are suggesting to move this logic based on size and granule-size to

> > arm-smmu-v2 driver and one more thing below..

> 

> Simpler than that - following on from my original comment above,

> tlb_flush_walk already knows it's invalidating at least one full level of

> table so there's nothing it even needs to check. Adding a size-based

> heuristic to arm_smmu_inv_range_* for leaf invalidations would be a separate

> concern (note that changing the non-leaf behaviour might allow cleaning up

> the "reg" indirection there too).

> 

> > > I'm instinctively a little twitchy about making this a blanket

> > > optimisation for SMMUv2 since I still remember the palaver with our

> > > display and MMU-500 integrations, where it had to implement the dodgy

> > > "prefetch" register to trigger translations before scanning out a

> > > frame since it couldn't ever afford a TLB miss, thus TLBIALL when

> > > freeing an old buffer would be a dangerous hammer to swing. However

> > > IIRC it also had to ensure everything was mapped as 2MB blocks to

> > > guarantee fitting everything in the TLBs in the first place, so I

> > > guess it would still work out OK due to never realistically unmapping

> > > a whole table at once anyway.

> > > 

> > 

> > You are also hinting to not do this for all SMMUv2 implementations and make

> > it QCOM specific?

> 

> No, I'm really just wary that the performance implication is more complex

> than a simple unmap latency benefit, possibly even for QCOM. Consider the

> access latency, power and memory bandwidth hit from all the additional

> pagetable walks incurred by other ongoing traffic fighting against those 16

> successive TLBIASIDs. Whether it's an overall win really depends on the

> specific workload and system conditions as much as the SMMU implementation.

> Thinking some more, I wonder if the Tegra folks might have an opinion to add

> here, given that their multiple-SMMU solution was seemingly about trying to

> get enough TLB and pagetable walk bandwidth in the first place?


Yes, so Tegra194 has three different instances of the SMMU. Two of them
are programmed in an interleaved mode to basically double the bandwidth
available. A third instance is specifically reserved for isochronous
memory clients and is used by the display controller to avoid potential
pressure on the dual-SMMU from interfering with display functionality.

I'm not sure if we've ever measured the impact of map/unmap operations
under heavy load. Krishna, do you know if this might be helpful for some
of the use-cases we have on Tegra? Or if it might negatively impact
performance under pressure?

Thierry
Robin Murphy June 10, 2021, 3:29 p.m. UTC | #8
On 2021-06-10 12:54, Sai Prakash Ranjan wrote:
> Hi Robin,

> 

> On 2021-06-10 17:03, Robin Murphy wrote:

>> On 2021-06-10 10:36, Sai Prakash Ranjan wrote:

>>> Hi Robin,

>>>

>>> On 2021-06-10 14:38, Robin Murphy wrote:

>>>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

>>>>> Hi Robin,

>>>>>

>>>>> On 2021-06-10 00:14, Robin Murphy wrote:

>>>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

>>>>>>> Currently for iommu_unmap() of large scatter-gather list with 

>>>>>>> page size

>>>>>>> elements, the majority of time is spent in flushing of partial 

>>>>>>> walks in

>>>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for

>>>>>>> arm-smmu).

>>>>>>>

>>>>>>> For example: to unmap a 32MB scatter-gather list with page size 

>>>>>>> elements

>>>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the 

>>>>>>> pgsize (2MB

>>>>>>> for 4K granule) and each of 2MB will further result in 512 

>>>>>>> TLBIVAs (2MB/4K)

>>>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing 

>>>>>>> a huge

>>>>>>> overhead.

>>>>>>>

>>>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the 

>>>>>>> entire context

>>>>>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). 

>>>>>>> For this

>>>>>>> example of 32MB scatter-gather list unmap, this results in just 

>>>>>>> 16 ASID

>>>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in 

>>>>>>> case of

>>>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 

>>>>>>> performance of

>>>>>>> unmaps drastically.

>>>>>>>

>>>>>>> Condition (size > granule size) is chosen for 

>>>>>>> io_pgtable_tlb_flush_all()

>>>>>>> because for any granule with supported pgsizes, we will have at 

>>>>>>> least 512

>>>>>>> TLB invalidations for which tlb_flush_all() is already 

>>>>>>> recommended. For

>>>>>>> example, take 4K granule with 2MB pgsize, this will result in 512 

>>>>>>> TLBIVA

>>>>>>> in partial walk flush.

>>>>>>>

>>>>>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:

>>>>>>> (average over 10 iterations)

>>>>>>>

>>>>>>> Before this optimization:

>>>>>>>

>>>>>>>      size        iommu_map_sg      iommu_unmap

>>>>>>>        4K            2.067 us         1.854 us

>>>>>>>       64K            9.598 us         8.802 us

>>>>>>>        1M          148.890 us       130.718 us

>>>>>>>        2M          305.864 us        67.291 us

>>>>>>>       12M         1793.604 us       390.838 us

>>>>>>>       16M         2386.848 us       518.187 us

>>>>>>>       24M         3563.296 us       775.989 us

>>>>>>>       32M         4747.171 us      1033.364 us

>>>>>>>

>>>>>>> After this optimization:

>>>>>>>

>>>>>>>      size        iommu_map_sg      iommu_unmap

>>>>>>>        4K            1.723 us         1.765 us

>>>>>>>       64K            9.880 us         8.869 us

>>>>>>>        1M          155.364 us       135.223 us

>>>>>>>        2M          303.906 us         5.385 us

>>>>>>>       12M         1786.557 us        21.250 us

>>>>>>>       16M         2391.890 us        27.437 us

>>>>>>>       24M         3570.895 us        39.937 us

>>>>>>>       32M         4755.234 us        51.797 us

>>>>>>>

>>>>>>> This is further reduced once the map/unmap_pages() support gets 

>>>>>>> in which

>>>>>>> will result in just 1 tlb_flush_all() as opposed to 16 

>>>>>>> tlb_flush_all().

>>>>>>>

>>>>>>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>

>>>>>>> ---

>>>>>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--

>>>>>>>   1 file changed, 5 insertions(+), 2 deletions(-)

>>>>>>>

>>>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c 

>>>>>>> b/drivers/iommu/io-pgtable-arm.c

>>>>>>> index 87def58e79b5..c3cb9add3179 100644

>>>>>>> --- a/drivers/iommu/io-pgtable-arm.c

>>>>>>> +++ b/drivers/iommu/io-pgtable-arm.c

>>>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 

>>>>>>> arm_lpae_io_pgtable *data,

>>>>>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {

>>>>>>>               /* Also flush any partial walks */

>>>>>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,

>>>>>>> -                          ARM_LPAE_GRANULE(data));

>>>>>>> +            if (size > ARM_LPAE_GRANULE(data))

>>>>>>> +                io_pgtable_tlb_flush_all(iop);

>>>>>>> +            else

>>>>>>

>>>>>> Erm, when will the above condition ever not be true? ;)

>>>>>>

>>>>>

>>>>> Ah right, silly me :)

>>>>>

>>>>>> Taking a step back, though, what about the impact to drivers other

>>>>>> than SMMUv2?

>>>>>

>>>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same

>>>>> thing as arm-smmu-v2 (page based invalidations), then there is 

>>>>> ipmmu-vmsa.c

>>>>> which does tlb_flush_all() for flush walk.

>>>>>

>>>>>> In particular I'm thinking of SMMUv3.2 where the whole

>>>>>> range can be invalidated by VA in a single command anyway, so the

>>>>>> additional penalties of TLBIALL are undesirable.

>>>>>>

>>>>>

>>>>> Right, so I am thinking we can have a new generic quirk 

>>>>> IO_PGTABLE_QUIRK_RANGE_INV

>>>>> to choose between range based invalidations(tlb_flush_walk) and 

>>>>> tlb_flush_all().

>>>>> In this case of arm-smmu-v3.2, we can tie up 

>>>>> ARM_SMMU_FEAT_RANGE_INV with this quirk

>>>>> and have something like below, thoughts?

>>>>>

>>>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)

>>>>>          io_pgtable_tlb_flush_walk(iop, iova, size,

>>>>>                                    ARM_LPAE_GRANULE(data));

>>>>> else

>>>>>          io_pgtable_tlb_flush_all(iop);

>>>>

>>>> The design here has always been that io-pgtable says *what* needs

>>>> invalidating, and we left it up to the drivers to decide exactly

>>>> *how*. Even though things have evolved a bit I don't think that has

>>>> fundamentally changed - tlb_flush_walk is now only used in this one

>>>> place (technically I suppose it could be renamed tlb_flush_table but

>>>> it's not worth the churn), so drivers can implement their own

>>>> preferred table-invalidating behaviour even more easily than choosing

>>>> whether to bounce a quirk through the common code or not. Consider

>>>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...

>>>>

>>>

>>> Thanks for the explanation, makes sense. If I am not mistaken, I see 

>>> that

>>> you are suggesting to move this logic based on size and granule-size to

>>> arm-smmu-v2 driver and one more thing below..

>>

>> Simpler than that - following on from my original comment above,

>> tlb_flush_walk already knows it's invalidating at least one full level

>> of table so there's nothing it even needs to check. Adding a

>> size-based heuristic to arm_smmu_inv_range_* for leaf invalidations

>> would be a separate concern (note that changing the non-leaf behaviour

>> might allow cleaning up the "reg" indirection there too).

> 

> Right, sorry I didn't mean to mention the size check as it was obvious

> from your first reply, but rather just calling impl->tlb_inv() in

> arm_smmu_tlb_inv_walk_s1().

> 

>>

>>>> I'm instinctively a little twitchy about making this a blanket

>>>> optimisation for SMMUv2 since I still remember the palaver with our

>>>> display and MMU-500 integrations, where it had to implement the dodgy

>>>> "prefetch" register to trigger translations before scanning out a

>>>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when

>>>> freeing an old buffer would be a dangerous hammer to swing. However

>>>> IIRC it also had to ensure everything was mapped as 2MB blocks to

>>>> guarantee fitting everything in the TLBs in the first place, so I

>>>> guess it would still work out OK due to never realistically unmapping

>>>> a whole table at once anyway.

>>>>

>>>

>>> You are also hinting to not do this for all SMMUv2 implementations 

>>> and make

>>> it QCOM specific?

>>

>> No, I'm really just wary that the performance implication is more

>> complex than a simple unmap latency benefit, possibly even for QCOM.

>> Consider the access latency, power and memory bandwidth hit from all

>> the additional pagetable walks incurred by other ongoing traffic

>> fighting against those 16 successive TLBIASIDs. Whether it's an

>> overall win really depends on the specific workload and system

>> conditions as much as the SMMU implementation.

> 

> No, the unmap latency is not just in some test case written, the issue

> is very real and we have workloads where camera is reporting frame drops

> because of this unmap latency in the order of 100s of milliseconds.

> And hardware team recommends using ASID based invalidations for anything

> larger than 128 TLB entries. So yes, we have taken note of impacts here

> before going this way and hence feel more inclined to make this qcom

> specific if required.


OK, that's good to know. I never suggested that CPU unmap latency wasn't 
a valid concern in itself - obviously spending millions of cycles in, 
say, an interrupt handler doing pointless busy work has some serious 
downsides - just that it might not always be the most important concern 
for everyone, so I wanted to make sure this discussion was had in the open.

TBH I *am* inclined to make this a core SMMU driver change provided 
nobody pops up with a strong counter-argument.

>> Thinking some more, I

>> wonder if the Tegra folks might have an opinion to add here, given

>> that their multiple-SMMU solution was seemingly about trying to get

>> enough TLB and pagetable walk bandwidth in the first place?

>>

> 

> Sure but I do not see how that will help with the unmap latency?


It won't. However it implies a use-case which is already sensitive to 
translation bandwidth, and thus is somewhat more likely to be sensitive 
to over-invalidation. But even then they also have more to gain from 
reducing the number of MMIO writes that have to be duplicated :)

Thanks,
Robin.
Sai Prakash Ranjan June 10, 2021, 3:51 p.m. UTC | #9
Hi Robin,

On 2021-06-10 20:59, Robin Murphy wrote:
> On 2021-06-10 12:54, Sai Prakash Ranjan wrote:

>> Hi Robin,

>> 

>> On 2021-06-10 17:03, Robin Murphy wrote:

>>> On 2021-06-10 10:36, Sai Prakash Ranjan wrote:

>>>> Hi Robin,

>>>> 

>>>> On 2021-06-10 14:38, Robin Murphy wrote:

>>>>> On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

>>>>>> Hi Robin,

>>>>>> 

>>>>>> On 2021-06-10 00:14, Robin Murphy wrote:

>>>>>>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

>>>>>>>> Currently for iommu_unmap() of large scatter-gather list with 

>>>>>>>> page size

>>>>>>>> elements, the majority of time is spent in flushing of partial 

>>>>>>>> walks in

>>>>>>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA 

>>>>>>>> for

>>>>>>>> arm-smmu).

>>>>>>>> 

>>>>>>>> For example: to unmap a 32MB scatter-gather list with page size 

>>>>>>>> elements

>>>>>>>> (8192 entries), there are 16->2MB buffer unmaps based on the 

>>>>>>>> pgsize (2MB

>>>>>>>> for 4K granule) and each of 2MB will further result in 512 

>>>>>>>> TLBIVAs (2MB/4K)

>>>>>>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB 

>>>>>>>> causing a huge

>>>>>>>> overhead.

>>>>>>>> 

>>>>>>>> So instead use io_pgtable_tlb_flush_all() to invalidate the 

>>>>>>>> entire context

>>>>>>>> if size (pgsize) is greater than the granule size (4K, 16K, 

>>>>>>>> 64K). For this

>>>>>>>> example of 32MB scatter-gather list unmap, this results in just 

>>>>>>>> 16 ASID

>>>>>>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in 

>>>>>>>> case of

>>>>>>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 

>>>>>>>> performance of

>>>>>>>> unmaps drastically.

>>>>>>>> 

>>>>>>>> Condition (size > granule size) is chosen for 

>>>>>>>> io_pgtable_tlb_flush_all()

>>>>>>>> because for any granule with supported pgsizes, we will have at 

>>>>>>>> least 512

>>>>>>>> TLB invalidations for which tlb_flush_all() is already 

>>>>>>>> recommended. For

>>>>>>>> example, take 4K granule with 2MB pgsize, this will result in 

>>>>>>>> 512 TLBIVA

>>>>>>>> in partial walk flush.

>>>>>>>> 

>>>>>>>> Test on QTI SM8150 SoC for 10 iterations of 

>>>>>>>> iommu_{map_sg}/unmap:

>>>>>>>> (average over 10 iterations)

>>>>>>>> 

>>>>>>>> Before this optimization:

>>>>>>>> 

>>>>>>>>      size        iommu_map_sg      iommu_unmap

>>>>>>>>        4K            2.067 us         1.854 us

>>>>>>>>       64K            9.598 us         8.802 us

>>>>>>>>        1M          148.890 us       130.718 us

>>>>>>>>        2M          305.864 us        67.291 us

>>>>>>>>       12M         1793.604 us       390.838 us

>>>>>>>>       16M         2386.848 us       518.187 us

>>>>>>>>       24M         3563.296 us       775.989 us

>>>>>>>>       32M         4747.171 us      1033.364 us

>>>>>>>> 

>>>>>>>> After this optimization:

>>>>>>>> 

>>>>>>>>      size        iommu_map_sg      iommu_unmap

>>>>>>>>        4K            1.723 us         1.765 us

>>>>>>>>       64K            9.880 us         8.869 us

>>>>>>>>        1M          155.364 us       135.223 us

>>>>>>>>        2M          303.906 us         5.385 us

>>>>>>>>       12M         1786.557 us        21.250 us

>>>>>>>>       16M         2391.890 us        27.437 us

>>>>>>>>       24M         3570.895 us        39.937 us

>>>>>>>>       32M         4755.234 us        51.797 us

>>>>>>>> 

>>>>>>>> This is further reduced once the map/unmap_pages() support gets 

>>>>>>>> in which

>>>>>>>> will result in just 1 tlb_flush_all() as opposed to 16 

>>>>>>>> tlb_flush_all().

>>>>>>>> 

>>>>>>>> Signed-off-by: Sai Prakash Ranjan 

>>>>>>>> <saiprakash.ranjan@codeaurora.org>

>>>>>>>> ---

>>>>>>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++--

>>>>>>>>   1 file changed, 5 insertions(+), 2 deletions(-)

>>>>>>>> 

>>>>>>>> diff --git a/drivers/iommu/io-pgtable-arm.c 

>>>>>>>> b/drivers/iommu/io-pgtable-arm.c

>>>>>>>> index 87def58e79b5..c3cb9add3179 100644

>>>>>>>> --- a/drivers/iommu/io-pgtable-arm.c

>>>>>>>> +++ b/drivers/iommu/io-pgtable-arm.c

>>>>>>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 

>>>>>>>> arm_lpae_io_pgtable *data,

>>>>>>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) {

>>>>>>>>               /* Also flush any partial walks */

>>>>>>>> -            io_pgtable_tlb_flush_walk(iop, iova, size,

>>>>>>>> -                          ARM_LPAE_GRANULE(data));

>>>>>>>> +            if (size > ARM_LPAE_GRANULE(data))

>>>>>>>> +                io_pgtable_tlb_flush_all(iop);

>>>>>>>> +            else

>>>>>>> 

>>>>>>> Erm, when will the above condition ever not be true? ;)

>>>>>>> 

>>>>>> 

>>>>>> Ah right, silly me :)

>>>>>> 

>>>>>>> Taking a step back, though, what about the impact to drivers 

>>>>>>> other

>>>>>>> than SMMUv2?

>>>>>> 

>>>>>> Other drivers would be msm_iommu.c, qcom_iommu.c which does the 

>>>>>> same

>>>>>> thing as arm-smmu-v2 (page based invalidations), then there is 

>>>>>> ipmmu-vmsa.c

>>>>>> which does tlb_flush_all() for flush walk.

>>>>>> 

>>>>>>> In particular I'm thinking of SMMUv3.2 where the whole

>>>>>>> range can be invalidated by VA in a single command anyway, so the

>>>>>>> additional penalties of TLBIALL are undesirable.

>>>>>>> 

>>>>>> 

>>>>>> Right, so I am thinking we can have a new generic quirk 

>>>>>> IO_PGTABLE_QUIRK_RANGE_INV

>>>>>> to choose between range based invalidations(tlb_flush_walk) and 

>>>>>> tlb_flush_all().

>>>>>> In this case of arm-smmu-v3.2, we can tie up 

>>>>>> ARM_SMMU_FEAT_RANGE_INV with this quirk

>>>>>> and have something like below, thoughts?

>>>>>> 

>>>>>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)

>>>>>>          io_pgtable_tlb_flush_walk(iop, iova, size,

>>>>>>                                    ARM_LPAE_GRANULE(data));

>>>>>> else

>>>>>>          io_pgtable_tlb_flush_all(iop);

>>>>> 

>>>>> The design here has always been that io-pgtable says *what* needs

>>>>> invalidating, and we left it up to the drivers to decide exactly

>>>>> *how*. Even though things have evolved a bit I don't think that has

>>>>> fundamentally changed - tlb_flush_walk is now only used in this one

>>>>> place (technically I suppose it could be renamed tlb_flush_table 

>>>>> but

>>>>> it's not worth the churn), so drivers can implement their own

>>>>> preferred table-invalidating behaviour even more easily than 

>>>>> choosing

>>>>> whether to bounce a quirk through the common code or not. Consider

>>>>> what you've already seen for the Renesas IPMMU, or SMMUv1 stage 

>>>>> 2...

>>>>> 

>>>> 

>>>> Thanks for the explanation, makes sense. If I am not mistaken, I see 

>>>> that

>>>> you are suggesting to move this logic based on size and granule-size 

>>>> to

>>>> arm-smmu-v2 driver and one more thing below..

>>> 

>>> Simpler than that - following on from my original comment above,

>>> tlb_flush_walk already knows it's invalidating at least one full 

>>> level

>>> of table so there's nothing it even needs to check. Adding a

>>> size-based heuristic to arm_smmu_inv_range_* for leaf invalidations

>>> would be a separate concern (note that changing the non-leaf 

>>> behaviour

>>> might allow cleaning up the "reg" indirection there too).

>> 

>> Right, sorry I didn't mean to mention the size check as it was obvious

>> from your first reply, but rather just calling impl->tlb_inv() in

>> arm_smmu_tlb_inv_walk_s1().

>> 

>>> 

>>>>> I'm instinctively a little twitchy about making this a blanket

>>>>> optimisation for SMMUv2 since I still remember the palaver with our

>>>>> display and MMU-500 integrations, where it had to implement the 

>>>>> dodgy

>>>>> "prefetch" register to trigger translations before scanning out a

>>>>> frame since it couldn't ever afford a TLB miss, thus TLBIALL when

>>>>> freeing an old buffer would be a dangerous hammer to swing. However

>>>>> IIRC it also had to ensure everything was mapped as 2MB blocks to

>>>>> guarantee fitting everything in the TLBs in the first place, so I

>>>>> guess it would still work out OK due to never realistically 

>>>>> unmapping

>>>>> a whole table at once anyway.

>>>>> 

>>>> 

>>>> You are also hinting to not do this for all SMMUv2 implementations 

>>>> and make

>>>> it QCOM specific?

>>> 

>>> No, I'm really just wary that the performance implication is more

>>> complex than a simple unmap latency benefit, possibly even for QCOM.

>>> Consider the access latency, power and memory bandwidth hit from all

>>> the additional pagetable walks incurred by other ongoing traffic

>>> fighting against those 16 successive TLBIASIDs. Whether it's an

>>> overall win really depends on the specific workload and system

>>> conditions as much as the SMMU implementation.

>> 

>> No, the unmap latency is not just in some test case written, the issue

>> is very real and we have workloads where camera is reporting frame 

>> drops

>> because of this unmap latency in the order of 100s of milliseconds.

>> And hardware team recommends using ASID based invalidations for 

>> anything

>> larger than 128 TLB entries. So yes, we have taken note of impacts 

>> here

>> before going this way and hence feel more inclined to make this qcom

>> specific if required.

> 

> OK, that's good to know. I never suggested that CPU unmap latency

> wasn't a valid concern in itself - obviously spending millions of

> cycles in, say, an interrupt handler doing pointless busy work has

> some serious downsides - just that it might not always be the most

> important concern for everyone, so I wanted to make sure this

> discussion was had in the open.

> 


Right, my mistake that I missed to mention these details of real world
data in commit text, will add them in next version.

> TBH I *am* inclined to make this a core SMMU driver change provided

> nobody pops up with a strong counter-argument.

> 


Ok that's even better in case it helps others as well.

>>> Thinking some more, I

>>> wonder if the Tegra folks might have an opinion to add here, given

>>> that their multiple-SMMU solution was seemingly about trying to get

>>> enough TLB and pagetable walk bandwidth in the first place?

>>> 

>> 

>> Sure but I do not see how that will help with the unmap latency?

> 

> It won't. However it implies a use-case which is already sensitive to

> translation bandwidth, and thus is somewhat more likely to be

> sensitive to over-invalidation. But even then they also have more to

> gain from reducing the number of MMIO writes that have to be

> duplicated :)

> 


Ah I see, sorry I misunderstood. It's definitely better if this gets
tested on other systems as well.

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Krishna Reddy June 11, 2021, 12:37 a.m. UTC | #10
> > No, the unmap latency is not just in some test case written, the issue

> > is very real and we have workloads where camera is reporting frame

> > drops because of this unmap latency in the order of 100s of milliseconds.

> > And hardware team recommends using ASID based invalidations for

> > anything larger than 128 TLB entries. So yes, we have taken note of

> > impacts here before going this way and hence feel more inclined to

> > make this qcom specific if required.


Seems like the real issue here is not the unmap API latency.
It should be the high number of back to back SMMU TLB invalidate register writes that is resulting
in lower ISO BW to Camera and overflow. Isn't it?
Even Tegra186 SoC has similar issue and HW team recommended to rate limit the number of
back to back SMMU tlb invalidate registers writes. The subsequent Tegra194 SoC has a dedicated SMMU for
ISO clients to avoid the impact of TLB invalidates from NISO clients on ISO BW.

>> Thinking some more, I

>> wonder if the Tegra folks might have an opinion to add here, given 

>> that their multiple-SMMU solution was seemingly about trying to get 

>> enough TLB and pagetable walk bandwidth in the first place?


While it is good to reduce the number of tlb register writes, Flushing all TLB entries at context granularity arbitrarily
can have negative impact on active traffic and BW. I don't have much data on possible impact at this point.
Can the flushing at context granularity be made a quirk than performing it as default? 

-KR
Sai Prakash Ranjan June 11, 2021, 12:54 a.m. UTC | #11
Hi Krishna,

On 2021-06-11 06:07, Krishna Reddy wrote:
>> > No, the unmap latency is not just in some test case written, the issue

>> > is very real and we have workloads where camera is reporting frame

>> > drops because of this unmap latency in the order of 100s of milliseconds.

>> > And hardware team recommends using ASID based invalidations for

>> > anything larger than 128 TLB entries. So yes, we have taken note of

>> > impacts here before going this way and hence feel more inclined to

>> > make this qcom specific if required.

> 

> Seems like the real issue here is not the unmap API latency.

> It should be the high number of back to back SMMU TLB invalidate

> register writes that is resulting

> in lower ISO BW to Camera and overflow. Isn't it?

> Even Tegra186 SoC has similar issue and HW team recommended to rate

> limit the number of

> back to back SMMU tlb invalidate registers writes. The subsequent

> Tegra194 SoC has a dedicated SMMU for

> ISO clients to avoid the impact of TLB invalidates from NISO clients on 

> ISO BW.

> 


Not exactly, this issue is not specific to camera. If you look at
the numbers in the commit text, even for the test device its the
same observation. It depends on the buffer size we are unmapping
which affects the number of TLBIs issue. I am not aware of any
such HW side bw issues for camera specifically on QCOM devices.

Thanks,
Sai

>>> Thinking some more, I

>>> wonder if the Tegra folks might have an opinion to add here, given

>>> that their multiple-SMMU solution was seemingly about trying to get

>>> enough TLB and pagetable walk bandwidth in the first place?

> 

> While it is good to reduce the number of tlb register writes, Flushing

> all TLB entries at context granularity arbitrarily

> can have negative impact on active traffic and BW. I don't have much

> data on possible impact at this point.

> Can the flushing at context granularity be made a quirk than

> performing it as default?

> 

> -KR


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Krishna Reddy June 11, 2021, 4:49 p.m. UTC | #12
Hi Sai,
> >> > No, the unmap latency is not just in some test case written, the

> >> > issue is very real and we have workloads where camera is reporting

> >> > frame drops because of this unmap latency in the order of 100s of

> milliseconds.


> Not exactly, this issue is not specific to camera. If you look at the numbers in the

> commit text, even for the test device its the same observation. It depends on

> the buffer size we are unmapping which affects the number of TLBIs issue. I am

> not aware of any such HW side bw issues for camera specifically on QCOM

> devices.


It is clear that reducing number of TLBIs  reduces the umap API latency. But, It is
at the expense of throwing away valid tlb entries. 
Quantifying the impact of arbitrary invalidation of valid tlb entries at context level is not straight forward and
use case dependent. The side-effects might be rare or won't be known until they are noticed.
Can you provide more details on How the unmap latency is causing camera to drop frames?
Is unmap performed in the perf path?
If unmap is queued and performed on a back ground thread, would it resolve the frame drops?

-KR
Sai Prakash Ranjan June 12, 2021, 2:46 a.m. UTC | #13
Hi Krishna,

On 2021-06-11 22:19, Krishna Reddy wrote:
> Hi Sai,

>> >> > No, the unmap latency is not just in some test case written, the

>> >> > issue is very real and we have workloads where camera is reporting

>> >> > frame drops because of this unmap latency in the order of 100s of

>> milliseconds.

> 

>> Not exactly, this issue is not specific to camera. If you look at the 

>> numbers in the

>> commit text, even for the test device its the same observation. It 

>> depends on

>> the buffer size we are unmapping which affects the number of TLBIs 

>> issue. I am

>> not aware of any such HW side bw issues for camera specifically on 

>> QCOM

>> devices.

> 

> It is clear that reducing number of TLBIs  reduces the umap API

> latency. But, It is

> at the expense of throwing away valid tlb entries.

> Quantifying the impact of arbitrary invalidation of valid tlb entries

> at context level is not straight forward and

> use case dependent. The side-effects might be rare or won't be known

> until they are noticed.


Right but we won't know until we profile the specific usecases or try 
them
in generic workload to see if they affect the performance. Sure, over
invalidation is a concern where multiple buffers can be mapped to same 
context
and the cache is not usable at the time for lookup and such but we don't 
do it
for small buffers and only for large buffers which means thousands of 
TLB entry
mappings in which case TLBIASID is preferred (note: I mentioned the HW 
team
recommendation to use it for anything greater than 128 TLB entries) in 
my earlier
reply. And also note that we do this only for partial walk flush, we are 
not
arbitrarily changing all the TLBIs to ASID based.

> Can you provide more details on How the unmap latency is causing

> camera to drop frames?

> Is unmap performed in the perf path?


I am no camera expert but from what the camera team mentioned is that
there is a thread which frees memory(large unused memory buffers)
periodically which ends up taking around 100+ms and causing some camera 
test
failures with frame drops. Parallel efforts are already being made to 
optimize
this usage of thread but as I mentioned previously, this is *not a 
camera
specific*, lets say someone else invokes such large unmaps, it's going 
to face
the same issue.

> If unmap is queued and performed on a back ground thread, would it

> resolve the frame drops?


Not sure I understand what you mean by queuing on background thread but 
with
that or not, we still do the same number of TLBIs and hop through
iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that 
help?

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Krishna Reddy June 14, 2021, 5:48 p.m. UTC | #14
> Right but we won't know until we profile the specific usecases or try them in

> generic workload to see if they affect the performance. Sure, over invalidation is

> a concern where multiple buffers can be mapped to same context and the cache

> is not usable at the time for lookup and such but we don't do it for small buffers

> and only for large buffers which means thousands of TLB entry mappings in

> which case TLBIASID is preferred (note: I mentioned the HW team

> recommendation to use it for anything greater than 128 TLB entries) in my

> earlier reply. And also note that we do this only for partial walk flush, we are not

> arbitrarily changing all the TLBIs to ASID based.


Most of the heavy bw use cases does involve processing larger buffers.
When the physical memory is allocated dis-contiguously at page_size (let's use 4KB here)
granularity, each aligned 2MB chunks IOVA unmap would involve performing a TLBIASID
as 2MB is not a leaf. Essentially, It happens all the time during large buffer unmaps and
potentially impact active traffic on other large buffers. Depending on how much
latency HW engines can absorb, the overflow/underflow issues for ISO engines can be
sporadic and vendor specific. 
Performing TLBIASID as default for all SoCs is not a safe operation.


> I am no camera expert but from what the camera team mentioned is that there

> is a thread which frees memory(large unused memory buffers) periodically which

> ends up taking around 100+ms and causing some camera test failures with

> frame drops. Parallel efforts are already being made to optimize this usage of

> thread but as I mentioned previously, this is *not a camera specific*, lets say

> someone else invokes such large unmaps, it's going to face the same issue.


From the above, It doesn't look like the root cause of frame drops is fully understood.
Why is 100+ms delay causing camera frame drop?  Is the same thread submitting the buffers
to camera after unmap is complete? If not, how is the unmap latency causing issue here?
 

> > If unmap is queued and performed on a back ground thread, would it

> > resolve the frame drops?

> 

> Not sure I understand what you mean by queuing on background thread but with

> that or not, we still do the same number of TLBIs and hop through

> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that

> help?


I mean adding the unmap requests into a queue and processing them from a different thread.
It is not to reduce the TLBIs. But, not to block subsequent buffer allocation, IOVA map requests, if they
are being requested from same thread that is performing unmap. If unmap is already performed from
a different thread, then the issue still need to be root caused to understand it fully. Check for any
serialization issues. 


-KR
Sai Prakash Ranjan June 15, 2021, 11:51 a.m. UTC | #15
Hi Krishna,

On 2021-06-14 23:18, Krishna Reddy wrote:
>> Right but we won't know until we profile the specific usecases or try 

>> them in

>> generic workload to see if they affect the performance. Sure, over 

>> invalidation is

>> a concern where multiple buffers can be mapped to same context and the 

>> cache

>> is not usable at the time for lookup and such but we don't do it for 

>> small buffers

>> and only for large buffers which means thousands of TLB entry mappings 

>> in

>> which case TLBIASID is preferred (note: I mentioned the HW team

>> recommendation to use it for anything greater than 128 TLB entries) in 

>> my

>> earlier reply. And also note that we do this only for partial walk 

>> flush, we are not

>> arbitrarily changing all the TLBIs to ASID based.

> 

> Most of the heavy bw use cases does involve processing larger buffers.

> When the physical memory is allocated dis-contiguously at page_size

> (let's use 4KB here)

> granularity, each aligned 2MB chunks IOVA unmap would involve

> performing a TLBIASID

> as 2MB is not a leaf. Essentially, It happens all the time during

> large buffer unmaps and

> potentially impact active traffic on other large buffers. Depending on 

> how much

> latency HW engines can absorb, the overflow/underflow issues for ISO

> engines can be

> sporadic and vendor specific.

> Performing TLBIASID as default for all SoCs is not a safe operation.

> 


Ok so from what I gather from this is that its not easy to test for the
negative impact and you don't have data on such yet and the behaviour is
very vendor specific. To add on qcom impl, we have several performance
improvements for TLB cache invalidations in HW like wait-for-safe(for 
realtime
clients such as camera and display) and few others to allow for cache
lookups/updates when TLBI is in progress for the same context bank, so 
atleast
we are good here.

> 

>> I am no camera expert but from what the camera team mentioned is that 

>> there

>> is a thread which frees memory(large unused memory buffers) 

>> periodically which

>> ends up taking around 100+ms and causing some camera test failures 

>> with

>> frame drops. Parallel efforts are already being made to optimize this 

>> usage of

>> thread but as I mentioned previously, this is *not a camera specific*, 

>> lets say

>> someone else invokes such large unmaps, it's going to face the same 

>> issue.

> 

> From the above, It doesn't look like the root cause of frame drops is

> fully understood.

> Why is 100+ms delay causing camera frame drop?  Is the same thread

> submitting the buffers

> to camera after unmap is complete? If not, how is the unmap latency

> causing issue here?

> 


Ok since you are interested in camera usecase, I have requested for more 
details
from the camera team and will give it once they comeback. However I 
don't think
its good to have unmap latency at all and that is being addressed by 
this patch.

> 

>> > If unmap is queued and performed on a back ground thread, would it

>> > resolve the frame drops?

>> 

>> Not sure I understand what you mean by queuing on background thread 

>> but with

>> that or not, we still do the same number of TLBIs and hop through

>> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that

>> help?

> 

> I mean adding the unmap requests into a queue and processing them from

> a different thread.

> It is not to reduce the TLBIs. But, not to block subsequent buffer

> allocation, IOVA map requests, if they

> are being requested from same thread that is performing unmap. If

> unmap is already performed from

> a different thread, then the issue still need to be root caused to

> understand it fully. Check for any

> serialization issues.

> 


This patch is to optimize unmap latency because of large number of mmio 
writes(TLBIVAs)
wasting CPU cycles and not to fix camera issue which can probably be 
solved by
parallelization. It seems to me like you are ok with the unmap latency 
in general
which we are not and want to avoid that latency.

Hi @Robin, from these discussions it seems they are not ok with the 
change
for all SoC vendor implementations and do not have any data on such 
impact.
As I mentioned above, on QCOM platforms we do have several optimizations 
in HW
for TLBIs and would like to make use of it and reduce the unmap latency.
What do you think, should this be made implementation specific?

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Robin Murphy June 15, 2021, 1:53 p.m. UTC | #16
On 2021-06-15 12:51, Sai Prakash Ranjan wrote:
> Hi Krishna,

> 

> On 2021-06-14 23:18, Krishna Reddy wrote:

>>> Right but we won't know until we profile the specific usecases or try 

>>> them in

>>> generic workload to see if they affect the performance. Sure, over 

>>> invalidation is

>>> a concern where multiple buffers can be mapped to same context and 

>>> the cache

>>> is not usable at the time for lookup and such but we don't do it for 

>>> small buffers

>>> and only for large buffers which means thousands of TLB entry 

>>> mappings in

>>> which case TLBIASID is preferred (note: I mentioned the HW team

>>> recommendation to use it for anything greater than 128 TLB entries) 

>>> in my

>>> earlier reply. And also note that we do this only for partial walk 

>>> flush, we are not

>>> arbitrarily changing all the TLBIs to ASID based.

>>

>> Most of the heavy bw use cases does involve processing larger buffers.

>> When the physical memory is allocated dis-contiguously at page_size

>> (let's use 4KB here)

>> granularity, each aligned 2MB chunks IOVA unmap would involve

>> performing a TLBIASID

>> as 2MB is not a leaf. Essentially, It happens all the time during

>> large buffer unmaps and

>> potentially impact active traffic on other large buffers. Depending on 

>> how much

>> latency HW engines can absorb, the overflow/underflow issues for ISO

>> engines can be

>> sporadic and vendor specific.

>> Performing TLBIASID as default for all SoCs is not a safe operation.

>>

> 

> Ok so from what I gather from this is that its not easy to test for the

> negative impact and you don't have data on such yet and the behaviour is

> very vendor specific. To add on qcom impl, we have several performance

> improvements for TLB cache invalidations in HW like wait-for-safe(for 

> realtime

> clients such as camera and display) and few others to allow for cache

> lookups/updates when TLBI is in progress for the same context bank, so 

> atleast

> we are good here.

> 

>>

>>> I am no camera expert but from what the camera team mentioned is that 

>>> there

>>> is a thread which frees memory(large unused memory buffers) 

>>> periodically which

>>> ends up taking around 100+ms and causing some camera test failures with

>>> frame drops. Parallel efforts are already being made to optimize this 

>>> usage of

>>> thread but as I mentioned previously, this is *not a camera 

>>> specific*, lets say

>>> someone else invokes such large unmaps, it's going to face the same 

>>> issue.

>>

>> From the above, It doesn't look like the root cause of frame drops is

>> fully understood.

>> Why is 100+ms delay causing camera frame drop?  Is the same thread

>> submitting the buffers

>> to camera after unmap is complete? If not, how is the unmap latency

>> causing issue here?

>>

> 

> Ok since you are interested in camera usecase, I have requested for more 

> details

> from the camera team and will give it once they comeback. However I 

> don't think

> its good to have unmap latency at all and that is being addressed by 

> this patch.

> 

>>

>>> > If unmap is queued and performed on a back ground thread, would it

>>> > resolve the frame drops?

>>>

>>> Not sure I understand what you mean by queuing on background thread 

>>> but with

>>> that or not, we still do the same number of TLBIs and hop through

>>> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that

>>> help?

>>

>> I mean adding the unmap requests into a queue and processing them from

>> a different thread.

>> It is not to reduce the TLBIs. But, not to block subsequent buffer

>> allocation, IOVA map requests, if they

>> are being requested from same thread that is performing unmap. If

>> unmap is already performed from

>> a different thread, then the issue still need to be root caused to

>> understand it fully. Check for any

>> serialization issues.

>>

> 

> This patch is to optimize unmap latency because of large number of mmio 

> writes(TLBIVAs)

> wasting CPU cycles and not to fix camera issue which can probably be 

> solved by

> parallelization. It seems to me like you are ok with the unmap latency 

> in general

> which we are not and want to avoid that latency.

> 

> Hi @Robin, from these discussions it seems they are not ok with the change

> for all SoC vendor implementations and do not have any data on such impact.

> As I mentioned above, on QCOM platforms we do have several optimizations 

> in HW

> for TLBIs and would like to make use of it and reduce the unmap latency.

> What do you think, should this be made implementation specific?


Yes, it sounds like there's enough uncertainty for now that this needs 
to be an opt-in feature. However, I still think that non-strict mode 
could use it generically, since that's all about over-invalidating to 
save time on individual unmaps - and relatively non-deterministic - already.

So maybe we have a second set of iommu_flush_ops, or just a flag 
somewhere to control the tlb_flush_walk functions internally, and the 
choice can be made in the iommu_get_dma_strict() test, but also forced 
on all the time by your init_context hook. What do you reckon?

Robin.
Sai Prakash Ranjan June 16, 2021, 6:58 a.m. UTC | #17
Hi Robin,

On 2021-06-15 19:23, Robin Murphy wrote:
> On 2021-06-15 12:51, Sai Prakash Ranjan wrote:


<snip>...

>> Hi @Robin, from these discussions it seems they are not ok with the 

>> change

>> for all SoC vendor implementations and do not have any data on such 

>> impact.

>> As I mentioned above, on QCOM platforms we do have several 

>> optimizations in HW

>> for TLBIs and would like to make use of it and reduce the unmap 

>> latency.

>> What do you think, should this be made implementation specific?

> 

> Yes, it sounds like there's enough uncertainty for now that this needs

> to be an opt-in feature. However, I still think that non-strict mode

> could use it generically, since that's all about over-invalidating to

> save time on individual unmaps - and relatively non-deterministic -

> already.

> 

> So maybe we have a second set of iommu_flush_ops, or just a flag

> somewhere to control the tlb_flush_walk functions internally, and the

> choice can be made in the iommu_get_dma_strict() test, but also forced

> on all the time by your init_context hook. What do you reckon?

> 


Sounds good to me. Since you mentioned non-strict mode using it 
generically,
can't we just set tlb_flush_all() in io_pgtable_tlb_flush_walk() like 
below
based on quirk so that we don't need to add any check in 
iommu_get_dma_strict()
and just force the new flush_ops in init_context hook?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {
         iop->cfg.tlb->tlb_flush_all(iop->cookie);
         return;
}

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Sai Prakash Ranjan June 16, 2021, 9:03 a.m. UTC | #18
On 2021-06-16 12:28, Sai Prakash Ranjan wrote:
> Hi Robin,

> 

> On 2021-06-15 19:23, Robin Murphy wrote:

>> On 2021-06-15 12:51, Sai Prakash Ranjan wrote:

> 

> <snip>...

> 

>>> Hi @Robin, from these discussions it seems they are not ok with the 

>>> change

>>> for all SoC vendor implementations and do not have any data on such 

>>> impact.

>>> As I mentioned above, on QCOM platforms we do have several 

>>> optimizations in HW

>>> for TLBIs and would like to make use of it and reduce the unmap 

>>> latency.

>>> What do you think, should this be made implementation specific?

>> 

>> Yes, it sounds like there's enough uncertainty for now that this needs

>> to be an opt-in feature. However, I still think that non-strict mode

>> could use it generically, since that's all about over-invalidating to

>> save time on individual unmaps - and relatively non-deterministic -

>> already.

>> 

>> So maybe we have a second set of iommu_flush_ops, or just a flag

>> somewhere to control the tlb_flush_walk functions internally, and the

>> choice can be made in the iommu_get_dma_strict() test, but also forced

>> on all the time by your init_context hook. What do you reckon?

>> 

> 

> Sounds good to me. Since you mentioned non-strict mode using it 

> generically,

> can't we just set tlb_flush_all() in io_pgtable_tlb_flush_walk() like 

> below

> based on quirk so that we don't need to add any check in 

> iommu_get_dma_strict()

> and just force the new flush_ops in init_context hook?

> 

> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {

>         iop->cfg.tlb->tlb_flush_all(iop->cookie);

>         return;

> }

> 


Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk 
since this
is related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV 
which will be
set in init_context impl hook and the prev condition in 
io_pgtable_tlb_flush_walk()
becomes something like below. Seems very minimal and neat instead of 
poking into
tlb_flush_walk functions or touching dma strict with some flag?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT ||
     iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) {
         iop->cfg.tlb->tlb_flush_all(iop->cookie);
         return;
}

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Krishna Reddy June 17, 2021, 9:18 p.m. UTC | #19
> Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk since this is

> related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV which will

> be set in init_context impl hook and the prev condition in

> io_pgtable_tlb_flush_walk()

> becomes something like below. Seems very minimal and neat instead of poking

> into tlb_flush_walk functions or touching dma strict with some flag?

> 

> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT ||

>      iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) {

>          iop->cfg.tlb->tlb_flush_all(iop->cookie);

>          return;

> }


Can you name it as IO_PGTABLE_QUIRK_TLB_INV_ASID or IO_PGTABLE_QUIRK_TLB_INV_ALL_ASID?

-KR
Sai Prakash Ranjan June 18, 2021, 2:47 a.m. UTC | #20
Hi Krishna,

On 2021-06-18 02:48, Krishna Reddy wrote:
>> Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk 

>> since this is

>> related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV which 

>> will

>> be set in init_context impl hook and the prev condition in

>> io_pgtable_tlb_flush_walk()

>> becomes something like below. Seems very minimal and neat instead of 

>> poking

>> into tlb_flush_walk functions or touching dma strict with some flag?

>> 

>> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT ||

>>      iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) {

>>          iop->cfg.tlb->tlb_flush_all(iop->cookie);

>>          return;

>> }

> 

> Can you name it as IO_PGTABLE_QUIRK_TLB_INV_ASID or

> IO_PGTABLE_QUIRK_TLB_INV_ALL_ASID?

> 


tlb_flush_all() callback implementations can use TLBIALL or TLBIASID. so
having ASID in the quirk name doesn't sound right given this quirk 
should
be generic enough to be usable on other implementations as well.
Instead I will go with IO_PGTABLE_QUIRK_TLB_INV_ALL and will be happy to
change if others have some other preference.

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation
Sai Prakash Ranjan June 18, 2021, 4:04 a.m. UTC | #21
On 2021-06-15 17:21, Sai Prakash Ranjan wrote:
> Hi Krishna,

> 

> On 2021-06-14 23:18, Krishna Reddy wrote:

>>> Right but we won't know until we profile the specific usecases or try them in

>>> generic workload to see if they affect the performance. Sure, over invalidation is

>>> a concern where multiple buffers can be mapped to same context and the cache

>>> is not usable at the time for lookup and such but we don't do it for small buffers

>>> and only for large buffers which means thousands of TLB entry mappings in

>>> which case TLBIASID is preferred (note: I mentioned the HW team

>>> recommendation to use it for anything greater than 128 TLB entries) in my

>>> earlier reply. And also note that we do this only for partial walk flush, we are not

>>> arbitrarily changing all the TLBIs to ASID based.

>>

>> Most of the heavy bw use cases does involve processing larger buffers.

>> When the physical memory is allocated dis-contiguously at page_size

>> (let's use 4KB here)

>> granularity, each aligned 2MB chunks IOVA unmap would involve

>> performing a TLBIASID

>> as 2MB is not a leaf. Essentially, It happens all the time during

>> large buffer unmaps and

>> potentially impact active traffic on other large buffers. Depending on how much

>> latency HW engines can absorb, the overflow/underflow issues for ISO

>> engines can be

>> sporadic and vendor specific.

>> Performing TLBIASID as default for all SoCs is not a safe operation.

>>

> 

> Ok so from what I gather from this is that its not easy to test for the

> negative impact and you don't have data on such yet and the behaviour is

> very vendor specific. To add on qcom impl, we have several performance

> improvements for TLB cache invalidations in HW like wait-for-safe(for realtime

> clients such as camera and display) and few others to allow for cache

> lookups/updates when TLBI is in progress for the same context bank, so atleast

> we are good here.

> 

>>

>>> I am no camera expert but from what the camera team mentioned is that there

>>> is a thread which frees memory(large unused memory buffers) periodically which

>>> ends up taking around 100+ms and causing some camera test failures with

>>> frame drops. Parallel efforts are already being made to optimize this usage of

>>> thread but as I mentioned previously, this is *not a camera specific*, lets say

>>> someone else invokes such large unmaps, it's going to face the same issue.

>>

>> From the above, It doesn't look like the root cause of frame drops is

>> fully understood.

>> Why is 100+ms delay causing camera frame drop?  Is the same thread

>> submitting the buffers

>> to camera after unmap is complete? If not, how is the unmap latency

>> causing issue here?

>>

> 

> Ok since you are interested in camera usecase, I have requested for more details

> from the camera team and will give it once they comeback. However I don't think

> its good to have unmap latency at all and that is being addressed by this patch.

> 


As promised, here are some more details shared by camera team:

Mapping of a framework buffer happens at the time of process request and unmapping
of a framework buffer happens once the buffer is available from hardware and result
will be notified to camera framework.
 * When there is a delay in unmapping of a buffer, result notification to framework
   will be delayed and based on pipeline delay depth, new requests from framework
   will be delayed.
 * Camera stack uses internal buffer managers for internal and framework buffers.
   While mapping and unmapping these managers will be accessed, so uses common lock
   and hence is a blocking call. So unmapping delay will cause the delay for mapping
   of a new request and leads to framedrop.

Map and unmap happens in the camera service process context. There is no separate perf
path to perform unmapping.

In Camera stack along with map/unmap delay, additional delays are due to HW. So HW should
be able to get the requests in time from SW to avoid frame drops.

Thanks,
Sai
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation
diff mbox series

Patch

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 
 		if (!iopte_leaf(pte, lvl, iop->fmt)) {
 			/* Also flush any partial walks */
-			io_pgtable_tlb_flush_walk(iop, iova, size,
-						  ARM_LPAE_GRANULE(data));
+			if (size > ARM_LPAE_GRANULE(data))
+				io_pgtable_tlb_flush_all(iop);
+			else
+				io_pgtable_tlb_flush_walk(iop, iova, size,
+							  ARM_LPAE_GRANULE(data));
 			ptep = iopte_deref(pte, data);
 			__arm_lpae_free_pgtable(data, lvl + 1, ptep);
 		} else if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {