diff mbox series

[RFC,1/1] genirq: Make threaded handler use irq affinity for managed interrupt

Message ID 1575642904-58295-2-git-send-email-john.garry@huawei.com
State New
Headers show
Series Threaded handler uses irq affinity for when the interrupt is managed | expand

Commit Message

John Garry Dec. 6, 2019, 2:35 p.m. UTC
Currently the cpu allowed mask for the threaded part of a threaded irq
handler will be set to the effective affinity of the hard irq.

Typically the effective affinity of the hard irq will be for a single cpu. As such,
the threaded handler would always run on the same cpu as the hard irq.

We have seen scenarios in high data-rate throughput testing that the cpu
handling the interrupt can be totally saturated handling both the hard
interrupt and threaded handler parts, limiting throughput.

For when the interrupt is managed, allow the threaded part to run on all
cpus in the irq affinity mask.

Signed-off-by: John Garry <john.garry@huawei.com>

---
 kernel/irq/manage.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

-- 
2.17.1

Comments

Marc Zyngier Dec. 6, 2019, 3:22 p.m. UTC | #1
Hi John,

On 2019-12-06 14:35, John Garry wrote:
> Currently the cpu allowed mask for the threaded part of a threaded 

> irq

> handler will be set to the effective affinity of the hard irq.

>

> Typically the effective affinity of the hard irq will be for a single

> cpu. As such,

> the threaded handler would always run on the same cpu as the hard 

> irq.

>

> We have seen scenarios in high data-rate throughput testing that the 

> cpu

> handling the interrupt can be totally saturated handling both the 

> hard

> interrupt and threaded handler parts, limiting throughput.

>

> For when the interrupt is managed, allow the threaded part to run on 

> all

> cpus in the irq affinity mask.

>

> Signed-off-by: John Garry <john.garry@huawei.com>

> ---

>  kernel/irq/manage.c | 6 +++++-

>  1 file changed, 5 insertions(+), 1 deletion(-)

>

> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c

> index 1753486b440c..8e7f8e758a88 100644

> --- a/kernel/irq/manage.c

> +++ b/kernel/irq/manage.c

> @@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc,

> struct irqaction *action)

>  	if (cpumask_available(desc->irq_common_data.affinity)) {

>  		const struct cpumask *m;

>

> -		m = irq_data_get_effective_affinity_mask(&desc->irq_data);

> +		if (irqd_affinity_is_managed(&desc->irq_data))

> +			m = desc->irq_common_data.affinity;

> +		else

> +			m = irq_data_get_effective_affinity_mask(

> +					&desc->irq_data);

>  		cpumask_copy(mask, m);

>  	} else {

>  		valid = false;


Although I completely understand that there are cases where you
really want to let your thread roam all CPUs, I feel like changing
this based on a seemingly unrelated property is likely to trigger
yet another whack-a-mole episode. I'd feel much more comfortable
if there was a way to let the IRQ subsystem know about what is best.

Shouldn't the endpoint driver know better about it? Note that
I have no data supporting an approach or the other, hence playing
the role of the village idiot here.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 6, 2019, 4:16 p.m. UTC | #2
On 06/12/2019 15:22, Marc Zyngier wrote:

Hi Marc,

> 

> On 2019-12-06 14:35, John Garry wrote:

>> Currently the cpu allowed mask for the threaded part of a threaded irq

>> handler will be set to the effective affinity of the hard irq.

>>

>> Typically the effective affinity of the hard irq will be for a single

>> cpu. As such,

>> the threaded handler would always run on the same cpu as the hard irq.

>>

>> We have seen scenarios in high data-rate throughput testing that the cpu

>> handling the interrupt can be totally saturated handling both the hard

>> interrupt and threaded handler parts, limiting throughput.

>>

>> For when the interrupt is managed, allow the threaded part to run on all

>> cpus in the irq affinity mask.

>>

>> Signed-off-by: John Garry <john.garry@huawei.com>

>> ---

>>  kernel/irq/manage.c | 6 +++++-

>>  1 file changed, 5 insertions(+), 1 deletion(-)

>>

>> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c

>> index 1753486b440c..8e7f8e758a88 100644

>> --- a/kernel/irq/manage.c

>> +++ b/kernel/irq/manage.c

>> @@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc,

>> struct irqaction *action)

>>      if (cpumask_available(desc->irq_common_data.affinity)) {

>>          const struct cpumask *m;

>>

>> -        m = irq_data_get_effective_affinity_mask(&desc->irq_data);

>> +        if (irqd_affinity_is_managed(&desc->irq_data))

>> +            m = desc->irq_common_data.affinity;

>> +        else

>> +            m = irq_data_get_effective_affinity_mask(

>> +                    &desc->irq_data);

>>          cpumask_copy(mask, m);

>>      } else {

>>          valid = false;

> 

> Although I completely understand that there are cases where you

> really want to let your thread roam all CPUs, I feel like changing

> this based on a seemingly unrelated property is likely to trigger

> yet another whack-a-mole episode. I'd feel much more comfortable

> if there was a way to let the IRQ subsystem know about what is best.

> 

> Shouldn't the endpoint driver know better about it? 


I did propose that same idea here:
https://lore.kernel.org/lkml/fd7d6101-37f4-2d34-f2f7-cfeade610278@huawei.com/

And that fits my agenda to get best throughput figures, while not 
possibly affecting others.

But it seems that we could do better to make this a common policy: allow 
the threaded part to roam when that CPU is overloaded, but how...?

Note that
> I have no data supporting an approach or the other, hence playing

> the role of the village idiot here.

> 


Understood. My data is that we get an ~11% throughput boost for our 
storage test with this change.

> Thanks,

> 

>          M.


Thanks,
John
Ming Lei Dec. 7, 2019, 8:03 a.m. UTC | #3
On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote:
> Currently the cpu allowed mask for the threaded part of a threaded irq

> handler will be set to the effective affinity of the hard irq.

> 

> Typically the effective affinity of the hard irq will be for a single cpu. As such,

> the threaded handler would always run on the same cpu as the hard irq.

> 

> We have seen scenarios in high data-rate throughput testing that the cpu

> handling the interrupt can be totally saturated handling both the hard

> interrupt and threaded handler parts, limiting throughput.


Frankly speaking, I never observed that single CPU is saturated by one storage
completion queue's interrupt load. Because CPU is still much quicker than
current storage device. 

If there are more drives, one CPU won't handle more than one queue(drive)'s
interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores.

So could you describe your case in a bit detail? Then we can confirm
if this change is really needed.

> 

> For when the interrupt is managed, allow the threaded part to run on all

> cpus in the irq affinity mask.


I remembered that performance drop is observed by this approach in some
test.


Thanks, 
Ming
John Garry Dec. 9, 2019, 2:30 p.m. UTC | #4
On 07/12/2019 08:03, Ming Lei wrote:
> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote:

>> Currently the cpu allowed mask for the threaded part of a threaded irq

>> handler will be set to the effective affinity of the hard irq.

>>

>> Typically the effective affinity of the hard irq will be for a single cpu. As such,

>> the threaded handler would always run on the same cpu as the hard irq.

>>

>> We have seen scenarios in high data-rate throughput testing that the cpu

>> handling the interrupt can be totally saturated handling both the hard

>> interrupt and threaded handler parts, limiting throughput.

> 


Hi Ming,

> Frankly speaking, I never observed that single CPU is saturated by one storage

> completion queue's interrupt load. Because CPU is still much quicker than

> current storage device.

> 

> If there are more drives, one CPU won't handle more than one queue(drive)'s

> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores.


Are things this simple? I mean, can you guarantee that fio processes are 
evenly distributed as such?

> 

> So could you describe your case in a bit detail? Then we can confirm

> if this change is really needed.


The issue is that the CPU is saturated in servicing the hard and 
threaded part of the interrupt together - here's the sort of thing which 
we saw previously:
Before:
CPU	%usr	%sys	%irq	%soft	%idle
all	2.9	13.1	1.2	4.6	78.2				
0	0.0	29.3	10.1	58.6	2.0
1	18.2	39.4	0.0	1.0	41.4
2	0.0	2.0	0.0	0.0	98.0

CPU0 has no effectively no idle.

Then, by allowing the threaded part to roam:
After:
CPU	%usr	%sys	%irq	%soft	%idle
all	3.5	18.4	2.7	6.8	68.6
0	0.0	20.6	29.9	29.9	19.6
1	0.0	39.8	0.0	50.0	10.2

Note: I think that I may be able to reduce the irq hard part load in the 
endpoint driver, but not that much such that we see still this issue.

> 

>>

>> For when the interrupt is managed, allow the threaded part to run on all

>> cpus in the irq affinity mask.

> 

> I remembered that performance drop is observed by this approach in some

> test.


 From checking the thread about the NVMe interrupt swamp, just switching 
to threaded handler alone degrades performance. I didn't see any 
specific results for this change from Long Li - 
https://lkml.org/lkml/2019/8/21/128

Thanks,
John
Hannes Reinecke Dec. 9, 2019, 3:09 p.m. UTC | #5
On 12/9/19 3:30 PM, John Garry wrote:
> On 07/12/2019 08:03, Ming Lei wrote:

>> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote:

>>> Currently the cpu allowed mask for the threaded part of a threaded irq

>>> handler will be set to the effective affinity of the hard irq.

>>>

>>> Typically the effective affinity of the hard irq will be for a single 

>>> cpu. As such,

>>> the threaded handler would always run on the same cpu as the hard irq.

>>>

>>> We have seen scenarios in high data-rate throughput testing that the cpu

>>> handling the interrupt can be totally saturated handling both the hard

>>> interrupt and threaded handler parts, limiting throughput.

>>

> 

> Hi Ming,

> 

>> Frankly speaking, I never observed that single CPU is saturated by one 

>> storage

>> completion queue's interrupt load. Because CPU is still much quicker than

>> current storage device.

>>

>> If there are more drives, one CPU won't handle more than one 

>> queue(drive)'s

>> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores.

> 

> Are things this simple? I mean, can you guarantee that fio processes are 

> evenly distributed as such?

> 

I would assume that it does, seeing that that was the primary goal of 
fio ...

>>

>> So could you describe your case in a bit detail? Then we can confirm

>> if this change is really needed.

> 

> The issue is that the CPU is saturated in servicing the hard and 

> threaded part of the interrupt together - here's the sort of thing which 

> we saw previously:

> Before:

> CPU    %usr    %sys    %irq    %soft    %idle

> all    2.9    13.1    1.2    4.6    78.2

> 0    0.0    29.3    10.1    58.6    2.0

> 1    18.2    39.4    0.0    1.0    41.4

> 2    0.0    2.0    0.0    0.0    98.0

> 

> CPU0 has no effectively no idle.

> 

> Then, by allowing the threaded part to roam:

> After:

> CPU    %usr    %sys    %irq    %soft    %idle

> all    3.5    18.4    2.7    6.8    68.6

> 0    0.0    20.6    29.9    29.9    19.6

> 1    0.0    39.8    0.0    50.0    10.2

> 

> Note: I think that I may be able to reduce the irq hard part load in the 

> endpoint driver, but not that much such that we see still this issue.

> 

Well ... to get a _real_ comparison you would need to specify the number 
of irqs handled (and the resulting IOPS) alongside the cpu load.
It might well be that by spreading out the interrupts to other CPUs 
we're increasing the latency, thus trivially reducing the load ...

My idea here is slightly different: can't we leverage SMT?
Most modern CPUs do SMT (I guess even ARM does it nowadays)
(Yes, I know about spectre and things. We're talking performance here :-)

So for 2-way SMT one could move the submisson queue on one side, and the 
completion queue handling (ie the irq handling) on the other side.
Due to SMT we shouldn't suffer from cache misses (keep fingers crossed),
and might even get better performance.

John, would such a scenario work on your boxes?
IE can we tweak the interrupt and queue assignment?

Initially I would love to test things out, just to see what'll be 
happening; might be that it doesn't bring any benefit at all, but it'd 
be interesting to test out anyway.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Marc Zyngier Dec. 9, 2019, 3:17 p.m. UTC | #6
On 2019-12-09 15:09, Hannes Reinecke wrote:

[slight digression]

> My idea here is slightly different: can't we leverage SMT?

> Most modern CPUs do SMT (I guess even ARM does it nowadays)

> (Yes, I know about spectre and things. We're talking performance here 

> :-)


I only know two of those: Cavium TX2 and ARM Neoverse-E1.
ARM SMT CPUs are the absolute minority (and I can't say I'm 
displeased).

         M,
-- 
Jazz is not dead. It just smells funny...
Hannes Reinecke Dec. 9, 2019, 3:25 p.m. UTC | #7
On 12/9/19 4:17 PM, Marc Zyngier wrote:
> On 2019-12-09 15:09, Hannes Reinecke wrote:

> 

> [slight digression]

> 

>> My idea here is slightly different: can't we leverage SMT?

>> Most modern CPUs do SMT (I guess even ARM does it nowadays)

>> (Yes, I know about spectre and things. We're talking performance here :-)

> 

> I only know two of those: Cavium TX2 and ARM Neoverse-E1.

> ARM SMT CPUs are the absolute minority (and I can't say I'm displeased).

> 

Ach, too bad.

Still a nice idea, putting SMT finally to some use ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Marc Zyngier Dec. 9, 2019, 3:36 p.m. UTC | #8
On 2019-12-09 15:25, Hannes Reinecke wrote:
> On 12/9/19 4:17 PM, Marc Zyngier wrote:

>> On 2019-12-09 15:09, Hannes Reinecke wrote:

>> [slight digression]

>>

>>> My idea here is slightly different: can't we leverage SMT?

>>> Most modern CPUs do SMT (I guess even ARM does it nowadays)

>>> (Yes, I know about spectre and things. We're talking performance 

>>> here :-)

>> I only know two of those: Cavium TX2 and ARM Neoverse-E1.

>> ARM SMT CPUs are the absolute minority (and I can't say I'm 

>> displeased).

>

> Ach, too bad.

>

> Still a nice idea, putting SMT finally to some use ...


But isn't your SMT idea just a special case of providing an affinity
for the thread (and in this case relative to the affinity of the hard
IRQ)?

You could apply the same principle to target any CPU affinity, and 
maybe
provide hints for the placement if you're really keen (same L3, for
example).

         M.
-- 
Jazz is not dead. It just smells funny...
Qais Yousef Dec. 9, 2019, 3:49 p.m. UTC | #9
On 12/09/19 15:17, Marc Zyngier wrote:
> On 2019-12-09 15:09, Hannes Reinecke wrote:

> 

> [slight digression]

> 

> > My idea here is slightly different: can't we leverage SMT?

> > Most modern CPUs do SMT (I guess even ARM does it nowadays)

> > (Yes, I know about spectre and things. We're talking performance here

> > :-)

> 

> I only know two of those: Cavium TX2 and ARM Neoverse-E1.


There's the Cortex-A65 too.

--
Qais Yousef

> ARM SMT CPUs are the absolute minority (and I can't say I'm displeased).

> 

>         M,

> -- 

> Jazz is not dead. It just smells funny...
Marc Zyngier Dec. 9, 2019, 3:55 p.m. UTC | #10
On 2019-12-09 15:49, Qais Yousef wrote:
> On 12/09/19 15:17, Marc Zyngier wrote:

>> On 2019-12-09 15:09, Hannes Reinecke wrote:

>>

>> [slight digression]

>>

>> > My idea here is slightly different: can't we leverage SMT?

>> > Most modern CPUs do SMT (I guess even ARM does it nowadays)

>> > (Yes, I know about spectre and things. We're talking performance 

>> here

>> > :-)

>>

>> I only know two of those: Cavium TX2 and ARM Neoverse-E1.

>

> There's the Cortex-A65 too.


Which is the exact same core as E1 (but don't tell anyone... ;-).

         M.
-- 
Jazz is not dead. It just smells funny...
Ming Lei Dec. 10, 2019, 1:43 a.m. UTC | #11
On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote:
> On 07/12/2019 08:03, Ming Lei wrote:

> > On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote:

> > > Currently the cpu allowed mask for the threaded part of a threaded irq

> > > handler will be set to the effective affinity of the hard irq.

> > > 

> > > Typically the effective affinity of the hard irq will be for a single cpu. As such,

> > > the threaded handler would always run on the same cpu as the hard irq.

> > > 

> > > We have seen scenarios in high data-rate throughput testing that the cpu

> > > handling the interrupt can be totally saturated handling both the hard

> > > interrupt and threaded handler parts, limiting throughput.

> > 

> 

> Hi Ming,

> 

> > Frankly speaking, I never observed that single CPU is saturated by one storage

> > completion queue's interrupt load. Because CPU is still much quicker than

> > current storage device.

> > 

> > If there are more drives, one CPU won't handle more than one queue(drive)'s

> > interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores.

> 

> Are things this simple? I mean, can you guarantee that fio processes are

> evenly distributed as such?


That is why I ask you for the details of your test.

If you mean hisilicon SAS, the interrupt load should have been distributed
well given the device has multiple reply queues for distributing interrupt
load.

> 

> > 

> > So could you describe your case in a bit detail? Then we can confirm

> > if this change is really needed.

> 

> The issue is that the CPU is saturated in servicing the hard and threaded

> part of the interrupt together - here's the sort of thing which we saw

> previously:

> Before:

> CPU	%usr	%sys	%irq	%soft	%idle

> all	2.9	13.1	1.2	4.6	78.2				

> 0	0.0	29.3	10.1	58.6	2.0

> 1	18.2	39.4	0.0	1.0	41.4

> 2	0.0	2.0	0.0	0.0	98.0

> 

> CPU0 has no effectively no idle.


The result just shows the saturation, we need to root cause it instead
of workaround it via random changes.

> 

> Then, by allowing the threaded part to roam:

> After:

> CPU	%usr	%sys	%irq	%soft	%idle

> all	3.5	18.4	2.7	6.8	68.6

> 0	0.0	20.6	29.9	29.9	19.6

> 1	0.0	39.8	0.0	50.0	10.2

> 

> Note: I think that I may be able to reduce the irq hard part load in the

> endpoint driver, but not that much such that we see still this issue.

> 

> > 

> > > 

> > > For when the interrupt is managed, allow the threaded part to run on all

> > > cpus in the irq affinity mask.

> > 

> > I remembered that performance drop is observed by this approach in some

> > test.

> 

> From checking the thread about the NVMe interrupt swamp, just switching to

> threaded handler alone degrades performance. I didn't see any specific

> results for this change from Long Li - https://lkml.org/lkml/2019/8/21/128


I am pretty clear the reason for Azure, which is caused by aggressive interrupt
coalescing, and this behavior shouldn't be very common, and it can be
addressed by the following patch:

http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html

Then please share your lockup story, such as, which HBA/drivers, test steps,
if you complete IOs from multiple disks(LUNs) on single CPU, if you have
multiple queues, how many active LUNs involved in the test, ...


Thanks,
Ming
John Garry Dec. 10, 2019, 9:45 a.m. UTC | #12
On 10/12/2019 01:43, Ming Lei wrote:
> On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote:

>> On 07/12/2019 08:03, Ming Lei wrote:

>>> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote:

>>>> Currently the cpu allowed mask for the threaded part of a threaded irq

>>>> handler will be set to the effective affinity of the hard irq.

>>>>

>>>> Typically the effective affinity of the hard irq will be for a single cpu. As such,

>>>> the threaded handler would always run on the same cpu as the hard irq.

>>>>

>>>> We have seen scenarios in high data-rate throughput testing that the cpu

>>>> handling the interrupt can be totally saturated handling both the hard

>>>> interrupt and threaded handler parts, limiting throughput.

>>>


Hi Ming,

>>> Frankly speaking, I never observed that single CPU is saturated by one storage

>>> completion queue's interrupt load. Because CPU is still much quicker than

>>> current storage device.

>>>

>>> If there are more drives, one CPU won't handle more than one queue(drive)'s

>>> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores.

>>

>> Are things this simple? I mean, can you guarantee that fio processes are

>> evenly distributed as such?

> 

> That is why I ask you for the details of your test.

> 

> If you mean hisilicon SAS,


Yes, it is.

  the interrupt load should have been distributed
> well given the device has multiple reply queues for distributing interrupt

> load.

> 

>>

>>>

>>> So could you describe your case in a bit detail? Then we can confirm

>>> if this change is really needed.

>>

>> The issue is that the CPU is saturated in servicing the hard and threaded

>> part of the interrupt together - here's the sort of thing which we saw

>> previously:

>> Before:

>> CPU	%usr	%sys	%irq	%soft	%idle

>> all	2.9	13.1	1.2	4.6	78.2				

>> 0	0.0	29.3	10.1	58.6	2.0

>> 1	18.2	39.4	0.0	1.0	41.4

>> 2	0.0	2.0	0.0	0.0	98.0

>>

>> CPU0 has no effectively no idle.

> 

> The result just shows the saturation, we need to root cause it instead

> of workaround it via random changes.

> 

>>

>> Then, by allowing the threaded part to roam:

>> After:

>> CPU	%usr	%sys	%irq	%soft	%idle

>> all	3.5	18.4	2.7	6.8	68.6

>> 0	0.0	20.6	29.9	29.9	19.6

>> 1	0.0	39.8	0.0	50.0	10.2

>>

>> Note: I think that I may be able to reduce the irq hard part load in the

>> endpoint driver, but not that much such that we see still this issue.

>>

>>>

>>>>

>>>> For when the interrupt is managed, allow the threaded part to run on all

>>>> cpus in the irq affinity mask.

>>>

>>> I remembered that performance drop is observed by this approach in some

>>> test.

>>

>>  From checking the thread about the NVMe interrupt swamp, just switching to

>> threaded handler alone degrades performance. I didn't see any specific

>> results for this change from Long Li - https://lkml.org/lkml/2019/8/21/128

> 

> I am pretty clear the reason for Azure, which is caused by aggressive interrupt

> coalescing, and this behavior shouldn't be very common, and it can be

> addressed by the following patch:

> 

> http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html

> 

> Then please share your lockup story, such as, which HBA/drivers, test steps,

> if you complete IOs from multiple disks(LUNs) on single CPU, if you have

> multiple queues, how many active LUNs involved in the test, ...


There is no lockup, just a potential performance boost in this change.

My colleague Xiang Chen can provide specifics of the test, as he is the 
one running it.

But one key bit of info - which I did not think most relevant before - 
that is we have 2x SAS controllers running the throughput test on the 
same host.

As such, the completion queue interrupts would be spread identically 
over the CPUs for each controller. I notice that ARM GICv3 ITS interrupt 
controller (which we use) does not use the generic irq matrix allocator, 
which I think would really help with this.

Hi Marc,

Is there any reason for which we couldn't utilise of the generic irq 
matrix allocator for GICv3?

Thanks,
John
Ming Lei Dec. 10, 2019, 10:06 a.m. UTC | #13
On Tue, Dec 10, 2019 at 09:45:45AM +0000, John Garry wrote:
> On 10/12/2019 01:43, Ming Lei wrote:

> > On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote:

> > > On 07/12/2019 08:03, Ming Lei wrote:

> > > > On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote:

> > > > > Currently the cpu allowed mask for the threaded part of a threaded irq

> > > > > handler will be set to the effective affinity of the hard irq.

> > > > > 

> > > > > Typically the effective affinity of the hard irq will be for a single cpu. As such,

> > > > > the threaded handler would always run on the same cpu as the hard irq.

> > > > > 

> > > > > We have seen scenarios in high data-rate throughput testing that the cpu

> > > > > handling the interrupt can be totally saturated handling both the hard

> > > > > interrupt and threaded handler parts, limiting throughput.

> > > > 

> 

> Hi Ming,

> 

> > > > Frankly speaking, I never observed that single CPU is saturated by one storage

> > > > completion queue's interrupt load. Because CPU is still much quicker than

> > > > current storage device.

> > > > 

> > > > If there are more drives, one CPU won't handle more than one queue(drive)'s

> > > > interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores.

> > > 

> > > Are things this simple? I mean, can you guarantee that fio processes are

> > > evenly distributed as such?

> > 

> > That is why I ask you for the details of your test.

> > 

> > If you mean hisilicon SAS,

> 

> Yes, it is.

> 

>  the interrupt load should have been distributed

> > well given the device has multiple reply queues for distributing interrupt

> > load.

> > 

> > > 

> > > > 

> > > > So could you describe your case in a bit detail? Then we can confirm

> > > > if this change is really needed.

> > > 

> > > The issue is that the CPU is saturated in servicing the hard and threaded

> > > part of the interrupt together - here's the sort of thing which we saw

> > > previously:

> > > Before:

> > > CPU	%usr	%sys	%irq	%soft	%idle

> > > all	2.9	13.1	1.2	4.6	78.2				

> > > 0	0.0	29.3	10.1	58.6	2.0

> > > 1	18.2	39.4	0.0	1.0	41.4

> > > 2	0.0	2.0	0.0	0.0	98.0

> > > 

> > > CPU0 has no effectively no idle.

> > 

> > The result just shows the saturation, we need to root cause it instead

> > of workaround it via random changes.

> > 

> > > 

> > > Then, by allowing the threaded part to roam:

> > > After:

> > > CPU	%usr	%sys	%irq	%soft	%idle

> > > all	3.5	18.4	2.7	6.8	68.6

> > > 0	0.0	20.6	29.9	29.9	19.6

> > > 1	0.0	39.8	0.0	50.0	10.2

> > > 

> > > Note: I think that I may be able to reduce the irq hard part load in the

> > > endpoint driver, but not that much such that we see still this issue.

> > > 

> > > > 

> > > > > 

> > > > > For when the interrupt is managed, allow the threaded part to run on all

> > > > > cpus in the irq affinity mask.

> > > > 

> > > > I remembered that performance drop is observed by this approach in some

> > > > test.

> > > 

> > >  From checking the thread about the NVMe interrupt swamp, just switching to

> > > threaded handler alone degrades performance. I didn't see any specific

> > > results for this change from Long Li - https://lkml.org/lkml/2019/8/21/128

> > 

> > I am pretty clear the reason for Azure, which is caused by aggressive interrupt

> > coalescing, and this behavior shouldn't be very common, and it can be

> > addressed by the following patch:

> > 

> > http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html

> > 

> > Then please share your lockup story, such as, which HBA/drivers, test steps,

> > if you complete IOs from multiple disks(LUNs) on single CPU, if you have

> > multiple queues, how many active LUNs involved in the test, ...

> 

> There is no lockup, just a potential performance boost in this change.

> 

> My colleague Xiang Chen can provide specifics of the test, as he is the one

> running it.

> 

> But one key bit of info - which I did not think most relevant before - that

> is we have 2x SAS controllers running the throughput test on the same host.

> 

> As such, the completion queue interrupts would be spread identically over

> the CPUs for each controller. I notice that ARM GICv3 ITS interrupt

> controller (which we use) does not use the generic irq matrix allocator,

> which I think would really help with this.


Yeah, looks only x86 uses irq matrix which seems abstracted from x86
arch code, and multiple NVMe may perform worse on non-x86 server.

Also when running IO against multiple LUNs in single HBA, there is
chance to saturate the completion CPU given multiple disks may be
quicker than the single CPU. IRQ matrix can't help this case.


Thanks, 
Ming
Marc Zyngier Dec. 10, 2019, 10:28 a.m. UTC | #14
On 2019-12-10 09:45, John Garry wrote:
> On 10/12/2019 01:43, Ming Lei wrote:

>> On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote:

>>> On 07/12/2019 08:03, Ming Lei wrote:

>>>> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote:

>>>>> Currently the cpu allowed mask for the threaded part of a 

>>>>> threaded irq

>>>>> handler will be set to the effective affinity of the hard irq.

>>>>>

>>>>> Typically the effective affinity of the hard irq will be for a 

>>>>> single cpu. As such,

>>>>> the threaded handler would always run on the same cpu as the hard 

>>>>> irq.

>>>>>

>>>>> We have seen scenarios in high data-rate throughput testing that 

>>>>> the cpu

>>>>> handling the interrupt can be totally saturated handling both the 

>>>>> hard

>>>>> interrupt and threaded handler parts, limiting throughput.

>>>>

>

> Hi Ming,

>

>>>> Frankly speaking, I never observed that single CPU is saturated by 

>>>> one storage

>>>> completion queue's interrupt load. Because CPU is still much 

>>>> quicker than

>>>> current storage device.

>>>>

>>>> If there are more drives, one CPU won't handle more than one 

>>>> queue(drive)'s

>>>> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores.

>>>

>>> Are things this simple? I mean, can you guarantee that fio 

>>> processes are

>>> evenly distributed as such?

>> That is why I ask you for the details of your test.

>> If you mean hisilicon SAS,

>

> Yes, it is.

>

>  the interrupt load should have been distributed

>> well given the device has multiple reply queues for distributing 

>> interrupt

>> load.

>>

>>>

>>>>

>>>> So could you describe your case in a bit detail? Then we can 

>>>> confirm

>>>> if this change is really needed.

>>>

>>> The issue is that the CPU is saturated in servicing the hard and 

>>> threaded

>>> part of the interrupt together - here's the sort of thing which we 

>>> saw

>>> previously:

>>> Before:

>>> CPU	%usr	%sys	%irq	%soft	%idle

>>> all	2.9	13.1	1.2	4.6	78.2

>>> 0	0.0	29.3	10.1	58.6	2.0

>>> 1	18.2	39.4	0.0	1.0	41.4

>>> 2	0.0	2.0	0.0	0.0	98.0

>>>

>>> CPU0 has no effectively no idle.

>> The result just shows the saturation, we need to root cause it 

>> instead

>> of workaround it via random changes.

>>

>>>

>>> Then, by allowing the threaded part to roam:

>>> After:

>>> CPU	%usr	%sys	%irq	%soft	%idle

>>> all	3.5	18.4	2.7	6.8	68.6

>>> 0	0.0	20.6	29.9	29.9	19.6

>>> 1	0.0	39.8	0.0	50.0	10.2

>>>

>>> Note: I think that I may be able to reduce the irq hard part load 

>>> in the

>>> endpoint driver, but not that much such that we see still this 

>>> issue.

>>>

>>>>

>>>>>

>>>>> For when the interrupt is managed, allow the threaded part to run 

>>>>> on all

>>>>> cpus in the irq affinity mask.

>>>>

>>>> I remembered that performance drop is observed by this approach in 

>>>> some

>>>> test.

>>>

>>>  From checking the thread about the NVMe interrupt swamp, just 

>>> switching to

>>> threaded handler alone degrades performance. I didn't see any 

>>> specific

>>> results for this change from Long Li - 

>>> https://lkml.org/lkml/2019/8/21/128

>> I am pretty clear the reason for Azure, which is caused by 

>> aggressive interrupt

>> coalescing, and this behavior shouldn't be very common, and it can 

>> be

>> addressed by the following patch:

>> 

>> http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html

>> Then please share your lockup story, such as, which HBA/drivers, 

>> test steps,

>> if you complete IOs from multiple disks(LUNs) on single CPU, if you 

>> have

>> multiple queues, how many active LUNs involved in the test, ...

>

> There is no lockup, just a potential performance boost in this 

> change.

>

> My colleague Xiang Chen can provide specifics of the test, as he is

> the one running it.

>

> But one key bit of info - which I did not think most relevant before

> - that is we have 2x SAS controllers running the throughput test on

> the same host.

>

> As such, the completion queue interrupts would be spread identically

> over the CPUs for each controller. I notice that ARM GICv3 ITS

> interrupt controller (which we use) does not use the generic irq

> matrix allocator, which I think would really help with this.

>

> Hi Marc,

>

> Is there any reason for which we couldn't utilise of the generic irq

> matrix allocator for GICv3?


For a start, the ITS code predates the matrix allocator by about three
years. Also, my understanding of this allocator is that it allows
x86 to cope with a very small number of possible interrupt vectors
per CPU. The ITS doesn't have such issue, as:

1) the namespace is global, and not per CPU
2) the namespace is *huge*

Now, what property of the matrix allocator is the ITS code missing?
I'd be more than happy to improve it.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 10, 2019, 10:59 a.m. UTC | #15
>>

>> There is no lockup, just a potential performance boost in this change.

>>

>> My colleague Xiang Chen can provide specifics of the test, as he is

>> the one running it.

>>

>> But one key bit of info - which I did not think most relevant before

>> - that is we have 2x SAS controllers running the throughput test on

>> the same host.

>>

>> As such, the completion queue interrupts would be spread identically

>> over the CPUs for each controller. I notice that ARM GICv3 ITS

>> interrupt controller (which we use) does not use the generic irq

>> matrix allocator, which I think would really help with this.

>>

>> Hi Marc,

>>

>> Is there any reason for which we couldn't utilise of the generic irq

>> matrix allocator for GICv3?

> 


Hi Marc,

> For a start, the ITS code predates the matrix allocator by about three

> years. Also, my understanding of this allocator is that it allows

> x86 to cope with a very small number of possible interrupt vectors

> per CPU. The ITS doesn't have such issue, as:

> 

> 1) the namespace is global, and not per CPU

> 2) the namespace is *huge*

> 

> Now, what property of the matrix allocator is the ITS code missing?

> I'd be more than happy to improve it.


I think specifically the property that the matrix allocator will try to 
find a CPU for irq affinity which "has the lowest number of managed IRQs 
allocated" - I'm quoting the comment on matrix_find_best_cpu_managed().

The ITS code will make the lowest online CPU in the affinity mask the 
target CPU for the interrupt, which may result in some CPUs handling so 
many interrupts.

Thanks,
John
Marc Zyngier Dec. 10, 2019, 11:36 a.m. UTC | #16
On 2019-12-10 10:59, John Garry wrote:
>>>

>>> There is no lockup, just a potential performance boost in this 

>>> change.

>>>

>>> My colleague Xiang Chen can provide specifics of the test, as he is

>>> the one running it.

>>>

>>> But one key bit of info - which I did not think most relevant 

>>> before

>>> - that is we have 2x SAS controllers running the throughput test on

>>> the same host.

>>>

>>> As such, the completion queue interrupts would be spread 

>>> identically

>>> over the CPUs for each controller. I notice that ARM GICv3 ITS

>>> interrupt controller (which we use) does not use the generic irq

>>> matrix allocator, which I think would really help with this.

>>>

>>> Hi Marc,

>>>

>>> Is there any reason for which we couldn't utilise of the generic 

>>> irq

>>> matrix allocator for GICv3?

>>

>

> Hi Marc,

>

>> For a start, the ITS code predates the matrix allocator by about 

>> three

>> years. Also, my understanding of this allocator is that it allows

>> x86 to cope with a very small number of possible interrupt vectors

>> per CPU. The ITS doesn't have such issue, as:

>> 1) the namespace is global, and not per CPU

>> 2) the namespace is *huge*

>> Now, what property of the matrix allocator is the ITS code missing?

>> I'd be more than happy to improve it.

>

> I think specifically the property that the matrix allocator will try

> to find a CPU for irq affinity which "has the lowest number of 

> managed

> IRQs allocated" - I'm quoting the comment on 

> matrix_find_best_cpu_managed().


But that decision is due to allocation constraints. You can have at 
most
256 interrupts per CPU, so the allocator tries to balance it.

On the contrary, the ITS does care about how many interrupt target any
given CPU. The whole 2^24 interrupt namespace can be thrown at a single
CPU.

> The ITS code will make the lowest online CPU in the affinity mask the

> target CPU for the interrupt, which may result in some CPUs handling

> so many interrupts.


If what you want is for the *default* affinity to be spread around,
that should be achieved pretty easily. Let me have a think about how
to do that.

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 10, 2019, 12:05 p.m. UTC | #17
On 10/12/2019 11:36, Marc Zyngier wrote:
> On 2019-12-10 10:59, John Garry wrote:

>>>>

>>>> There is no lockup, just a potential performance boost in this change.

>>>>

>>>> My colleague Xiang Chen can provide specifics of the test, as he is

>>>> the one running it.

>>>>

>>>> But one key bit of info - which I did not think most relevant before

>>>> - that is we have 2x SAS controllers running the throughput test on

>>>> the same host.

>>>>

>>>> As such, the completion queue interrupts would be spread identically

>>>> over the CPUs for each controller. I notice that ARM GICv3 ITS

>>>> interrupt controller (which we use) does not use the generic irq

>>>> matrix allocator, which I think would really help with this.

>>>>

>>>> Hi Marc,

>>>>

>>>> Is there any reason for which we couldn't utilise of the generic irq

>>>> matrix allocator for GICv3?

>>>

>>

>> Hi Marc,

>>

>>> For a start, the ITS code predates the matrix allocator by about three

>>> years. Also, my understanding of this allocator is that it allows

>>> x86 to cope with a very small number of possible interrupt vectors

>>> per CPU. The ITS doesn't have such issue, as:

>>> 1) the namespace is global, and not per CPU

>>> 2) the namespace is *huge*

>>> Now, what property of the matrix allocator is the ITS code missing?

>>> I'd be more than happy to improve it.

>>

>> I think specifically the property that the matrix allocator will try

>> to find a CPU for irq affinity which "has the lowest number of managed

>> IRQs allocated" - I'm quoting the comment on 

>> matrix_find_best_cpu_managed().

> 

> But that decision is due to allocation constraints. You can have at most

> 256 interrupts per CPU, so the allocator tries to balance it.

> 

> On the contrary, the ITS does care about how many interrupt target any

> given CPU. The whole 2^24 interrupt namespace can be thrown at a single

> CPU.

> 

>> The ITS code will make the lowest online CPU in the affinity mask the

>> target CPU for the interrupt, which may result in some CPUs handling

>> so many interrupts.

> 

> If what you want is for the *default* affinity to be spread around,

> that should be achieved pretty easily. Let me have a think about how

> to do that.


Cool, I anticipate that it should help my case.

I can also seek out some NVMe cards to see how it would help a more 
"generic" scenario.

Cheers,
John
Marc Zyngier Dec. 10, 2019, 6:32 p.m. UTC | #18
On 2019-12-10 12:05, John Garry wrote:
> On 10/12/2019 11:36, Marc Zyngier wrote:

>> On 2019-12-10 10:59, John Garry wrote:

>>>>>

>>>>> There is no lockup, just a potential performance boost in this 

>>>>> change.

>>>>>

>>>>> My colleague Xiang Chen can provide specifics of the test, as he 

>>>>> is

>>>>> the one running it.

>>>>>

>>>>> But one key bit of info - which I did not think most relevant 

>>>>> before

>>>>> - that is we have 2x SAS controllers running the throughput test 

>>>>> on

>>>>> the same host.

>>>>>

>>>>> As such, the completion queue interrupts would be spread 

>>>>> identically

>>>>> over the CPUs for each controller. I notice that ARM GICv3 ITS

>>>>> interrupt controller (which we use) does not use the generic irq

>>>>> matrix allocator, which I think would really help with this.

>>>>>

>>>>> Hi Marc,

>>>>>

>>>>> Is there any reason for which we couldn't utilise of the generic 

>>>>> irq

>>>>> matrix allocator for GICv3?

>>>>

>>>

>>> Hi Marc,

>>>

>>>> For a start, the ITS code predates the matrix allocator by about 

>>>> three

>>>> years. Also, my understanding of this allocator is that it allows

>>>> x86 to cope with a very small number of possible interrupt vectors

>>>> per CPU. The ITS doesn't have such issue, as:

>>>> 1) the namespace is global, and not per CPU

>>>> 2) the namespace is *huge*

>>>> Now, what property of the matrix allocator is the ITS code 

>>>> missing?

>>>> I'd be more than happy to improve it.

>>>

>>> I think specifically the property that the matrix allocator will 

>>> try

>>> to find a CPU for irq affinity which "has the lowest number of 

>>> managed

>>> IRQs allocated" - I'm quoting the comment on 

>>> matrix_find_best_cpu_managed().

>> But that decision is due to allocation constraints. You can have at 

>> most

>> 256 interrupts per CPU, so the allocator tries to balance it.

>> On the contrary, the ITS does care about how many interrupt target 

>> any

>> given CPU. The whole 2^24 interrupt namespace can be thrown at a 

>> single

>> CPU.

>>

>>> The ITS code will make the lowest online CPU in the affinity mask 

>>> the

>>> target CPU for the interrupt, which may result in some CPUs 

>>> handling

>>> so many interrupts.

>> If what you want is for the *default* affinity to be spread around,

>> that should be achieved pretty easily. Let me have a think about how

>> to do that.

>

> Cool, I anticipate that it should help my case.

>

> I can also seek out some NVMe cards to see how it would help a more

> "generic" scenario.


Can you give the following a go? It probably has all kind of warts on
top of the quality debug information, but I managed to get my D05 and
a couple of guests to boot with it. It will probably eat your data,
so use caution! ;-)

Thanks,

         M.

diff --git a/drivers/irqchip/irq-gic-v3-its.c 
b/drivers/irqchip/irq-gic-v3-its.c
index e05673bcd52b..301ee3bc0602 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -177,6 +177,8 @@ static DEFINE_IDA(its_vpeid_ida);
  #define gic_data_rdist_rd_base()	(gic_data_rdist()->rd_base)
  #define gic_data_rdist_vlpi_base()	(gic_data_rdist_rd_base() + 
SZ_128K)

+static DEFINE_PER_CPU(atomic_t, cpu_lpi_count);
+
  static u16 get_its_list(struct its_vm *vm)
  {
  	struct its_node *its;
@@ -1287,42 +1289,76 @@ static void its_unmask_irq(struct irq_data *d)
  	lpi_update_config(d, 0, LPI_PROP_ENABLED);
  }

+static int its_pick_target_cpu(struct its_device *its_dev, const 
struct cpumask *cpu_mask)
+{
+	unsigned int cpu = nr_cpu_ids, tmp;
+	int count = S32_MAX;
+
+	for_each_cpu_and(tmp, cpu_mask, cpu_online_mask) {
+		int this_count = per_cpu(cpu_lpi_count, tmp).counter;
+		if (this_count < count) {
+			cpu = tmp;
+		        count = this_count;
+		}
+	}
+
+	return cpu;
+}
+
  static int its_set_affinity(struct irq_data *d, const struct cpumask 
*mask_val,
  			    bool force)
  {
-	unsigned int cpu;
-	const struct cpumask *cpu_mask = cpu_online_mask;
  	struct its_device *its_dev = irq_data_get_irq_chip_data(d);
-	struct its_collection *target_col;
+	int ret = IRQ_SET_MASK_OK_DONE;
  	u32 id = its_get_event_id(d);
+	cpumask_var_t tmpmask;

  	/* A forwarded interrupt should use irq_set_vcpu_affinity */
  	if (irqd_is_forwarded_to_vcpu(d))
  		return -EINVAL;

-       /* lpi cannot be routed to a redistributor that is on a foreign 
node */
-	if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) {
-		if (its_dev->its->numa_node >= 0) {
-			cpu_mask = cpumask_of_node(its_dev->its->numa_node);
-			if (!cpumask_intersects(mask_val, cpu_mask))
-				return -EINVAL;
+	if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL))
+		return -ENOMEM;
+
+	cpumask_and(tmpmask, mask_val, cpu_online_mask);
+
+	if (its_dev->its->numa_node >= 0)
+		cpumask_and(tmpmask, tmpmask, 
cpumask_of_node(its_dev->its->numa_node));
+
+	if (cpumask_empty(tmpmask)) {
+		/* LPI cannot be routed to a redistributor that is on a foreign node 
*/
+		if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) {
+			ret = -EINVAL;
+			goto out;
  		}
+
+		cpumask_copy(tmpmask, cpu_online_mask);
  	}

-	cpu = cpumask_any_and(mask_val, cpu_mask);
+	if (!cpumask_test_cpu(its_dev->event_map.col_map[id], tmpmask)) {
+		struct its_collection *target_col;
+		int cpu;

-	if (cpu >= nr_cpu_ids)
-		return -EINVAL;
+		cpu = its_pick_target_cpu(its_dev, tmpmask);
+		if (cpu >= nr_cpu_ids) {
+			ret = -EINVAL;
+			goto out;
+		}

-	/* don't set the affinity when the target cpu is same as current one 
*/
-	if (cpu != its_dev->event_map.col_map[id]) {
+		pr_info("IRQ%d CPU%d -> CPU%d\n",
+			d->irq, its_dev->event_map.col_map[id], cpu);
+		atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu));
+		atomic_dec(per_cpu_ptr(&cpu_lpi_count,
+				       its_dev->event_map.col_map[id]));
  		target_col = &its_dev->its->collections[cpu];
  		its_send_movi(its_dev, target_col, id);
  		its_dev->event_map.col_map[id] = cpu;
  		irq_data_update_effective_affinity(d, cpumask_of(cpu));
  	}

-	return IRQ_SET_MASK_OK_DONE;
+out:
+	free_cpumask_var(tmpmask);
+	return ret;
  }

  static u64 its_irq_get_msi_base(struct its_device *its_dev)
@@ -2773,22 +2809,28 @@ static int its_irq_domain_activate(struct 
irq_domain *domain,
  {
  	struct its_device *its_dev = irq_data_get_irq_chip_data(d);
  	u32 event = its_get_event_id(d);
-	const struct cpumask *cpu_mask = cpu_online_mask;
-	int cpu;
+	int cpu = nr_cpu_ids;

-	/* get the cpu_mask of local node */
-	if (its_dev->its->numa_node >= 0)
-		cpu_mask = cpumask_of_node(its_dev->its->numa_node);
+	/* Find the least loaded CPU on the local node */
+	if (its_dev->its->numa_node >= 0) {
+		cpu = its_pick_target_cpu(its_dev,
+					  cpumask_of_node(its_dev->its->numa_node));
+		if (cpu < 0)
+			return cpu;

-	/* Bind the LPI to the first possible CPU */
-	cpu = cpumask_first_and(cpu_mask, cpu_online_mask);
-	if (cpu >= nr_cpu_ids) {
-		if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144)
+		if (cpu >= nr_cpu_ids &&
+		    (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144))
  			return -EINVAL;
+	}

-		cpu = cpumask_first(cpu_online_mask);
+	if (cpu >= nr_cpu_ids) {
+		cpu = its_pick_target_cpu(its_dev, cpu_online_mask);
+		if (cpu < 0)
+			return cpu;
  	}

+	pr_info("picked CPU%d IRQ%d\n", cpu, d->irq);
+	atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu));
  	its_dev->event_map.col_map[event] = cpu;
  	irq_data_update_effective_affinity(d, cpumask_of(cpu));

@@ -2803,6 +2845,8 @@ static void its_irq_domain_deactivate(struct 
irq_domain *domain,
  	struct its_device *its_dev = irq_data_get_irq_chip_data(d);
  	u32 event = its_get_event_id(d);

+	atomic_dec(per_cpu_ptr(&cpu_lpi_count,
+			       its_dev->event_map.col_map[event]));
  	/* Stop the delivery of interrupts */
  	its_send_discard(its_dev, event);
  }

-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 11, 2019, 9:41 a.m. UTC | #19
On 10/12/2019 18:32, Marc Zyngier wrote:
>>>> The ITS code will make the lowest online CPU in the affinity mask

>>>> the

>>>> target CPU for the interrupt, which may result in some CPUs

>>>> handling

>>>> so many interrupts.

>>> If what you want is for the*default*  affinity to be spread around,

>>> that should be achieved pretty easily. Let me have a think about how

>>> to do that.

>> Cool, I anticipate that it should help my case.

>>

>> I can also seek out some NVMe cards to see how it would help a more

>> "generic" scenario.

> Can you give the following a go? It probably has all kind of warts on

> top of the quality debug information, but I managed to get my D05 and

> a couple of guests to boot with it. It will probably eat your data,

> so use caution!;-)

> 


Hi Marc,

Ok, we'll give it a spin.

Thanks,
John

> Thanks,

> 

>           M.

> 

> diff --git a/drivers/irqchip/irq-gic-v3-its.c

> b/drivers/irqchip/irq-gic-v3-its.c

> index e05673bcd52b..301ee3bc0602 100644

> --- a/drivers/irqchip/irq-gic-v3-its.c

> +++ b/drivers/irqchip/irq-gic-v3-its.c

> @@ -177,6 +177,8 @@ static DEFINE_IDA(its_vpeid_ida);
John Garry Dec. 11, 2019, 5:09 p.m. UTC | #20
On 10/12/2019 01:43, Ming Lei wrote:
>>>> For when the interrupt is managed, allow the threaded part to run on all

>>>> cpus in the irq affinity mask.

>>> I remembered that performance drop is observed by this approach in some

>>> test.

>>  From checking the thread about the NVMe interrupt swamp, just switching to

>> threaded handler alone degrades performance. I didn't see any specific

>> results for this change from Long Li -https://lkml.org/lkml/2019/8/21/128


Hi Ming,

> I am pretty clear the reason for Azure, which is caused by aggressive interrupt

> coalescing, and this behavior shouldn't be very common, and it can be

> addressed by the following patch:


I am running some NVMe perf tests with Marc's patch.

I see this almost always eventually (with or without that patch):

[   66.018140] rcu: INFO: rcu_preempt self-detected stall on CPU2% done] 
[5058MB/0KB/0KB /s] [1295K/0/0 iops] [eta 01m:39s]
[   66.023885] rcu: 12-....: (5250 ticks this GP) 
idle=182/1/0x4000000000000004 softirq=517/517 fqs=2529
[   66.033306] (t=5254 jiffies g=733 q=2241)
[   66.037394] Task dump for CPU 12:
[   66.040696] fio             R  running task        0   798    796 
0x00000002
[   66.047733] Call trace:
[   66.050173]  dump_backtrace+0x0/0x1a0
[   66.053823]  show_stack+0x14/0x20
[   66.057126]  sched_show_task+0x164/0x1a0
[   66.061036]  dump_cpu_task+0x40/0x2e8
[   66.064686]  rcu_dump_cpu_stacks+0xa0/0xe0
[   66.068769]  rcu_sched_clock_irq+0x6d8/0xaa8
[   66.073027]  update_process_times+0x2c/0x50
[   66.077198]  tick_sched_handle.isra.14+0x30/0x50
[   66.081802]  tick_sched_timer+0x48/0x98
[   66.085625]  __hrtimer_run_queues+0x120/0x1b8
[   66.089968]  hrtimer_interrupt+0xd4/0x250
[   66.093966]  arch_timer_handler_phys+0x28/0x40
[   66.098398]  handle_percpu_devid_irq+0x80/0x140
[   66.102915]  generic_handle_irq+0x24/0x38
[   66.106911]  __handle_domain_irq+0x5c/0xb0
[   66.110995]  gic_handle_irq+0x5c/0x148
[   66.114731]  el1_irq+0xb8/0x180
[   66.117858]  efi_header_end+0x94/0x234
[   66.121595]  irq_exit+0xd0/0xd8
[   66.124724]  __handle_domain_irq+0x60/0xb0
[   66.128806]  gic_handle_irq+0x5c/0x148
[   66.132542]  el0_irq_naked+0x4c/0x54
[   97.152870] rcu: INFO: rcu_preempt self-detected stall on CPU8% done] 
[4736MB/0KB/0KB /s] [1212K/0/0 iops] [eta 01m:08s]
[   97.158616] rcu: 8-....: (1 GPs behind) idle=08e/1/0x4000000000000002 
softirq=462/505 fqs=2621
[   97.167414] (t=5253 jiffies g=737 q=5507)
[   97.171498] Task dump for CPU 8:
[pu_task+0x40/0x2e8
[   97.198705]  rcu_dump_cpu_stacks+0xa0/0xe0
[   97.202788]  rcu_sched_clock_irq+0x6d8/0xaa8
[   97.207046]  update_process_times+0x2c/0x50
[   97.211217]  tick_sched_handle.isra.14+0x30/0x50
[   97.215820]  tick_sched_timer+0x48/0x98
[   97.219644]  __hrtimer_run_queues+0x120/0x1b8
[   97.223989]  hrtimer_interrupt+0xd4/0x250
[   97.227987]  arch_timer_handler_phys+0x28/0x40
[   97.232418]  handle_percpu_devid_irq+0x80/0x140
[   97.236935]  generic_handle_irq+0x24/0x38
[   97.240931]  __handle_domain_irq+0x5c/0xb0
[   97.245015]  gic_handle_irq+0x5c/0x148
[   97.248751]  el1_irq+0xb8/0x180
[   97.251880]  find_busiest_group+0x18c/0x9e8
[   97.256050]  load_balance+0x154/0xb98
[   97.259700]  rebalance_domains+0x1cc/0x2f8
[   97.263783]  run_rebalance_domains+0x78/0xe0
[   97.268040]  efi_header_end+0x114/0x234
[   97.271864]  run_ksoftirqd+0x38/0x48
[   97.275427]  smpboot_thread_fn+0x16c/0x270
[   97.279511]  kthread+0x118/0x120
[   97.282726]  ret_from_fork+0x10/0x18
[   97.286289] Task dump for CPU 12:
[   97.289591] kworker/12:1    R  running task        0   570      2 
0x0000002a
[   97.296634] Workqueue:  0x0 (mm_percpu_wq)
[   97.300718] Call trace:
[   97.303152]  __switch_to+0xbc/0x218
[   97.306632]  page_wait_table+0x1500/0x1800

Would this be the same interrupt "swamp" issue?

> 

> http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html

> 


What is the status of these patches? I did not see them in mainline.

> Then please share your lockup story, such as, which HBA/drivers, test steps,

> if you complete IOs from multiple disks(LUNs) on single CPU, if you have

> multiple queues, how many active LUNs involved in the test, ...

> 

> 


Thanks,
John
Ming Lei Dec. 12, 2019, 10:38 p.m. UTC | #21
On Wed, Dec 11, 2019 at 05:09:18PM +0000, John Garry wrote:
> On 10/12/2019 01:43, Ming Lei wrote:

> > > > > For when the interrupt is managed, allow the threaded part to run on all

> > > > > cpus in the irq affinity mask.

> > > > I remembered that performance drop is observed by this approach in some

> > > > test.

> > >  From checking the thread about the NVMe interrupt swamp, just switching to

> > > threaded handler alone degrades performance. I didn't see any specific

> > > results for this change from Long Li -https://lkml.org/lkml/2019/8/21/128

> 

> Hi Ming,

> 

> > I am pretty clear the reason for Azure, which is caused by aggressive interrupt

> > coalescing, and this behavior shouldn't be very common, and it can be

> > addressed by the following patch:

> 

> I am running some NVMe perf tests with Marc's patch.


We need to confirm that if Marc's patch works as expected, could you
collect log via the attached script?

> 

> I see this almost always eventually (with or without that patch):

> 

> [   66.018140] rcu: INFO: rcu_preempt self-detected stall on CPU2% done]

> [5058MB/0KB/0KB /s] [1295K/0/0 iops] [eta 01m:39s]

> [   66.023885] rcu: 12-....: (5250 ticks this GP)

> idle=182/1/0x4000000000000004 softirq=517/517 fqs=2529

> [   66.033306] (t=5254 jiffies g=733 q=2241)

> [   66.037394] Task dump for CPU 12:

> [   66.040696] fio             R  running task        0   798    796

> 0x00000002

> [   66.047733] Call trace:

> [   66.050173]  dump_backtrace+0x0/0x1a0

> [   66.053823]  show_stack+0x14/0x20

> [   66.057126]  sched_show_task+0x164/0x1a0

> [   66.061036]  dump_cpu_task+0x40/0x2e8

> [   66.064686]  rcu_dump_cpu_stacks+0xa0/0xe0

> [   66.068769]  rcu_sched_clock_irq+0x6d8/0xaa8

> [   66.073027]  update_process_times+0x2c/0x50

> [   66.077198]  tick_sched_handle.isra.14+0x30/0x50

> [   66.081802]  tick_sched_timer+0x48/0x98

> [   66.085625]  __hrtimer_run_queues+0x120/0x1b8

> [   66.089968]  hrtimer_interrupt+0xd4/0x250

> [   66.093966]  arch_timer_handler_phys+0x28/0x40

> [   66.098398]  handle_percpu_devid_irq+0x80/0x140

> [   66.102915]  generic_handle_irq+0x24/0x38

> [   66.106911]  __handle_domain_irq+0x5c/0xb0

> [   66.110995]  gic_handle_irq+0x5c/0x148

> [   66.114731]  el1_irq+0xb8/0x180

> [   66.117858]  efi_header_end+0x94/0x234

> [   66.121595]  irq_exit+0xd0/0xd8

> [   66.124724]  __handle_domain_irq+0x60/0xb0

> [   66.128806]  gic_handle_irq+0x5c/0x148

> [   66.132542]  el0_irq_naked+0x4c/0x54

> [   97.152870] rcu: INFO: rcu_preempt self-detected stall on CPU8% done]

> [4736MB/0KB/0KB /s] [1212K/0/0 iops] [eta 01m:08s]

> [   97.158616] rcu: 8-....: (1 GPs behind) idle=08e/1/0x4000000000000002

> softirq=462/505 fqs=2621

> [   97.167414] (t=5253 jiffies g=737 q=5507)

> [   97.171498] Task dump for CPU 8:

> [pu_task+0x40/0x2e8

> [   97.198705]  rcu_dump_cpu_stacks+0xa0/0xe0

> [   97.202788]  rcu_sched_clock_irq+0x6d8/0xaa8

> [   97.207046]  update_process_times+0x2c/0x50

> [   97.211217]  tick_sched_handle.isra.14+0x30/0x50

> [   97.215820]  tick_sched_timer+0x48/0x98

> [   97.219644]  __hrtimer_run_queues+0x120/0x1b8

> [   97.223989]  hrtimer_interrupt+0xd4/0x250

> [   97.227987]  arch_timer_handler_phys+0x28/0x40

> [   97.232418]  handle_percpu_devid_irq+0x80/0x140

> [   97.236935]  generic_handle_irq+0x24/0x38

> [   97.240931]  __handle_domain_irq+0x5c/0xb0

> [   97.245015]  gic_handle_irq+0x5c/0x148

> [   97.248751]  el1_irq+0xb8/0x180

> [   97.251880]  find_busiest_group+0x18c/0x9e8

> [   97.256050]  load_balance+0x154/0xb98

> [   97.259700]  rebalance_domains+0x1cc/0x2f8

> [   97.263783]  run_rebalance_domains+0x78/0xe0

> [   97.268040]  efi_header_end+0x114/0x234

> [   97.271864]  run_ksoftirqd+0x38/0x48

> [   97.275427]  smpboot_thread_fn+0x16c/0x270

> [   97.279511]  kthread+0x118/0x120

> [   97.282726]  ret_from_fork+0x10/0x18

> [   97.286289] Task dump for CPU 12:

> [   97.289591] kworker/12:1    R  running task        0   570      2

> 0x0000002a

> [   97.296634] Workqueue:  0x0 (mm_percpu_wq)

> [   97.300718] Call trace:

> [   97.303152]  __switch_to+0xbc/0x218

> [   97.306632]  page_wait_table+0x1500/0x1800

> 

> Would this be the same interrupt "swamp" issue?


It could be, but reason need to investigated.

You never provide the test details(how many drives, how many disks
attached to each drive) as I asked, so I can't comment on the reason,
also no reason shows that the patch is a good fix.

My theory is simple, so far, the CPU is still much quicker than
current storage in case that IO aren't from multiple disks which are
connected to same drive.

Thanks, 
Ming
#!/bin/sh

get_disk_from_pcid()
{
	PCID=$1

	DISKS=`find /sys/block -name "*"`
	for DISK in $DISKS; do
		DISKP=`realpath $DISK/device`
		echo $DISKP | grep $PCID > /dev/null
		[ $? -eq 0 ] && echo `basename $DISK` && break
	done
}

dump_irq_affinity()
{
	PCID=$1
	PCIP=`find /sys/devices -name *$PCID | grep pci`

	[[ ! -d $PCIP/msi_irqs ]] && return

	IRQS=`ls $PCIP/msi_irqs`

	[ $? -ne 0 ] && return

	DISK=`get_disk_from_pcid $PCID`
	echo "PCI name is $PCID: $DISK"

	for IRQ in $IRQS; do
	    [ -f /proc/irq/$IRQ/smp_affinity_list ] && CPUS=`cat /proc/irq/$IRQ/smp_affinity_list`
	    [ -f /proc/irq/$IRQ/effective_affinity_list ] && ECPUS=`cat /proc/irq/$IRQ/effective_affinity_list`
	    echo -e "\tirq $IRQ, cpu list $CPUS, effective list $ECPUS"
	done
}


if [ $# -ge 1 ]; then
	PCIDS=$1
else
#	PCID=`lspci | grep "Non-Volatile memory" | cut -c1-7`
	PCIDS=`lspci | grep "Non-Volatile memory controller" | awk '{print $1}'`
fi

echo "kernel version: "
uname -a

for PCID in $PCIDS; do
	dump_irq_affinity $PCID
done
John Garry Dec. 13, 2019, 10:07 a.m. UTC | #22
On 11/12/2019 09:41, John Garry wrote:
> On 10/12/2019 18:32, Marc Zyngier wrote:

>>>>> The ITS code will make the lowest online CPU in the affinity mask

>>>>> the

>>>>> target CPU for the interrupt, which may result in some CPUs

>>>>> handling

>>>>> so many interrupts.

>>>> If what you want is for the*default*  affinity to be spread around,

>>>> that should be achieved pretty easily. Let me have a think about how

>>>> to do that.

>>> Cool, I anticipate that it should help my case.

>>>

>>> I can also seek out some NVMe cards to see how it would help a more

>>> "generic" scenario.

>> Can you give the following a go? It probably has all kind of warts on

>> top of the quality debug information, but I managed to get my D05 and

>> a couple of guests to boot with it. It will probably eat your data,

>> so use caution!;-)

>>

> 

> Hi Marc,

> 

> Ok, we'll give it a spin.

> 

> Thanks,

> John


Hi Marc,

JFYI, we're still testing this and the patch itself seems to work as 
intended.

Here's the kernel log if you just want to see how the interrupts are 
getting assigned:
https://pastebin.com/hh3r810g

For me, I did get a performance boost for NVMe testing, but my colleague 
Xiang Chen saw a drop for our storage test of interest  - that's the 
HiSi SAS controller. We're trying to make sense of it now.

Thanks,
John

> 

>> Thanks,

>>

>>           M.

>>

>> diff --git a/drivers/irqchip/irq-gic-v3-its.c

>> b/drivers/irqchip/irq-gic-v3-its.c

>> index e05673bcd52b..301ee3bc0602 100644

>> --- a/drivers/irqchip/irq-gic-v3-its.c

>> +++ b/drivers/irqchip/irq-gic-v3-its.c

>> @@ -177,6 +177,8 @@ static DEFINE_IDA(its_vpeid_ida);

>
Marc Zyngier Dec. 13, 2019, 10:31 a.m. UTC | #23
Hi John,

On 2019-12-13 10:07, John Garry wrote:
> On 11/12/2019 09:41, John Garry wrote:

>> On 10/12/2019 18:32, Marc Zyngier wrote:

>>>>>> The ITS code will make the lowest online CPU in the affinity 

>>>>>> mask

>>>>>> the

>>>>>> target CPU for the interrupt, which may result in some CPUs

>>>>>> handling

>>>>>> so many interrupts.

>>>>> If what you want is for the*default*  affinity to be spread 

>>>>> around,

>>>>> that should be achieved pretty easily. Let me have a think about 

>>>>> how

>>>>> to do that.

>>>> Cool, I anticipate that it should help my case.

>>>>

>>>> I can also seek out some NVMe cards to see how it would help a 

>>>> more

>>>> "generic" scenario.

>>> Can you give the following a go? It probably has all kind of warts 

>>> on

>>> top of the quality debug information, but I managed to get my D05 

>>> and

>>> a couple of guests to boot with it. It will probably eat your data,

>>> so use caution!;-)

>>>

>> Hi Marc,

>> Ok, we'll give it a spin.

>> Thanks,

>> John

>

> Hi Marc,

>

> JFYI, we're still testing this and the patch itself seems to work as

> intended.

>

> Here's the kernel log if you just want to see how the interrupts are

> getting assigned:

> https://pastebin.com/hh3r810g


It is a bit hard to make sense of this dump, specially on such a wide
machine (I want one!) without really knowing the topology of the 
system.

> For me, I did get a performance boost for NVMe testing, but my

> colleague Xiang Chen saw a drop for our storage test of interest  -

> that's the HiSi SAS controller. We're trying to make sense of it now.


One of the difference is that with this patch, the initial affinity
is picked inside the NUMA node that matches the ITS. In your case,
that's either node 0 or 2. But it is unclear whether which CPUs these
map to.

Given that I see interrupts mapped to CPUs 0-23 on one side, and 48-71
on the other, it looks like half of your machine gets starved, and that
may be because no ITS targets the NUMA nodes they are part of. It would
be interesting to see what happens if you manually set the affinity
of the interrupts outside of the NUMA node.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 13, 2019, 11:12 a.m. UTC | #24
Hi Ming,

>> I am running some NVMe perf tests with Marc's patch.

> 

> We need to confirm that if Marc's patch works as expected, could you

> collect log via the attached script?


As immediately below, I see this on vanilla mainline, so let's see what 
the issue is without that patch.

>  >

> You never provide the test details(how many drives, how many disks

> attached to each drive) as I asked, so I can't comment on the reason,

> also no reason shows that the patch is a good fix.


So I have only 2x ES3000 V3s. This looks like the same one:
https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf

> 

> My theory is simple, so far, the CPU is still much quicker than

> current storage in case that IO aren't from multiple disks which are

> connected to same drive.


Hopefully this is all the info you need:

Last login: Fri Dec 13 10:41:55 GMT 2019 on ttyAMA0
Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 
5.5.0-rc1-00001-g3779c27ad995-dirty aarch64)

  * Documentation:  https://help.ubuntu.com
  * Management:     https://landscape.canonical.com
  * Support:        https://ubuntu.com/advantage

Failed to connect to https://changelogs.ubuntu.com/meta-release-lts. 
Check your Internet connection or proxy settings

john@ubuntu:~$ lstopo
Machine (14GB total)
   Package L#0
     NUMANode L#0 (P#0 14GB)
       L3 L#0 (32MB)
         L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + 
PU L#0 (P#0)
         L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + 
PU L#1 (P#1)
         L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + 
PU L#2 (P#2)
         L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + 
PU L#3 (P#3)
         L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + 
PU L#4 (P#4)
         L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + 
PU L#5 (P#5)
         L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + 
PU L#6 (P#6)
         L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + 
PU L#7 (P#7)
         L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + 
PU L#8 (P#8)
         L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + 
PU L#9 (P#9)
         L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 
+ PU L#10 (P#10)
         L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 
+ PU L#11 (P#11)
         L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 
+ PU L#12 (P#12)
         L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 
+ PU L#13 (P#13)
         L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 
+ PU L#14 (P#14)
         L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 
+ PU L#15 (P#15)
         L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 
+ PU L#16 (P#16)
         L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 
+ PU L#17 (P#17)
         L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 
+ PU L#18 (P#18)
         L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 
+ PU L#19 (P#19)
         L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 
+ PU L#20 (P#20)
         L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 
+ PU L#21 (P#21)
         L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23)
       HostBridge L#0
         PCIBridge
           2 x { PCI 8086:10fb }
         PCIBridge
           PCI 19e5:0123
         PCIBridge
           PCI 19e5:1711
       HostBridge L#4
         PCIBridge
           PCI 19e5:a250
         PCI 19e5:a230
         PCI 19e5:a235
       HostBridge L#6
         PCIBridge
           PCI 19e5:a222
             Net L#0 "eno1"
           PCI 19e5:a222
             Net L#1 "eno2"
           PCI 19e5:a222
             Net L#2 "eno3"
           PCI 19e5:a221
             Net L#3 "eno4"
     NUMANode L#1 (P#1) + L3 L#1 (32MB)
       L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + 
PU L#24 (P#24)
       L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + 
PU L#25 (P#25)
       L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + 
PU L#26 (P#26)
       L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + 
PU L#27 (P#27)
       L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + 
PU L#28 (P#28)
       L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + 
PU L#29 (P#29)
       L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + 
PU L#30 (P#30)
       L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + 
PU L#31 (P#31)
       L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + 
PU L#32 (P#32)
       L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + 
PU L#33 (P#33)
       L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + 
PU L#34 (P#34)
       L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + 
PU L#35 (P#35)
       L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + 
PU L#36 (P#36)
       L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + 
PU L#37 (P#37)
       L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + 
PU L#38 (P#38)
       L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + 
PU L#39 (P#39)
       L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + 
PU L#40 (P#40)
       L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + 
PU L#41 (P#41)
       L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + 
PU L#42 (P#42)
       L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + 
PU L#43 (P#43)
       L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + 
PU L#44 (P#44)
       L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + 
PU L#45 (P#45)
       L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + 
PU L#46 (P#46)
       L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + 
PU L#47 (P#47)
   Package L#1
     NUMANode L#2 (P#2)
       L3 L#2 (32MB)
         L2 L#48 (512KB) + L1d L#48 (64KB) + L1i L#48 (64KB) + Core L#48 
+ PU L#48 (P#48)
         L2 L#49 (512KB) + L1d L#49 (64KB) + L1i L#49 (64KB) + Core L#49 
+ PU L#49 (P#49)
         L2 L#50 (512KB) + L1d L#50 (64KB) + L1i L#50 (64KB) + Core L#50 
+ PU L#50 (P#50)
         L2 L#51 (512KB) + L1d L#51 (64KB) + L1i L#51 (64KB) + Core L#51 
+ PU L#51 (P#51)
         L2 L#52 (512KB) + L1d L#52 (64KB) + L1i L#52 (64KB) + Core L#52 
+ PU L#52 (P#52)
         L2 L#53 (512KB) + L1d L#53 (64KB) + L1i L#53 (64KB) + Core L#53 
+ PU L#53 (P#53)
         L2 L#54 (512KB) + L1d L#54 (64KB) + L1i L#54 (64KB) + Core L#54 
+ PU L#54 (P#54)
         L2 L#55 (512KB) + L1d L#55 (64KB) + L1i L#55 (64KB) + Core L#55 
+ PU L#55 (P#55)
         L2 L#56 (512KB) + L1d L#56 (64KB) + L1i L#56 (64KB) + Core L#56 
+ PU L#56 (P#56)
         L2 L#57 (512KB) + L1d L#57 (64KB) + L1i L#57 (64KB) + Core L#57 
+ PU L#57 (P#57)
         L2 L#58 (512KB) + L1d L#58 (64KB) + L1i L#58 (64KB) + Core L#58 
+ PU L#58 (P#58)
         L2 L#59 (512KB) + L1d L#59 (64KB) + L1i L#59 (64KB) + Core L#59 
+ PU L#59 (P#59)
         L2 L#60 (512KB) + L1d L#60 (64KB) + L1i L#60 (64KB) + Core L#60 
+ PU L#60 (P#60)
         L2 L#61 (512KB) + L1d L#61 (64KB) + L1i L#61 (64KB) + Core L#61 
+ PU L#61 (P#61)
         L2 L#62 (512KB) + L1d L#62 (64KB) + L1i L#62 (64KB) + Core L#62 
+ PU L#62 (P#62)
         L2 L#63 (512KB) + L1d L#63 (64KB) + L1i L#63 (64KB) + Core L#63 
+ PU L#63 (P#63)
         L2 L#64 (512KB) + L1d L#64 (64KB) + L1i L#64 (64KB) + Core L#64 
+ PU L#64 (P#64)
         L2 L#65 (512KB) + L1d L#65 (64KB) + L1i L#65 (64KB) + Core L#65 
+ PU L#65 (P#65)
         L2 L#66 (512KB) + L1d L#66 (64KB) + L1i L#66 (64KB) + Core L#66 
+ PU L#66 (P#66)
         L2 L#67 (512KB) + L1d L#67 (64KB) + L1i L#67 (64KB) + Core L#67 
+ PU L#67 (P#67)
         L2 L#68 (512KB) + L1d L#68 (64KB) + L1i L#68 (64KB) + Core L#68 
+ PU L#68 (P#68)
         L2 L#69 (512KB) + L1d L#69 (64KB) + L1i L#69 (64KB) + Core L#69 
+ PU L#69 (P#69)
         L2 L#70 (512KB) + L1d L#70 (64KB) + L1i L#70 (64KB) + Core L#70 
+ PU L#70 (P#70)
         L2 L#71 (512KB) + L1d L#71 (64KB) + L1i L#71 (64KB) + Core L#71 
+ PU L#71 (P#71)
       HostBridge L#8
         PCIBridge
           PCI 19e5:0123
       HostBridge L#10
         PCIBridge
           PCI 19e5:a250
       HostBridge L#12
         PCIBridge
           PCI 19e5:a226
             Net L#4 "eno5"
     NUMANode L#3 (P#3) + L3 L#3 (32MB)
       L2 L#72 (512KB) + L1d L#72 (64KB) + L1i L#72 (64KB) + Core L#72 + 
PU L#72 (P#72)
       L2 L#73 (512KB) + L1d L#73 (64KB) + L1i L#73 (64KB) + Core L#73 + 
PU L#73 (P#73)
       L2 L#74 (512KB) + L1d L#74 (64KB) + L1i L#74 (64KB) + Core L#74 + 
PU L#74 (P#74)
       L2 L#75 (512KB) + L1d L#75 (64KB) + L1i L#75 (64KB) + Core L#75 + 
PU L#75 (P#75)
       L2 L#76 (512KB) + L1d L#76 (64KB) + L1i L#76 (64KB) + Core L#76 + 
PU L#76 (P#76)
       L2 L#77 (512KB) + L1d L#77 (64KB) + L1i L#77 (64KB) + Core L#77 + 
PU L#77 (P#77)
       L2 L#78 (512KB) + L1d L#78 (64KB) + L1i L#78 (64KB) + Core L#78 + 
PU L#78 (P#78)
       L2 L#79 (512KB) + L1d L#79 (64KB) + L1i L#79 (64KB) + Core L#79 + 
PU L#79 (P#79)
       L2 L#80 (512KB) + L1d L#80 (64KB) + L1i L#80 (64KB) + Core L#80 + 
PU L#80 (P#80)
       L2 L#81 (512KB) + L1d L#81 (64KB) + L1i L#81 (64KB) + Core L#81 + 
PU L#81 (P#81)
       L2 L#82 (512KB) + L1d L#82 (64KB) + L1i L#82 (64KB) + Core L#82 + 
PU L#82 (P#82)
       L2 L#83 (512KB) + L1d L#83 (64KB) + L1i L#83 (64KB) + Core L#83 + 
PU L#83 (P#83)
       L2 L#84 (512KB) + L1d L#84 (64KB) + L1i L#84 (64KB) + Core L#84 + 
PU L#84 (P#84)
       L2 L#85 (512KB) + L1d L#85 (64KB) + L1i L#85 (64KB) + Core L#85 + 
PU L#85 (P#85)
       L2 L#86 (512KB) + L1d L#86 (64KB) + L1i L#86 (64KB) + Core L#86 + 
PU L#86 (P#86)
       L2 L#87 (512KB) + L1d L#87 (64KB) + L1i L#87 (64KB) + Core L#87 + 
PU L#87 (P#87)
       L2 L#88 (512KB) + L1d L#88 (64KB) + L1i L#88 (64KB) + Core L#88 + 
PU L#88 (P#88)
       L2 L#89 (512KB) + L1d L#89 (64KB) + L1i L#89 (64KB) + Core L#89 + 
PU L#89 (P#89)
       L2 L#90 (512KB) + L1d L#90 (64KB) + L1i L#90 (64KB) + Core L#90 + 
PU L#90 (P#90)
       L2 L#91 (512KB) + L1d L#91 (64KB) + L1i L#91 (64KB) + Core L#91 + 
PU L#91 (P#91)
       L2 L#92 (512KB) + L1d L#92 (64KB) + L1i L#92 (64KB) + Core L#92 + 
PU L#92 (P#92)
       L2 L#93 (512KB) + L1d L#93 (64KB) + L1i L#93 (64KB) + Core L#93 + 
PU L#93 (P#93)
       L2 L#94 (512KB) + L1d L#94 (64KB) + L1i L#94 (64KB) + Core L#94 + 
PU L#94 (P#94)
       L2 L#95 (512KB) + L1d L#95 (64KB) + L1i L#95 (64KB) + Core L#95 + 
PU L#95 (P#95)
john@ubuntu:~$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           0x48
Model:               0
Stepping:            0x0
BogoMIPS:            200.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics 
cpuid asimdrdm dcpop
john@ubuntu:~$ dmesg | grep "Linux v"
[    0.000000] Linux version 5.5.0-rc1-00001-g3779c27ad995-dirty 
(john@john-ThinkCentre-M93p) (gcc version 7.3.1 20180425 
[linaro-7.3-2018.05-rc1 revision 
38aec9a676236eaa42ca03ccb3a6c1dd0182c29f] (Linaro GCC 7.3-2018.05-rc1)) 
#1436 SMP PREEMPT Fri Dec 13 10:51:46 GMT 2019
john@ubuntu:~$
john@ubuntu:~$ lspci
00:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
00:04.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
00:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
00:0c.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
00:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
00:12.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
04:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. 
ES3000 V3 NVMe PCIe SSD (rev 45)
05:00.0 VGA compatible controller: Huawei Technologies Co., Ltd. Hi1710 
[iBMC Intelligent Management system chip w/VGA support] (rev 01)
74:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI 
Bridge (rev 20)
74:02.0 Serial Attached SCSI controller: Huawei Technologies Co., Ltd. 
HiSilicon SAS 3.0 HBA (rev 20)
74:03.0 SATA controller: Huawei Technologies Co., Ltd. HiSilicon AHCI 
HBA (rev 20)
75:00.0 Processing accelerators: Huawei Technologies Co., Ltd. HiSilicon 
ZIP Engine (rev 20)
78:00.0 Network and computing encryption device: Huawei Technologies 
Co., Ltd. HiSilicon HPRE Engine (rev 20)
7a:00.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 
2-port Host Controller (rev 20)
7a:01.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 
2-port Host Controller (rev 20)
7a:02.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 3.0 
Host Controller (rev 20)
7c:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI 
Bridge (rev 20)
7d:00.0 Ethernet controller: Huawei Technologies Co., Ltd. HNS 
GE/10GE/25GE RDMA Network Controller (rev 20)
7d:00.1 Ethernet controller: Huawei Technologies Co., Ltd. HNS 
GE/10GE/25GE RDMA Network Controller (rev 20)
7d:00.2 Ethernet controller: Huawei Technologies Co., Ltd. HNS 
GE/10GE/25GE RDMA Network Controller (rev 20)
7d:00.3 Ethernet controller: Huawei Technologies Co., Ltd. HNS 
GE/10GE/25GE Network Controller (rev 20)
80:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
80:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
80:0c.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
80:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root 
Port with Gen4 (rev 45)
81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. 
ES3000 V3 NVMe PCIe SSD (rev 45)
b4:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI 
Bridge (rev 20)
b5:00.0 Processing accelerators: Huawei Technologies Co., Ltd. HiSilicon 
ZIP Engine (rev 20)
b8:00.0 Network and computing encryption device: Huawei Technologies 
Co., Ltd. HiSilicon HPRE Engine (rev 20)
ba:00.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 
2-port Host Controller (rev 20)
ba:01.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 
2-port Host Controller (rev 20)
ba:02.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 3.0 
Host Controller (rev 20)
bc:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI 
Bridge (rev 20)
bd:00.0 Ethernet controller: Huawei Technologies Co., Ltd. HNS 
GE/10GE/25GE/50GE/100GE RDMA Network Controller (rev 20)
john@ubuntu:~$ sudo /bin/bash create_fio_task_cpu_liuyifan_nvme.sh 4k 
read 20 1
Creat 4k_read_depth20_fiotest file sucessfully
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=20
...
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=20
...
fio-3.1
Starting 40 processes
[  175.642410] rcu: INFO: rcu_preempt self-detected stall on CPU 
IOPS][eta 00m:18s]
[  175.648150] rcu: 0-....: (1 GPs behind) idle=3ae/1/0x4000000000000004 
softirq=1589/1589 fqs=2322
Jobs: 40 (f=40): [R(40)][100.0%][r=4270MiB/s,w=0KiB/s][r=1093k,w=0 
IOPS][eta 00m:00s]
job1: (groupid=0, jobs=40): err= 0: pid=1227: Fri Dec 13 10:57:49 2019
    read: IOPS=952k, BW=3719MiB/s (3900MB/s)(145GiB/40003msec)
     slat (usec): min=2, max=20126k, avg=10.66, stdev=9637.70
     clat (usec): min=13, max=20156k, avg=517.95, stdev=31017.58
      lat (usec): min=21, max=20156k, avg=528.77, stdev=32487.76
     clat percentiles (usec):
      |  1.00th=[  103],  5.00th=[  113], 10.00th=[  147], 20.00th=[  200],
      | 30.00th=[  260], 40.00th=[  318], 50.00th=[  375], 60.00th=[  429],
      | 70.00th=[  486], 80.00th=[  578], 90.00th=[  799], 95.00th=[  996],
      | 99.00th=[ 1958], 99.50th=[ 2114], 99.90th=[ 2311], 99.95th=[ 2474],
      | 99.99th=[ 7767]
    bw (  KiB/s): min=  112, max=745026, per=4.60%, avg=175285.03, 
stdev=117592.37, samples=1740
    iops        : min=   28, max=186256, avg=43821.06, stdev=29398.12, 
samples=1740
   lat (usec)   : 20=0.01%, 50=0.01%, 100=0.14%, 250=28.38%, 500=43.76%
   lat (usec)   : 750=16.17%, 1000=6.65%
   lat (msec)   : 2=4.02%, 4=0.86%, 10=0.01%, 20=0.01%, 50=0.01%
   lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 2000=0.01%
   lat (msec)   : >=2000=0.01%
   cpu          : usr=3.67%, sys=15.82%, ctx=20799355, majf=0, minf=4275
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%

      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%

      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%

      issued rwt: total=38086812,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=20

Run status group 0 (all jobs):
    READ: bw=3719MiB/s (3900MB/s), 3719MiB/s-3719MiB/s 
(3900MB/s-3900MB/s), io=145GiB (156GB), run=40003-40003msec

Disk stats (read0/0, ticks=5002739/0, in_queue=540, util=99.83%
john@ubuntu:~$ dmesg | tail -n 100
[   20.380611] Key type dns_resolver registered
[   20.385000] registered taskstats version 1
[   20.389092] Loading compiled-in X.509 certificates
[   20.394494] pcieport 0000:00:00.0: Adding to iommu group 9
[   20.401556] pcieport 0000:00:04.0: Adding to iommu group 10
[   20.408695] pcieport 0000:00:08.0: Adding to iommu group 11
[   20.415767] pcieport 0000:00:0c.0: Adding to iommu group 12
[   20.422842] pcieport 0000:00:10.0: Adding to iommu group 13
[   20.429932] pcieport 0000:00:12.0: Adding to iommu group 14
[   20.437077] pcieport 0000:7c:00.0: Adding to iommu group 15
[   20.443397] pcieport 0000:74:00.0: Adding to iommu group 16
[   20.449790] pcieport 0000:80:00.0: Adding to iommu group 17
[   20.453983] usb 1-2: new high-speed USB device number 3 using ehci-pci
[   20.457553] pcieport 0000:80:08.0: Adding to iommu group 18
[   20.469455] pcieport 0000:80:0c.0: Adding to iommu group 19
[   20.477037] pcieport 0000:80:10.0: Adding to iommu group 20
[   20.484712] pcieport 0000:bc:00.0: Adding to iommu group 21
[   20.491155] pcieport 0000:b4:00.0: Adding to iommu group 22
[   20.517723] rtc-efi rtc-efi: setting system clock to 
2019-12-13T10:54:56 UTC (1576234496)
[   20.525913] ALSA device list:
[   20.528878]   No soundcards found.
[   20.618601] hub 1-2:1.0: USB hub found
[   20.622440] hub 1-2:1.0: 4 ports detected
[   20.744970] EXT4-fs (sdd1): recovery complete
[   20.759425] EXT4-fs (sdd1): mounted filesystem with ordered data 
mode. Opts: (null)
[   20.767090] VFS: Mounted root (ext4 filesystem) on device 8:49.
[   20.788837] devtmpfs: mounted
[   20.793124] Freeing unused kernel memory: 5184K
[   20.797817] Run /sbin/init as init process
[   20.913986] usb 1-2.1: new full-speed USB device number 4 using ehci-pci
[   21.379891] systemd[1]: systemd 237 running in system mode. (+PAM 
+AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP 
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN 
-PCRE2 default-hierarchy=hybrid)
[   21.401921] systemd[1]: Detected architecture arm64.
[   21.459107] systemd[1]: Set hostname to <ubuntu>.
[   21.474734] systemd[1]: Couldn't move remaining userspace processes, 
ignoring: Input/output error
[   21.947303] systemd[1]: File 
/lib/systemd/system/systemd-journald.service:36 configures an IP 
firewall (IPAddressDeny=any), but the local system does not support 
BPF/cgroup based firewalling.
[   21.964340] systemd[1]: Proceeding WITHOUT firewalling in effect! 
(This warning is only shown for the first loaded unit using IP firewalling.)
[   22.268240] random: systemd: uninitialized urandom read (16 bytes read)
[   22.274946] systemd[1]: Started Forward Password Requests to Wall 
Directory Watch.
[   22.298022] random: systemd: uninitialized urandom read (16 bytes read)
[   22.304894] systemd[1]: Created slice User and Session Slice.
[   22.322032] random: systemd: uninitialized urandom read (16 bytes read)
[   22.328850] systemd[1]: Created slice System Slice.
[   22.346109] systemd[1]: Listening on Syslog Socket.
[   22.552644] random: crng init done
[   22.558740] random: 7 urandom warning(s) missed due to ratelimiting
[   23.370478] EXT4-fs (sdd1): re-mounted. Opts: errors=remount-ro
[   23.547390] systemd-journald[806]: Received request to flush runtime 
journal from PID 1
[   23.633956] systemd-journald[806]: File 
/var/log/journal/f0ef8dc5ede84b5eb7431c01908d3558/system.journal 
corrupted or uncleanly shut down, renaming and replacing.
[   23.814035] Adding 2097148k swap on /swapfile.  Priority:-2 extents:6 
across:2260988k
[   25.012707] hns3 0000:7d:00.2 eno3: renamed from eth2
[   25.054228] hns3 0000:7d:00.3 eno4: renamed from eth3
[   25.086971] hns3 0000:7d:00.1 eno2: renamed from eth1
[   25.118154] hns3 0000:7d:00.0 eno1: renamed from eth0
[   25.154467] hns3 0000:bd:00.0 eno5: renamed from eth4
[   26.130742] input: Keyboard/Mouse KVM 1.1.0 as 
/devices/pci0000:7a/0000:7a:01.0/usb1/1-2/1-2.1/1-2.1:1.0/0003:12D1:0003.0001/input/input1
[   26.190189] hid-generic 0003:12D1:0003.0001: input: USB HID v1.10 
Keyboard [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input0
[   26.191049] input: Keyboard/Mouse KVM 1.1.0 as 
/devices/pci0000:7a/0000:7a:01.0/usb1/1-2/1-2.1/1-2.1:1.1/0003:12D1:0003.0002/input/input2
[   26.191090] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 
Mouse [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1
[  175.642410] rcu: INFO: rcu_preempt self-detected stall on CPU
[  175.648150] rcu: 0-....: (1 GPs behind) idle=3ae/1/0x4000000000000004 
softirq=1589/1589 fqs=2322
[  175.657102] (t=5253 jiffies g=2893 q=3123)
[  175.657105] Task dump for CPU 0:
[  175.657108] fio             R  running task        0  1254   1224 
0x00000002
[  175.657112] Call trace:
[  175.657122]  dump_backtrace+0x0/0x1a0
[  175.657126]  show_stack+0x14/0x20
[  175.657130]  sched_show_task+0x164/0x1a0
[  175.657133]  dump_cpu_task+0x40/0x2e8
[  175.657137]  rcu_dump_cpu_stacks+0xa0/0xe0
[  175.657139]  rcu_sched_clock_irq+0x6d8/0xaa8
[  175.657143]  update_process_times+0x2c/0x50
[  175.657147]  tick_sched_handle.isra.14+0x30/0x50
[  175.657149]  tick_sched_timer+0x48/0x98
[  175.657152]  __hrtimer_run_queues+0x120/0x1b8
[  175.657154]  hrtimer_interrupt+0xd4/0x250
[  175.657159]  arch_timer_handler_phys+0x28/0x40
[  175.657162]  handle_percpu_devid_irq+0x80/0x140
[  175.657165]  generic_handle_irq+0x24/0x38
[  175.657167]  __handle_domain_irq+0x5c/0xb0
[  175.657170]  gic_handle_irq+0x5c/0x148
[  175.657172]  el1_irq+0xb8/0x180
[  175.657175]  efi_header_end+0x94/0x234
[  175.657178]  irq_exit+0xd0/0xd8
[  175.657180]  __handle_domain_irq+0x60/0xb0
[  175.657182]  gic_handle_irq+0x5c/0x148
[  175.657184]  el1_irq+0xb8/0x180
[  175.657194]  nvme_open+0x80/0xc8
[  175.657199]  __blkdev_get+0x3f8/0x4f0
[  175.657201]  blkdev_get+0x110/0x180
[  175.657204]  blkdev_open+0x8c/0xa0
[  175.657207]  do_dentry_open+0x1c4/0x3d8
[  175.657210]  vfs_open+0x28/0x30
[  175.657212]  path_openat+0x2a8/0x12a0
[  175.657214]  do_filp_open+0x78/0xf8
[  175.657217]  do_sys_open+0x19c/0x258
[  175.657219]  __arm64_sys_openat+0x20/0x28
[  175.657222]  el0_svc_common.constprop.2+0x64/0x160
[  175.657225]  el0_svc_handler+0x20/0x80
[  175.657227]  el0_sync_handler+0xe4/0x188
[  175.657229]  el0_sync+0x140/0x180

john@ubuntu:~$ ./dump-io-irq-affinity
kernel version:
Linux ubuntu 5.5.0-rc1-00001-g3779c27ad995-dirty #1436 SMP PREEMPT Fri 
Dec 13 10:51:46 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
PCI name is 04:00.0: nvme0n1
irq 56, cpu list 75, effective list 75
irq 60, cpu list 24-28, effective list 24
irq 61, cpu list 29-33, effective list 29
irq 62, cpu list 34-38, effective list 34
irq 63, cpu list 39-43, effective list 39
irq 64, cpu list 44-47, effective list 44
irq 65, cpu list 48-51, effective list 48
irq 66, cpu list 52-55, effective list 52
irq 67, cpu list 56-59, effective list 56
irq 68, cpu list 60-63, effective list 60
irq 69, cpu list 64-67, effective list 64
irq 70, cpu list 68-71, effective list 68
irq 71, cpu list 72-75, effective list 72
irq 72, cpu list 76-79, effective list 76
irq 73, cpu list 80-83, effective list 80
irq 74, cpu list 84-87, effective list 84
irq 75, cpu list 88-91, effective list 88
irq 76, cpu list 92-95, effective list 92
irq 77, cpu list 0-3, effective list 0
irq 78, cpu list 4-7, effective list 4
irq 79, cpu list 8-11, effective list 8
irq 80, cpu list 12-15, effective list 12
irq 81, cpu list 16-19, effective list 16
irq 82, cpu list 20-23, effective list 20
PCI name is 81:00.0: nvme1n1
irq 100, cpu list 0-3, effective list 0
irq 101, cpu list 4-7, effective list 4
irq 102, cpu list 8-11, effective list 8
irq 103, cpu list 12-15, effective list 12
irq 104, cpu list 16-19, effective list 16
irq 105, cpu list 20-23, effective list 20
irq 57, cpu list 63, effective list 63
irq 83, cpu list 24-28, effective list 24
irq 84, cpu list 29-33, effective list 29
irq 85, cpu list 34-38, effective list 34
irq 86, cpu list 39-43, effective list 39
irq 87, cpu list 44-47, effective list 44
irq 88, cpu list 48-51, effective list 48
irq 89, cpu list 52-55, effective list 52
irq 90, cpu list 56-59, effective list 56
irq 91, cpu list 60-63, effective list 60
irq 92, cpu list 64-67, effective list 64
irq 93, cpu list 68-71, effective list 68
irq 94, cpu list 72-75, effective list 72
irq 95, cpu list 76-79, effective list 76
irq 96, cpu list 80-83, effective list 80
irq 97, cpu list 84-87, effective list 84
irq 98, cpu list 88-91, effective list 88
irq 99, cpu list 92-95, effective list 92

john@ubuntu:~$ more create_fio_task_cpu_liuyifan_nvme.sh
#s!/bin/bash
#
#
#echo "$1_$2_$3_test" > $filename
echo "
[global]
rw=$2
direct=1
ioengine=libaio
iodepth=$3
numjobs=20
bs=$1
;size=10240000m
;zero_buffers=1
group_reporting=1
group_reporting=1
;ioscheduler=noop
;cpumask=0xfe
;cpus_allowed=0-3
;gtod_reduce=1
;iodepth_batch=2
;iodepth_batch_complete=2
runtime=40
;thread
loops = 10000
" > $1_$2_depth$3_fiotest

declare -i new_count=1
declare -i disk_count=0
#fdisk -l |grep "Disk /dev/sd" > fdiskinfo
#cat fdiskinfo |awk '{print $2}' |awk -F ":" '{print $1}'  > devinfo
ls /dev/nvme0n1 /dev/nvme1n1 > devinfo
new_num=`sed -n '$=' devinfo`

while [ $new_count -le $new_num ]
do
if [ "$diskcount" != "$4" ]; then
new_disk=`sed -n "$new_count"p devinfo`
disk_list=${new_disk}":"${disk_list}
((new_count++))
((diskcount++))
if [ $new_count -gt $new_num ]; then
echo "[job1]" >> $1_$2_depth$3_fiotest
echo "filename=$disk_list" >> $1_$2_depth$3_fiotest
fi
continue
fi
# if [ "$new_disk" = "/dev/sda" ]; then
# continue
# fi
     echo "[job1]" >> $1_$2_depth$3_fiotest
echo "filename=$disk_list" >> $1_$2_depth$3_fiotest
diskcount=0
disk_list=""
done

echo "Creat $1_$2_depth$3_fiotest file sucessfully"

fio $1_$2_depth$3_fiotest
john@ubuntu:~$

Thanks,
John
John Garry Dec. 13, 2019, 12:08 p.m. UTC | #25
Hi Marc,

>> JFYI, we're still testing this and the patch itself seems to work as

>> intended.

>>

>> Here's the kernel log if you just want to see how the interrupts are

>> getting assigned:

>> https://pastebin.com/hh3r810g

> 

> It is a bit hard to make sense of this dump, specially on such a wide

> machine (I want one!) 


So do I :) That's the newer "D06CS" board.

without really knowing the topology of the system.

So it's 2x socket, each socket has 2x CPU dies, and each die has 6 
clusters of 4 CPUs, which gives 96 in total.

> 

>> For me, I did get a performance boost for NVMe testing, but my

>> colleague Xiang Chen saw a drop for our storage test of interest  -

>> that's the HiSi SAS controller. We're trying to make sense of it now.

> 

> One of the difference is that with this patch, the initial affinity

> is picked inside the NUMA node that matches the ITS. 


Is that even for managed interrupts? We're testing the storage 
controller which uses managed interrupts. I should have made that clearer.

In your case,
> that's either node 0 or 2. But it is unclear whether which CPUs these

> map to.

> 

> Given that I see interrupts mapped to CPUs 0-23 on one side, and 48-71

> on the other, it looks like half of your machine gets starved, 


Seems that way.

So this is a mystery to me:

[   23.584192] picked CPU62 IRQ147

147:          0          0          0          0          0          0 
        0          0          0          0          0          0 
  0          0          0          0          0          0          0 
       0          0          0          0          0          0 
0          0          0          0          0          0          0 
     0          0          0          0          0          0          0 
          0          0          0          0          0          0 
    0          0          0          0          0          0          0 
         0          0          0          0          0          0 
   0          0          0          0          0          0          0 
        0          0          0          0          0          0 
  0          0          0          0          0          0          0 
       0          0          0          0          0          0 
0          0          0          0          0          0          0 
     0          0          0          0          0   ITS-MSI 94404626 
Edge      hisi_sas_v3_hw cq


and

[   25.896728] picked CPU62 IRQ183

183:          0          0          0          0          0          0 
        0          0          0          0          0          0 
  0          0          0          0          0          0          0 
       0          0          0          0          0          0 
0          0          0          0          0          0          0 
     0          0          0          0          0          0          0 
          0          0          0          0          0          0 
    0          0          0          0          0          0          0 
         0          0          0          0          0          0 
   0          0          0          0          0          0          0 
        0          0          0          0          0          0 
  0          0          0          0          0          0          0 
       0          0          0          0          0          0 
0          0          0          0          0          0          0 
     0          0          0          0          0   ITS-MSI 94437398 
Edge      hisi_sas_v3_hw cq


But mpstat reports for CPU62:

12:44:58 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
12:45:00 AM   62    6.54    0.00   42.99    0.00    6.54   12.15    0.00 
    0.00    6.54   25.23

I don't know what interrupts they are...

It's the "hisi_sas_v3_hw cq" interrupts which we're interested in.

and that
> may be because no ITS targets the NUMA nodes they are part of.


So both storage controllers (which we're interested in for this test) 
are on socket #0, node #0.

  It would
> be interesting to see what happens if you manually set the affinity

> of the interrupts outside of the NUMA node.

> 


Again, managed, so I don't think it's possible.

Thanks,
John
Ming Lei Dec. 13, 2019, 1:18 p.m. UTC | #26
Hi John,

On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote:
> Hi Ming,

> 

> > > I am running some NVMe perf tests with Marc's patch.

> > 

> > We need to confirm that if Marc's patch works as expected, could you

> > collect log via the attached script?

> 

> As immediately below, I see this on vanilla mainline, so let's see what the

> issue is without that patch.


IMO, the interrupt load needs to be distributed as what X86 IRQ matrix
does. If the ARM64 server doesn't do that, the 1st step should align to
that.

Also do you pass 'use_threaded_interrupts=1' in your test?

> 

> >  >

> > You never provide the test details(how many drives, how many disks

> > attached to each drive) as I asked, so I can't comment on the reason,

> > also no reason shows that the patch is a good fix.

> 

> So I have only 2x ES3000 V3s. This looks like the same one:

> https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf

> 

> > 

> > My theory is simple, so far, the CPU is still much quicker than

> > current storage in case that IO aren't from multiple disks which are

> > connected to same drive.

> 

> Hopefully this is all the info you need:

> 

> Last login: Fri Dec 13 10:41:55 GMT 2019 on ttyAMA0

> Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 5.5.0-rc1-00001-g3779c27ad995-dirty

> aarch64)

> 

>  * Documentation:  https://help.ubuntu.com

>  * Management:     https://landscape.canonical.com

>  * Support:        https://ubuntu.com/advantage

> 

> Failed to connect to https://changelogs.ubuntu.com/meta-release-lts. Check

> your Internet connection or proxy settings

> 

> john@ubuntu:~$ lstopo

> Machine (14GB total)

>   Package L#0

>     NUMANode L#0 (P#0 14GB)

>       L3 L#0 (32MB)

>         L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0

> (P#0)

>         L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1

> (P#1)

>         L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2

> (P#2)

>         L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3

> (P#3)

>         L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4

> (P#4)

>         L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5

> (P#5)

>         L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6

> (P#6)

>         L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7

> (P#7)

>         L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8

> (P#8)

>         L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9

> (P#9)

>         L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU

> L#10 (P#10)

>         L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU

> L#11 (P#11)

>         L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU

> L#12 (P#12)

>         L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU

> L#13 (P#13)

>         L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU

> L#14 (P#14)

>         L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU

> L#15 (P#15)

>         L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU

> L#16 (P#16)

>         L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU

> L#17 (P#17)

>         L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU

> L#18 (P#18)

>         L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU

> L#19 (P#19)

>         L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU

> L#20 (P#20)

>         L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU

> L#21 (P#21)

>         L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23)

>       HostBridge L#0

>         PCIBridge

>           2 x { PCI 8086:10fb }

>         PCIBridge

>           PCI 19e5:0123

>         PCIBridge

>           PCI 19e5:1711

>       HostBridge L#4

>         PCIBridge

>           PCI 19e5:a250

>         PCI 19e5:a230

>         PCI 19e5:a235

>       HostBridge L#6

>         PCIBridge

>           PCI 19e5:a222

>             Net L#0 "eno1"

>           PCI 19e5:a222

>             Net L#1 "eno2"

>           PCI 19e5:a222

>             Net L#2 "eno3"

>           PCI 19e5:a221

>             Net L#3 "eno4"

>     NUMANode L#1 (P#1) + L3 L#1 (32MB)

>       L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU

> L#24 (P#24)

>       L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU

> L#25 (P#25)

>       L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU

> L#26 (P#26)

>       L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU

> L#27 (P#27)

>       L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU

> L#28 (P#28)

>       L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU

> L#29 (P#29)

>       L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU

> L#30 (P#30)

>       L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU

> L#31 (P#31)

>       L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU

> L#32 (P#32)

>       L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU

> L#33 (P#33)

>       L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU

> L#34 (P#34)

>       L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU

> L#35 (P#35)

>       L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU

> L#36 (P#36)

>       L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU

> L#37 (P#37)

>       L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU

> L#38 (P#38)

>       L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU

> L#39 (P#39)

>       L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU

> L#40 (P#40)

>       L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU

> L#41 (P#41)

>       L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU

> L#42 (P#42)

>       L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU

> L#43 (P#43)

>       L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU

> L#44 (P#44)

>       L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU

> L#45 (P#45)

>       L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU

> L#46 (P#46)

>       L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU

> L#47 (P#47)

>   Package L#1

>     NUMANode L#2 (P#2)

>       L3 L#2 (32MB)

>         L2 L#48 (512KB) + L1d L#48 (64KB) + L1i L#48 (64KB) + Core L#48 + PU

> L#48 (P#48)

>         L2 L#49 (512KB) + L1d L#49 (64KB) + L1i L#49 (64KB) + Core L#49 + PU

> L#49 (P#49)

>         L2 L#50 (512KB) + L1d L#50 (64KB) + L1i L#50 (64KB) + Core L#50 + PU

> L#50 (P#50)

>         L2 L#51 (512KB) + L1d L#51 (64KB) + L1i L#51 (64KB) + Core L#51 + PU

> L#51 (P#51)

>         L2 L#52 (512KB) + L1d L#52 (64KB) + L1i L#52 (64KB) + Core L#52 + PU

> L#52 (P#52)

>         L2 L#53 (512KB) + L1d L#53 (64KB) + L1i L#53 (64KB) + Core L#53 + PU

> L#53 (P#53)

>         L2 L#54 (512KB) + L1d L#54 (64KB) + L1i L#54 (64KB) + Core L#54 + PU

> L#54 (P#54)

>         L2 L#55 (512KB) + L1d L#55 (64KB) + L1i L#55 (64KB) + Core L#55 + PU

> L#55 (P#55)

>         L2 L#56 (512KB) + L1d L#56 (64KB) + L1i L#56 (64KB) + Core L#56 + PU

> L#56 (P#56)

>         L2 L#57 (512KB) + L1d L#57 (64KB) + L1i L#57 (64KB) + Core L#57 + PU

> L#57 (P#57)

>         L2 L#58 (512KB) + L1d L#58 (64KB) + L1i L#58 (64KB) + Core L#58 + PU

> L#58 (P#58)

>         L2 L#59 (512KB) + L1d L#59 (64KB) + L1i L#59 (64KB) + Core L#59 + PU

> L#59 (P#59)

>         L2 L#60 (512KB) + L1d L#60 (64KB) + L1i L#60 (64KB) + Core L#60 + PU

> L#60 (P#60)

>         L2 L#61 (512KB) + L1d L#61 (64KB) + L1i L#61 (64KB) + Core L#61 + PU

> L#61 (P#61)

>         L2 L#62 (512KB) + L1d L#62 (64KB) + L1i L#62 (64KB) + Core L#62 + PU

> L#62 (P#62)

>         L2 L#63 (512KB) + L1d L#63 (64KB) + L1i L#63 (64KB) + Core L#63 + PU

> L#63 (P#63)

>         L2 L#64 (512KB) + L1d L#64 (64KB) + L1i L#64 (64KB) + Core L#64 + PU

> L#64 (P#64)

>         L2 L#65 (512KB) + L1d L#65 (64KB) + L1i L#65 (64KB) + Core L#65 + PU

> L#65 (P#65)

>         L2 L#66 (512KB) + L1d L#66 (64KB) + L1i L#66 (64KB) + Core L#66 + PU

> L#66 (P#66)

>         L2 L#67 (512KB) + L1d L#67 (64KB) + L1i L#67 (64KB) + Core L#67 + PU

> L#67 (P#67)

>         L2 L#68 (512KB) + L1d L#68 (64KB) + L1i L#68 (64KB) + Core L#68 + PU

> L#68 (P#68)

>         L2 L#69 (512KB) + L1d L#69 (64KB) + L1i L#69 (64KB) + Core L#69 + PU

> L#69 (P#69)

>         L2 L#70 (512KB) + L1d L#70 (64KB) + L1i L#70 (64KB) + Core L#70 + PU

> L#70 (P#70)

>         L2 L#71 (512KB) + L1d L#71 (64KB) + L1i L#71 (64KB) + Core L#71 + PU

> L#71 (P#71)

>       HostBridge L#8

>         PCIBridge

>           PCI 19e5:0123

>       HostBridge L#10

>         PCIBridge

>           PCI 19e5:a250

>       HostBridge L#12

>         PCIBridge

>           PCI 19e5:a226

>             Net L#4 "eno5"

>     NUMANode L#3 (P#3) + L3 L#3 (32MB)

>       L2 L#72 (512KB) + L1d L#72 (64KB) + L1i L#72 (64KB) + Core L#72 + PU

> L#72 (P#72)

>       L2 L#73 (512KB) + L1d L#73 (64KB) + L1i L#73 (64KB) + Core L#73 + PU

> L#73 (P#73)

>       L2 L#74 (512KB) + L1d L#74 (64KB) + L1i L#74 (64KB) + Core L#74 + PU

> L#74 (P#74)

>       L2 L#75 (512KB) + L1d L#75 (64KB) + L1i L#75 (64KB) + Core L#75 + PU

> L#75 (P#75)

>       L2 L#76 (512KB) + L1d L#76 (64KB) + L1i L#76 (64KB) + Core L#76 + PU

> L#76 (P#76)

>       L2 L#77 (512KB) + L1d L#77 (64KB) + L1i L#77 (64KB) + Core L#77 + PU

> L#77 (P#77)

>       L2 L#78 (512KB) + L1d L#78 (64KB) + L1i L#78 (64KB) + Core L#78 + PU

> L#78 (P#78)

>       L2 L#79 (512KB) + L1d L#79 (64KB) + L1i L#79 (64KB) + Core L#79 + PU

> L#79 (P#79)

>       L2 L#80 (512KB) + L1d L#80 (64KB) + L1i L#80 (64KB) + Core L#80 + PU

> L#80 (P#80)

>       L2 L#81 (512KB) + L1d L#81 (64KB) + L1i L#81 (64KB) + Core L#81 + PU

> L#81 (P#81)

>       L2 L#82 (512KB) + L1d L#82 (64KB) + L1i L#82 (64KB) + Core L#82 + PU

> L#82 (P#82)

>       L2 L#83 (512KB) + L1d L#83 (64KB) + L1i L#83 (64KB) + Core L#83 + PU

> L#83 (P#83)

>       L2 L#84 (512KB) + L1d L#84 (64KB) + L1i L#84 (64KB) + Core L#84 + PU

> L#84 (P#84)

>       L2 L#85 (512KB) + L1d L#85 (64KB) + L1i L#85 (64KB) + Core L#85 + PU

> L#85 (P#85)

>       L2 L#86 (512KB) + L1d L#86 (64KB) + L1i L#86 (64KB) + Core L#86 + PU

> L#86 (P#86)

>       L2 L#87 (512KB) + L1d L#87 (64KB) + L1i L#87 (64KB) + Core L#87 + PU

> L#87 (P#87)

>       L2 L#88 (512KB) + L1d L#88 (64KB) + L1i L#88 (64KB) + Core L#88 + PU

> L#88 (P#88)

>       L2 L#89 (512KB) + L1d L#89 (64KB) + L1i L#89 (64KB) + Core L#89 + PU

> L#89 (P#89)

>       L2 L#90 (512KB) + L1d L#90 (64KB) + L1i L#90 (64KB) + Core L#90 + PU

> L#90 (P#90)

>       L2 L#91 (512KB) + L1d L#91 (64KB) + L1i L#91 (64KB) + Core L#91 + PU

> L#91 (P#91)

>       L2 L#92 (512KB) + L1d L#92 (64KB) + L1i L#92 (64KB) + Core L#92 + PU

> L#92 (P#92)

>       L2 L#93 (512KB) + L1d L#93 (64KB) + L1i L#93 (64KB) + Core L#93 + PU

> L#93 (P#93)

>       L2 L#94 (512KB) + L1d L#94 (64KB) + L1i L#94 (64KB) + Core L#94 + PU

> L#94 (P#94)

>       L2 L#95 (512KB) + L1d L#95 (64KB) + L1i L#95 (64KB) + Core L#95 + PU

> L#95 (P#95)

> john@ubuntu:~$ lscpu

> Architecture:        aarch64

> Byte Order:          Little Endian

> CPU(s):              96

> On-line CPU(s) list: 0-95

> Thread(s) per core:  1

> Core(s) per socket:  48

> Socket(s):           2

> NUMA node(s):        4

> Vendor ID:           0x48

> Model:               0

> Stepping:            0x0

> BogoMIPS:            200.00

> L1d cache:           64K

> L1i cache:           64K

> L2 cache:            512K

> L3 cache:            32768K

> NUMA node0 CPU(s):   0-23

> NUMA node1 CPU(s):   24-47

> NUMA node2 CPU(s):   48-71

> NUMA node3 CPU(s):   72-95

> Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics

> cpuid asimdrdm dcpop

> john@ubuntu:~$ dmesg | grep "Linux v"

> [    0.000000] Linux version 5.5.0-rc1-00001-g3779c27ad995-dirty

> (john@john-ThinkCentre-M93p) (gcc version 7.3.1 20180425

> [linaro-7.3-2018.05-rc1 revision 38aec9a676236eaa42ca03ccb3a6c1dd0182c29f]

> (Linaro GCC 7.3-2018.05-rc1)) #1436 SMP PREEMPT Fri Dec 13 10:51:46 GMT 2019

> john@ubuntu:~$

> john@ubuntu:~$ lspci

> 00:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 00:04.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 00:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 00:0c.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 00:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 00:12.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+

> Network Connection (rev 01)

> 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+

> Network Connection (rev 01)

> 04:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. ES3000

> V3 NVMe PCIe SSD (rev 45)

> 05:00.0 VGA compatible controller: Huawei Technologies Co., Ltd. Hi1710

> [iBMC Intelligent Management system chip w/VGA support] (rev 01)

> 74:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge

> (rev 20)

> 74:02.0 Serial Attached SCSI controller: Huawei Technologies Co., Ltd.

> HiSilicon SAS 3.0 HBA (rev 20)

> 74:03.0 SATA controller: Huawei Technologies Co., Ltd. HiSilicon AHCI HBA

> (rev 20)

> 75:00.0 Processing accelerators: Huawei Technologies Co., Ltd. HiSilicon ZIP

> Engine (rev 20)

> 78:00.0 Network and computing encryption device: Huawei Technologies Co.,

> Ltd. HiSilicon HPRE Engine (rev 20)

> 7a:00.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0

> 2-port Host Controller (rev 20)

> 7a:01.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0

> 2-port Host Controller (rev 20)

> 7a:02.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 3.0 Host

> Controller (rev 20)

> 7c:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge

> (rev 20)

> 7d:00.0 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE

> RDMA Network Controller (rev 20)

> 7d:00.1 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE

> RDMA Network Controller (rev 20)

> 7d:00.2 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE

> RDMA Network Controller (rev 20)

> 7d:00.3 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE

> Network Controller (rev 20)

> 80:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 80:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 80:0c.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 80:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port

> with Gen4 (rev 45)

> 81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. ES3000

> V3 NVMe PCIe SSD (rev 45)

> b4:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge

> (rev 20)

> b5:00.0 Processing accelerators: Huawei Technologies Co., Ltd. HiSilicon ZIP

> Engine (rev 20)

> b8:00.0 Network and computing encryption device: Huawei Technologies Co.,

> Ltd. HiSilicon HPRE Engine (rev 20)

> ba:00.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0

> 2-port Host Controller (rev 20)

> ba:01.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0

> 2-port Host Controller (rev 20)

> ba:02.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 3.0 Host

> Controller (rev 20)

> bc:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge

> (rev 20)

> bd:00.0 Ethernet controller: Huawei Technologies Co., Ltd. HNS

> GE/10GE/25GE/50GE/100GE RDMA Network Controller (rev 20)

> john@ubuntu:~$ sudo /bin/bash create_fio_task_cpu_liuyifan_nvme.sh 4k read

> 20 1

> Creat 4k_read_depth20_fiotest file sucessfully

> job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,

> ioengine=libaio, iodepth=20

> ...

> job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,

> ioengine=libaio, iodepth=20

> ...

> fio-3.1

> Starting 40 processes

> [  175.642410] rcu: INFO: rcu_preempt self-detected stall on CPU IOPS][eta

> 00m:18s]

> [  175.648150] rcu: 0-....: (1 GPs behind) idle=3ae/1/0x4000000000000004

> softirq=1589/1589 fqs=2322

> Jobs: 40 (f=40): [R(40)][100.0%][r=4270MiB/s,w=0KiB/s][r=1093k,w=0 IOPS][eta

> 00m:00s]

> job1: (groupid=0, jobs=40): err= 0: pid=1227: Fri Dec 13 10:57:49 2019

>    read: IOPS=952k, BW=3719MiB/s (3900MB/s)(145GiB/40003msec)

>     slat (usec): min=2, max=20126k, avg=10.66, stdev=9637.70

>     clat (usec): min=13, max=20156k, avg=517.95, stdev=31017.58

>      lat (usec): min=21, max=20156k, avg=528.77, stdev=32487.76

>     clat percentiles (usec):

>      |  1.00th=[  103],  5.00th=[  113], 10.00th=[  147], 20.00th=[  200],

>      | 30.00th=[  260], 40.00th=[  318], 50.00th=[  375], 60.00th=[  429],

>      | 70.00th=[  486], 80.00th=[  578], 90.00th=[  799], 95.00th=[  996],

>      | 99.00th=[ 1958], 99.50th=[ 2114], 99.90th=[ 2311], 99.95th=[ 2474],

>      | 99.99th=[ 7767]

>    bw (  KiB/s): min=  112, max=745026, per=4.60%, avg=175285.03,

> stdev=117592.37, samples=1740

>    iops        : min=   28, max=186256, avg=43821.06, stdev=29398.12,

> samples=1740

>   lat (usec)   : 20=0.01%, 50=0.01%, 100=0.14%, 250=28.38%, 500=43.76%

>   lat (usec)   : 750=16.17%, 1000=6.65%

>   lat (msec)   : 2=4.02%, 4=0.86%, 10=0.01%, 20=0.01%, 50=0.01%

>   lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 2000=0.01%

>   lat (msec)   : >=2000=0.01%

>   cpu          : usr=3.67%, sys=15.82%, ctx=20799355, majf=0, minf=4275

>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%,

> >=64=0.0%

>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

> >=64=0.0%

>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,

> >=64=0.0%

>      issued rwt: total=38086812,0,0, short=0,0,0, dropped=0,0,0

>      latency   : target=0, window=0, percentile=100.00%, depth=20

> 

> Run status group 0 (all jobs):

>    READ: bw=3719MiB/s (3900MB/s), 3719MiB/s-3719MiB/s (3900MB/s-3900MB/s),

> io=145GiB (156GB), run=40003-40003msec

> 

> Disk stats (read0/0, ticks=5002739/0, in_queue=540, util=99.83%

> john@ubuntu:~$ dmesg | tail -n 100

> [   20.380611] Key type dns_resolver registered

> [   20.385000] registered taskstats version 1

> [   20.389092] Loading compiled-in X.509 certificates

> [   20.394494] pcieport 0000:00:00.0: Adding to iommu group 9

> [   20.401556] pcieport 0000:00:04.0: Adding to iommu group 10

> [   20.408695] pcieport 0000:00:08.0: Adding to iommu group 11

> [   20.415767] pcieport 0000:00:0c.0: Adding to iommu group 12

> [   20.422842] pcieport 0000:00:10.0: Adding to iommu group 13

> [   20.429932] pcieport 0000:00:12.0: Adding to iommu group 14

> [   20.437077] pcieport 0000:7c:00.0: Adding to iommu group 15

> [   20.443397] pcieport 0000:74:00.0: Adding to iommu group 16

> [   20.449790] pcieport 0000:80:00.0: Adding to iommu group 17

> [   20.453983] usb 1-2: new high-speed USB device number 3 using ehci-pci

> [   20.457553] pcieport 0000:80:08.0: Adding to iommu group 18

> [   20.469455] pcieport 0000:80:0c.0: Adding to iommu group 19

> [   20.477037] pcieport 0000:80:10.0: Adding to iommu group 20

> [   20.484712] pcieport 0000:bc:00.0: Adding to iommu group 21

> [   20.491155] pcieport 0000:b4:00.0: Adding to iommu group 22

> [   20.517723] rtc-efi rtc-efi: setting system clock to 2019-12-13T10:54:56

> UTC (1576234496)

> [   20.525913] ALSA device list:

> [   20.528878]   No soundcards found.

> [   20.618601] hub 1-2:1.0: USB hub found

> [   20.622440] hub 1-2:1.0: 4 ports detected

> [   20.744970] EXT4-fs (sdd1): recovery complete

> [   20.759425] EXT4-fs (sdd1): mounted filesystem with ordered data mode.

> Opts: (null)

> [   20.767090] VFS: Mounted root (ext4 filesystem) on device 8:49.

> [   20.788837] devtmpfs: mounted

> [   20.793124] Freeing unused kernel memory: 5184K

> [   20.797817] Run /sbin/init as init process

> [   20.913986] usb 1-2.1: new full-speed USB device number 4 using ehci-pci

> [   21.379891] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT

> +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT

> +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2

> default-hierarchy=hybrid)

> [   21.401921] systemd[1]: Detected architecture arm64.

> [   21.459107] systemd[1]: Set hostname to <ubuntu>.

> [   21.474734] systemd[1]: Couldn't move remaining userspace processes,

> ignoring: Input/output error

> [   21.947303] systemd[1]: File

> /lib/systemd/system/systemd-journald.service:36 configures an IP firewall

> (IPAddressDeny=any), but the local system does not support BPF/cgroup based

> firewalling.

> [   21.964340] systemd[1]: Proceeding WITHOUT firewalling in effect! (This

> warning is only shown for the first loaded unit using IP firewalling.)

> [   22.268240] random: systemd: uninitialized urandom read (16 bytes read)

> [   22.274946] systemd[1]: Started Forward Password Requests to Wall

> Directory Watch.

> [   22.298022] random: systemd: uninitialized urandom read (16 bytes read)

> [   22.304894] systemd[1]: Created slice User and Session Slice.

> [   22.322032] random: systemd: uninitialized urandom read (16 bytes read)

> [   22.328850] systemd[1]: Created slice System Slice.

> [   22.346109] systemd[1]: Listening on Syslog Socket.

> [   22.552644] random: crng init done

> [   22.558740] random: 7 urandom warning(s) missed due to ratelimiting

> [   23.370478] EXT4-fs (sdd1): re-mounted. Opts: errors=remount-ro

> [   23.547390] systemd-journald[806]: Received request to flush runtime

> journal from PID 1

> [   23.633956] systemd-journald[806]: File

> /var/log/journal/f0ef8dc5ede84b5eb7431c01908d3558/system.journal corrupted

> or uncleanly shut down, renaming and replacing.

> [   23.814035] Adding 2097148k swap on /swapfile.  Priority:-2 extents:6

> across:2260988k

> [   25.012707] hns3 0000:7d:00.2 eno3: renamed from eth2

> [   25.054228] hns3 0000:7d:00.3 eno4: renamed from eth3

> [   25.086971] hns3 0000:7d:00.1 eno2: renamed from eth1

> [   25.118154] hns3 0000:7d:00.0 eno1: renamed from eth0

> [   25.154467] hns3 0000:bd:00.0 eno5: renamed from eth4

> [   26.130742] input: Keyboard/Mouse KVM 1.1.0 as /devices/pci0000:7a/0000:7a:01.0/usb1/1-2/1-2.1/1-2.1:1.0/0003:12D1:0003.0001/input/input1

> [   26.190189] hid-generic 0003:12D1:0003.0001: input: USB HID v1.10

> Keyboard [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input0

> [   26.191049] input: Keyboard/Mouse KVM 1.1.0 as /devices/pci0000:7a/0000:7a:01.0/usb1/1-2/1-2.1/1-2.1:1.1/0003:12D1:0003.0002/input/input2

> [   26.191090] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse

> [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1

> [  175.642410] rcu: INFO: rcu_preempt self-detected stall on CPU

> [  175.648150] rcu: 0-....: (1 GPs behind) idle=3ae/1/0x4000000000000004

> softirq=1589/1589 fqs=2322

> [  175.657102] (t=5253 jiffies g=2893 q=3123)

> [  175.657105] Task dump for CPU 0:

> [  175.657108] fio             R  running task        0  1254   1224

> 0x00000002

> [  175.657112] Call trace:

> [  175.657122]  dump_backtrace+0x0/0x1a0

> [  175.657126]  show_stack+0x14/0x20

> [  175.657130]  sched_show_task+0x164/0x1a0

> [  175.657133]  dump_cpu_task+0x40/0x2e8

> [  175.657137]  rcu_dump_cpu_stacks+0xa0/0xe0

> [  175.657139]  rcu_sched_clock_irq+0x6d8/0xaa8

> [  175.657143]  update_process_times+0x2c/0x50

> [  175.657147]  tick_sched_handle.isra.14+0x30/0x50

> [  175.657149]  tick_sched_timer+0x48/0x98

> [  175.657152]  __hrtimer_run_queues+0x120/0x1b8

> [  175.657154]  hrtimer_interrupt+0xd4/0x250

> [  175.657159]  arch_timer_handler_phys+0x28/0x40

> [  175.657162]  handle_percpu_devid_irq+0x80/0x140

> [  175.657165]  generic_handle_irq+0x24/0x38

> [  175.657167]  __handle_domain_irq+0x5c/0xb0

> [  175.657170]  gic_handle_irq+0x5c/0x148

> [  175.657172]  el1_irq+0xb8/0x180

> [  175.657175]  efi_header_end+0x94/0x234

> [  175.657178]  irq_exit+0xd0/0xd8

> [  175.657180]  __handle_domain_irq+0x60/0xb0

> [  175.657182]  gic_handle_irq+0x5c/0x148

> [  175.657184]  el1_irq+0xb8/0x180

> [  175.657194]  nvme_open+0x80/0xc8

> [  175.657199]  __blkdev_get+0x3f8/0x4f0

> [  175.657201]  blkdev_get+0x110/0x180

> [  175.657204]  blkdev_open+0x8c/0xa0

> [  175.657207]  do_dentry_open+0x1c4/0x3d8

> [  175.657210]  vfs_open+0x28/0x30

> [  175.657212]  path_openat+0x2a8/0x12a0

> [  175.657214]  do_filp_open+0x78/0xf8

> [  175.657217]  do_sys_open+0x19c/0x258

> [  175.657219]  __arm64_sys_openat+0x20/0x28

> [  175.657222]  el0_svc_common.constprop.2+0x64/0x160

> [  175.657225]  el0_svc_handler+0x20/0x80

> [  175.657227]  el0_sync_handler+0xe4/0x188

> [  175.657229]  el0_sync+0x140/0x180

> 

> john@ubuntu:~$ ./dump-io-irq-affinity

> kernel version:

> Linux ubuntu 5.5.0-rc1-00001-g3779c27ad995-dirty #1436 SMP PREEMPT Fri Dec

> 13 10:51:46 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux

> PCI name is 04:00.0: nvme0n1

> irq 56, cpu list 75, effective list 75

> irq 60, cpu list 24-28, effective list 24

> irq 61, cpu list 29-33, effective list 29

> irq 62, cpu list 34-38, effective list 34

> irq 63, cpu list 39-43, effective list 39

> irq 64, cpu list 44-47, effective list 44

> irq 65, cpu list 48-51, effective list 48

> irq 66, cpu list 52-55, effective list 52

> irq 67, cpu list 56-59, effective list 56

> irq 68, cpu list 60-63, effective list 60

> irq 69, cpu list 64-67, effective list 64

> irq 70, cpu list 68-71, effective list 68

> irq 71, cpu list 72-75, effective list 72

> irq 72, cpu list 76-79, effective list 76

> irq 73, cpu list 80-83, effective list 80

> irq 74, cpu list 84-87, effective list 84

> irq 75, cpu list 88-91, effective list 88

> irq 76, cpu list 92-95, effective list 92

> irq 77, cpu list 0-3, effective list 0

> irq 78, cpu list 4-7, effective list 4

> irq 79, cpu list 8-11, effective list 8

> irq 80, cpu list 12-15, effective list 12

> irq 81, cpu list 16-19, effective list 16

> irq 82, cpu list 20-23, effective list 20

> PCI name is 81:00.0: nvme1n1

> irq 100, cpu list 0-3, effective list 0

> irq 101, cpu list 4-7, effective list 4

> irq 102, cpu list 8-11, effective list 8

> irq 103, cpu list 12-15, effective list 12

> irq 104, cpu list 16-19, effective list 16

> irq 105, cpu list 20-23, effective list 20

> irq 57, cpu list 63, effective list 63

> irq 83, cpu list 24-28, effective list 24

> irq 84, cpu list 29-33, effective list 29

> irq 85, cpu list 34-38, effective list 34

> irq 86, cpu list 39-43, effective list 39

> irq 87, cpu list 44-47, effective list 44

> irq 88, cpu list 48-51, effective list 48

> irq 89, cpu list 52-55, effective list 52

> irq 90, cpu list 56-59, effective list 56

> irq 91, cpu list 60-63, effective list 60

> irq 92, cpu list 64-67, effective list 64

> irq 93, cpu list 68-71, effective list 68

> irq 94, cpu list 72-75, effective list 72

> irq 95, cpu list 76-79, effective list 76

> irq 96, cpu list 80-83, effective list 80

> irq 97, cpu list 84-87, effective list 84

> irq 98, cpu list 88-91, effective list 88

> irq 99, cpu list 92-95, effective list 92

 
The above log shows there are two nvme drives, each drive has 24 hw
queues.

Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine,
each hw queue can be assigned one unique effective CPU for handling
the queue's interrupt.

Because arm64's gic driver doesn't distribute irq's effective cpu affinity,
each hw queue is assigned same CPU to handle its interrupt.

As you saw, the detected RCU stall is on CPU0, which is for handling
both irq 77 and irq 100.

Please apply Marc's patch and observe if unique effective CPU is
assigned to each hw queue's irq.

If unique effective CPU is assigned to each hw queue's irq, and the RCU
stall can still be triggered, let's investigate further, given one single
ARM64 CPU core should be quick enough to handle IO completion from single
NVNe drive.


Thanks,
Ming
John Garry Dec. 13, 2019, 3:43 p.m. UTC | #27
On 13/12/2019 13:18, Ming Lei wrote:

Hi Ming,

> 

> On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote:

>> Hi Ming,

>>

>>>> I am running some NVMe perf tests with Marc's patch.

>>>

>>> We need to confirm that if Marc's patch works as expected, could you

>>> collect log via the attached script?

>>

>> As immediately below, I see this on vanilla mainline, so let's see what the

>> issue is without that patch.

> 

> IMO, the interrupt load needs to be distributed as what X86 IRQ matrix

> does. If the ARM64 server doesn't do that, the 1st step should align to

> that.


That would make sense. But still, I would like to think that a CPU could 
sink the interrupts from 2x queues.

> 

> Also do you pass 'use_threaded_interrupts=1' in your test?


When I set this, then, as I anticipated, no lockup. But IOPS drops from 
~ 1M IOPS->800K.

> 

>>

>>>   >

>>> You never provide the test details(how many drives, how many disks

>>> attached to each drive) as I asked, so I can't comment on the reason,

>>> also no reason shows that the patch is a good fix.

>>

>> So I have only 2x ES3000 V3s. This looks like the same one:

>> https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf

>>

>>>

>>> My theory is simple, so far, the CPU is still much quicker than

>>> current storage in case that IO aren't from multiple disks which are

>>> connected to same drive.

>>


[...]

>> irq 98, cpu list 88-91, effective list 88

>> irq 99, cpu list 92-95, effective list 92

>   

> The above log shows there are two nvme drives, each drive has 24 hw

> queues.

> 

> Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine,

> each hw queue can be assigned one unique effective CPU for handling

> the queue's interrupt.

> 

> Because arm64's gic driver doesn't distribute irq's effective cpu affinity,

> each hw queue is assigned same CPU to handle its interrupt.

> 

> As you saw, the detected RCU stall is on CPU0, which is for handling

> both irq 77 and irq 100.

> 

> Please apply Marc's patch and observe if unique effective CPU is

> assigned to each hw queue's irq.

> 


Same issue:

979826] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse 
[Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1
[   38.772536] IRQ25 CPU14 -> CPU3
[   38.777138] IRQ58 CPU8 -> CPU17
[  119.499459] rcu: INFO: rcu_preempt self-detected stall on CPU
[  119.505202] rcu: 16-....: (1 GPs behind) 
idle=a8a/1/0x4000000000000002 softirq=952/1211 fqs=2625
[  119.514188] (t=5253 jiffies g=2613 q=4573)
[  119.514193] Task dump for CPU 16:
[  119.514197] ksoftirqd/16    R  running task        0    91      2 
0x0000002a
[  119.514206] Call trace:
[  119.514224]  dump_backtrace+0x0/0x1a0
[  119.514228]  show_stack+0x14/0x20
[  119.514236]  sched_show_task+0x164/0x1a0
[  119.514240]  dump_cpu_task+0x40/0x2e8
[  119.514245]  rcu_dump_cpu_stacks+0xa0/0xe0
[  119.514247]  rcu_sched_clock_irq+0x6d8/0xaa8
[  119.514251]  update_process_times+0x2c/0x50
[  119.514258]  tick_sched_handle.isra.14+0x30/0x50
[  119.514261]  tick_sched_timer+0x48/0x98
[  119.514264]  __hrtimer_run_queues+0x120/0x1b8
[  119.514266]  hrtimer_interrupt+0xd4/0x250
[  119.514277]  arch_timer_handler_phys+0x28/0x40
[  119.514280]  handle_percpu_devid_irq+0x80/0x140
[  119.514283]  generic_handle_irq+0x24/0x38
[  119.514285]  __handle_domain_irq+0x5c/0xb0
[  119.514299]  gic_handle_irq+0x5c/0x148
[  119.514301]  el1_irq+0xb8/0x180
[  119.514305]  load_balance+0x478/0xb98
[  119.514308]  rebalance_domains+0x1cc/0x2f8
[  119.514311]  run_rebalance_domains+0x78/0xe0
[  119.514313]  efi_header_end+0x114/0x234
[  119.514317]  run_ksoftirqd+0x38/0x48
[  119.514322]  smpboot_thread_fn+0x16c/0x270
[  119.514324]  kthread+0x118/0x120
[  119.514326]  ret_from_fork+0x10/0x18
john@ubuntu:~$ ./dump-io-irq-affinity
kernel version:
Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri 
Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
PCI name is 04:00.0: nvme0n1
irq 56, cpu list 75, effective list 5
irq 60, cpu list 24-28, effective list 10
irq 61, cpu list 29-33, effective list 7
irq 62, cpu list 34-38, effective list 5
irq 63, cpu list 39-43, effective list 6
irq 64, cpu list 44-47, effective list 8
irq 65, cpu list 48-51, effective list 9
irq 66, cpu list 52-55, effective list 10
irq 67, cpu list 56-59, effective list 11
irq 68, cpu list 60-63, effective list 12
irq 69, cpu list 64-67, effective list 13
irq 70, cpu list 68-71, effective list 14
irq 71, cpu list 72-75, effective list 15
irq 72, cpu list 76-79, effective list 16
irq 73, cpu list 80-83, effective list 17
irq 74, cpu list 84-87, effective list 18
irq 75, cpu list 88-91, effective list 19
irq 76, cpu list 92-95, effective list 20
irq 77, cpu list 0-3, effective list 3
irq 78, cpu list 4-7, effective list 4
irq 79, cpu list 8-11, effective list 8
irq 80, cpu list 12-15, effective list 12
irq 81, cpu list 16-19, effective list 16
irq 82, cpu list 20-23, effective list 23
PCI name is 81:00.0: nvme1n1
irq 100, cpu list 0-3, effective list 0
irq 101, cpu list 4-7, effective list 5
irq 102, cpu list 8-11, effective list 9
irq 103, cpu list 12-15, effective list 13
irq 104, cpu list 16-19, effective list 17
irq 105, cpu list 20-23, effective list 21
irq 57, cpu list 63, effective list 7
irq 83, cpu list 24-28, effective list 5
irq 84, cpu list 29-33, effective list 6
irq 85, cpu list 34-38, effective list 8
irq 86, cpu list 39-43, effective list 9
irq 87, cpu list 44-47, effective list 10
irq 88, cpu list 48-51, effective list 11
irq 89, cpu list 52-55, effective list 12
irq 90, cpu list 56-59, effective list 13
irq 91, cpu list 60-63, effective list 14
irq 92, cpu list 64-67, effective list 15
irq 93, cpu list 68-71, effective list 16
irq 94, cpu list 72-75, effective list 17
irq 95, cpu list 76-79, effective list 18
irq 96, cpu list 80-83, effective list 19
irq 97, cpu list 84-87, effective list 20
irq 98, cpu list 88-91, effective list 21
irq 99, cpu list 92-95, effective list 22
john@ubuntu:~$

but you can see that CPU16 is handling irq72, 81, and 93.

> If unique effective CPU is assigned to each hw queue's irq, and the RCU

> stall can still be triggered, let's investigate further, given one single

> ARM64 CPU core should be quick enough to handle IO completion from single

> NVNe drive.


If I remove the code for bring the affinity within the ITS numa node 
mask - as Marc hinted - then I still get a lockup, but we still we have 
CPUs serving multiple interrupts:

   116.166881] rcu: INFO: rcu_preempt self-detected stall on CPU
[  116.181432] Task dump for CPU 4:
[  116.181502] Task dump for CPU 8:
john@ubuntu:~$ ./dump-io-irq-affinity
kernel version:
Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1443 SMP PREEMPT Fri 
Dec 13 15:29:55 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux
PCI name is 04:00.0: nvme0n1
irq 56, cpu list 75, effective list 75
irq 60, cpu list 24-28, effective list 25
irq 61, cpu list 29-33, effective list 29
irq 62, cpu list 34-38, effective list 34
irq 63, cpu list 39-43, effective list 39
irq 64, cpu list 44-47, effective list 44
irq 65, cpu list 48-51, effective list 49
irq 66, cpu list 52-55, effective list 55
irq 67, cpu list 56-59, effective list 56
irq 68, cpu list 60-63, effective list 61
irq 69, cpu list 64-67, effective list 64
irq 70, cpu list 68-71, effective list 68
irq 71, cpu list 72-75, effective list 73
irq 72, cpu list 76-79, effective list 76
irq 73, cpu list 80-83, effective list 80
irq 74, cpu list 84-87, effective list 85
irq 75, cpu list 88-91, effective list 88
irq 76, cpu list 92-95, effective list 92
irq 77, cpu list 0-3, effective list 1
irq 78, cpu list 4-7, effective list 4
irq 79, cpu list 8-11, effective list 8
irq 80, cpu list 12-15, effective list 14
irq 81, cpu list 16-19, effective list 16
irq 82, cpu list 20-23, effective list 20
PCI name is 81:00.0: nvme1n1
irq 100, cpu list 0-3, effective list 0
irq 101, cpu list 4-7, effective list 4
irq 102, cpu list 8-11, effective list 8
irq 103, cpu list 12-15, effective list 13
irq 104, cpu list 16-19, effective list 16
irq 105, cpu list 20-23, effective list 20
irq 57, cpu list 63, effective list 63
irq 83, cpu list 24-28, effective list 26
irq 84, cpu list 29-33, effective list 31
irq 85, cpu list 34-38, effective list 35
irq 86, cpu list 39-43, effective list 40
irq 87, cpu list 44-47, effective list 45
irq 88, cpu list 48-51, effective list 50
irq 89, cpu list 52-55, effective list 52
irq 90, cpu list 56-59, effective list 57
irq 91, cpu list 60-63, effective list 62
irq 92, cpu list 64-67, effective list 65
irq 93, cpu list 68-71, effective list 69
irq 94, cpu list 72-75, effective list 74
irq 95, cpu list 76-79, effective list 77
irq 96, cpu list 80-83, effective list 81
irq 97, cpu list 84-87, effective list 86
irq 98, cpu list 88-91, effective list 89
irq 99, cpu list 92-95, effective list 93
john@ubuntu:~$

I'm now thinking that we should just attempt this intelligent CPU 
affinity assignment for managed interrupts.

Thanks,
John
Ming Lei Dec. 13, 2019, 5:12 p.m. UTC | #28
On Fri, Dec 13, 2019 at 03:43:07PM +0000, John Garry wrote:
> On 13/12/2019 13:18, Ming Lei wrote:

> 

> Hi Ming,

> 

> > 

> > On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote:

> > > Hi Ming,

> > > 

> > > > > I am running some NVMe perf tests with Marc's patch.

> > > > 

> > > > We need to confirm that if Marc's patch works as expected, could you

> > > > collect log via the attached script?

> > > 

> > > As immediately below, I see this on vanilla mainline, so let's see what the

> > > issue is without that patch.

> > 

> > IMO, the interrupt load needs to be distributed as what X86 IRQ matrix

> > does. If the ARM64 server doesn't do that, the 1st step should align to

> > that.

> 

> That would make sense. But still, I would like to think that a CPU could

> sink the interrupts from 2x queues.

> 

> > 

> > Also do you pass 'use_threaded_interrupts=1' in your test?

> 

> When I set this, then, as I anticipated, no lockup. But IOPS drops from ~ 1M

> IOPS->800K.

> 

> > 

> > > 

> > > >   >

> > > > You never provide the test details(how many drives, how many disks

> > > > attached to each drive) as I asked, so I can't comment on the reason,

> > > > also no reason shows that the patch is a good fix.

> > > 

> > > So I have only 2x ES3000 V3s. This looks like the same one:

> > > https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf

> > > 

> > > > 

> > > > My theory is simple, so far, the CPU is still much quicker than

> > > > current storage in case that IO aren't from multiple disks which are

> > > > connected to same drive.

> > > 

> 

> [...]

> 

> > > irq 98, cpu list 88-91, effective list 88

> > > irq 99, cpu list 92-95, effective list 92

> > The above log shows there are two nvme drives, each drive has 24 hw

> > queues.

> > 

> > Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine,

> > each hw queue can be assigned one unique effective CPU for handling

> > the queue's interrupt.

> > 

> > Because arm64's gic driver doesn't distribute irq's effective cpu affinity,

> > each hw queue is assigned same CPU to handle its interrupt.

> > 

> > As you saw, the detected RCU stall is on CPU0, which is for handling

> > both irq 77 and irq 100.

> > 

> > Please apply Marc's patch and observe if unique effective CPU is

> > assigned to each hw queue's irq.

> > 

> 

> Same issue:

> 

> 979826] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse

> [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1

> [   38.772536] IRQ25 CPU14 -> CPU3

> [   38.777138] IRQ58 CPU8 -> CPU17

> [  119.499459] rcu: INFO: rcu_preempt self-detected stall on CPU

> [  119.505202] rcu: 16-....: (1 GPs behind) idle=a8a/1/0x4000000000000002

> softirq=952/1211 fqs=2625

> [  119.514188] (t=5253 jiffies g=2613 q=4573)

> [  119.514193] Task dump for CPU 16:

> [  119.514197] ksoftirqd/16    R  running task        0    91      2

> 0x0000002a

> [  119.514206] Call trace:

> [  119.514224]  dump_backtrace+0x0/0x1a0

> [  119.514228]  show_stack+0x14/0x20

> [  119.514236]  sched_show_task+0x164/0x1a0

> [  119.514240]  dump_cpu_task+0x40/0x2e8

> [  119.514245]  rcu_dump_cpu_stacks+0xa0/0xe0

> [  119.514247]  rcu_sched_clock_irq+0x6d8/0xaa8

> [  119.514251]  update_process_times+0x2c/0x50

> [  119.514258]  tick_sched_handle.isra.14+0x30/0x50

> [  119.514261]  tick_sched_timer+0x48/0x98

> [  119.514264]  __hrtimer_run_queues+0x120/0x1b8

> [  119.514266]  hrtimer_interrupt+0xd4/0x250

> [  119.514277]  arch_timer_handler_phys+0x28/0x40

> [  119.514280]  handle_percpu_devid_irq+0x80/0x140

> [  119.514283]  generic_handle_irq+0x24/0x38

> [  119.514285]  __handle_domain_irq+0x5c/0xb0

> [  119.514299]  gic_handle_irq+0x5c/0x148

> [  119.514301]  el1_irq+0xb8/0x180

> [  119.514305]  load_balance+0x478/0xb98

> [  119.514308]  rebalance_domains+0x1cc/0x2f8

> [  119.514311]  run_rebalance_domains+0x78/0xe0

> [  119.514313]  efi_header_end+0x114/0x234

> [  119.514317]  run_ksoftirqd+0x38/0x48

> [  119.514322]  smpboot_thread_fn+0x16c/0x270

> [  119.514324]  kthread+0x118/0x120

> [  119.514326]  ret_from_fork+0x10/0x18

> john@ubuntu:~$ ./dump-io-irq-affinity

> kernel version:

> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri Dec

> 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux

> PCI name is 04:00.0: nvme0n1

> irq 56, cpu list 75, effective list 5

> irq 60, cpu list 24-28, effective list 10


The effect list supposes to be subset of irq's affinity(24-28).

> irq 61, cpu list 29-33, effective list 7

> irq 62, cpu list 34-38, effective list 5

> irq 63, cpu list 39-43, effective list 6

> irq 64, cpu list 44-47, effective list 8

> irq 65, cpu list 48-51, effective list 9

> irq 66, cpu list 52-55, effective list 10

> irq 67, cpu list 56-59, effective list 11

> irq 68, cpu list 60-63, effective list 12

> irq 69, cpu list 64-67, effective list 13

> irq 70, cpu list 68-71, effective list 14

> irq 71, cpu list 72-75, effective list 15

> irq 72, cpu list 76-79, effective list 16

> irq 73, cpu list 80-83, effective list 17

> irq 74, cpu list 84-87, effective list 18

> irq 75, cpu list 88-91, effective list 19

> irq 76, cpu list 92-95, effective list 20


Same with above, so looks Marc's patch is wrong.

> irq 77, cpu list 0-3, effective list 3

> irq 78, cpu list 4-7, effective list 4

> irq 79, cpu list 8-11, effective list 8

> irq 80, cpu list 12-15, effective list 12

> irq 81, cpu list 16-19, effective list 16

> irq 82, cpu list 20-23, effective list 23

> PCI name is 81:00.0: nvme1n1

> irq 100, cpu list 0-3, effective list 0

> irq 101, cpu list 4-7, effective list 5

> irq 102, cpu list 8-11, effective list 9

> irq 103, cpu list 12-15, effective list 13

> irq 104, cpu list 16-19, effective list 17

> irq 105, cpu list 20-23, effective list 21

> irq 57, cpu list 63, effective list 7

> irq 83, cpu list 24-28, effective list 5

> irq 84, cpu list 29-33, effective list 6

> irq 85, cpu list 34-38, effective list 8

> irq 86, cpu list 39-43, effective list 9

> irq 87, cpu list 44-47, effective list 10

> irq 88, cpu list 48-51, effective list 11

> irq 89, cpu list 52-55, effective list 12

> irq 90, cpu list 56-59, effective list 13

> irq 91, cpu list 60-63, effective list 14

> irq 92, cpu list 64-67, effective list 15

> irq 93, cpu list 68-71, effective list 16

> irq 94, cpu list 72-75, effective list 17

> irq 95, cpu list 76-79, effective list 18

> irq 96, cpu list 80-83, effective list 19

> irq 97, cpu list 84-87, effective list 20

> irq 98, cpu list 88-91, effective list 21

> irq 99, cpu list 92-95, effective list 22


More are wrong.

> john@ubuntu:~$

> 

> but you can see that CPU16 is handling irq72, 81, and 93.


As I mentioned, the effective affinity has to be subset of the irq's
affinity.

> 

> > If unique effective CPU is assigned to each hw queue's irq, and the RCU

> > stall can still be triggered, let's investigate further, given one single

> > ARM64 CPU core should be quick enough to handle IO completion from single

> > NVNe drive.

> 

> If I remove the code for bring the affinity within the ITS numa node mask -

> as Marc hinted - then I still get a lockup, but we still we have CPUs

> serving multiple interrupts:

> 

>   116.166881] rcu: INFO: rcu_preempt self-detected stall on CPU

> [  116.181432] Task dump for CPU 4:

> [  116.181502] Task dump for CPU 8:

> john@ubuntu:~$ ./dump-io-irq-affinity

> kernel version:

> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1443 SMP PREEMPT Fri Dec

> 13 15:29:55 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux

> PCI name is 04:00.0: nvme0n1

> irq 56, cpu list 75, effective list 75

> irq 60, cpu list 24-28, effective list 25

> irq 61, cpu list 29-33, effective list 29

> irq 62, cpu list 34-38, effective list 34

> irq 63, cpu list 39-43, effective list 39

> irq 64, cpu list 44-47, effective list 44

> irq 65, cpu list 48-51, effective list 49

> irq 66, cpu list 52-55, effective list 55

> irq 67, cpu list 56-59, effective list 56

> irq 68, cpu list 60-63, effective list 61

> irq 69, cpu list 64-67, effective list 64

> irq 70, cpu list 68-71, effective list 68

> irq 71, cpu list 72-75, effective list 73

> irq 72, cpu list 76-79, effective list 76

> irq 73, cpu list 80-83, effective list 80

> irq 74, cpu list 84-87, effective list 85

> irq 75, cpu list 88-91, effective list 88

> irq 76, cpu list 92-95, effective list 92

> irq 77, cpu list 0-3, effective list 1

> irq 78, cpu list 4-7, effective list 4

> irq 79, cpu list 8-11, effective list 8

> irq 80, cpu list 12-15, effective list 14

> irq 81, cpu list 16-19, effective list 16

> irq 82, cpu list 20-23, effective list 20

> PCI name is 81:00.0: nvme1n1

> irq 100, cpu list 0-3, effective list 0

> irq 101, cpu list 4-7, effective list 4

> irq 102, cpu list 8-11, effective list 8

> irq 103, cpu list 12-15, effective list 13

> irq 104, cpu list 16-19, effective list 16

> irq 105, cpu list 20-23, effective list 20

> irq 57, cpu list 63, effective list 63

> irq 83, cpu list 24-28, effective list 26

> irq 84, cpu list 29-33, effective list 31

> irq 85, cpu list 34-38, effective list 35

> irq 86, cpu list 39-43, effective list 40

> irq 87, cpu list 44-47, effective list 45

> irq 88, cpu list 48-51, effective list 50

> irq 89, cpu list 52-55, effective list 52

> irq 90, cpu list 56-59, effective list 57

> irq 91, cpu list 60-63, effective list 62

> irq 92, cpu list 64-67, effective list 65

> irq 93, cpu list 68-71, effective list 69

> irq 94, cpu list 72-75, effective list 74

> irq 95, cpu list 76-79, effective list 77

> irq 96, cpu list 80-83, effective list 81

> irq 97, cpu list 84-87, effective list 86

> irq 98, cpu list 88-91, effective list 89

> irq 99, cpu list 92-95, effective list 93

> john@ubuntu:~$

> 

> I'm now thinking that we should just attempt this intelligent CPU affinity

> assignment for managed interrupts.


Right, the rule is simple: distribute effective list among CPUs evenly,
meantime select the effective CPU from the irq's affinity mask.


Thanks,
Ming
John Garry Dec. 13, 2019, 5:50 p.m. UTC | #29
On 13/12/2019 17:12, Ming Lei wrote:
>> pu list 80-83, effective list 81

>> irq 97, cpu list 84-87, effective list 86

>> irq 98, cpu list 88-91, effective list 89

>> irq 99, cpu list 92-95, effective list 93

>> john@ubuntu:~$

>>

>> I'm now thinking that we should just attempt this intelligent CPU affinity

>> assignment for managed interrupts.

> Right, the rule is simple: distribute effective list among CPUs evenly,

> meantime select the effective CPU from the irq's affinity mask.

> 


Even if we fix that, there is still a potential to have a CPU handling 
multiple nvme completion queues due to many factors, like cpu count, 
probe ordering, other PCI endpoints in the system, etc, so this lockup 
needs to be remedied.

Thanks,
John
Marc Zyngier Dec. 14, 2019, 10:59 a.m. UTC | #30
On Fri, 13 Dec 2019 12:08:54 +0000,
John Garry <john.garry@huawei.com> wrote:
> 

> Hi Marc,

> 

> >> JFYI, we're still testing this and the patch itself seems to work as

> >> intended.

> >> 

> >> Here's the kernel log if you just want to see how the interrupts are

> >> getting assigned:

> >> https://pastebin.com/hh3r810g

> > 

> > It is a bit hard to make sense of this dump, specially on such a wide

> > machine (I want one!) 

> 

> So do I :) That's the newer "D06CS" board.

> 

> without really knowing the topology of the system.

> 

> So it's 2x socket, each socket has 2x CPU dies, and each die has 6

> clusters of 4 CPUs, which gives 96 in total.

> 

> > 

> >> For me, I did get a performance boost for NVMe testing, but my

> >> colleague Xiang Chen saw a drop for our storage test of interest  -

> >> that's the HiSi SAS controller. We're trying to make sense of it now.

> > 

> > One of the difference is that with this patch, the initial affinity

> > is picked inside the NUMA node that matches the ITS. 

> 

> Is that even for managed interrupts? We're testing the storage

> controller which uses managed interrupts. I should have made that

> clearer.


The ITS driver doesn't care about the fact that an interrupt affinity
is 'managed' or not. And I don't think a low-level driver should, as
it will just follow whatever interrupt affinity it is requested to
use. If a managed interrupt has some requirements, then these
requirements better be explicit in terms of CPU affinity.

> In your case,

> > that's either node 0 or 2. But it is unclear whether which CPUs these

> > map to.

> > 

> > Given that I see interrupts mapped to CPUs 0-23 on one side, and 48-71

> > on the other, it looks like half of your machine gets starved, 

> 

> Seems that way.

> 

> So this is a mystery to me:

> 

> [   23.584192] picked CPU62 IRQ147

> 

> 147:          0          0          0          0          0          0

> 0          0          0          0          0          0  0          0

> 0          0          0          0          0       0          0

> 0          0          0          0 0          0          0          0

> 0          0          0     0          0          0          0

> 0          0          0          0          0          0          0

> 0          0    0          0          0          0          0

> 0          0         0          0          0          0          0

> 0   0          0          0          0          0          0

> 0        0          0          0          0          0          0  0

> 0          0          0          0          0          0       0

> 0          0          0          0          0 0          0          0

> 0          0          0          0     0          0          0

> 0          0   ITS-MSI 94404626 Edge      hisi_sas_v3_hw cq

> 

> 

> and

> 

> [   25.896728] picked CPU62 IRQ183

> 

> 183:          0          0          0          0          0          0

> 0          0          0          0          0          0  0          0

> 0          0          0          0          0       0          0

> 0          0          0          0 0          0          0          0

> 0          0          0     0          0          0          0

> 0          0          0          0          0          0          0

> 0          0    0          0          0          0          0

> 0          0         0          0          0          0          0

> 0   0          0          0          0          0          0

> 0        0          0          0          0          0          0  0

> 0          0          0          0          0          0       0

> 0          0          0          0          0 0          0          0

> 0          0          0          0     0          0          0

> 0          0   ITS-MSI 94437398 Edge      hisi_sas_v3_hw cq

> 

> 

> But mpstat reports for CPU62:

> 

> 12:44:58 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft

> %steal  %guest  %gnice   %idle

> 12:45:00 AM   62    6.54    0.00   42.99    0.00    6.54   12.15

> 0.00    0.00    6.54   25.23

> 

> I don't know what interrupts they are...


Clearly, they aren't your SAS interrupts. But the debug print do not
mean that these are the only interrupts that are targeting
CPU62. Looking at the 62nd column of /proc/interrupts should tell you
what fires (and my bet is on something like the timer).

> It's the "hisi_sas_v3_hw cq" interrupts which we're interested in.


Clearly, they aren't firing.

> and that

> > may be because no ITS targets the NUMA nodes they are part of.

> 

> So both storage controllers (which we're interested in for this test)

> are on socket #0, node #0.

> 

>  It would

> > be interesting to see what happens if you manually set the affinity

> > of the interrupts outside of the NUMA node.

> > 

> 

> Again, managed, so I don't think it's possible.


OK, we need to get back to what the actual requirements of a 'managed'
interrupt are, because there is clearly something that hasn't made it
into the core code...

	M.

-- 
Jazz is not dead, it just smells funny.
Marc Zyngier Dec. 14, 2019, 1:56 p.m. UTC | #31
On Fri, 13 Dec 2019 15:43:07 +0000
John Garry <john.garry@huawei.com> wrote:

[...]

> john@ubuntu:~$ ./dump-io-irq-affinity

> kernel version:

> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux

> PCI name is 04:00.0: nvme0n1

> irq 56, cpu list 75, effective list 5

> irq 60, cpu list 24-28, effective list 10


The NUMA selection code definitely gets in the way. And to be honest,
this NUMA thing is only there for the benefit of a terminally broken
implementation (Cavium ThunderX), which we should have never supported
the first place.

Let's rework this and simply use the managed affinity whenever
available instead. It may well be that it will break TX1, but I care
about it just as much as Cavium/Marvell does...

Please give this new patch a shot on your system (my D05 doesn't have
any managed devices):

https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/its-balance-mappings&id=1e987d83b8d880d56c9a2d8a86289631da94e55a

Thanks,

	M.

-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 16, 2019, 10:47 a.m. UTC | #32
On 14/12/2019 13:56, Marc Zyngier wrote:
> On Fri, 13 Dec 2019 15:43:07 +0000

> John Garry <john.garry@huawei.com> wrote:

> 

> [...]

> 

>> john@ubuntu:~$ ./dump-io-irq-affinity

>> kernel version:

>> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux

>> PCI name is 04:00.0: nvme0n1

>> irq 56, cpu list 75, effective list 5

>> irq 60, cpu list 24-28, effective list 10

> 

> The NUMA selection code definitely gets in the way. And to be honest,

> this NUMA thing is only there for the benefit of a terminally broken

> implementation (Cavium ThunderX), which we should have never supported

> the first place.

> 

> Let's rework this and simply use the managed affinity whenever

> available instead. It may well be that it will break TX1, but I care

> about it just as much as Cavium/Marvell does...


I'm just wondering if non-managed interrupts should be included in the 
load balancing calculation? Couldn't irqbalance (if active) start moving 
non-managed interrupts around anyway?

> 

> Please give this new patch a shot on your system (my D05 doesn't have

> any managed devices):


We could consider supporting platform msi managed interrupts, but I 
doubt the value.

> 

> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/its-balance-mappings&id=1e987d83b8d880d56c9a2d8a86289631da94e55a

> 


I quickly tested that in my NVMe env, and I see a performance boost of 
1055K -> 1206K IOPS. Results at bottom.

Here's the irq mapping dump:

PCI name is 04:00.0: nvme0n1
irq 56, cpu list 75, effective list 75
irq 60, cpu list 24-28, effective list 26
irq 61, cpu list 29-33, effective list 30
irq 62, cpu list 34-38, effective list 35
irq 63, cpu list 39-43, effective list 40
irq 64, cpu list 44-47, effective list 45
irq 65, cpu list 48-51, effective list 49
irq 66, cpu list 52-55, effective list 55
irq 67, cpu list 56-59, effective list 57
irq 68, cpu list 60-63, effective list 61
irq 69, cpu list 64-67, effective list 65
irq 70, cpu list 68-71, effective list 69
irq 71, cpu list 72-75, effective list 73
irq 72, cpu list 76-79, effective list 77
irq 73, cpu list 80-83, effective list 81
irq 74, cpu list 84-87, effective list 85
irq 75, cpu list 88-91, effective list 89
irq 76, cpu list 92-95, effective list 93
irq 77, cpu list 0-3, effective list 1
irq 78, cpu list 4-7, effective list 6
irq 79, cpu list 8-11, effective list 9
irq 80, cpu list 12-15, effective list 13
irq 81, cpu list 16-19, effective list 17
irq 82, cpu list 20-23, effective list 21
PCI name is 81:00.0: nvme1n1
irq 100, cpu list 0-3, effective list 0
irq 101, cpu list 4-7, effective list 4
irq 102, cpu list 8-11, effective list 8
irq 103, cpu list 12-15, effective list 12
irq 104, cpu list 16-19, effective list 16
irq 105, cpu list 20-23, effective list 20
irq 57, cpu list 63, effective list 63
irq 83, cpu list 24-28, effective list 26
irq 84, cpu list 29-33, effective list 29
irq 85, cpu list 34-38, effective list 34
irq 86, cpu list 39-43, effective list 39
irq 87, cpu list 44-47, effective list 44
irq 88, cpu list 48-51, effective list 48
irq 89, cpu list 52-55, effective list 54
irq 90, cpu list 56-59, effective list 56
irq 91, cpu list 60-63, effective list 60
irq 92, cpu list 64-67, effective list 64
irq 93, cpu list 68-71, effective list 68
irq 94, cpu list 72-75, effective list 72
irq 95, cpu list 76-79, effective list 76
irq 96, cpu list 80-83, effective list 80
irq 97, cpu list 84-87, effective list 84
irq 98, cpu list 88-91, effective list 88
irq 99, cpu list 92-95, effective list 92

I'm still getting the CPU lockup (even on CPUs which have a single NVMe 
completion interrupt assigned), which taints these results. That lockup 
needs to be fixed.

We'll check on our SAS env also. I did already hack something up similar 
to your change and again we saw a boost there.

Thanks,
John


before

job1: (groupid=0, jobs=20): err= 0: pid=1328: Mon Dec 16 10:03:35 2019
    read: IOPS=1055k, BW=4121MiB/s (4322MB/s)(994GiB/246946msec)
     slat (usec): min=2, max=36747k, avg= 6.08, stdev=4018.85
     clat (usec): min=13, max=145774k, avg=369.87, stdev=50221.38
      lat (usec): min=22, max=145774k, avg=376.12, stdev=50387.08
     clat percentiles (usec):
      |  1.00th=[  105],  5.00th=[  128], 10.00th=[  149], 20.00th=[  178],
      | 30.00th=[  210], 40.00th=[  243], 50.00th=[  281], 60.00th=[  326],
      | 70.00th=[  396], 80.00th=[  486], 90.00th=[  570], 95.00th=[  619],
      | 99.00th=[  775], 99.50th=[  906], 99.90th=[ 1254], 99.95th=[ 1631],
      | 99.99th=[ 3884]
    bw (  KiB/s): min=    8, max=715687, per=5.65%, avg=238518.42, 
stdev=115795.80, samples=8726
    iops        : min=    2, max=178921, avg=59629.49, stdev=28948.95, 
samples=8726
   lat (usec)   : 20=0.01%, 50=0.01%, 100=0.60%, 250=41.66%, 500=39.19%
   lat (usec)   : 750=17.36%, 1000=0.95%
   lat (msec)   : 2=0.20%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
   lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec)   : 2000=0.01%, >=2000=0.01%
   cpu          : usr=8.26%, sys=33.56%, ctx=132171506, majf=0, minf=6774
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%

      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%

      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%

      issued rwt: total=260541724,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=20

Run status group 0 (all jobs):
    READ: bw=4121MiB/s (4322MB/s), 4121MiB/s-4121MiB/s 
(4322MB/s-4322MB/s), io=994GiB (1067GB), run=246946-246946msec

Disk stats (read/write):
   nvme0n1: ios=136993553/0, merge=0/0, ticks=42019997/0, 
in_queue=14168, util=100.00%
   nvme1n1: ios=123408538/0, merge=0/0, ticks=37371364/0, 
in_queue=44672, util=100.00%
john@ubuntu:~$ dmesg | grep "Linux v"
[    0.000000] Linux version 5.5.0-rc1-dirty 
(john@john-ThinkCentre-M93p) (gcc version 7.3.1 20180425 
[linaro-7.3-2018.05-rc1 revision 
38aec9a676236eaa42ca03ccb3a6c1dd0182c29f] (Linaro GCC 7.3-2018.05-rc1)) 
#546 SMP PREEMPT Mon Dec 16 09:47:44 GMT 2019
john@ubuntu:~$


after

Creat 4k_read_depth20_fiotest file sucessfully_cpu_liuyifan_nvme.sh 4k 
read 20 1
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=20
...
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=20
...
fio-3.1
Starting 20 processes
[  318.569268] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 04m:30s]
[  318.575010] rcu: 26-....: (1 GPs behind) 
idle=b82/1/0x4000000000000004 softirq=842/843 fqs=2508
[  355.781759] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 03m:53s]
[  355.787499] rcu: 34-....: (1 GPs behind) 
idle=35a/1/0x4000000000000004 softirq=10395/11729 fqs=2623
[  407.805329] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 03m:01s]
[  407.811069] rcu: 0-....: (1 GPs behind) idle=0ba/0/0x3 
softirq=10830/14926 fqs=2625
Jobs: 20 (f=20): [R(20)][61.0%][r=4747MiB/s,w=0KiB/s][r=1215k,w=0[ 
470.817317] rcu: INFO: rcu_preempt self-detected stall on CPU
[  470.824912] rcu: 0-....: (2779 ticks this GP) idle=0ba/0/0x3 
softirq=14927/14927 fqs=10501
[  533.829618] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 00m:54s]
[  533.835360] rcu: 39-....: (1 GPs behind) 
idle=74e/1/0x4000000000000004 softirq=3422/3422 fqs=17226
Jobs: 20 (f=20): [R(20)][100.0%][r=4822MiB/s,w=0KiB/s][r=1234k,w=0 
IOPS][eta 00m:00s]
job1: (groupid=0, jobs=20): err= 0: pid=1273: Mon Dec 16 10:15:55 2019
    read: IOPS=1206k, BW=4712MiB/s (4941MB/s)(1381GiB/300002msec)
     slat (usec): min=2, max=165648k, avg= 7.26, stdev=10373.59
     clat (usec): min=12, max=191808k, avg=323.17, stdev=57005.77
      lat (usec): min=19, max=191808k, avg=330.59, stdev=58014.79
     clat percentiles (usec):
      |  1.00th=[  106],  5.00th=[  151], 10.00th=[  174], 20.00th=[  194],
      | 30.00th=[  212], 40.00th=[  231], 50.00th=[  247], 60.00th=[  262],
      | 70.00th=[  285], 80.00th=[  330], 90.00th=[  457], 95.00th=[  537],
      | 99.00th=[  676], 99.50th=[  807], 99.90th=[ 1647], 99.95th=[ 2376],
      | 99.99th=[ 6915]
    bw (  KiB/s): min=    8, max=648593, per=5.73%, avg=276597.82, 
stdev=98174.89, samples=10475
    iops        : min=    2, max=162148, avg=69149.31, stdev=24543.72, 
samples=10475
   lat (usec)   : 20=0.01%, 50=0.01%, 100=0.67%, 250=51.48%, 500=41.68%
   lat (usec)   : 750=5.54%, 1000=0.33%
   lat (msec)   : 2=0.23%, 4=0.05%, 10=0.02%, 20=0.01%, 50=0.01%
   lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec)   : 2000=0.01%, >=2000=0.01%
   cpu          : usr=9.77%, sys=41.68%, ctx=218155976, majf=0, minf=6376
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%

      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%

      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%

      issued rwt: total=361899317,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=20

Run status group 0 (all jobs):
    READ: bw=4712MiB/s (4941MB/s), 4712MiB/s-4712MiB/s 
(4941MB/s-4941MB/s), io=1381GiB (1482GB), run=300002-300002msec

Disk stats (read/write):
   nvme0n1: ios=188627578/0, merge=0/0, ticks=50365208/0, 
in_queue=55380, util=100.00%
   nvme1n1: ios=173066657/0, merge=0/0, ticks=38804419/0, 
in_queue=151212, util=100.00%
john@ubuntu:~$ dmesg | grep "Linux v"
[    0.000000] Linux version 5.5.0-rc1-00001-g1e987d83b8d8-dirty 
(john@john-ThinkCentre-M93p) (gcc version 7.3.1 20180425 
[linaro-7.3-2018.05-rc1 revision 
38aec9a676236eaa42ca03ccb3a6c1dd0182c29f] (Linaro GCC 7.3-2018.05-rc1)) 
#547 SMP PREEMPT Mon Dec 16 10:02:27 GMT 2019
Marc Zyngier Dec. 16, 2019, 11:40 a.m. UTC | #33
On 2019-12-16 10:47, John Garry wrote:
> On 14/12/2019 13:56, Marc Zyngier wrote:

>> On Fri, 13 Dec 2019 15:43:07 +0000

>> John Garry <john.garry@huawei.com> wrote:

>> [...]

>>

>>> john@ubuntu:~$ ./dump-io-irq-affinity

>>> kernel version:

>>> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT 

>>> Fri Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux

>>> PCI name is 04:00.0: nvme0n1

>>> irq 56, cpu list 75, effective list 5

>>> irq 60, cpu list 24-28, effective list 10

>> The NUMA selection code definitely gets in the way. And to be 

>> honest,

>> this NUMA thing is only there for the benefit of a terminally broken

>> implementation (Cavium ThunderX), which we should have never 

>> supported

>> the first place.

>> Let's rework this and simply use the managed affinity whenever

>> available instead. It may well be that it will break TX1, but I care

>> about it just as much as Cavium/Marvell does...

>

> I'm just wondering if non-managed interrupts should be included in

> the load balancing calculation? Couldn't irqbalance (if active) start

> moving non-managed interrupts around anyway?


But they are, aren't they? See what we do in irq_set_affinity:

+		atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu));
+		atomic_dec(per_cpu_ptr(&cpu_lpi_count,
+				       its_dev->event_map.col_map[id]));

We don't try to "rebalance" anything based on that though, not that
I think we should.

>

>> Please give this new patch a shot on your system (my D05 doesn't 

>> have

>> any managed devices):

>

> We could consider supporting platform msi managed interrupts, but I

> doubt the value.


It shouldn't be hard to do, and most of the existing code could be
moved to the generic level. As for the value, I'm not convinced
either. For example D05 uses the MBIGEN as an intermediate interrupt
controller, so MSIs are from the PoV of MBIGEN, and not the SAS device
attached to it. Not the best design...

>> 

>> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/its-balance-mappings&id=1e987d83b8d880d56c9a2d8a86289631da94e55a

>>

>

> I quickly tested that in my NVMe env, and I see a performance boost

> of 1055K -> 1206K IOPS. Results at bottom.


OK, that's encouraging.

> Here's the irq mapping dump:


[...]

Looks good.

> I'm still getting the CPU lockup (even on CPUs which have a single

> NVMe completion interrupt assigned), which taints these results. That

> lockup needs to be fixed.


Is this interrupt screaming to the point where it prevents the 
completion
thread from making forward progress? What if you don't use threaded
interrupts?

> We'll check on our SAS env also. I did already hack something up

> similar to your change and again we saw a boost there.


OK. Please keep me posted. If the result is overall positive, I'll
push this into -next for some soaking.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 16, 2019, 2:17 p.m. UTC | #34
Hi Marc,

>>

>> I'm just wondering if non-managed interrupts should be included in

>> the load balancing calculation? Couldn't irqbalance (if active) start

>> moving non-managed interrupts around anyway?

> 

> But they are, aren't they? See what we do in irq_set_affinity:

> 

> +        atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu));

> +        atomic_dec(per_cpu_ptr(&cpu_lpi_count,

> +                       its_dev->event_map.col_map[id]));

> 

> We don't try to "rebalance" anything based on that though, not that

> I think we should.


Ah sorry, I meant whether they should not be included. In 
its_irq_domain_activate(), we increment the per-cpu lpi count and also 
use its_pick_target_cpu() to find the least loaded cpu. I am asking 
whether we should just stick with the old policy for non-managed 
interrupts here.

After checking D05, I see a very significant performance hit for SAS 
controller performance - ~40% throughout lowering.

With this patch, now we have effective affinity targeted at seemingly 
"random" CPUs, as opposed to all just using CPU0. This affects performance.

The difference is that when we use managed interrupts - like for NVME or 
D06 SAS controller - the irq cpu affinity mask matches the CPUs which 
enqueue the requests to the queue associated with the interrupt. So 
there is an efficiency is enqueuing and deqeueing on same CPU group - 
all related to blk multi-queue. And this is not the case for non-managed 
interrupts.

>>

>>> Please give this new patch a shot on your system (my D05 doesn't have

>>> any managed devices):

>>

>> We could consider supporting platform msi managed interrupts, but I

>> doubt the value.

> 

> It shouldn't be hard to do, and most of the existing code could be

> moved to the generic level. As for the value, I'm not convinced

> either. For example D05 uses the MBIGEN as an intermediate interrupt

> controller, so MSIs are from the PoV of MBIGEN, and not the SAS device

> attached to it. Not the best design...


JFYI, I did raise this following topic before, but that's as far as I got:

https://marc.info/?l=linux-block&m=150722088314310&w=2

> 

>>>

>>> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/its-balance-mappings&id=1e987d83b8d880d56c9a2d8a86289631da94e55a 

>>>

>>>

>>

>> I quickly tested that in my NVMe env, and I see a performance boost

>> of 1055K -> 1206K IOPS. Results at bottom.

> 

> OK, that's encouraging.

> 

>> Here's the irq mapping dump:

> 

> [...]

> 

> Looks good.

> 

>> I'm still getting the CPU lockup (even on CPUs which have a single

>> NVMe completion interrupt assigned), which taints these results. That

>> lockup needs to be fixed.

> 

> Is this interrupt screaming to the point where it prevents the completion

> thread from making forward progress? What if you don't use threaded

> interrupts?


Yeah, just switching to threaded interrupts solves it (nvme core has a 
switch for this). So there was a big discussion on this topic a while ago:

https://lkml.org/lkml/2019/8/20/45 (couldn't find this on lore)

The conclusion there was to switch to irq poll, but leiming though that 
it was another issue - see earlier mail:

https://lore.kernel.org/lkml/20191210014335.GA25022@ming.t460p/

> 

>> We'll check on our SAS env also. I did already hack something up

>> similar to your change and again we saw a boost there.

> 

> OK. Please keep me posted. If the result is overall positive, I'll

> push this into -next for some soaking.

> 


ok, thanks

John
Marc Zyngier Dec. 16, 2019, 6 p.m. UTC | #35
Hi John,

On 2019-12-16 14:17, John Garry wrote:
> Hi Marc,

>

>>>

>>> I'm just wondering if non-managed interrupts should be included in

>>> the load balancing calculation? Couldn't irqbalance (if active) 

>>> start

>>> moving non-managed interrupts around anyway?

>> But they are, aren't they? See what we do in irq_set_affinity:

>> +        atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu));

>> +        atomic_dec(per_cpu_ptr(&cpu_lpi_count,

>> +                       its_dev->event_map.col_map[id]));

>> We don't try to "rebalance" anything based on that though, not that

>> I think we should.

>

> Ah sorry, I meant whether they should not be included. In

> its_irq_domain_activate(), we increment the per-cpu lpi count and 

> also

> use its_pick_target_cpu() to find the least loaded cpu. I am asking

> whether we should just stick with the old policy for non-managed

> interrupts here.

>

> After checking D05, I see a very significant performance hit for SAS

> controller performance - ~40% throughout lowering.


-ETOOMANYMOVINGPARTS.

> With this patch, now we have effective affinity targeted at seemingly

> "random" CPUs, as opposed to all just using CPU0. This affects

> performance.


And piling all interrupts on the same CPU does help?

> The difference is that when we use managed interrupts - like for NVME

> or D06 SAS controller - the irq cpu affinity mask matches the CPUs

> which enqueue the requests to the queue associated with the 

> interrupt.

> So there is an efficiency is enqueuing and deqeueing on same CPU 

> group

> - all related to blk multi-queue. And this is not the case for

> non-managed interrupts.


So you enqueue requests from CPU0 only? It seems a bit odd...

>>>> Please give this new patch a shot on your system (my D05 doesn't 

>>>> have

>>>> any managed devices):

>>>

>>> We could consider supporting platform msi managed interrupts, but I

>>> doubt the value.

>> It shouldn't be hard to do, and most of the existing code could be

>> moved to the generic level. As for the value, I'm not convinced

>> either. For example D05 uses the MBIGEN as an intermediate interrupt

>> controller, so MSIs are from the PoV of MBIGEN, and not the SAS 

>> device

>> attached to it. Not the best design...

>

> JFYI, I did raise this following topic before, but that's as far as I 

> got:

>

> https://marc.info/?l=linux-block&m=150722088314310&w=2


Yes. And that's probably not very hard, but the problem in your case is
that the D05 HW is not using MSIs... You'd have to provide an 
abstraction
for wired interrupts (please don't).

You'd be better off directly setting the affinity of the interrupts 
from
the driver, but I somehow can't believe that you're only submitting 
requests
from the same CPU, always. There must be something I'm missing.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 16, 2019, 6:50 p.m. UTC | #36
Hi Marc,

>>

>>>>

>>>> I'm just wondering if non-managed interrupts should be included in

>>>> the load balancing calculation? Couldn't irqbalance (if active) start

>>>> moving non-managed interrupts around anyway?

>>> But they are, aren't they? See what we do in irq_set_affinity:

>>> +        atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu));

>>> +        atomic_dec(per_cpu_ptr(&cpu_lpi_count,

>>> +                       its_dev->event_map.col_map[id]));

>>> We don't try to "rebalance" anything based on that though, not that

>>> I think we should.

>>

>> Ah sorry, I meant whether they should not be included. In

>> its_irq_domain_activate(), we increment the per-cpu lpi count and also

>> use its_pick_target_cpu() to find the least loaded cpu. I am asking

>> whether we should just stick with the old policy for non-managed

>> interrupts here.

>>

>> After checking D05, I see a very significant performance hit for SAS

>> controller performance - ~40% throughout lowering.

> 

> -ETOOMANYMOVINGPARTS.


Understood.

> 

>> With this patch, now we have effective affinity targeted at seemingly

>> "random" CPUs, as opposed to all just using CPU0. This affects

>> performance.

> 

> And piling all interrupts on the same CPU does help?


Apparently... I need to check this more.

> 

>> The difference is that when we use managed interrupts - like for NVME

>> or D06 SAS controller - the irq cpu affinity mask matches the CPUs

>> which enqueue the requests to the queue associated with the interrupt.

>> So there is an efficiency is enqueuing and deqeueing on same CPU group

>> - all related to blk multi-queue. And this is not the case for

>> non-managed interrupts.

> 

> So you enqueue requests from CPU0 only? It seems a bit odd...


No, but maybe I wasn't clear enough. I'll give an overview:

For D06 SAS controller - which is a multi-queue PCI device - we use 
managed interrupts. The HW has 16 submission/completion queues, so for 
96 cores, we have an even spread of 6 CPUs assigned per queue; and this 
per-queue CPU mask is the interrupt affinity mask. So CPU0-5 would 
submit any IO on queue0, CPU6-11 on queue2, and so on. PCI NVMe is 
essentially the same.

These are the environments which we're trying to promote performance.

Then for D05 SAS controller - which is multi-queue platform device 
(mbigen) - we don't use managed interrupts. We still submit IO from any 
CPU, but we choose the queue to submit IO on a round-robin basis to 
promote some isolation, i.e. reduce inter-queue lock contention, so the 
queue chosen has nothing to do with the CPU.

And with your change we may submit on cpu4 but service the interrupt on 
cpu30, as an example. While previously we would always service on cpu0. 
The old way still isn't ideal, I'll admit.

For this env, we would just like to maintain the same performance. And 
it's here that we see the performance drop.

> 

>>>>> Please give this new patch a shot on your system (my D05 doesn't have

>>>>> any managed devices):

>>>>

>>>> We could consider supporting platform msi managed interrupts, but I

>>>> doubt the value.

>>> It shouldn't be hard to do, and most of the existing code could be

>>> moved to the generic level. As for the value, I'm not convinced

>>> either. For example D05 uses the MBIGEN as an intermediate interrupt

>>> controller, so MSIs are from the PoV of MBIGEN, and not the SAS device

>>> attached to it. Not the best design...

>>

>> JFYI, I did raise this following topic before, but that's as far as I 

>> got:

>>

>> https://marc.info/?l=linux-block&m=150722088314310&w=2

> 

> Yes. And that's probably not very hard, but the problem in your case is

> that the D05 HW is not using MSIs...


Right

  You'd have to provide an abstraction
> for wired interrupts (please don't).

> 

> You'd be better off directly setting the affinity of the interrupts from

> the driver, but I somehow can't believe that you're only submitting 

> requests

> from the same CPU,


Maybe...

  always. There must be something I'm missing.
> 


Thanks,
John
John Garry Dec. 20, 2019, 11:30 a.m. UTC | #37
>> So you enqueue requests from CPU0 only? It seems a bit odd...

> 

> No, but maybe I wasn't clear enough. I'll give an overview:

> 

> For D06 SAS controller - which is a multi-queue PCI device - we use 

> managed interrupts. The HW has 16 submission/completion queues, so for 

> 96 cores, we have an even spread of 6 CPUs assigned per queue; and this 

> per-queue CPU mask is the interrupt affinity mask. So CPU0-5 would 

> submit any IO on queue0, CPU6-11 on queue2, and so on. PCI NVMe is 

> essentially the same.

> 

> These are the environments which we're trying to promote performance.

> 

> Then for D05 SAS controller - which is multi-queue platform device 

> (mbigen) - we don't use managed interrupts. We still submit IO from any 

> CPU, but we choose the queue to submit IO on a round-robin basis to 

> promote some isolation, i.e. reduce inter-queue lock contention, so the 

> queue chosen has nothing to do with the CPU.

> 

> And with your change we may submit on cpu4 but service the interrupt on 

> cpu30, as an example. While previously we would always service on cpu0. 

> The old way still isn't ideal, I'll admit.

> 

> For this env, we would just like to maintain the same performance. And 

> it's here that we see the performance drop.

> 


Hi Marc,

We've got some more results and it looks promising.

So with your patch we get a performance boost of 3180.1K -> 3294.9K IOPS 
in the D06 SAS env. Then when we change the driver to use threaded 
interrupt handler (mainline currently uses tasklet), we get a boost 
again up to 3415K IOPS.

Now this is essentially the same figure we had with using threaded 
handler + the gen irq change in spreading the handler CPU affinity. We 
did also test your patch + gen irq change and got a performance drop, to 
3347K IOPS.

So tentatively I'd say your patch may be all we need.

FYI, here is how the effective affinity is looking for both SAS 
controllers with your patch:

74:02.0
irq 81, cpu list 24-29, effective list 24 cq
irq 82, cpu list 30-35, effective list 30 cq
irq 83, cpu list 36-41, effective list 36 cq
irq 84, cpu list 42-47, effective list 42 cq
irq 85, cpu list 48-53, effective list 48 cq
irq 86, cpu list 54-59, effective list 56 cq
irq 87, cpu list 60-65, effective list 60 cq
irq 88, cpu list 66-71, effective list 66 cq
irq 89, cpu list 72-77, effective list 72 cq
irq 90, cpu list 78-83, effective list 78 cq
irq 91, cpu list 84-89, effective list 84 cq
irq 92, cpu list 90-95, effective list 90 cq
irq 93, cpu list 0-5, effective list 0 cq
irq 94, cpu list 6-11, effective list 6 cq
irq 95, cpu list 12-17, effective list 12 cq
irq 96, cpu list 18-23, effective list 18 cq

74:04.0
irq 113, cpu list 24-29, effective list 25 cq
irq 114, cpu list 30-35, effective list 31 cq
irq 115, cpu list 36-41, effective list 37 cq
irq 116, cpu list 42-47, effective list 43 cq
irq 117, cpu list 48-53, effective list 49 cq
irq 118, cpu list 54-59, effective list 57 cq
irq 119, cpu list 60-65, effective list 61 cq
irq 120, cpu list 66-71, effective list 67 cq
irq 121, cpu list 72-77, effective list 73 cq
irq 122, cpu list 78-83, effective list 79 cq
irq 123, cpu list 84-89, effective list 85 cq
irq 124, cpu list 90-95, effective list 91 cq
irq 125, cpu list 0-5, effective list 1 cq
irq 126, cpu list 6-11, effective list 7 cq
irq 127, cpu list 12-17, effective list 17 cq
irq 128, cpu list 18-23, effective list 19 cq

As for your patch itself, I'm still concerned of possible regressions if 
we don't apply this effective interrupt affinity spread policy to only 
managed interrupts.

JFYI, about NVMe CPU lockup issue, there are 2 works on going here:
https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t
https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t

Cheers,
John

Ps. Thanks to Xiang Chen for all the work here in getting these results.

>>

>>>>>> Please give this new patch a shot on your system (my D05 doesn't have

>>>>>> any managed devices):

>>>>>

>>>>> We could consider supporting platform msi managed interrupts, but I
Marc Zyngier Dec. 20, 2019, 2:43 p.m. UTC | #38
Hi John,

On 2019-12-20 11:30, John Garry wrote:
>>> So you enqueue requests from CPU0 only? It seems a bit odd...

>> No, but maybe I wasn't clear enough. I'll give an overview:

>> For D06 SAS controller - which is a multi-queue PCI device - we use 

>> managed interrupts. The HW has 16 submission/completion queues, so for 

>> 96 cores, we have an even spread of 6 CPUs assigned per queue; and 

>> this per-queue CPU mask is the interrupt affinity mask. So CPU0-5 

>> would submit any IO on queue0, CPU6-11 on queue2, and so on. PCI NVMe 

>> is essentially the same.

>> These are the environments which we're trying to promote 

>> performance.

>> Then for D05 SAS controller - which is multi-queue platform device 

>> (mbigen) - we don't use managed interrupts. We still submit IO from 

>> any CPU, but we choose the queue to submit IO on a round-robin basis 

>> to promote some isolation, i.e. reduce inter-queue lock contention, so 

>> the queue chosen has nothing to do with the CPU.

>> And with your change we may submit on cpu4 but service the interrupt 

>> on cpu30, as an example. While previously we would always service on 

>> cpu0. The old way still isn't ideal, I'll admit.

>> For this env, we would just like to maintain the same performance. 

>> And it's here that we see the performance drop.

>>

>

> Hi Marc,

>

> We've got some more results and it looks promising.

>

> So with your patch we get a performance boost of 3180.1K -> 3294.9K

> IOPS in the D06 SAS env. Then when we change the driver to use

> threaded interrupt handler (mainline currently uses tasklet), we get 

> a

> boost again up to 3415K IOPS.

>

> Now this is essentially the same figure we had with using threaded

> handler + the gen irq change in spreading the handler CPU affinity. 

> We

> did also test your patch + gen irq change and got a performance drop,

> to 3347K IOPS.

>

> So tentatively I'd say your patch may be all we need.


OK.

> FYI, here is how the effective affinity is looking for both SAS

> controllers with your patch:

>

> 74:02.0

> irq 81, cpu list 24-29, effective list 24 cq

> irq 82, cpu list 30-35, effective list 30 cq


Cool.

[...]

> As for your patch itself, I'm still concerned of possible regressions

> if we don't apply this effective interrupt affinity spread policy to

> only managed interrupts.


I'll try and revise that as I post the patch, probably at some point
between now and Christmas. I still think we should find a way to
address this for the D05 SAS driver though, maybe by managing the
affinity yourself in the driver. But this requires experimentation.

> JFYI, about NVMe CPU lockup issue, there are 2 works on going here:

> 

> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t

> 

> https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t


I've also managed to trigger some of them now that I have access to
a decent box with nvme storage. Out of curiosity, have you tried
with the SMMU disabled? I'm wondering whether we hit some livelock
condition on unmapping buffers...

> Cheers,

> John

>

> Ps. Thanks to Xiang Chen for all the work here in getting these 

> results.


Yup, much appreciated!

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 20, 2019, 3:38 p.m. UTC | #39
>> We've got some more results and it looks promising.

>>

>> So with your patch we get a performance boost of 3180.1K -> 3294.9K

>> IOPS in the D06 SAS env. Then when we change the driver to use

>> threaded interrupt handler (mainline currently uses tasklet), we get a

>> boost again up to 3415K IOPS.

>>

>> Now this is essentially the same figure we had with using threaded

>> handler + the gen irq change in spreading the handler CPU affinity. We

>> did also test your patch + gen irq change and got a performance drop,

>> to 3347K IOPS.

>>

>> So tentatively I'd say your patch may be all we need.

> 

> OK.

> 

>> FYI, here is how the effective affinity is looking for both SAS

>> controllers with your patch:

>>

>> 74:02.0

>> irq 81, cpu list 24-29, effective list 24 cq

>> irq 82, cpu list 30-35, effective list 30 cq

> 

> Cool.

> 

> [...]

> 

>> As for your patch itself, I'm still concerned of possible regressions

>> if we don't apply this effective interrupt affinity spread policy to

>> only managed interrupts.

> 

> I'll try and revise that as I post the patch, probably at some point

> between now and Christmas. I still think we should find a way to

> address this for the D05 SAS driver though, maybe by managing the

> affinity yourself in the driver. But this requires experimentation.


I've already done something experimental for the driver to manage the 
affinity, and performance is generally much better:

https://github.com/hisilicon/kernel-dev/commit/e15bd404ed1086fed44da34ed3bd37a8433688a7

But I still think it's wise to only consider managed interrupts for now.

> 

>> JFYI, about NVMe CPU lockup issue, there are 2 works on going here:

>>

>> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t 

>>

>>

>> https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t 

>>

> 

> I've also managed to trigger some of them now that I have access to

> a decent box with nvme storage. 


I only have 2x NVMe SSDs when this occurs - I should not be hitting this...

Out of curiosity, have you tried
> with the SMMU disabled? I'm wondering whether we hit some livelock

> condition on unmapping buffers...


No, but I can give it a try. Doing that should lower the CPU usage, 
though, so maybe masks the issue - probably not.

Much appreciated,
John
Marc Zyngier Dec. 20, 2019, 4:16 p.m. UTC | #40
On 2019-12-20 15:38, John Garry wrote:

> I've already done something experimental for the driver to manage the

> affinity, and performance is generally much better:

>

> 

> https://github.com/hisilicon/kernel-dev/commit/e15bd404ed1086fed44da34ed3bd37a8433688a7

>

> But I still think it's wise to only consider managed interrupts for 

> now.


Sure. We've lived with it so far, we can make it last a bit longer... 
;-)

>>

>>> JFYI, about NVMe CPU lockup issue, there are 2 works on going here:

>>>

>>> 

>>> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t

>>>

>>>

>>> 

>>> https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t

>>>

>> I've also managed to trigger some of them now that I have access to

>> a decent box with nvme storage.

>

> I only have 2x NVMe SSDs when this occurs - I should not be hitting 

> this...


Same configuration here. And the number of interrupts is pretty
low (less that 20k/s per CPU), so I doubt this is interrupt related.

> Out of curiosity, have you tried

>> with the SMMU disabled? I'm wondering whether we hit some livelock

>> condition on unmapping buffers...

>

> No, but I can give it a try. Doing that should lower the CPU usage,

> though, so maybe masks the issue - probably not.


I wonder whether we could end-up in some form of unmap storm on
completion, with a CPU being starved trying to insert its TLBI
command into the queue.

Anyway, more digging in perspective.

         M.
-- 
Jazz is not dead. It just smells funny...
Ming Lei Dec. 20, 2019, 11:31 p.m. UTC | #41
On Fri, Dec 20, 2019 at 03:38:24PM +0000, John Garry wrote:
> > > We've got some more results and it looks promising.

> > > 

> > > So with your patch we get a performance boost of 3180.1K -> 3294.9K

> > > IOPS in the D06 SAS env. Then when we change the driver to use

> > > threaded interrupt handler (mainline currently uses tasklet), we get a

> > > boost again up to 3415K IOPS.

> > > 

> > > Now this is essentially the same figure we had with using threaded

> > > handler + the gen irq change in spreading the handler CPU affinity. We

> > > did also test your patch + gen irq change and got a performance drop,

> > > to 3347K IOPS.

> > > 

> > > So tentatively I'd say your patch may be all we need.

> > 

> > OK.

> > 

> > > FYI, here is how the effective affinity is looking for both SAS

> > > controllers with your patch:

> > > 

> > > 74:02.0

> > > irq 81, cpu list 24-29, effective list 24 cq

> > > irq 82, cpu list 30-35, effective list 30 cq

> > 

> > Cool.

> > 

> > [...]

> > 

> > > As for your patch itself, I'm still concerned of possible regressions

> > > if we don't apply this effective interrupt affinity spread policy to

> > > only managed interrupts.

> > 

> > I'll try and revise that as I post the patch, probably at some point

> > between now and Christmas. I still think we should find a way to

> > address this for the D05 SAS driver though, maybe by managing the

> > affinity yourself in the driver. But this requires experimentation.

> 

> I've already done something experimental for the driver to manage the

> affinity, and performance is generally much better:

> 

> https://github.com/hisilicon/kernel-dev/commit/e15bd404ed1086fed44da34ed3bd37a8433688a7

> 

> But I still think it's wise to only consider managed interrupts for now.

> 

> > 

> > > JFYI, about NVMe CPU lockup issue, there are 2 works on going here:

> > > 

> > > https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t

> > > 

> > > 

> > > https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t

> > > 

> > 

> > I've also managed to trigger some of them now that I have access to

> > a decent box with nvme storage.

> 

> I only have 2x NVMe SSDs when this occurs - I should not be hitting this...

> 

> Out of curiosity, have you tried

> > with the SMMU disabled? I'm wondering whether we hit some livelock

> > condition on unmapping buffers...

> 

> No, but I can give it a try. Doing that should lower the CPU usage, though,

> so maybe masks the issue - probably not.


Lots of CPU lockup can is performance issue if there isn't obvious bug.

I am wondering if you may explain it a bit why enabling SMMU may save
CPU a it?

Thanks,
Ming
Marc Zyngier Dec. 23, 2019, 9:07 a.m. UTC | #42
On 2019-12-20 23:31, Ming Lei wrote:
> On Fri, Dec 20, 2019 at 03:38:24PM +0000, John Garry wrote:

>> > > We've got some more results and it looks promising.

>> > >

>> > > So with your patch we get a performance boost of 3180.1K -> 

>> 3294.9K

>> > > IOPS in the D06 SAS env. Then when we change the driver to use

>> > > threaded interrupt handler (mainline currently uses tasklet), we 

>> get a

>> > > boost again up to 3415K IOPS.

>> > >

>> > > Now this is essentially the same figure we had with using 

>> threaded

>> > > handler + the gen irq change in spreading the handler CPU 

>> affinity. We

>> > > did also test your patch + gen irq change and got a performance 

>> drop,

>> > > to 3347K IOPS.

>> > >

>> > > So tentatively I'd say your patch may be all we need.

>> >

>> > OK.

>> >

>> > > FYI, here is how the effective affinity is looking for both SAS

>> > > controllers with your patch:

>> > >

>> > > 74:02.0

>> > > irq 81, cpu list 24-29, effective list 24 cq

>> > > irq 82, cpu list 30-35, effective list 30 cq

>> >

>> > Cool.

>> >

>> > [...]

>> >

>> > > As for your patch itself, I'm still concerned of possible 

>> regressions

>> > > if we don't apply this effective interrupt affinity spread 

>> policy to

>> > > only managed interrupts.

>> >

>> > I'll try and revise that as I post the patch, probably at some 

>> point

>> > between now and Christmas. I still think we should find a way to

>> > address this for the D05 SAS driver though, maybe by managing the

>> > affinity yourself in the driver. But this requires 

>> experimentation.

>>

>> I've already done something experimental for the driver to manage 

>> the

>> affinity, and performance is generally much better:

>>

>> 

>> https://github.com/hisilicon/kernel-dev/commit/e15bd404ed1086fed44da34ed3bd37a8433688a7

>>

>> But I still think it's wise to only consider managed interrupts for 

>> now.

>>

>> >

>> > > JFYI, about NVMe CPU lockup issue, there are 2 works on going 

>> here:

>> > >

>> > > 

>> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t

>> > >

>> > >

>> > > 

>> https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t

>> > >

>> >

>> > I've also managed to trigger some of them now that I have access 

>> to

>> > a decent box with nvme storage.

>>

>> I only have 2x NVMe SSDs when this occurs - I should not be hitting 

>> this...

>>

>> Out of curiosity, have you tried

>> > with the SMMU disabled? I'm wondering whether we hit some livelock

>> > condition on unmapping buffers...

>>

>> No, but I can give it a try. Doing that should lower the CPU usage, 

>> though,

>> so maybe masks the issue - probably not.

>

> Lots of CPU lockup can is performance issue if there isn't obvious 

> bug.

>

> I am wondering if you may explain it a bit why enabling SMMU may save

> CPU a it?


The other way around. mapping/unmapping IOVAs doesn't comes for free.
I'm trying to find out whether the NVMe map/unmap patterns trigger
something unexpected in the SMMU driver, but that's a very long shot.

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 23, 2019, 10:26 a.m. UTC | #43
>>> > I've also managed to trigger some of them now that I have access to

>>> > a decent box with nvme storage.

>>>

>>> I only have 2x NVMe SSDs when this occurs - I should not be hitting 

>>> this...

>>>

>>> Out of curiosity, have you tried

>>> > with the SMMU disabled? I'm wondering whether we hit some livelock

>>> > condition on unmapping buffers...

>>>

>>> No, but I can give it a try. Doing that should lower the CPU usage, 

>>> though,

>>> so maybe masks the issue - probably not.

>>

>> Lots of CPU lockup can is performance issue if there isn't obvious bug.

>>

>> I am wondering if you may explain it a bit why enabling SMMU may save

>> CPU a it?

> 

> The other way around. mapping/unmapping IOVAs doesn't comes for free.

> I'm trying to find out whether the NVMe map/unmap patterns trigger

> something unexpected in the SMMU driver, but that's a very long shot.


So I tested v5.5-rc3 with and without the SMMU enabled, and without the 
SMMU enabled I don't get the lockup.

fio summary SMMU enabled:

john@ubuntu:~$ dmesg | grep "Adding to iommu group"
[   10.550212] hisi_sas_v3_hw 0000:74:02.0: Adding to iommu group 0
[   14.773231] nvme 0000:04:00.0: Adding to iommu group 1
[   14.784000] nvme 0000:81:00.0: Adding to iommu group 2
[   14.794884] ahci 0000:74:03.0: Adding to iommu group 3

[snip]

sudo  sh create_fio_task_cpu_liuyifan_nvme.sh 4k read 20 1
Creat 4k_read_depth20_fiotest file sucessfully
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=20
...
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=20
...
fio-3.1
Starting 20 processes
[  110.155618] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 04m:11s]
[  110.161360] rcu: 4-....: (1 GPs behind) idle=00e/1/0x4000000000000004 
softirq=1284/4115 fqs=2625
[  173.167743] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 03m:08s]
[  173.173484] rcu: 29-....: (1 GPs behind) idle=e1e/0/0x3 
softirq=662/5436 fqs=10501
[  236.179623] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 02m:05s]
[  236.185362] rcu: 29-....: (1 GPs behind) idle=e1e/0/0x3 
softirq=662/5436 fqs=18220
[  271.735648] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 01m:30s]
[  271.741387] rcu: 16-....: (1 GPs behind) 
idle=fb6/1/0x4000000000000002 softirq=858/1168 fqs=2605
[  334.747590] rcu: INFO: rcu_preempt self-detected stall on CPU0 
IOPS][eta 00m:27s]
[  334.753328] rcu: 0-....: (1 GPs behind) idle=57a/1/0x4000000000000002 
softirq=1384/1384 fqs=10309
Jobs: 20 (f=20): [R(20)][100.0%][r=4230MiB/s,w=0KiB/s][r=1083k,w=0 
IOPS][eta 00m:00s]
job1: (groupid=0, jobs=20): err= 0: pid=1242: Mon Dec 23 09:45:12 2019
    read: IOPS=1183k, BW=4621MiB/s (4846MB/s)(1354GiB/300002msec)
     slat (usec): min=2, max=183172k, avg= 6.47, stdev=12846.53
     clat (usec): min=4, max=183173k, avg=330.40, stdev=63380.85
      lat (usec): min=20, max=183173k, avg=337.02, stdev=64670.18
     clat percentiles (usec):
      |  1.00th=[  104],  5.00th=[  112], 10.00th=[  137], 20.00th=[  182],
      | 30.00th=[  219], 40.00th=[  245], 50.00th=[  269], 60.00th=[  297],
      | 70.00th=[  338], 80.00th=[  379], 90.00th=[  429], 95.00th=[  482],
      | 99.00th=[  635], 99.50th=[  742], 99.90th=[ 1221], 99.95th=[ 1876],
      | 99.99th=[ 6194]
    bw (  KiB/s): min=   32, max=733328, per=5.75%, avg=272330.58, 
stdev=110721.72, samples=10435
    iops        : min=    8, max=183332, avg=68082.49, stdev=27680.43, 
samples=10435
   lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.46%, 250=41.97%
   lat (usec)   : 500=53.32%, 750=3.78%, 1000=0.31%
   lat (msec)   : 2=0.11%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
   lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec)   : 2000=0.01%, >=2000=0.01%
   cpu          : usr=8.38%, sys=33.43%, ctx=134950965, majf=0, minf=4371
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%

      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%

      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%

      issued rwt: total=354924097,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=20

Run status group 0 (all jobs):
    READ: bw=4621MiB/s (4846MB/s), 4621MiB/s-4621MiB/s 
(4846MB/s-4846MB/s), io=1354GiB (1454GB), run=300002-300002msec

Disk stats (read/write):
   nvme0n1: ios=187325975/0, merge=0/0, ticks=49841664/0, 
in_queue=11620, util=100.00%
   nvme1n1: ios=167416192/0, merge=0/0, ticks=42280120/0, 
in_queue=194576, util=100.00%
john@ubuntu:~$


fio summary SMMU disabled:

john@ubuntu:~$ dmesg | grep "Adding to iommu group"
john@ubuntu:~$


sudo  sh create_fio_task_cpu_liuyifan_nvme.sh 4k read 20 1
Creat 4k_read_depth20_fiotest file sucessfully
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=20
...
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=20
...
fio-3.1
Starting 20 processes
Jobs: 20 (f=20): [R(20)][100.0%][r=6053MiB/s,w=0KiB/s][r=1550k,w=0 
IOPS][eta 00m:00s]
job1: (groupid=0, jobs=20): err= 0: pid=1221: Mon Dec 23 09:54:15 2019
    read: IOPS=1539k, BW=6011MiB/s (6303MB/s)(1761GiB/300001msec)
     slat (usec): min=2, max=224572, avg= 4.44, stdev=14.57
     clat (usec): min=11, max=238636, avg=254.59, stdev=140.45
      lat (usec): min=15, max=240025, avg=259.17, stdev=142.61
     clat percentiles (usec):
      |  1.00th=[   94],  5.00th=[  125], 10.00th=[  167], 20.00th=[  208],
      | 30.00th=[  221], 40.00th=[  227], 50.00th=[  237], 60.00th=[  247],
      | 70.00th=[  262], 80.00th=[  281], 90.00th=[  338], 95.00th=[  420],
      | 99.00th=[  701], 99.50th=[  857], 99.90th=[ 1270], 99.95th=[ 1483],
      | 99.99th=[ 2114]
    bw (  KiB/s): min= 2292, max=429480, per=5.01%, avg=308068.30, 
stdev=36800.42, samples=12000
    iops        : min=  573, max=107370, avg=77016.89, stdev=9200.10, 
samples=12000
   lat (usec)   : 20=0.01%, 50=0.04%, 100=1.56%, 250=61.54%, 500=33.86%
   lat (usec)   : 750=2.19%, 1000=0.53%
   lat (msec)   : 2=0.26%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
   lat (msec)   : 100=0.01%, 250=0.01%
   cpu          : usr=11.50%, sys=40.49%, ctx=198764008, majf=0, minf=30760
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%

      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%

      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%

      issued rwt: total=461640046,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=20

Run status group 0 (all jobs):
    READ: bw=6011MiB/s (6303MB/s), 6011MiB/s-6011MiB/s 
(6303MB/s-6303MB/s), io=1761GiB (1891GB), run=300001-300001msec

Disk stats (read/write):
   nvme0n1: ios=229212121/0, merge=0/0, ticks=56349577/0, in_queue=2908, 
util=100.00%
   nvme1n1: ios=232165508/0, merge=0/0, ticks=56708137/0, in_queue=372, 
util=100.00%
john@ubuntu:~$

Obviously this is not conclusive, especially with such limited testing - 
5 minute runs each. The CPU load goes up when disabling the SMMU, but 
that could be attributed to extra throughput (1183K -> 1539K) loading.

I do notice that since we complete the NVMe request in irq context, we 
also do the DMA unmap, i.e. talk to the SMMU, in the same context, which 
is less than ideal.

I need to finish for the Christmas break today, so can't check this much 
further ATM.

Thanks,
John
Marc Zyngier Dec. 23, 2019, 10:47 a.m. UTC | #44
On 2019-12-23 10:26, John Garry wrote:
>>>> > I've also managed to trigger some of them now that I have access 

>>>> to

>>>> > a decent box with nvme storage.

>>>>

>>>> I only have 2x NVMe SSDs when this occurs - I should not be 

>>>> hitting this...

>>>>

>>>> Out of curiosity, have you tried

>>>> > with the SMMU disabled? I'm wondering whether we hit some 

>>>> livelock

>>>> > condition on unmapping buffers...

>>>>

>>>> No, but I can give it a try. Doing that should lower the CPU 

>>>> usage, though,

>>>> so maybe masks the issue - probably not.

>>>

>>> Lots of CPU lockup can is performance issue if there isn't obvious 

>>> bug.

>>>

>>> I am wondering if you may explain it a bit why enabling SMMU may 

>>> save

>>> CPU a it?

>> The other way around. mapping/unmapping IOVAs doesn't comes for 

>> free.

>> I'm trying to find out whether the NVMe map/unmap patterns trigger

>> something unexpected in the SMMU driver, but that's a very long 

>> shot.

>

> So I tested v5.5-rc3 with and without the SMMU enabled, and without

> the SMMU enabled I don't get the lockup.


OK, so my hunch wasn't completely off... At least we have something
to look into.

[...]

> Obviously this is not conclusive, especially with such limited

> testing - 5 minute runs each. The CPU load goes up when disabling the

> SMMU, but that could be attributed to extra throughput (1183K ->

> 1539K) loading.

>

> I do notice that since we complete the NVMe request in irq context,

> we also do the DMA unmap, i.e. talk to the SMMU, in the same context,

> which is less than ideal.


It depends on how much overhead invalidating the TLB adds to the
equation, but we should be able to do some tracing and find out.

> I need to finish for the Christmas break today, so can't check this

> much further ATM.


No worries. May I suggest creating a new thread in the new year, maybe
involving Robin and Will as well?

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
John Garry Dec. 23, 2019, 11:35 a.m. UTC | #45
On 23/12/2019 10:47, Marc Zyngier wrote:
> On 2019-12-23 10:26, John Garry wrote:

>>>>> > I've also managed to trigger some of them now that I have access to

>>>>> > a decent box with nvme storage.

>>>>>

>>>>> I only have 2x NVMe SSDs when this occurs - I should not be hitting 

>>>>> this...

>>>>>

>>>>> Out of curiosity, have you tried

>>>>> > with the SMMU disabled? I'm wondering whether we hit some livelock

>>>>> > condition on unmapping buffers...

>>>>>

>>>>> No, but I can give it a try. Doing that should lower the CPU usage, 

>>>>> though,

>>>>> so maybe masks the issue - probably not.

>>>>

>>>> Lots of CPU lockup can is performance issue if there isn't obvious bug.

>>>>

>>>> I am wondering if you may explain it a bit why enabling SMMU may save

>>>> CPU a it?

>>> The other way around. mapping/unmapping IOVAs doesn't comes for free.

>>> I'm trying to find out whether the NVMe map/unmap patterns trigger

>>> something unexpected in the SMMU driver, but that's a very long shot.

>>

>> So I tested v5.5-rc3 with and without the SMMU enabled, and without

>> the SMMU enabled I don't get the lockup.

> 

> OK, so my hunch wasn't completely off... At least we have something

> to look into.

> 

> [...]

> 

>> Obviously this is not conclusive, especially with such limited

>> testing - 5 minute runs each. The CPU load goes up when disabling the

>> SMMU, but that could be attributed to extra throughput (1183K ->

>> 1539K) loading.

>>

>> I do notice that since we complete the NVMe request in irq context,

>> we also do the DMA unmap, i.e. talk to the SMMU, in the same context,

>> which is less than ideal.

> 

> It depends on how much overhead invalidating the TLB adds to the

> equation, but we should be able to do some tracing and find out.


ok, but let's remember that x86 iommu uses non-strict unmapping by 
default, and they also see this issue.

> 

>> I need to finish for the Christmas break today, so can't check this

>> much further ATM.

> 

> No worries. May I suggest creating a new thread in the new year, maybe

> involving Robin and Will as well?


Can do, but would be good to know how x86 fairs and the IOMMU config 
used for testing also when the lockup occurs.

Cheers,
John
Ming Lei Dec. 24, 2019, 1:59 a.m. UTC | #46
On Mon, Dec 23, 2019 at 10:47:07AM +0000, Marc Zyngier wrote:
> On 2019-12-23 10:26, John Garry wrote:

> > > > > > I've also managed to trigger some of them now that I have

> > > > > access to

> > > > > > a decent box with nvme storage.

> > > > > 

> > > > > I only have 2x NVMe SSDs when this occurs - I should not be

> > > > > hitting this...

> > > > > 

> > > > > Out of curiosity, have you tried

> > > > > > with the SMMU disabled? I'm wondering whether we hit some

> > > > > livelock

> > > > > > condition on unmapping buffers...

> > > > > 

> > > > > No, but I can give it a try. Doing that should lower the CPU

> > > > > usage, though,

> > > > > so maybe masks the issue - probably not.

> > > > 

> > > > Lots of CPU lockup can is performance issue if there isn't

> > > > obvious bug.

> > > > 

> > > > I am wondering if you may explain it a bit why enabling SMMU may

> > > > save

> > > > CPU a it?

> > > The other way around. mapping/unmapping IOVAs doesn't comes for

> > > free.

> > > I'm trying to find out whether the NVMe map/unmap patterns trigger

> > > something unexpected in the SMMU driver, but that's a very long

> > > shot.

> > 

> > So I tested v5.5-rc3 with and without the SMMU enabled, and without

> > the SMMU enabled I don't get the lockup.

> 

> OK, so my hunch wasn't completely off... At least we have something

> to look into.

> 

> [...]

> 

> > Obviously this is not conclusive, especially with such limited

> > testing - 5 minute runs each. The CPU load goes up when disabling the

> > SMMU, but that could be attributed to extra throughput (1183K ->

> > 1539K) loading.

> > 

> > I do notice that since we complete the NVMe request in irq context,

> > we also do the DMA unmap, i.e. talk to the SMMU, in the same context,

> > which is less than ideal.

> 

> It depends on how much overhead invalidating the TLB adds to the

> equation, but we should be able to do some tracing and find out.

> 

> > I need to finish for the Christmas break today, so can't check this

> > much further ATM.

> 

> No worries. May I suggest creating a new thread in the new year, maybe

> involving Robin and Will as well?


Zhang Yi has observed the CPU lockup issue once when running heavy IO on
single nvme drive, and please CC him if you have new patch to try.

Then looks the DMA unmap cost is too big on aarch64 if SMMU is involved.


Thanks,
Ming
Marc Zyngier Dec. 24, 2019, 11:20 a.m. UTC | #47
On 2019-12-24 01:59, Ming Lei wrote:
> On Mon, Dec 23, 2019 at 10:47:07AM +0000, Marc Zyngier wrote:

>> On 2019-12-23 10:26, John Garry wrote:

>> > > > > > I've also managed to trigger some of them now that I have

>> > > > > access to

>> > > > > > a decent box with nvme storage.

>> > > > >

>> > > > > I only have 2x NVMe SSDs when this occurs - I should not be

>> > > > > hitting this...

>> > > > >

>> > > > > Out of curiosity, have you tried

>> > > > > > with the SMMU disabled? I'm wondering whether we hit some

>> > > > > livelock

>> > > > > > condition on unmapping buffers...

>> > > > >

>> > > > > No, but I can give it a try. Doing that should lower the CPU

>> > > > > usage, though,

>> > > > > so maybe masks the issue - probably not.

>> > > >

>> > > > Lots of CPU lockup can is performance issue if there isn't

>> > > > obvious bug.

>> > > >

>> > > > I am wondering if you may explain it a bit why enabling SMMU 

>> may

>> > > > save

>> > > > CPU a it?

>> > > The other way around. mapping/unmapping IOVAs doesn't comes for

>> > > free.

>> > > I'm trying to find out whether the NVMe map/unmap patterns 

>> trigger

>> > > something unexpected in the SMMU driver, but that's a very long

>> > > shot.

>> >

>> > So I tested v5.5-rc3 with and without the SMMU enabled, and 

>> without

>> > the SMMU enabled I don't get the lockup.

>>

>> OK, so my hunch wasn't completely off... At least we have something

>> to look into.

>>

>> [...]

>>

>> > Obviously this is not conclusive, especially with such limited

>> > testing - 5 minute runs each. The CPU load goes up when disabling 

>> the

>> > SMMU, but that could be attributed to extra throughput (1183K ->

>> > 1539K) loading.

>> >

>> > I do notice that since we complete the NVMe request in irq 

>> context,

>> > we also do the DMA unmap, i.e. talk to the SMMU, in the same 

>> context,

>> > which is less than ideal.

>>

>> It depends on how much overhead invalidating the TLB adds to the

>> equation, but we should be able to do some tracing and find out.

>>

>> > I need to finish for the Christmas break today, so can't check 

>> this

>> > much further ATM.

>>

>> No worries. May I suggest creating a new thread in the new year, 

>> maybe

>> involving Robin and Will as well?

>

> Zhang Yi has observed the CPU lockup issue once when running heavy IO 

> on

> single nvme drive, and please CC him if you have new patch to try.


On which architecture? John was indicating that this also happen on 
x86.

> Then looks the DMA unmap cost is too big on aarch64 if SMMU is 

> involved.


So far, we don't have any data suggesting that this is actually the 
case.
Also, other workloads (such as networking) do not exhibit this 
behaviour,
while being least as unmap-heavy as NVMe is.

If the cross-architecture aspect is confirmed, this points more into
the direction of an interaction between the NVMe subsystem and the
DMA API more than an architecture-specific problem.

Given that we have so far very little data, I'd hold off any 
conclusion.

         M.
-- 
Jazz is not dead. It just smells funny...
Ming Lei Dec. 25, 2019, 12:48 a.m. UTC | #48
On Tue, Dec 24, 2019 at 11:20:25AM +0000, Marc Zyngier wrote:
> On 2019-12-24 01:59, Ming Lei wrote:

> > On Mon, Dec 23, 2019 at 10:47:07AM +0000, Marc Zyngier wrote:

> > > On 2019-12-23 10:26, John Garry wrote:

> > > > > > > > I've also managed to trigger some of them now that I have

> > > > > > > access to

> > > > > > > > a decent box with nvme storage.

> > > > > > >

> > > > > > > I only have 2x NVMe SSDs when this occurs - I should not be

> > > > > > > hitting this...

> > > > > > >

> > > > > > > Out of curiosity, have you tried

> > > > > > > > with the SMMU disabled? I'm wondering whether we hit some

> > > > > > > livelock

> > > > > > > > condition on unmapping buffers...

> > > > > > >

> > > > > > > No, but I can give it a try. Doing that should lower the CPU

> > > > > > > usage, though,

> > > > > > > so maybe masks the issue - probably not.

> > > > > >

> > > > > > Lots of CPU lockup can is performance issue if there isn't

> > > > > > obvious bug.

> > > > > >

> > > > > > I am wondering if you may explain it a bit why enabling SMMU

> > > may

> > > > > > save

> > > > > > CPU a it?

> > > > > The other way around. mapping/unmapping IOVAs doesn't comes for

> > > > > free.

> > > > > I'm trying to find out whether the NVMe map/unmap patterns

> > > trigger

> > > > > something unexpected in the SMMU driver, but that's a very long

> > > > > shot.

> > > >

> > > > So I tested v5.5-rc3 with and without the SMMU enabled, and

> > > without

> > > > the SMMU enabled I don't get the lockup.

> > > 

> > > OK, so my hunch wasn't completely off... At least we have something

> > > to look into.

> > > 

> > > [...]

> > > 

> > > > Obviously this is not conclusive, especially with such limited

> > > > testing - 5 minute runs each. The CPU load goes up when disabling

> > > the

> > > > SMMU, but that could be attributed to extra throughput (1183K ->

> > > > 1539K) loading.

> > > >

> > > > I do notice that since we complete the NVMe request in irq

> > > context,

> > > > we also do the DMA unmap, i.e. talk to the SMMU, in the same

> > > context,

> > > > which is less than ideal.

> > > 

> > > It depends on how much overhead invalidating the TLB adds to the

> > > equation, but we should be able to do some tracing and find out.

> > > 

> > > > I need to finish for the Christmas break today, so can't check

> > > this

> > > > much further ATM.

> > > 

> > > No worries. May I suggest creating a new thread in the new year,

> > > maybe

> > > involving Robin and Will as well?

> > 

> > Zhang Yi has observed the CPU lockup issue once when running heavy IO on

> > single nvme drive, and please CC him if you have new patch to try.

> 

> On which architecture? John was indicating that this also happen on x86.


ARM64.

To be honest, I never see such CPU lockup issue on x86 in case of running
heavy IO on single NVMe drive.

> 

> > Then looks the DMA unmap cost is too big on aarch64 if SMMU is involved.

> 

> So far, we don't have any data suggesting that this is actually the case.

> Also, other workloads (such as networking) do not exhibit this behaviour,

> while being least as unmap-heavy as NVMe is.


Maybe it is because networking workloads usually completes IO in softirq
context, instead of hard interrupt context.

> 

> If the cross-architecture aspect is confirmed, this points more into

> the direction of an interaction between the NVMe subsystem and the

> DMA API more than an architecture-specific problem.

> 

> Given that we have so far very little data, I'd hold off any conclusion.


We can start to collect latency data of dma unmapping vs nvme_irq()
on both x86 and arm64.

I will see if I can get a such box for collecting the latency data.


Thanks,
Ming
diff mbox series

Patch

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 1753486b440c..8e7f8e758a88 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -968,7 +968,11 @@  irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
 	if (cpumask_available(desc->irq_common_data.affinity)) {
 		const struct cpumask *m;
 
-		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
+		if (irqd_affinity_is_managed(&desc->irq_data))
+			m = desc->irq_common_data.affinity;
+		else
+			m = irq_data_get_effective_affinity_mask(
+					&desc->irq_data);
 		cpumask_copy(mask, m);
 	} else {
 		valid = false;