Message ID | 1575642904-58295-2-git-send-email-john.garry@huawei.com |
---|---|
State | New |
Headers | show |
Series | Threaded handler uses irq affinity for when the interrupt is managed | expand |
Hi John, On 2019-12-06 14:35, John Garry wrote: > Currently the cpu allowed mask for the threaded part of a threaded > irq > handler will be set to the effective affinity of the hard irq. > > Typically the effective affinity of the hard irq will be for a single > cpu. As such, > the threaded handler would always run on the same cpu as the hard > irq. > > We have seen scenarios in high data-rate throughput testing that the > cpu > handling the interrupt can be totally saturated handling both the > hard > interrupt and threaded handler parts, limiting throughput. > > For when the interrupt is managed, allow the threaded part to run on > all > cpus in the irq affinity mask. > > Signed-off-by: John Garry <john.garry@huawei.com> > --- > kernel/irq/manage.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c > index 1753486b440c..8e7f8e758a88 100644 > --- a/kernel/irq/manage.c > +++ b/kernel/irq/manage.c > @@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc, > struct irqaction *action) > if (cpumask_available(desc->irq_common_data.affinity)) { > const struct cpumask *m; > > - m = irq_data_get_effective_affinity_mask(&desc->irq_data); > + if (irqd_affinity_is_managed(&desc->irq_data)) > + m = desc->irq_common_data.affinity; > + else > + m = irq_data_get_effective_affinity_mask( > + &desc->irq_data); > cpumask_copy(mask, m); > } else { > valid = false; Although I completely understand that there are cases where you really want to let your thread roam all CPUs, I feel like changing this based on a seemingly unrelated property is likely to trigger yet another whack-a-mole episode. I'd feel much more comfortable if there was a way to let the IRQ subsystem know about what is best. Shouldn't the endpoint driver know better about it? Note that I have no data supporting an approach or the other, hence playing the role of the village idiot here. Thanks, M. -- Jazz is not dead. It just smells funny...
On 06/12/2019 15:22, Marc Zyngier wrote: Hi Marc, > > On 2019-12-06 14:35, John Garry wrote: >> Currently the cpu allowed mask for the threaded part of a threaded irq >> handler will be set to the effective affinity of the hard irq. >> >> Typically the effective affinity of the hard irq will be for a single >> cpu. As such, >> the threaded handler would always run on the same cpu as the hard irq. >> >> We have seen scenarios in high data-rate throughput testing that the cpu >> handling the interrupt can be totally saturated handling both the hard >> interrupt and threaded handler parts, limiting throughput. >> >> For when the interrupt is managed, allow the threaded part to run on all >> cpus in the irq affinity mask. >> >> Signed-off-by: John Garry <john.garry@huawei.com> >> --- >> kernel/irq/manage.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c >> index 1753486b440c..8e7f8e758a88 100644 >> --- a/kernel/irq/manage.c >> +++ b/kernel/irq/manage.c >> @@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc, >> struct irqaction *action) >> if (cpumask_available(desc->irq_common_data.affinity)) { >> const struct cpumask *m; >> >> - m = irq_data_get_effective_affinity_mask(&desc->irq_data); >> + if (irqd_affinity_is_managed(&desc->irq_data)) >> + m = desc->irq_common_data.affinity; >> + else >> + m = irq_data_get_effective_affinity_mask( >> + &desc->irq_data); >> cpumask_copy(mask, m); >> } else { >> valid = false; > > Although I completely understand that there are cases where you > really want to let your thread roam all CPUs, I feel like changing > this based on a seemingly unrelated property is likely to trigger > yet another whack-a-mole episode. I'd feel much more comfortable > if there was a way to let the IRQ subsystem know about what is best. > > Shouldn't the endpoint driver know better about it? I did propose that same idea here: https://lore.kernel.org/lkml/fd7d6101-37f4-2d34-f2f7-cfeade610278@huawei.com/ And that fits my agenda to get best throughput figures, while not possibly affecting others. But it seems that we could do better to make this a common policy: allow the threaded part to roam when that CPU is overloaded, but how...? Note that > I have no data supporting an approach or the other, hence playing > the role of the village idiot here. > Understood. My data is that we get an ~11% throughput boost for our storage test with this change. > Thanks, > > M. Thanks, John
On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote: > Currently the cpu allowed mask for the threaded part of a threaded irq > handler will be set to the effective affinity of the hard irq. > > Typically the effective affinity of the hard irq will be for a single cpu. As such, > the threaded handler would always run on the same cpu as the hard irq. > > We have seen scenarios in high data-rate throughput testing that the cpu > handling the interrupt can be totally saturated handling both the hard > interrupt and threaded handler parts, limiting throughput. Frankly speaking, I never observed that single CPU is saturated by one storage completion queue's interrupt load. Because CPU is still much quicker than current storage device. If there are more drives, one CPU won't handle more than one queue(drive)'s interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores. So could you describe your case in a bit detail? Then we can confirm if this change is really needed. > > For when the interrupt is managed, allow the threaded part to run on all > cpus in the irq affinity mask. I remembered that performance drop is observed by this approach in some test. Thanks, Ming
On 07/12/2019 08:03, Ming Lei wrote: > On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote: >> Currently the cpu allowed mask for the threaded part of a threaded irq >> handler will be set to the effective affinity of the hard irq. >> >> Typically the effective affinity of the hard irq will be for a single cpu. As such, >> the threaded handler would always run on the same cpu as the hard irq. >> >> We have seen scenarios in high data-rate throughput testing that the cpu >> handling the interrupt can be totally saturated handling both the hard >> interrupt and threaded handler parts, limiting throughput. > Hi Ming, > Frankly speaking, I never observed that single CPU is saturated by one storage > completion queue's interrupt load. Because CPU is still much quicker than > current storage device. > > If there are more drives, one CPU won't handle more than one queue(drive)'s > interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores. Are things this simple? I mean, can you guarantee that fio processes are evenly distributed as such? > > So could you describe your case in a bit detail? Then we can confirm > if this change is really needed. The issue is that the CPU is saturated in servicing the hard and threaded part of the interrupt together - here's the sort of thing which we saw previously: Before: CPU %usr %sys %irq %soft %idle all 2.9 13.1 1.2 4.6 78.2 0 0.0 29.3 10.1 58.6 2.0 1 18.2 39.4 0.0 1.0 41.4 2 0.0 2.0 0.0 0.0 98.0 CPU0 has no effectively no idle. Then, by allowing the threaded part to roam: After: CPU %usr %sys %irq %soft %idle all 3.5 18.4 2.7 6.8 68.6 0 0.0 20.6 29.9 29.9 19.6 1 0.0 39.8 0.0 50.0 10.2 Note: I think that I may be able to reduce the irq hard part load in the endpoint driver, but not that much such that we see still this issue. > >> >> For when the interrupt is managed, allow the threaded part to run on all >> cpus in the irq affinity mask. > > I remembered that performance drop is observed by this approach in some > test. From checking the thread about the NVMe interrupt swamp, just switching to threaded handler alone degrades performance. I didn't see any specific results for this change from Long Li - https://lkml.org/lkml/2019/8/21/128 Thanks, John
On 12/9/19 3:30 PM, John Garry wrote: > On 07/12/2019 08:03, Ming Lei wrote: >> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote: >>> Currently the cpu allowed mask for the threaded part of a threaded irq >>> handler will be set to the effective affinity of the hard irq. >>> >>> Typically the effective affinity of the hard irq will be for a single >>> cpu. As such, >>> the threaded handler would always run on the same cpu as the hard irq. >>> >>> We have seen scenarios in high data-rate throughput testing that the cpu >>> handling the interrupt can be totally saturated handling both the hard >>> interrupt and threaded handler parts, limiting throughput. >> > > Hi Ming, > >> Frankly speaking, I never observed that single CPU is saturated by one >> storage >> completion queue's interrupt load. Because CPU is still much quicker than >> current storage device. >> >> If there are more drives, one CPU won't handle more than one >> queue(drive)'s >> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores. > > Are things this simple? I mean, can you guarantee that fio processes are > evenly distributed as such? > I would assume that it does, seeing that that was the primary goal of fio ... >> >> So could you describe your case in a bit detail? Then we can confirm >> if this change is really needed. > > The issue is that the CPU is saturated in servicing the hard and > threaded part of the interrupt together - here's the sort of thing which > we saw previously: > Before: > CPU %usr %sys %irq %soft %idle > all 2.9 13.1 1.2 4.6 78.2 > 0 0.0 29.3 10.1 58.6 2.0 > 1 18.2 39.4 0.0 1.0 41.4 > 2 0.0 2.0 0.0 0.0 98.0 > > CPU0 has no effectively no idle. > > Then, by allowing the threaded part to roam: > After: > CPU %usr %sys %irq %soft %idle > all 3.5 18.4 2.7 6.8 68.6 > 0 0.0 20.6 29.9 29.9 19.6 > 1 0.0 39.8 0.0 50.0 10.2 > > Note: I think that I may be able to reduce the irq hard part load in the > endpoint driver, but not that much such that we see still this issue. > Well ... to get a _real_ comparison you would need to specify the number of irqs handled (and the resulting IOPS) alongside the cpu load. It might well be that by spreading out the interrupts to other CPUs we're increasing the latency, thus trivially reducing the load ... My idea here is slightly different: can't we leverage SMT? Most modern CPUs do SMT (I guess even ARM does it nowadays) (Yes, I know about spectre and things. We're talking performance here :-) So for 2-way SMT one could move the submisson queue on one side, and the completion queue handling (ie the irq handling) on the other side. Due to SMT we shouldn't suffer from cache misses (keep fingers crossed), and might even get better performance. John, would such a scenario work on your boxes? IE can we tweak the interrupt and queue assignment? Initially I would love to test things out, just to see what'll be happening; might be that it doesn't bring any benefit at all, but it'd be interesting to test out anyway. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On 2019-12-09 15:09, Hannes Reinecke wrote: [slight digression] > My idea here is slightly different: can't we leverage SMT? > Most modern CPUs do SMT (I guess even ARM does it nowadays) > (Yes, I know about spectre and things. We're talking performance here > :-) I only know two of those: Cavium TX2 and ARM Neoverse-E1. ARM SMT CPUs are the absolute minority (and I can't say I'm displeased). M, -- Jazz is not dead. It just smells funny...
On 12/9/19 4:17 PM, Marc Zyngier wrote: > On 2019-12-09 15:09, Hannes Reinecke wrote: > > [slight digression] > >> My idea here is slightly different: can't we leverage SMT? >> Most modern CPUs do SMT (I guess even ARM does it nowadays) >> (Yes, I know about spectre and things. We're talking performance here :-) > > I only know two of those: Cavium TX2 and ARM Neoverse-E1. > ARM SMT CPUs are the absolute minority (and I can't say I'm displeased). > Ach, too bad. Still a nice idea, putting SMT finally to some use ... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On 2019-12-09 15:25, Hannes Reinecke wrote: > On 12/9/19 4:17 PM, Marc Zyngier wrote: >> On 2019-12-09 15:09, Hannes Reinecke wrote: >> [slight digression] >> >>> My idea here is slightly different: can't we leverage SMT? >>> Most modern CPUs do SMT (I guess even ARM does it nowadays) >>> (Yes, I know about spectre and things. We're talking performance >>> here :-) >> I only know two of those: Cavium TX2 and ARM Neoverse-E1. >> ARM SMT CPUs are the absolute minority (and I can't say I'm >> displeased). > > Ach, too bad. > > Still a nice idea, putting SMT finally to some use ... But isn't your SMT idea just a special case of providing an affinity for the thread (and in this case relative to the affinity of the hard IRQ)? You could apply the same principle to target any CPU affinity, and maybe provide hints for the placement if you're really keen (same L3, for example). M. -- Jazz is not dead. It just smells funny...
On 12/09/19 15:17, Marc Zyngier wrote: > On 2019-12-09 15:09, Hannes Reinecke wrote: > > [slight digression] > > > My idea here is slightly different: can't we leverage SMT? > > Most modern CPUs do SMT (I guess even ARM does it nowadays) > > (Yes, I know about spectre and things. We're talking performance here > > :-) > > I only know two of those: Cavium TX2 and ARM Neoverse-E1. There's the Cortex-A65 too. -- Qais Yousef > ARM SMT CPUs are the absolute minority (and I can't say I'm displeased). > > M, > -- > Jazz is not dead. It just smells funny...
On 2019-12-09 15:49, Qais Yousef wrote: > On 12/09/19 15:17, Marc Zyngier wrote: >> On 2019-12-09 15:09, Hannes Reinecke wrote: >> >> [slight digression] >> >> > My idea here is slightly different: can't we leverage SMT? >> > Most modern CPUs do SMT (I guess even ARM does it nowadays) >> > (Yes, I know about spectre and things. We're talking performance >> here >> > :-) >> >> I only know two of those: Cavium TX2 and ARM Neoverse-E1. > > There's the Cortex-A65 too. Which is the exact same core as E1 (but don't tell anyone... ;-). M. -- Jazz is not dead. It just smells funny...
On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote: > On 07/12/2019 08:03, Ming Lei wrote: > > On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote: > > > Currently the cpu allowed mask for the threaded part of a threaded irq > > > handler will be set to the effective affinity of the hard irq. > > > > > > Typically the effective affinity of the hard irq will be for a single cpu. As such, > > > the threaded handler would always run on the same cpu as the hard irq. > > > > > > We have seen scenarios in high data-rate throughput testing that the cpu > > > handling the interrupt can be totally saturated handling both the hard > > > interrupt and threaded handler parts, limiting throughput. > > > > Hi Ming, > > > Frankly speaking, I never observed that single CPU is saturated by one storage > > completion queue's interrupt load. Because CPU is still much quicker than > > current storage device. > > > > If there are more drives, one CPU won't handle more than one queue(drive)'s > > interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores. > > Are things this simple? I mean, can you guarantee that fio processes are > evenly distributed as such? That is why I ask you for the details of your test. If you mean hisilicon SAS, the interrupt load should have been distributed well given the device has multiple reply queues for distributing interrupt load. > > > > > So could you describe your case in a bit detail? Then we can confirm > > if this change is really needed. > > The issue is that the CPU is saturated in servicing the hard and threaded > part of the interrupt together - here's the sort of thing which we saw > previously: > Before: > CPU %usr %sys %irq %soft %idle > all 2.9 13.1 1.2 4.6 78.2 > 0 0.0 29.3 10.1 58.6 2.0 > 1 18.2 39.4 0.0 1.0 41.4 > 2 0.0 2.0 0.0 0.0 98.0 > > CPU0 has no effectively no idle. The result just shows the saturation, we need to root cause it instead of workaround it via random changes. > > Then, by allowing the threaded part to roam: > After: > CPU %usr %sys %irq %soft %idle > all 3.5 18.4 2.7 6.8 68.6 > 0 0.0 20.6 29.9 29.9 19.6 > 1 0.0 39.8 0.0 50.0 10.2 > > Note: I think that I may be able to reduce the irq hard part load in the > endpoint driver, but not that much such that we see still this issue. > > > > > > > > > For when the interrupt is managed, allow the threaded part to run on all > > > cpus in the irq affinity mask. > > > > I remembered that performance drop is observed by this approach in some > > test. > > From checking the thread about the NVMe interrupt swamp, just switching to > threaded handler alone degrades performance. I didn't see any specific > results for this change from Long Li - https://lkml.org/lkml/2019/8/21/128 I am pretty clear the reason for Azure, which is caused by aggressive interrupt coalescing, and this behavior shouldn't be very common, and it can be addressed by the following patch: http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html Then please share your lockup story, such as, which HBA/drivers, test steps, if you complete IOs from multiple disks(LUNs) on single CPU, if you have multiple queues, how many active LUNs involved in the test, ... Thanks, Ming
On 10/12/2019 01:43, Ming Lei wrote: > On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote: >> On 07/12/2019 08:03, Ming Lei wrote: >>> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote: >>>> Currently the cpu allowed mask for the threaded part of a threaded irq >>>> handler will be set to the effective affinity of the hard irq. >>>> >>>> Typically the effective affinity of the hard irq will be for a single cpu. As such, >>>> the threaded handler would always run on the same cpu as the hard irq. >>>> >>>> We have seen scenarios in high data-rate throughput testing that the cpu >>>> handling the interrupt can be totally saturated handling both the hard >>>> interrupt and threaded handler parts, limiting throughput. >>> Hi Ming, >>> Frankly speaking, I never observed that single CPU is saturated by one storage >>> completion queue's interrupt load. Because CPU is still much quicker than >>> current storage device. >>> >>> If there are more drives, one CPU won't handle more than one queue(drive)'s >>> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores. >> >> Are things this simple? I mean, can you guarantee that fio processes are >> evenly distributed as such? > > That is why I ask you for the details of your test. > > If you mean hisilicon SAS, Yes, it is. the interrupt load should have been distributed > well given the device has multiple reply queues for distributing interrupt > load. > >> >>> >>> So could you describe your case in a bit detail? Then we can confirm >>> if this change is really needed. >> >> The issue is that the CPU is saturated in servicing the hard and threaded >> part of the interrupt together - here's the sort of thing which we saw >> previously: >> Before: >> CPU %usr %sys %irq %soft %idle >> all 2.9 13.1 1.2 4.6 78.2 >> 0 0.0 29.3 10.1 58.6 2.0 >> 1 18.2 39.4 0.0 1.0 41.4 >> 2 0.0 2.0 0.0 0.0 98.0 >> >> CPU0 has no effectively no idle. > > The result just shows the saturation, we need to root cause it instead > of workaround it via random changes. > >> >> Then, by allowing the threaded part to roam: >> After: >> CPU %usr %sys %irq %soft %idle >> all 3.5 18.4 2.7 6.8 68.6 >> 0 0.0 20.6 29.9 29.9 19.6 >> 1 0.0 39.8 0.0 50.0 10.2 >> >> Note: I think that I may be able to reduce the irq hard part load in the >> endpoint driver, but not that much such that we see still this issue. >> >>> >>>> >>>> For when the interrupt is managed, allow the threaded part to run on all >>>> cpus in the irq affinity mask. >>> >>> I remembered that performance drop is observed by this approach in some >>> test. >> >> From checking the thread about the NVMe interrupt swamp, just switching to >> threaded handler alone degrades performance. I didn't see any specific >> results for this change from Long Li - https://lkml.org/lkml/2019/8/21/128 > > I am pretty clear the reason for Azure, which is caused by aggressive interrupt > coalescing, and this behavior shouldn't be very common, and it can be > addressed by the following patch: > > http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html > > Then please share your lockup story, such as, which HBA/drivers, test steps, > if you complete IOs from multiple disks(LUNs) on single CPU, if you have > multiple queues, how many active LUNs involved in the test, ... There is no lockup, just a potential performance boost in this change. My colleague Xiang Chen can provide specifics of the test, as he is the one running it. But one key bit of info - which I did not think most relevant before - that is we have 2x SAS controllers running the throughput test on the same host. As such, the completion queue interrupts would be spread identically over the CPUs for each controller. I notice that ARM GICv3 ITS interrupt controller (which we use) does not use the generic irq matrix allocator, which I think would really help with this. Hi Marc, Is there any reason for which we couldn't utilise of the generic irq matrix allocator for GICv3? Thanks, John
On Tue, Dec 10, 2019 at 09:45:45AM +0000, John Garry wrote: > On 10/12/2019 01:43, Ming Lei wrote: > > On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote: > > > On 07/12/2019 08:03, Ming Lei wrote: > > > > On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote: > > > > > Currently the cpu allowed mask for the threaded part of a threaded irq > > > > > handler will be set to the effective affinity of the hard irq. > > > > > > > > > > Typically the effective affinity of the hard irq will be for a single cpu. As such, > > > > > the threaded handler would always run on the same cpu as the hard irq. > > > > > > > > > > We have seen scenarios in high data-rate throughput testing that the cpu > > > > > handling the interrupt can be totally saturated handling both the hard > > > > > interrupt and threaded handler parts, limiting throughput. > > > > > > Hi Ming, > > > > > Frankly speaking, I never observed that single CPU is saturated by one storage > > > > completion queue's interrupt load. Because CPU is still much quicker than > > > > current storage device. > > > > > > > > If there are more drives, one CPU won't handle more than one queue(drive)'s > > > > interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores. > > > > > > Are things this simple? I mean, can you guarantee that fio processes are > > > evenly distributed as such? > > > > That is why I ask you for the details of your test. > > > > If you mean hisilicon SAS, > > Yes, it is. > > the interrupt load should have been distributed > > well given the device has multiple reply queues for distributing interrupt > > load. > > > > > > > > > > > > > So could you describe your case in a bit detail? Then we can confirm > > > > if this change is really needed. > > > > > > The issue is that the CPU is saturated in servicing the hard and threaded > > > part of the interrupt together - here's the sort of thing which we saw > > > previously: > > > Before: > > > CPU %usr %sys %irq %soft %idle > > > all 2.9 13.1 1.2 4.6 78.2 > > > 0 0.0 29.3 10.1 58.6 2.0 > > > 1 18.2 39.4 0.0 1.0 41.4 > > > 2 0.0 2.0 0.0 0.0 98.0 > > > > > > CPU0 has no effectively no idle. > > > > The result just shows the saturation, we need to root cause it instead > > of workaround it via random changes. > > > > > > > > Then, by allowing the threaded part to roam: > > > After: > > > CPU %usr %sys %irq %soft %idle > > > all 3.5 18.4 2.7 6.8 68.6 > > > 0 0.0 20.6 29.9 29.9 19.6 > > > 1 0.0 39.8 0.0 50.0 10.2 > > > > > > Note: I think that I may be able to reduce the irq hard part load in the > > > endpoint driver, but not that much such that we see still this issue. > > > > > > > > > > > > > > > > > For when the interrupt is managed, allow the threaded part to run on all > > > > > cpus in the irq affinity mask. > > > > > > > > I remembered that performance drop is observed by this approach in some > > > > test. > > > > > > From checking the thread about the NVMe interrupt swamp, just switching to > > > threaded handler alone degrades performance. I didn't see any specific > > > results for this change from Long Li - https://lkml.org/lkml/2019/8/21/128 > > > > I am pretty clear the reason for Azure, which is caused by aggressive interrupt > > coalescing, and this behavior shouldn't be very common, and it can be > > addressed by the following patch: > > > > http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html > > > > Then please share your lockup story, such as, which HBA/drivers, test steps, > > if you complete IOs from multiple disks(LUNs) on single CPU, if you have > > multiple queues, how many active LUNs involved in the test, ... > > There is no lockup, just a potential performance boost in this change. > > My colleague Xiang Chen can provide specifics of the test, as he is the one > running it. > > But one key bit of info - which I did not think most relevant before - that > is we have 2x SAS controllers running the throughput test on the same host. > > As such, the completion queue interrupts would be spread identically over > the CPUs for each controller. I notice that ARM GICv3 ITS interrupt > controller (which we use) does not use the generic irq matrix allocator, > which I think would really help with this. Yeah, looks only x86 uses irq matrix which seems abstracted from x86 arch code, and multiple NVMe may perform worse on non-x86 server. Also when running IO against multiple LUNs in single HBA, there is chance to saturate the completion CPU given multiple disks may be quicker than the single CPU. IRQ matrix can't help this case. Thanks, Ming
On 2019-12-10 09:45, John Garry wrote: > On 10/12/2019 01:43, Ming Lei wrote: >> On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote: >>> On 07/12/2019 08:03, Ming Lei wrote: >>>> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote: >>>>> Currently the cpu allowed mask for the threaded part of a >>>>> threaded irq >>>>> handler will be set to the effective affinity of the hard irq. >>>>> >>>>> Typically the effective affinity of the hard irq will be for a >>>>> single cpu. As such, >>>>> the threaded handler would always run on the same cpu as the hard >>>>> irq. >>>>> >>>>> We have seen scenarios in high data-rate throughput testing that >>>>> the cpu >>>>> handling the interrupt can be totally saturated handling both the >>>>> hard >>>>> interrupt and threaded handler parts, limiting throughput. >>>> > > Hi Ming, > >>>> Frankly speaking, I never observed that single CPU is saturated by >>>> one storage >>>> completion queue's interrupt load. Because CPU is still much >>>> quicker than >>>> current storage device. >>>> >>>> If there are more drives, one CPU won't handle more than one >>>> queue(drive)'s >>>> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores. >>> >>> Are things this simple? I mean, can you guarantee that fio >>> processes are >>> evenly distributed as such? >> That is why I ask you for the details of your test. >> If you mean hisilicon SAS, > > Yes, it is. > > the interrupt load should have been distributed >> well given the device has multiple reply queues for distributing >> interrupt >> load. >> >>> >>>> >>>> So could you describe your case in a bit detail? Then we can >>>> confirm >>>> if this change is really needed. >>> >>> The issue is that the CPU is saturated in servicing the hard and >>> threaded >>> part of the interrupt together - here's the sort of thing which we >>> saw >>> previously: >>> Before: >>> CPU %usr %sys %irq %soft %idle >>> all 2.9 13.1 1.2 4.6 78.2 >>> 0 0.0 29.3 10.1 58.6 2.0 >>> 1 18.2 39.4 0.0 1.0 41.4 >>> 2 0.0 2.0 0.0 0.0 98.0 >>> >>> CPU0 has no effectively no idle. >> The result just shows the saturation, we need to root cause it >> instead >> of workaround it via random changes. >> >>> >>> Then, by allowing the threaded part to roam: >>> After: >>> CPU %usr %sys %irq %soft %idle >>> all 3.5 18.4 2.7 6.8 68.6 >>> 0 0.0 20.6 29.9 29.9 19.6 >>> 1 0.0 39.8 0.0 50.0 10.2 >>> >>> Note: I think that I may be able to reduce the irq hard part load >>> in the >>> endpoint driver, but not that much such that we see still this >>> issue. >>> >>>> >>>>> >>>>> For when the interrupt is managed, allow the threaded part to run >>>>> on all >>>>> cpus in the irq affinity mask. >>>> >>>> I remembered that performance drop is observed by this approach in >>>> some >>>> test. >>> >>> From checking the thread about the NVMe interrupt swamp, just >>> switching to >>> threaded handler alone degrades performance. I didn't see any >>> specific >>> results for this change from Long Li - >>> https://lkml.org/lkml/2019/8/21/128 >> I am pretty clear the reason for Azure, which is caused by >> aggressive interrupt >> coalescing, and this behavior shouldn't be very common, and it can >> be >> addressed by the following patch: >> >> http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html >> Then please share your lockup story, such as, which HBA/drivers, >> test steps, >> if you complete IOs from multiple disks(LUNs) on single CPU, if you >> have >> multiple queues, how many active LUNs involved in the test, ... > > There is no lockup, just a potential performance boost in this > change. > > My colleague Xiang Chen can provide specifics of the test, as he is > the one running it. > > But one key bit of info - which I did not think most relevant before > - that is we have 2x SAS controllers running the throughput test on > the same host. > > As such, the completion queue interrupts would be spread identically > over the CPUs for each controller. I notice that ARM GICv3 ITS > interrupt controller (which we use) does not use the generic irq > matrix allocator, which I think would really help with this. > > Hi Marc, > > Is there any reason for which we couldn't utilise of the generic irq > matrix allocator for GICv3? For a start, the ITS code predates the matrix allocator by about three years. Also, my understanding of this allocator is that it allows x86 to cope with a very small number of possible interrupt vectors per CPU. The ITS doesn't have such issue, as: 1) the namespace is global, and not per CPU 2) the namespace is *huge* Now, what property of the matrix allocator is the ITS code missing? I'd be more than happy to improve it. Thanks, M. -- Jazz is not dead. It just smells funny...
>> >> There is no lockup, just a potential performance boost in this change. >> >> My colleague Xiang Chen can provide specifics of the test, as he is >> the one running it. >> >> But one key bit of info - which I did not think most relevant before >> - that is we have 2x SAS controllers running the throughput test on >> the same host. >> >> As such, the completion queue interrupts would be spread identically >> over the CPUs for each controller. I notice that ARM GICv3 ITS >> interrupt controller (which we use) does not use the generic irq >> matrix allocator, which I think would really help with this. >> >> Hi Marc, >> >> Is there any reason for which we couldn't utilise of the generic irq >> matrix allocator for GICv3? > Hi Marc, > For a start, the ITS code predates the matrix allocator by about three > years. Also, my understanding of this allocator is that it allows > x86 to cope with a very small number of possible interrupt vectors > per CPU. The ITS doesn't have such issue, as: > > 1) the namespace is global, and not per CPU > 2) the namespace is *huge* > > Now, what property of the matrix allocator is the ITS code missing? > I'd be more than happy to improve it. I think specifically the property that the matrix allocator will try to find a CPU for irq affinity which "has the lowest number of managed IRQs allocated" - I'm quoting the comment on matrix_find_best_cpu_managed(). The ITS code will make the lowest online CPU in the affinity mask the target CPU for the interrupt, which may result in some CPUs handling so many interrupts. Thanks, John
On 2019-12-10 10:59, John Garry wrote: >>> >>> There is no lockup, just a potential performance boost in this >>> change. >>> >>> My colleague Xiang Chen can provide specifics of the test, as he is >>> the one running it. >>> >>> But one key bit of info - which I did not think most relevant >>> before >>> - that is we have 2x SAS controllers running the throughput test on >>> the same host. >>> >>> As such, the completion queue interrupts would be spread >>> identically >>> over the CPUs for each controller. I notice that ARM GICv3 ITS >>> interrupt controller (which we use) does not use the generic irq >>> matrix allocator, which I think would really help with this. >>> >>> Hi Marc, >>> >>> Is there any reason for which we couldn't utilise of the generic >>> irq >>> matrix allocator for GICv3? >> > > Hi Marc, > >> For a start, the ITS code predates the matrix allocator by about >> three >> years. Also, my understanding of this allocator is that it allows >> x86 to cope with a very small number of possible interrupt vectors >> per CPU. The ITS doesn't have such issue, as: >> 1) the namespace is global, and not per CPU >> 2) the namespace is *huge* >> Now, what property of the matrix allocator is the ITS code missing? >> I'd be more than happy to improve it. > > I think specifically the property that the matrix allocator will try > to find a CPU for irq affinity which "has the lowest number of > managed > IRQs allocated" - I'm quoting the comment on > matrix_find_best_cpu_managed(). But that decision is due to allocation constraints. You can have at most 256 interrupts per CPU, so the allocator tries to balance it. On the contrary, the ITS does care about how many interrupt target any given CPU. The whole 2^24 interrupt namespace can be thrown at a single CPU. > The ITS code will make the lowest online CPU in the affinity mask the > target CPU for the interrupt, which may result in some CPUs handling > so many interrupts. If what you want is for the *default* affinity to be spread around, that should be achieved pretty easily. Let me have a think about how to do that. M. -- Jazz is not dead. It just smells funny...
On 10/12/2019 11:36, Marc Zyngier wrote: > On 2019-12-10 10:59, John Garry wrote: >>>> >>>> There is no lockup, just a potential performance boost in this change. >>>> >>>> My colleague Xiang Chen can provide specifics of the test, as he is >>>> the one running it. >>>> >>>> But one key bit of info - which I did not think most relevant before >>>> - that is we have 2x SAS controllers running the throughput test on >>>> the same host. >>>> >>>> As such, the completion queue interrupts would be spread identically >>>> over the CPUs for each controller. I notice that ARM GICv3 ITS >>>> interrupt controller (which we use) does not use the generic irq >>>> matrix allocator, which I think would really help with this. >>>> >>>> Hi Marc, >>>> >>>> Is there any reason for which we couldn't utilise of the generic irq >>>> matrix allocator for GICv3? >>> >> >> Hi Marc, >> >>> For a start, the ITS code predates the matrix allocator by about three >>> years. Also, my understanding of this allocator is that it allows >>> x86 to cope with a very small number of possible interrupt vectors >>> per CPU. The ITS doesn't have such issue, as: >>> 1) the namespace is global, and not per CPU >>> 2) the namespace is *huge* >>> Now, what property of the matrix allocator is the ITS code missing? >>> I'd be more than happy to improve it. >> >> I think specifically the property that the matrix allocator will try >> to find a CPU for irq affinity which "has the lowest number of managed >> IRQs allocated" - I'm quoting the comment on >> matrix_find_best_cpu_managed(). > > But that decision is due to allocation constraints. You can have at most > 256 interrupts per CPU, so the allocator tries to balance it. > > On the contrary, the ITS does care about how many interrupt target any > given CPU. The whole 2^24 interrupt namespace can be thrown at a single > CPU. > >> The ITS code will make the lowest online CPU in the affinity mask the >> target CPU for the interrupt, which may result in some CPUs handling >> so many interrupts. > > If what you want is for the *default* affinity to be spread around, > that should be achieved pretty easily. Let me have a think about how > to do that. Cool, I anticipate that it should help my case. I can also seek out some NVMe cards to see how it would help a more "generic" scenario. Cheers, John
On 2019-12-10 12:05, John Garry wrote: > On 10/12/2019 11:36, Marc Zyngier wrote: >> On 2019-12-10 10:59, John Garry wrote: >>>>> >>>>> There is no lockup, just a potential performance boost in this >>>>> change. >>>>> >>>>> My colleague Xiang Chen can provide specifics of the test, as he >>>>> is >>>>> the one running it. >>>>> >>>>> But one key bit of info - which I did not think most relevant >>>>> before >>>>> - that is we have 2x SAS controllers running the throughput test >>>>> on >>>>> the same host. >>>>> >>>>> As such, the completion queue interrupts would be spread >>>>> identically >>>>> over the CPUs for each controller. I notice that ARM GICv3 ITS >>>>> interrupt controller (which we use) does not use the generic irq >>>>> matrix allocator, which I think would really help with this. >>>>> >>>>> Hi Marc, >>>>> >>>>> Is there any reason for which we couldn't utilise of the generic >>>>> irq >>>>> matrix allocator for GICv3? >>>> >>> >>> Hi Marc, >>> >>>> For a start, the ITS code predates the matrix allocator by about >>>> three >>>> years. Also, my understanding of this allocator is that it allows >>>> x86 to cope with a very small number of possible interrupt vectors >>>> per CPU. The ITS doesn't have such issue, as: >>>> 1) the namespace is global, and not per CPU >>>> 2) the namespace is *huge* >>>> Now, what property of the matrix allocator is the ITS code >>>> missing? >>>> I'd be more than happy to improve it. >>> >>> I think specifically the property that the matrix allocator will >>> try >>> to find a CPU for irq affinity which "has the lowest number of >>> managed >>> IRQs allocated" - I'm quoting the comment on >>> matrix_find_best_cpu_managed(). >> But that decision is due to allocation constraints. You can have at >> most >> 256 interrupts per CPU, so the allocator tries to balance it. >> On the contrary, the ITS does care about how many interrupt target >> any >> given CPU. The whole 2^24 interrupt namespace can be thrown at a >> single >> CPU. >> >>> The ITS code will make the lowest online CPU in the affinity mask >>> the >>> target CPU for the interrupt, which may result in some CPUs >>> handling >>> so many interrupts. >> If what you want is for the *default* affinity to be spread around, >> that should be achieved pretty easily. Let me have a think about how >> to do that. > > Cool, I anticipate that it should help my case. > > I can also seek out some NVMe cards to see how it would help a more > "generic" scenario. Can you give the following a go? It probably has all kind of warts on top of the quality debug information, but I managed to get my D05 and a couple of guests to boot with it. It will probably eat your data, so use caution! ;-) Thanks, M. diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index e05673bcd52b..301ee3bc0602 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -177,6 +177,8 @@ static DEFINE_IDA(its_vpeid_ida); #define gic_data_rdist_rd_base() (gic_data_rdist()->rd_base) #define gic_data_rdist_vlpi_base() (gic_data_rdist_rd_base() + SZ_128K) +static DEFINE_PER_CPU(atomic_t, cpu_lpi_count); + static u16 get_its_list(struct its_vm *vm) { struct its_node *its; @@ -1287,42 +1289,76 @@ static void its_unmask_irq(struct irq_data *d) lpi_update_config(d, 0, LPI_PROP_ENABLED); } +static int its_pick_target_cpu(struct its_device *its_dev, const struct cpumask *cpu_mask) +{ + unsigned int cpu = nr_cpu_ids, tmp; + int count = S32_MAX; + + for_each_cpu_and(tmp, cpu_mask, cpu_online_mask) { + int this_count = per_cpu(cpu_lpi_count, tmp).counter; + if (this_count < count) { + cpu = tmp; + count = this_count; + } + } + + return cpu; +} + static int its_set_affinity(struct irq_data *d, const struct cpumask *mask_val, bool force) { - unsigned int cpu; - const struct cpumask *cpu_mask = cpu_online_mask; struct its_device *its_dev = irq_data_get_irq_chip_data(d); - struct its_collection *target_col; + int ret = IRQ_SET_MASK_OK_DONE; u32 id = its_get_event_id(d); + cpumask_var_t tmpmask; /* A forwarded interrupt should use irq_set_vcpu_affinity */ if (irqd_is_forwarded_to_vcpu(d)) return -EINVAL; - /* lpi cannot be routed to a redistributor that is on a foreign node */ - if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) { - if (its_dev->its->numa_node >= 0) { - cpu_mask = cpumask_of_node(its_dev->its->numa_node); - if (!cpumask_intersects(mask_val, cpu_mask)) - return -EINVAL; + if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + + cpumask_and(tmpmask, mask_val, cpu_online_mask); + + if (its_dev->its->numa_node >= 0) + cpumask_and(tmpmask, tmpmask, cpumask_of_node(its_dev->its->numa_node)); + + if (cpumask_empty(tmpmask)) { + /* LPI cannot be routed to a redistributor that is on a foreign node */ + if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) { + ret = -EINVAL; + goto out; } + + cpumask_copy(tmpmask, cpu_online_mask); } - cpu = cpumask_any_and(mask_val, cpu_mask); + if (!cpumask_test_cpu(its_dev->event_map.col_map[id], tmpmask)) { + struct its_collection *target_col; + int cpu; - if (cpu >= nr_cpu_ids) - return -EINVAL; + cpu = its_pick_target_cpu(its_dev, tmpmask); + if (cpu >= nr_cpu_ids) { + ret = -EINVAL; + goto out; + } - /* don't set the affinity when the target cpu is same as current one */ - if (cpu != its_dev->event_map.col_map[id]) { + pr_info("IRQ%d CPU%d -> CPU%d\n", + d->irq, its_dev->event_map.col_map[id], cpu); + atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu)); + atomic_dec(per_cpu_ptr(&cpu_lpi_count, + its_dev->event_map.col_map[id])); target_col = &its_dev->its->collections[cpu]; its_send_movi(its_dev, target_col, id); its_dev->event_map.col_map[id] = cpu; irq_data_update_effective_affinity(d, cpumask_of(cpu)); } - return IRQ_SET_MASK_OK_DONE; +out: + free_cpumask_var(tmpmask); + return ret; } static u64 its_irq_get_msi_base(struct its_device *its_dev) @@ -2773,22 +2809,28 @@ static int its_irq_domain_activate(struct irq_domain *domain, { struct its_device *its_dev = irq_data_get_irq_chip_data(d); u32 event = its_get_event_id(d); - const struct cpumask *cpu_mask = cpu_online_mask; - int cpu; + int cpu = nr_cpu_ids; - /* get the cpu_mask of local node */ - if (its_dev->its->numa_node >= 0) - cpu_mask = cpumask_of_node(its_dev->its->numa_node); + /* Find the least loaded CPU on the local node */ + if (its_dev->its->numa_node >= 0) { + cpu = its_pick_target_cpu(its_dev, + cpumask_of_node(its_dev->its->numa_node)); + if (cpu < 0) + return cpu; - /* Bind the LPI to the first possible CPU */ - cpu = cpumask_first_and(cpu_mask, cpu_online_mask); - if (cpu >= nr_cpu_ids) { - if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) + if (cpu >= nr_cpu_ids && + (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144)) return -EINVAL; + } - cpu = cpumask_first(cpu_online_mask); + if (cpu >= nr_cpu_ids) { + cpu = its_pick_target_cpu(its_dev, cpu_online_mask); + if (cpu < 0) + return cpu; } + pr_info("picked CPU%d IRQ%d\n", cpu, d->irq); + atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu)); its_dev->event_map.col_map[event] = cpu; irq_data_update_effective_affinity(d, cpumask_of(cpu)); @@ -2803,6 +2845,8 @@ static void its_irq_domain_deactivate(struct irq_domain *domain, struct its_device *its_dev = irq_data_get_irq_chip_data(d); u32 event = its_get_event_id(d); + atomic_dec(per_cpu_ptr(&cpu_lpi_count, + its_dev->event_map.col_map[event])); /* Stop the delivery of interrupts */ its_send_discard(its_dev, event); } -- Jazz is not dead. It just smells funny...
On 10/12/2019 18:32, Marc Zyngier wrote: >>>> The ITS code will make the lowest online CPU in the affinity mask >>>> the >>>> target CPU for the interrupt, which may result in some CPUs >>>> handling >>>> so many interrupts. >>> If what you want is for the*default* affinity to be spread around, >>> that should be achieved pretty easily. Let me have a think about how >>> to do that. >> Cool, I anticipate that it should help my case. >> >> I can also seek out some NVMe cards to see how it would help a more >> "generic" scenario. > Can you give the following a go? It probably has all kind of warts on > top of the quality debug information, but I managed to get my D05 and > a couple of guests to boot with it. It will probably eat your data, > so use caution!;-) > Hi Marc, Ok, we'll give it a spin. Thanks, John > Thanks, > > M. > > diff --git a/drivers/irqchip/irq-gic-v3-its.c > b/drivers/irqchip/irq-gic-v3-its.c > index e05673bcd52b..301ee3bc0602 100644 > --- a/drivers/irqchip/irq-gic-v3-its.c > +++ b/drivers/irqchip/irq-gic-v3-its.c > @@ -177,6 +177,8 @@ static DEFINE_IDA(its_vpeid_ida);
On 10/12/2019 01:43, Ming Lei wrote: >>>> For when the interrupt is managed, allow the threaded part to run on all >>>> cpus in the irq affinity mask. >>> I remembered that performance drop is observed by this approach in some >>> test. >> From checking the thread about the NVMe interrupt swamp, just switching to >> threaded handler alone degrades performance. I didn't see any specific >> results for this change from Long Li -https://lkml.org/lkml/2019/8/21/128 Hi Ming, > I am pretty clear the reason for Azure, which is caused by aggressive interrupt > coalescing, and this behavior shouldn't be very common, and it can be > addressed by the following patch: I am running some NVMe perf tests with Marc's patch. I see this almost always eventually (with or without that patch): [ 66.018140] rcu: INFO: rcu_preempt self-detected stall on CPU2% done] [5058MB/0KB/0KB /s] [1295K/0/0 iops] [eta 01m:39s] [ 66.023885] rcu: 12-....: (5250 ticks this GP) idle=182/1/0x4000000000000004 softirq=517/517 fqs=2529 [ 66.033306] (t=5254 jiffies g=733 q=2241) [ 66.037394] Task dump for CPU 12: [ 66.040696] fio R running task 0 798 796 0x00000002 [ 66.047733] Call trace: [ 66.050173] dump_backtrace+0x0/0x1a0 [ 66.053823] show_stack+0x14/0x20 [ 66.057126] sched_show_task+0x164/0x1a0 [ 66.061036] dump_cpu_task+0x40/0x2e8 [ 66.064686] rcu_dump_cpu_stacks+0xa0/0xe0 [ 66.068769] rcu_sched_clock_irq+0x6d8/0xaa8 [ 66.073027] update_process_times+0x2c/0x50 [ 66.077198] tick_sched_handle.isra.14+0x30/0x50 [ 66.081802] tick_sched_timer+0x48/0x98 [ 66.085625] __hrtimer_run_queues+0x120/0x1b8 [ 66.089968] hrtimer_interrupt+0xd4/0x250 [ 66.093966] arch_timer_handler_phys+0x28/0x40 [ 66.098398] handle_percpu_devid_irq+0x80/0x140 [ 66.102915] generic_handle_irq+0x24/0x38 [ 66.106911] __handle_domain_irq+0x5c/0xb0 [ 66.110995] gic_handle_irq+0x5c/0x148 [ 66.114731] el1_irq+0xb8/0x180 [ 66.117858] efi_header_end+0x94/0x234 [ 66.121595] irq_exit+0xd0/0xd8 [ 66.124724] __handle_domain_irq+0x60/0xb0 [ 66.128806] gic_handle_irq+0x5c/0x148 [ 66.132542] el0_irq_naked+0x4c/0x54 [ 97.152870] rcu: INFO: rcu_preempt self-detected stall on CPU8% done] [4736MB/0KB/0KB /s] [1212K/0/0 iops] [eta 01m:08s] [ 97.158616] rcu: 8-....: (1 GPs behind) idle=08e/1/0x4000000000000002 softirq=462/505 fqs=2621 [ 97.167414] (t=5253 jiffies g=737 q=5507) [ 97.171498] Task dump for CPU 8: [pu_task+0x40/0x2e8 [ 97.198705] rcu_dump_cpu_stacks+0xa0/0xe0 [ 97.202788] rcu_sched_clock_irq+0x6d8/0xaa8 [ 97.207046] update_process_times+0x2c/0x50 [ 97.211217] tick_sched_handle.isra.14+0x30/0x50 [ 97.215820] tick_sched_timer+0x48/0x98 [ 97.219644] __hrtimer_run_queues+0x120/0x1b8 [ 97.223989] hrtimer_interrupt+0xd4/0x250 [ 97.227987] arch_timer_handler_phys+0x28/0x40 [ 97.232418] handle_percpu_devid_irq+0x80/0x140 [ 97.236935] generic_handle_irq+0x24/0x38 [ 97.240931] __handle_domain_irq+0x5c/0xb0 [ 97.245015] gic_handle_irq+0x5c/0x148 [ 97.248751] el1_irq+0xb8/0x180 [ 97.251880] find_busiest_group+0x18c/0x9e8 [ 97.256050] load_balance+0x154/0xb98 [ 97.259700] rebalance_domains+0x1cc/0x2f8 [ 97.263783] run_rebalance_domains+0x78/0xe0 [ 97.268040] efi_header_end+0x114/0x234 [ 97.271864] run_ksoftirqd+0x38/0x48 [ 97.275427] smpboot_thread_fn+0x16c/0x270 [ 97.279511] kthread+0x118/0x120 [ 97.282726] ret_from_fork+0x10/0x18 [ 97.286289] Task dump for CPU 12: [ 97.289591] kworker/12:1 R running task 0 570 2 0x0000002a [ 97.296634] Workqueue: 0x0 (mm_percpu_wq) [ 97.300718] Call trace: [ 97.303152] __switch_to+0xbc/0x218 [ 97.306632] page_wait_table+0x1500/0x1800 Would this be the same interrupt "swamp" issue? > > http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html > What is the status of these patches? I did not see them in mainline. > Then please share your lockup story, such as, which HBA/drivers, test steps, > if you complete IOs from multiple disks(LUNs) on single CPU, if you have > multiple queues, how many active LUNs involved in the test, ... > > Thanks, John
On Wed, Dec 11, 2019 at 05:09:18PM +0000, John Garry wrote: > On 10/12/2019 01:43, Ming Lei wrote: > > > > > For when the interrupt is managed, allow the threaded part to run on all > > > > > cpus in the irq affinity mask. > > > > I remembered that performance drop is observed by this approach in some > > > > test. > > > From checking the thread about the NVMe interrupt swamp, just switching to > > > threaded handler alone degrades performance. I didn't see any specific > > > results for this change from Long Li -https://lkml.org/lkml/2019/8/21/128 > > Hi Ming, > > > I am pretty clear the reason for Azure, which is caused by aggressive interrupt > > coalescing, and this behavior shouldn't be very common, and it can be > > addressed by the following patch: > > I am running some NVMe perf tests with Marc's patch. We need to confirm that if Marc's patch works as expected, could you collect log via the attached script? > > I see this almost always eventually (with or without that patch): > > [ 66.018140] rcu: INFO: rcu_preempt self-detected stall on CPU2% done] > [5058MB/0KB/0KB /s] [1295K/0/0 iops] [eta 01m:39s] > [ 66.023885] rcu: 12-....: (5250 ticks this GP) > idle=182/1/0x4000000000000004 softirq=517/517 fqs=2529 > [ 66.033306] (t=5254 jiffies g=733 q=2241) > [ 66.037394] Task dump for CPU 12: > [ 66.040696] fio R running task 0 798 796 > 0x00000002 > [ 66.047733] Call trace: > [ 66.050173] dump_backtrace+0x0/0x1a0 > [ 66.053823] show_stack+0x14/0x20 > [ 66.057126] sched_show_task+0x164/0x1a0 > [ 66.061036] dump_cpu_task+0x40/0x2e8 > [ 66.064686] rcu_dump_cpu_stacks+0xa0/0xe0 > [ 66.068769] rcu_sched_clock_irq+0x6d8/0xaa8 > [ 66.073027] update_process_times+0x2c/0x50 > [ 66.077198] tick_sched_handle.isra.14+0x30/0x50 > [ 66.081802] tick_sched_timer+0x48/0x98 > [ 66.085625] __hrtimer_run_queues+0x120/0x1b8 > [ 66.089968] hrtimer_interrupt+0xd4/0x250 > [ 66.093966] arch_timer_handler_phys+0x28/0x40 > [ 66.098398] handle_percpu_devid_irq+0x80/0x140 > [ 66.102915] generic_handle_irq+0x24/0x38 > [ 66.106911] __handle_domain_irq+0x5c/0xb0 > [ 66.110995] gic_handle_irq+0x5c/0x148 > [ 66.114731] el1_irq+0xb8/0x180 > [ 66.117858] efi_header_end+0x94/0x234 > [ 66.121595] irq_exit+0xd0/0xd8 > [ 66.124724] __handle_domain_irq+0x60/0xb0 > [ 66.128806] gic_handle_irq+0x5c/0x148 > [ 66.132542] el0_irq_naked+0x4c/0x54 > [ 97.152870] rcu: INFO: rcu_preempt self-detected stall on CPU8% done] > [4736MB/0KB/0KB /s] [1212K/0/0 iops] [eta 01m:08s] > [ 97.158616] rcu: 8-....: (1 GPs behind) idle=08e/1/0x4000000000000002 > softirq=462/505 fqs=2621 > [ 97.167414] (t=5253 jiffies g=737 q=5507) > [ 97.171498] Task dump for CPU 8: > [pu_task+0x40/0x2e8 > [ 97.198705] rcu_dump_cpu_stacks+0xa0/0xe0 > [ 97.202788] rcu_sched_clock_irq+0x6d8/0xaa8 > [ 97.207046] update_process_times+0x2c/0x50 > [ 97.211217] tick_sched_handle.isra.14+0x30/0x50 > [ 97.215820] tick_sched_timer+0x48/0x98 > [ 97.219644] __hrtimer_run_queues+0x120/0x1b8 > [ 97.223989] hrtimer_interrupt+0xd4/0x250 > [ 97.227987] arch_timer_handler_phys+0x28/0x40 > [ 97.232418] handle_percpu_devid_irq+0x80/0x140 > [ 97.236935] generic_handle_irq+0x24/0x38 > [ 97.240931] __handle_domain_irq+0x5c/0xb0 > [ 97.245015] gic_handle_irq+0x5c/0x148 > [ 97.248751] el1_irq+0xb8/0x180 > [ 97.251880] find_busiest_group+0x18c/0x9e8 > [ 97.256050] load_balance+0x154/0xb98 > [ 97.259700] rebalance_domains+0x1cc/0x2f8 > [ 97.263783] run_rebalance_domains+0x78/0xe0 > [ 97.268040] efi_header_end+0x114/0x234 > [ 97.271864] run_ksoftirqd+0x38/0x48 > [ 97.275427] smpboot_thread_fn+0x16c/0x270 > [ 97.279511] kthread+0x118/0x120 > [ 97.282726] ret_from_fork+0x10/0x18 > [ 97.286289] Task dump for CPU 12: > [ 97.289591] kworker/12:1 R running task 0 570 2 > 0x0000002a > [ 97.296634] Workqueue: 0x0 (mm_percpu_wq) > [ 97.300718] Call trace: > [ 97.303152] __switch_to+0xbc/0x218 > [ 97.306632] page_wait_table+0x1500/0x1800 > > Would this be the same interrupt "swamp" issue? It could be, but reason need to investigated. You never provide the test details(how many drives, how many disks attached to each drive) as I asked, so I can't comment on the reason, also no reason shows that the patch is a good fix. My theory is simple, so far, the CPU is still much quicker than current storage in case that IO aren't from multiple disks which are connected to same drive. Thanks, Ming #!/bin/sh get_disk_from_pcid() { PCID=$1 DISKS=`find /sys/block -name "*"` for DISK in $DISKS; do DISKP=`realpath $DISK/device` echo $DISKP | grep $PCID > /dev/null [ $? -eq 0 ] && echo `basename $DISK` && break done } dump_irq_affinity() { PCID=$1 PCIP=`find /sys/devices -name *$PCID | grep pci` [[ ! -d $PCIP/msi_irqs ]] && return IRQS=`ls $PCIP/msi_irqs` [ $? -ne 0 ] && return DISK=`get_disk_from_pcid $PCID` echo "PCI name is $PCID: $DISK" for IRQ in $IRQS; do [ -f /proc/irq/$IRQ/smp_affinity_list ] && CPUS=`cat /proc/irq/$IRQ/smp_affinity_list` [ -f /proc/irq/$IRQ/effective_affinity_list ] && ECPUS=`cat /proc/irq/$IRQ/effective_affinity_list` echo -e "\tirq $IRQ, cpu list $CPUS, effective list $ECPUS" done } if [ $# -ge 1 ]; then PCIDS=$1 else # PCID=`lspci | grep "Non-Volatile memory" | cut -c1-7` PCIDS=`lspci | grep "Non-Volatile memory controller" | awk '{print $1}'` fi echo "kernel version: " uname -a for PCID in $PCIDS; do dump_irq_affinity $PCID done
On 11/12/2019 09:41, John Garry wrote: > On 10/12/2019 18:32, Marc Zyngier wrote: >>>>> The ITS code will make the lowest online CPU in the affinity mask >>>>> the >>>>> target CPU for the interrupt, which may result in some CPUs >>>>> handling >>>>> so many interrupts. >>>> If what you want is for the*default* affinity to be spread around, >>>> that should be achieved pretty easily. Let me have a think about how >>>> to do that. >>> Cool, I anticipate that it should help my case. >>> >>> I can also seek out some NVMe cards to see how it would help a more >>> "generic" scenario. >> Can you give the following a go? It probably has all kind of warts on >> top of the quality debug information, but I managed to get my D05 and >> a couple of guests to boot with it. It will probably eat your data, >> so use caution!;-) >> > > Hi Marc, > > Ok, we'll give it a spin. > > Thanks, > John Hi Marc, JFYI, we're still testing this and the patch itself seems to work as intended. Here's the kernel log if you just want to see how the interrupts are getting assigned: https://pastebin.com/hh3r810g For me, I did get a performance boost for NVMe testing, but my colleague Xiang Chen saw a drop for our storage test of interest - that's the HiSi SAS controller. We're trying to make sense of it now. Thanks, John > >> Thanks, >> >> M. >> >> diff --git a/drivers/irqchip/irq-gic-v3-its.c >> b/drivers/irqchip/irq-gic-v3-its.c >> index e05673bcd52b..301ee3bc0602 100644 >> --- a/drivers/irqchip/irq-gic-v3-its.c >> +++ b/drivers/irqchip/irq-gic-v3-its.c >> @@ -177,6 +177,8 @@ static DEFINE_IDA(its_vpeid_ida); >
Hi John, On 2019-12-13 10:07, John Garry wrote: > On 11/12/2019 09:41, John Garry wrote: >> On 10/12/2019 18:32, Marc Zyngier wrote: >>>>>> The ITS code will make the lowest online CPU in the affinity >>>>>> mask >>>>>> the >>>>>> target CPU for the interrupt, which may result in some CPUs >>>>>> handling >>>>>> so many interrupts. >>>>> If what you want is for the*default* affinity to be spread >>>>> around, >>>>> that should be achieved pretty easily. Let me have a think about >>>>> how >>>>> to do that. >>>> Cool, I anticipate that it should help my case. >>>> >>>> I can also seek out some NVMe cards to see how it would help a >>>> more >>>> "generic" scenario. >>> Can you give the following a go? It probably has all kind of warts >>> on >>> top of the quality debug information, but I managed to get my D05 >>> and >>> a couple of guests to boot with it. It will probably eat your data, >>> so use caution!;-) >>> >> Hi Marc, >> Ok, we'll give it a spin. >> Thanks, >> John > > Hi Marc, > > JFYI, we're still testing this and the patch itself seems to work as > intended. > > Here's the kernel log if you just want to see how the interrupts are > getting assigned: > https://pastebin.com/hh3r810g It is a bit hard to make sense of this dump, specially on such a wide machine (I want one!) without really knowing the topology of the system. > For me, I did get a performance boost for NVMe testing, but my > colleague Xiang Chen saw a drop for our storage test of interest - > that's the HiSi SAS controller. We're trying to make sense of it now. One of the difference is that with this patch, the initial affinity is picked inside the NUMA node that matches the ITS. In your case, that's either node 0 or 2. But it is unclear whether which CPUs these map to. Given that I see interrupts mapped to CPUs 0-23 on one side, and 48-71 on the other, it looks like half of your machine gets starved, and that may be because no ITS targets the NUMA nodes they are part of. It would be interesting to see what happens if you manually set the affinity of the interrupts outside of the NUMA node. Thanks, M. -- Jazz is not dead. It just smells funny...
Hi Ming, >> I am running some NVMe perf tests with Marc's patch. > > We need to confirm that if Marc's patch works as expected, could you > collect log via the attached script? As immediately below, I see this on vanilla mainline, so let's see what the issue is without that patch. > > > You never provide the test details(how many drives, how many disks > attached to each drive) as I asked, so I can't comment on the reason, > also no reason shows that the patch is a good fix. So I have only 2x ES3000 V3s. This looks like the same one: https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf > > My theory is simple, so far, the CPU is still much quicker than > current storage in case that IO aren't from multiple disks which are > connected to same drive. Hopefully this is all the info you need: Last login: Fri Dec 13 10:41:55 GMT 2019 on ttyAMA0 Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 5.5.0-rc1-00001-g3779c27ad995-dirty aarch64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage Failed to connect to https://changelogs.ubuntu.com/meta-release-lts. Check your Internet connection or proxy settings john@ubuntu:~$ lstopo Machine (14GB total) Package L#0 NUMANode L#0 (P#0 14GB) L3 L#0 (32MB) L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11) L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15) L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU L#17 (P#17) L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU L#19 (P#19) L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU L#21 (P#21) L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23) HostBridge L#0 PCIBridge 2 x { PCI 8086:10fb } PCIBridge PCI 19e5:0123 PCIBridge PCI 19e5:1711 HostBridge L#4 PCIBridge PCI 19e5:a250 PCI 19e5:a230 PCI 19e5:a235 HostBridge L#6 PCIBridge PCI 19e5:a222 Net L#0 "eno1" PCI 19e5:a222 Net L#1 "eno2" PCI 19e5:a222 Net L#2 "eno3" PCI 19e5:a221 Net L#3 "eno4" NUMANode L#1 (P#1) + L3 L#1 (32MB) L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU L#27 (P#27) L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU L#29 (P#29) L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU L#31 (P#31) L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35) L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU L#39 (P#39) L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU L#41 (P#41) L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU L#43 (P#43) L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU L#47 (P#47) Package L#1 NUMANode L#2 (P#2) L3 L#2 (32MB) L2 L#48 (512KB) + L1d L#48 (64KB) + L1i L#48 (64KB) + Core L#48 + PU L#48 (P#48) L2 L#49 (512KB) + L1d L#49 (64KB) + L1i L#49 (64KB) + Core L#49 + PU L#49 (P#49) L2 L#50 (512KB) + L1d L#50 (64KB) + L1i L#50 (64KB) + Core L#50 + PU L#50 (P#50) L2 L#51 (512KB) + L1d L#51 (64KB) + L1i L#51 (64KB) + Core L#51 + PU L#51 (P#51) L2 L#52 (512KB) + L1d L#52 (64KB) + L1i L#52 (64KB) + Core L#52 + PU L#52 (P#52) L2 L#53 (512KB) + L1d L#53 (64KB) + L1i L#53 (64KB) + Core L#53 + PU L#53 (P#53) L2 L#54 (512KB) + L1d L#54 (64KB) + L1i L#54 (64KB) + Core L#54 + PU L#54 (P#54) L2 L#55 (512KB) + L1d L#55 (64KB) + L1i L#55 (64KB) + Core L#55 + PU L#55 (P#55) L2 L#56 (512KB) + L1d L#56 (64KB) + L1i L#56 (64KB) + Core L#56 + PU L#56 (P#56) L2 L#57 (512KB) + L1d L#57 (64KB) + L1i L#57 (64KB) + Core L#57 + PU L#57 (P#57) L2 L#58 (512KB) + L1d L#58 (64KB) + L1i L#58 (64KB) + Core L#58 + PU L#58 (P#58) L2 L#59 (512KB) + L1d L#59 (64KB) + L1i L#59 (64KB) + Core L#59 + PU L#59 (P#59) L2 L#60 (512KB) + L1d L#60 (64KB) + L1i L#60 (64KB) + Core L#60 + PU L#60 (P#60) L2 L#61 (512KB) + L1d L#61 (64KB) + L1i L#61 (64KB) + Core L#61 + PU L#61 (P#61) L2 L#62 (512KB) + L1d L#62 (64KB) + L1i L#62 (64KB) + Core L#62 + PU L#62 (P#62) L2 L#63 (512KB) + L1d L#63 (64KB) + L1i L#63 (64KB) + Core L#63 + PU L#63 (P#63) L2 L#64 (512KB) + L1d L#64 (64KB) + L1i L#64 (64KB) + Core L#64 + PU L#64 (P#64) L2 L#65 (512KB) + L1d L#65 (64KB) + L1i L#65 (64KB) + Core L#65 + PU L#65 (P#65) L2 L#66 (512KB) + L1d L#66 (64KB) + L1i L#66 (64KB) + Core L#66 + PU L#66 (P#66) L2 L#67 (512KB) + L1d L#67 (64KB) + L1i L#67 (64KB) + Core L#67 + PU L#67 (P#67) L2 L#68 (512KB) + L1d L#68 (64KB) + L1i L#68 (64KB) + Core L#68 + PU L#68 (P#68) L2 L#69 (512KB) + L1d L#69 (64KB) + L1i L#69 (64KB) + Core L#69 + PU L#69 (P#69) L2 L#70 (512KB) + L1d L#70 (64KB) + L1i L#70 (64KB) + Core L#70 + PU L#70 (P#70) L2 L#71 (512KB) + L1d L#71 (64KB) + L1i L#71 (64KB) + Core L#71 + PU L#71 (P#71) HostBridge L#8 PCIBridge PCI 19e5:0123 HostBridge L#10 PCIBridge PCI 19e5:a250 HostBridge L#12 PCIBridge PCI 19e5:a226 Net L#4 "eno5" NUMANode L#3 (P#3) + L3 L#3 (32MB) L2 L#72 (512KB) + L1d L#72 (64KB) + L1i L#72 (64KB) + Core L#72 + PU L#72 (P#72) L2 L#73 (512KB) + L1d L#73 (64KB) + L1i L#73 (64KB) + Core L#73 + PU L#73 (P#73) L2 L#74 (512KB) + L1d L#74 (64KB) + L1i L#74 (64KB) + Core L#74 + PU L#74 (P#74) L2 L#75 (512KB) + L1d L#75 (64KB) + L1i L#75 (64KB) + Core L#75 + PU L#75 (P#75) L2 L#76 (512KB) + L1d L#76 (64KB) + L1i L#76 (64KB) + Core L#76 + PU L#76 (P#76) L2 L#77 (512KB) + L1d L#77 (64KB) + L1i L#77 (64KB) + Core L#77 + PU L#77 (P#77) L2 L#78 (512KB) + L1d L#78 (64KB) + L1i L#78 (64KB) + Core L#78 + PU L#78 (P#78) L2 L#79 (512KB) + L1d L#79 (64KB) + L1i L#79 (64KB) + Core L#79 + PU L#79 (P#79) L2 L#80 (512KB) + L1d L#80 (64KB) + L1i L#80 (64KB) + Core L#80 + PU L#80 (P#80) L2 L#81 (512KB) + L1d L#81 (64KB) + L1i L#81 (64KB) + Core L#81 + PU L#81 (P#81) L2 L#82 (512KB) + L1d L#82 (64KB) + L1i L#82 (64KB) + Core L#82 + PU L#82 (P#82) L2 L#83 (512KB) + L1d L#83 (64KB) + L1i L#83 (64KB) + Core L#83 + PU L#83 (P#83) L2 L#84 (512KB) + L1d L#84 (64KB) + L1i L#84 (64KB) + Core L#84 + PU L#84 (P#84) L2 L#85 (512KB) + L1d L#85 (64KB) + L1i L#85 (64KB) + Core L#85 + PU L#85 (P#85) L2 L#86 (512KB) + L1d L#86 (64KB) + L1i L#86 (64KB) + Core L#86 + PU L#86 (P#86) L2 L#87 (512KB) + L1d L#87 (64KB) + L1i L#87 (64KB) + Core L#87 + PU L#87 (P#87) L2 L#88 (512KB) + L1d L#88 (64KB) + L1i L#88 (64KB) + Core L#88 + PU L#88 (P#88) L2 L#89 (512KB) + L1d L#89 (64KB) + L1i L#89 (64KB) + Core L#89 + PU L#89 (P#89) L2 L#90 (512KB) + L1d L#90 (64KB) + L1i L#90 (64KB) + Core L#90 + PU L#90 (P#90) L2 L#91 (512KB) + L1d L#91 (64KB) + L1i L#91 (64KB) + Core L#91 + PU L#91 (P#91) L2 L#92 (512KB) + L1d L#92 (64KB) + L1i L#92 (64KB) + Core L#92 + PU L#92 (P#92) L2 L#93 (512KB) + L1d L#93 (64KB) + L1i L#93 (64KB) + Core L#93 + PU L#93 (P#93) L2 L#94 (512KB) + L1d L#94 (64KB) + L1i L#94 (64KB) + Core L#94 + PU L#94 (P#94) L2 L#95 (512KB) + L1d L#95 (64KB) + L1i L#95 (64KB) + Core L#95 + PU L#95 (P#95) john@ubuntu:~$ lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 4 Vendor ID: 0x48 Model: 0 Stepping: 0x0 BogoMIPS: 200.00 L1d cache: 64K L1i cache: 64K L2 cache: 512K L3 cache: 32768K NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47 NUMA node2 CPU(s): 48-71 NUMA node3 CPU(s): 72-95 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm dcpop john@ubuntu:~$ dmesg | grep "Linux v" [ 0.000000] Linux version 5.5.0-rc1-00001-g3779c27ad995-dirty (john@john-ThinkCentre-M93p) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05-rc1 revision 38aec9a676236eaa42ca03ccb3a6c1dd0182c29f] (Linaro GCC 7.3-2018.05-rc1)) #1436 SMP PREEMPT Fri Dec 13 10:51:46 GMT 2019 john@ubuntu:~$ john@ubuntu:~$ lspci 00:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 00:04.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 00:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 00:0c.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 00:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 00:12.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 04:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. ES3000 V3 NVMe PCIe SSD (rev 45) 05:00.0 VGA compatible controller: Huawei Technologies Co., Ltd. Hi1710 [iBMC Intelligent Management system chip w/VGA support] (rev 01) 74:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge (rev 20) 74:02.0 Serial Attached SCSI controller: Huawei Technologies Co., Ltd. HiSilicon SAS 3.0 HBA (rev 20) 74:03.0 SATA controller: Huawei Technologies Co., Ltd. HiSilicon AHCI HBA (rev 20) 75:00.0 Processing accelerators: Huawei Technologies Co., Ltd. HiSilicon ZIP Engine (rev 20) 78:00.0 Network and computing encryption device: Huawei Technologies Co., Ltd. HiSilicon HPRE Engine (rev 20) 7a:00.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 2-port Host Controller (rev 20) 7a:01.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 2-port Host Controller (rev 20) 7a:02.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 3.0 Host Controller (rev 20) 7c:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge (rev 20) 7d:00.0 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE RDMA Network Controller (rev 20) 7d:00.1 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE RDMA Network Controller (rev 20) 7d:00.2 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE RDMA Network Controller (rev 20) 7d:00.3 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE Network Controller (rev 20) 80:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 80:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 80:0c.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 80:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port with Gen4 (rev 45) 81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. ES3000 V3 NVMe PCIe SSD (rev 45) b4:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge (rev 20) b5:00.0 Processing accelerators: Huawei Technologies Co., Ltd. HiSilicon ZIP Engine (rev 20) b8:00.0 Network and computing encryption device: Huawei Technologies Co., Ltd. HiSilicon HPRE Engine (rev 20) ba:00.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 2-port Host Controller (rev 20) ba:01.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 2-port Host Controller (rev 20) ba:02.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 3.0 Host Controller (rev 20) bc:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge (rev 20) bd:00.0 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE/50GE/100GE RDMA Network Controller (rev 20) john@ubuntu:~$ sudo /bin/bash create_fio_task_cpu_liuyifan_nvme.sh 4k read 20 1 Creat 4k_read_depth20_fiotest file sucessfully job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20 ... job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20 ... fio-3.1 Starting 40 processes [ 175.642410] rcu: INFO: rcu_preempt self-detected stall on CPU IOPS][eta 00m:18s] [ 175.648150] rcu: 0-....: (1 GPs behind) idle=3ae/1/0x4000000000000004 softirq=1589/1589 fqs=2322 Jobs: 40 (f=40): [R(40)][100.0%][r=4270MiB/s,w=0KiB/s][r=1093k,w=0 IOPS][eta 00m:00s] job1: (groupid=0, jobs=40): err= 0: pid=1227: Fri Dec 13 10:57:49 2019 read: IOPS=952k, BW=3719MiB/s (3900MB/s)(145GiB/40003msec) slat (usec): min=2, max=20126k, avg=10.66, stdev=9637.70 clat (usec): min=13, max=20156k, avg=517.95, stdev=31017.58 lat (usec): min=21, max=20156k, avg=528.77, stdev=32487.76 clat percentiles (usec): | 1.00th=[ 103], 5.00th=[ 113], 10.00th=[ 147], 20.00th=[ 200], | 30.00th=[ 260], 40.00th=[ 318], 50.00th=[ 375], 60.00th=[ 429], | 70.00th=[ 486], 80.00th=[ 578], 90.00th=[ 799], 95.00th=[ 996], | 99.00th=[ 1958], 99.50th=[ 2114], 99.90th=[ 2311], 99.95th=[ 2474], | 99.99th=[ 7767] bw ( KiB/s): min= 112, max=745026, per=4.60%, avg=175285.03, stdev=117592.37, samples=1740 iops : min= 28, max=186256, avg=43821.06, stdev=29398.12, samples=1740 lat (usec) : 20=0.01%, 50=0.01%, 100=0.14%, 250=28.38%, 500=43.76% lat (usec) : 750=16.17%, 1000=6.65% lat (msec) : 2=4.02%, 4=0.86%, 10=0.01%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 2000=0.01% lat (msec) : >=2000=0.01% cpu : usr=3.67%, sys=15.82%, ctx=20799355, majf=0, minf=4275 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwt: total=38086812,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=20 Run status group 0 (all jobs): READ: bw=3719MiB/s (3900MB/s), 3719MiB/s-3719MiB/s (3900MB/s-3900MB/s), io=145GiB (156GB), run=40003-40003msec Disk stats (read0/0, ticks=5002739/0, in_queue=540, util=99.83% john@ubuntu:~$ dmesg | tail -n 100 [ 20.380611] Key type dns_resolver registered [ 20.385000] registered taskstats version 1 [ 20.389092] Loading compiled-in X.509 certificates [ 20.394494] pcieport 0000:00:00.0: Adding to iommu group 9 [ 20.401556] pcieport 0000:00:04.0: Adding to iommu group 10 [ 20.408695] pcieport 0000:00:08.0: Adding to iommu group 11 [ 20.415767] pcieport 0000:00:0c.0: Adding to iommu group 12 [ 20.422842] pcieport 0000:00:10.0: Adding to iommu group 13 [ 20.429932] pcieport 0000:00:12.0: Adding to iommu group 14 [ 20.437077] pcieport 0000:7c:00.0: Adding to iommu group 15 [ 20.443397] pcieport 0000:74:00.0: Adding to iommu group 16 [ 20.449790] pcieport 0000:80:00.0: Adding to iommu group 17 [ 20.453983] usb 1-2: new high-speed USB device number 3 using ehci-pci [ 20.457553] pcieport 0000:80:08.0: Adding to iommu group 18 [ 20.469455] pcieport 0000:80:0c.0: Adding to iommu group 19 [ 20.477037] pcieport 0000:80:10.0: Adding to iommu group 20 [ 20.484712] pcieport 0000:bc:00.0: Adding to iommu group 21 [ 20.491155] pcieport 0000:b4:00.0: Adding to iommu group 22 [ 20.517723] rtc-efi rtc-efi: setting system clock to 2019-12-13T10:54:56 UTC (1576234496) [ 20.525913] ALSA device list: [ 20.528878] No soundcards found. [ 20.618601] hub 1-2:1.0: USB hub found [ 20.622440] hub 1-2:1.0: 4 ports detected [ 20.744970] EXT4-fs (sdd1): recovery complete [ 20.759425] EXT4-fs (sdd1): mounted filesystem with ordered data mode. Opts: (null) [ 20.767090] VFS: Mounted root (ext4 filesystem) on device 8:49. [ 20.788837] devtmpfs: mounted [ 20.793124] Freeing unused kernel memory: 5184K [ 20.797817] Run /sbin/init as init process [ 20.913986] usb 1-2.1: new full-speed USB device number 4 using ehci-pci [ 21.379891] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid) [ 21.401921] systemd[1]: Detected architecture arm64. [ 21.459107] systemd[1]: Set hostname to <ubuntu>. [ 21.474734] systemd[1]: Couldn't move remaining userspace processes, ignoring: Input/output error [ 21.947303] systemd[1]: File /lib/systemd/system/systemd-journald.service:36 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling. [ 21.964340] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.) [ 22.268240] random: systemd: uninitialized urandom read (16 bytes read) [ 22.274946] systemd[1]: Started Forward Password Requests to Wall Directory Watch. [ 22.298022] random: systemd: uninitialized urandom read (16 bytes read) [ 22.304894] systemd[1]: Created slice User and Session Slice. [ 22.322032] random: systemd: uninitialized urandom read (16 bytes read) [ 22.328850] systemd[1]: Created slice System Slice. [ 22.346109] systemd[1]: Listening on Syslog Socket. [ 22.552644] random: crng init done [ 22.558740] random: 7 urandom warning(s) missed due to ratelimiting [ 23.370478] EXT4-fs (sdd1): re-mounted. Opts: errors=remount-ro [ 23.547390] systemd-journald[806]: Received request to flush runtime journal from PID 1 [ 23.633956] systemd-journald[806]: File /var/log/journal/f0ef8dc5ede84b5eb7431c01908d3558/system.journal corrupted or uncleanly shut down, renaming and replacing. [ 23.814035] Adding 2097148k swap on /swapfile. Priority:-2 extents:6 across:2260988k [ 25.012707] hns3 0000:7d:00.2 eno3: renamed from eth2 [ 25.054228] hns3 0000:7d:00.3 eno4: renamed from eth3 [ 25.086971] hns3 0000:7d:00.1 eno2: renamed from eth1 [ 25.118154] hns3 0000:7d:00.0 eno1: renamed from eth0 [ 25.154467] hns3 0000:bd:00.0 eno5: renamed from eth4 [ 26.130742] input: Keyboard/Mouse KVM 1.1.0 as /devices/pci0000:7a/0000:7a:01.0/usb1/1-2/1-2.1/1-2.1:1.0/0003:12D1:0003.0001/input/input1 [ 26.190189] hid-generic 0003:12D1:0003.0001: input: USB HID v1.10 Keyboard [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input0 [ 26.191049] input: Keyboard/Mouse KVM 1.1.0 as /devices/pci0000:7a/0000:7a:01.0/usb1/1-2/1-2.1/1-2.1:1.1/0003:12D1:0003.0002/input/input2 [ 26.191090] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1 [ 175.642410] rcu: INFO: rcu_preempt self-detected stall on CPU [ 175.648150] rcu: 0-....: (1 GPs behind) idle=3ae/1/0x4000000000000004 softirq=1589/1589 fqs=2322 [ 175.657102] (t=5253 jiffies g=2893 q=3123) [ 175.657105] Task dump for CPU 0: [ 175.657108] fio R running task 0 1254 1224 0x00000002 [ 175.657112] Call trace: [ 175.657122] dump_backtrace+0x0/0x1a0 [ 175.657126] show_stack+0x14/0x20 [ 175.657130] sched_show_task+0x164/0x1a0 [ 175.657133] dump_cpu_task+0x40/0x2e8 [ 175.657137] rcu_dump_cpu_stacks+0xa0/0xe0 [ 175.657139] rcu_sched_clock_irq+0x6d8/0xaa8 [ 175.657143] update_process_times+0x2c/0x50 [ 175.657147] tick_sched_handle.isra.14+0x30/0x50 [ 175.657149] tick_sched_timer+0x48/0x98 [ 175.657152] __hrtimer_run_queues+0x120/0x1b8 [ 175.657154] hrtimer_interrupt+0xd4/0x250 [ 175.657159] arch_timer_handler_phys+0x28/0x40 [ 175.657162] handle_percpu_devid_irq+0x80/0x140 [ 175.657165] generic_handle_irq+0x24/0x38 [ 175.657167] __handle_domain_irq+0x5c/0xb0 [ 175.657170] gic_handle_irq+0x5c/0x148 [ 175.657172] el1_irq+0xb8/0x180 [ 175.657175] efi_header_end+0x94/0x234 [ 175.657178] irq_exit+0xd0/0xd8 [ 175.657180] __handle_domain_irq+0x60/0xb0 [ 175.657182] gic_handle_irq+0x5c/0x148 [ 175.657184] el1_irq+0xb8/0x180 [ 175.657194] nvme_open+0x80/0xc8 [ 175.657199] __blkdev_get+0x3f8/0x4f0 [ 175.657201] blkdev_get+0x110/0x180 [ 175.657204] blkdev_open+0x8c/0xa0 [ 175.657207] do_dentry_open+0x1c4/0x3d8 [ 175.657210] vfs_open+0x28/0x30 [ 175.657212] path_openat+0x2a8/0x12a0 [ 175.657214] do_filp_open+0x78/0xf8 [ 175.657217] do_sys_open+0x19c/0x258 [ 175.657219] __arm64_sys_openat+0x20/0x28 [ 175.657222] el0_svc_common.constprop.2+0x64/0x160 [ 175.657225] el0_svc_handler+0x20/0x80 [ 175.657227] el0_sync_handler+0xe4/0x188 [ 175.657229] el0_sync+0x140/0x180 john@ubuntu:~$ ./dump-io-irq-affinity kernel version: Linux ubuntu 5.5.0-rc1-00001-g3779c27ad995-dirty #1436 SMP PREEMPT Fri Dec 13 10:51:46 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux PCI name is 04:00.0: nvme0n1 irq 56, cpu list 75, effective list 75 irq 60, cpu list 24-28, effective list 24 irq 61, cpu list 29-33, effective list 29 irq 62, cpu list 34-38, effective list 34 irq 63, cpu list 39-43, effective list 39 irq 64, cpu list 44-47, effective list 44 irq 65, cpu list 48-51, effective list 48 irq 66, cpu list 52-55, effective list 52 irq 67, cpu list 56-59, effective list 56 irq 68, cpu list 60-63, effective list 60 irq 69, cpu list 64-67, effective list 64 irq 70, cpu list 68-71, effective list 68 irq 71, cpu list 72-75, effective list 72 irq 72, cpu list 76-79, effective list 76 irq 73, cpu list 80-83, effective list 80 irq 74, cpu list 84-87, effective list 84 irq 75, cpu list 88-91, effective list 88 irq 76, cpu list 92-95, effective list 92 irq 77, cpu list 0-3, effective list 0 irq 78, cpu list 4-7, effective list 4 irq 79, cpu list 8-11, effective list 8 irq 80, cpu list 12-15, effective list 12 irq 81, cpu list 16-19, effective list 16 irq 82, cpu list 20-23, effective list 20 PCI name is 81:00.0: nvme1n1 irq 100, cpu list 0-3, effective list 0 irq 101, cpu list 4-7, effective list 4 irq 102, cpu list 8-11, effective list 8 irq 103, cpu list 12-15, effective list 12 irq 104, cpu list 16-19, effective list 16 irq 105, cpu list 20-23, effective list 20 irq 57, cpu list 63, effective list 63 irq 83, cpu list 24-28, effective list 24 irq 84, cpu list 29-33, effective list 29 irq 85, cpu list 34-38, effective list 34 irq 86, cpu list 39-43, effective list 39 irq 87, cpu list 44-47, effective list 44 irq 88, cpu list 48-51, effective list 48 irq 89, cpu list 52-55, effective list 52 irq 90, cpu list 56-59, effective list 56 irq 91, cpu list 60-63, effective list 60 irq 92, cpu list 64-67, effective list 64 irq 93, cpu list 68-71, effective list 68 irq 94, cpu list 72-75, effective list 72 irq 95, cpu list 76-79, effective list 76 irq 96, cpu list 80-83, effective list 80 irq 97, cpu list 84-87, effective list 84 irq 98, cpu list 88-91, effective list 88 irq 99, cpu list 92-95, effective list 92 john@ubuntu:~$ more create_fio_task_cpu_liuyifan_nvme.sh #s!/bin/bash # # #echo "$1_$2_$3_test" > $filename echo " [global] rw=$2 direct=1 ioengine=libaio iodepth=$3 numjobs=20 bs=$1 ;size=10240000m ;zero_buffers=1 group_reporting=1 group_reporting=1 ;ioscheduler=noop ;cpumask=0xfe ;cpus_allowed=0-3 ;gtod_reduce=1 ;iodepth_batch=2 ;iodepth_batch_complete=2 runtime=40 ;thread loops = 10000 " > $1_$2_depth$3_fiotest declare -i new_count=1 declare -i disk_count=0 #fdisk -l |grep "Disk /dev/sd" > fdiskinfo #cat fdiskinfo |awk '{print $2}' |awk -F ":" '{print $1}' > devinfo ls /dev/nvme0n1 /dev/nvme1n1 > devinfo new_num=`sed -n '$=' devinfo` while [ $new_count -le $new_num ] do if [ "$diskcount" != "$4" ]; then new_disk=`sed -n "$new_count"p devinfo` disk_list=${new_disk}":"${disk_list} ((new_count++)) ((diskcount++)) if [ $new_count -gt $new_num ]; then echo "[job1]" >> $1_$2_depth$3_fiotest echo "filename=$disk_list" >> $1_$2_depth$3_fiotest fi continue fi # if [ "$new_disk" = "/dev/sda" ]; then # continue # fi echo "[job1]" >> $1_$2_depth$3_fiotest echo "filename=$disk_list" >> $1_$2_depth$3_fiotest diskcount=0 disk_list="" done echo "Creat $1_$2_depth$3_fiotest file sucessfully" fio $1_$2_depth$3_fiotest john@ubuntu:~$ Thanks, John
Hi Marc, >> JFYI, we're still testing this and the patch itself seems to work as >> intended. >> >> Here's the kernel log if you just want to see how the interrupts are >> getting assigned: >> https://pastebin.com/hh3r810g > > It is a bit hard to make sense of this dump, specially on such a wide > machine (I want one!) So do I :) That's the newer "D06CS" board. without really knowing the topology of the system. So it's 2x socket, each socket has 2x CPU dies, and each die has 6 clusters of 4 CPUs, which gives 96 in total. > >> For me, I did get a performance boost for NVMe testing, but my >> colleague Xiang Chen saw a drop for our storage test of interest - >> that's the HiSi SAS controller. We're trying to make sense of it now. > > One of the difference is that with this patch, the initial affinity > is picked inside the NUMA node that matches the ITS. Is that even for managed interrupts? We're testing the storage controller which uses managed interrupts. I should have made that clearer. In your case, > that's either node 0 or 2. But it is unclear whether which CPUs these > map to. > > Given that I see interrupts mapped to CPUs 0-23 on one side, and 48-71 > on the other, it looks like half of your machine gets starved, Seems that way. So this is a mystery to me: [ 23.584192] picked CPU62 IRQ147 147: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ITS-MSI 94404626 Edge hisi_sas_v3_hw cq and [ 25.896728] picked CPU62 IRQ183 183: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ITS-MSI 94437398 Edge hisi_sas_v3_hw cq But mpstat reports for CPU62: 12:44:58 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 12:45:00 AM 62 6.54 0.00 42.99 0.00 6.54 12.15 0.00 0.00 6.54 25.23 I don't know what interrupts they are... It's the "hisi_sas_v3_hw cq" interrupts which we're interested in. and that > may be because no ITS targets the NUMA nodes they are part of. So both storage controllers (which we're interested in for this test) are on socket #0, node #0. It would > be interesting to see what happens if you manually set the affinity > of the interrupts outside of the NUMA node. > Again, managed, so I don't think it's possible. Thanks, John
Hi John, On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote: > Hi Ming, > > > > I am running some NVMe perf tests with Marc's patch. > > > > We need to confirm that if Marc's patch works as expected, could you > > collect log via the attached script? > > As immediately below, I see this on vanilla mainline, so let's see what the > issue is without that patch. IMO, the interrupt load needs to be distributed as what X86 IRQ matrix does. If the ARM64 server doesn't do that, the 1st step should align to that. Also do you pass 'use_threaded_interrupts=1' in your test? > > > > > > You never provide the test details(how many drives, how many disks > > attached to each drive) as I asked, so I can't comment on the reason, > > also no reason shows that the patch is a good fix. > > So I have only 2x ES3000 V3s. This looks like the same one: > https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf > > > > > My theory is simple, so far, the CPU is still much quicker than > > current storage in case that IO aren't from multiple disks which are > > connected to same drive. > > Hopefully this is all the info you need: > > Last login: Fri Dec 13 10:41:55 GMT 2019 on ttyAMA0 > Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 5.5.0-rc1-00001-g3779c27ad995-dirty > aarch64) > > * Documentation: https://help.ubuntu.com > * Management: https://landscape.canonical.com > * Support: https://ubuntu.com/advantage > > Failed to connect to https://changelogs.ubuntu.com/meta-release-lts. Check > your Internet connection or proxy settings > > john@ubuntu:~$ lstopo > Machine (14GB total) > Package L#0 > NUMANode L#0 (P#0 14GB) > L3 L#0 (32MB) > L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 > (P#0) > L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 > (P#1) > L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 > (P#2) > L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 > (P#3) > L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 > (P#4) > L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 > (P#5) > L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 > (P#6) > L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 > (P#7) > L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 > (P#8) > L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 > (P#9) > L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU > L#10 (P#10) > L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU > L#11 (P#11) > L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU > L#12 (P#12) > L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU > L#13 (P#13) > L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU > L#14 (P#14) > L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU > L#15 (P#15) > L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU > L#16 (P#16) > L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU > L#17 (P#17) > L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU > L#18 (P#18) > L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU > L#19 (P#19) > L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU > L#20 (P#20) > L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU > L#21 (P#21) > L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23) > HostBridge L#0 > PCIBridge > 2 x { PCI 8086:10fb } > PCIBridge > PCI 19e5:0123 > PCIBridge > PCI 19e5:1711 > HostBridge L#4 > PCIBridge > PCI 19e5:a250 > PCI 19e5:a230 > PCI 19e5:a235 > HostBridge L#6 > PCIBridge > PCI 19e5:a222 > Net L#0 "eno1" > PCI 19e5:a222 > Net L#1 "eno2" > PCI 19e5:a222 > Net L#2 "eno3" > PCI 19e5:a221 > Net L#3 "eno4" > NUMANode L#1 (P#1) + L3 L#1 (32MB) > L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU > L#24 (P#24) > L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU > L#25 (P#25) > L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU > L#26 (P#26) > L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU > L#27 (P#27) > L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU > L#28 (P#28) > L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU > L#29 (P#29) > L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU > L#30 (P#30) > L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU > L#31 (P#31) > L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU > L#32 (P#32) > L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU > L#33 (P#33) > L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU > L#34 (P#34) > L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU > L#35 (P#35) > L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU > L#36 (P#36) > L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU > L#37 (P#37) > L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU > L#38 (P#38) > L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU > L#39 (P#39) > L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU > L#40 (P#40) > L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU > L#41 (P#41) > L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU > L#42 (P#42) > L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU > L#43 (P#43) > L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU > L#44 (P#44) > L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU > L#45 (P#45) > L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU > L#46 (P#46) > L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU > L#47 (P#47) > Package L#1 > NUMANode L#2 (P#2) > L3 L#2 (32MB) > L2 L#48 (512KB) + L1d L#48 (64KB) + L1i L#48 (64KB) + Core L#48 + PU > L#48 (P#48) > L2 L#49 (512KB) + L1d L#49 (64KB) + L1i L#49 (64KB) + Core L#49 + PU > L#49 (P#49) > L2 L#50 (512KB) + L1d L#50 (64KB) + L1i L#50 (64KB) + Core L#50 + PU > L#50 (P#50) > L2 L#51 (512KB) + L1d L#51 (64KB) + L1i L#51 (64KB) + Core L#51 + PU > L#51 (P#51) > L2 L#52 (512KB) + L1d L#52 (64KB) + L1i L#52 (64KB) + Core L#52 + PU > L#52 (P#52) > L2 L#53 (512KB) + L1d L#53 (64KB) + L1i L#53 (64KB) + Core L#53 + PU > L#53 (P#53) > L2 L#54 (512KB) + L1d L#54 (64KB) + L1i L#54 (64KB) + Core L#54 + PU > L#54 (P#54) > L2 L#55 (512KB) + L1d L#55 (64KB) + L1i L#55 (64KB) + Core L#55 + PU > L#55 (P#55) > L2 L#56 (512KB) + L1d L#56 (64KB) + L1i L#56 (64KB) + Core L#56 + PU > L#56 (P#56) > L2 L#57 (512KB) + L1d L#57 (64KB) + L1i L#57 (64KB) + Core L#57 + PU > L#57 (P#57) > L2 L#58 (512KB) + L1d L#58 (64KB) + L1i L#58 (64KB) + Core L#58 + PU > L#58 (P#58) > L2 L#59 (512KB) + L1d L#59 (64KB) + L1i L#59 (64KB) + Core L#59 + PU > L#59 (P#59) > L2 L#60 (512KB) + L1d L#60 (64KB) + L1i L#60 (64KB) + Core L#60 + PU > L#60 (P#60) > L2 L#61 (512KB) + L1d L#61 (64KB) + L1i L#61 (64KB) + Core L#61 + PU > L#61 (P#61) > L2 L#62 (512KB) + L1d L#62 (64KB) + L1i L#62 (64KB) + Core L#62 + PU > L#62 (P#62) > L2 L#63 (512KB) + L1d L#63 (64KB) + L1i L#63 (64KB) + Core L#63 + PU > L#63 (P#63) > L2 L#64 (512KB) + L1d L#64 (64KB) + L1i L#64 (64KB) + Core L#64 + PU > L#64 (P#64) > L2 L#65 (512KB) + L1d L#65 (64KB) + L1i L#65 (64KB) + Core L#65 + PU > L#65 (P#65) > L2 L#66 (512KB) + L1d L#66 (64KB) + L1i L#66 (64KB) + Core L#66 + PU > L#66 (P#66) > L2 L#67 (512KB) + L1d L#67 (64KB) + L1i L#67 (64KB) + Core L#67 + PU > L#67 (P#67) > L2 L#68 (512KB) + L1d L#68 (64KB) + L1i L#68 (64KB) + Core L#68 + PU > L#68 (P#68) > L2 L#69 (512KB) + L1d L#69 (64KB) + L1i L#69 (64KB) + Core L#69 + PU > L#69 (P#69) > L2 L#70 (512KB) + L1d L#70 (64KB) + L1i L#70 (64KB) + Core L#70 + PU > L#70 (P#70) > L2 L#71 (512KB) + L1d L#71 (64KB) + L1i L#71 (64KB) + Core L#71 + PU > L#71 (P#71) > HostBridge L#8 > PCIBridge > PCI 19e5:0123 > HostBridge L#10 > PCIBridge > PCI 19e5:a250 > HostBridge L#12 > PCIBridge > PCI 19e5:a226 > Net L#4 "eno5" > NUMANode L#3 (P#3) + L3 L#3 (32MB) > L2 L#72 (512KB) + L1d L#72 (64KB) + L1i L#72 (64KB) + Core L#72 + PU > L#72 (P#72) > L2 L#73 (512KB) + L1d L#73 (64KB) + L1i L#73 (64KB) + Core L#73 + PU > L#73 (P#73) > L2 L#74 (512KB) + L1d L#74 (64KB) + L1i L#74 (64KB) + Core L#74 + PU > L#74 (P#74) > L2 L#75 (512KB) + L1d L#75 (64KB) + L1i L#75 (64KB) + Core L#75 + PU > L#75 (P#75) > L2 L#76 (512KB) + L1d L#76 (64KB) + L1i L#76 (64KB) + Core L#76 + PU > L#76 (P#76) > L2 L#77 (512KB) + L1d L#77 (64KB) + L1i L#77 (64KB) + Core L#77 + PU > L#77 (P#77) > L2 L#78 (512KB) + L1d L#78 (64KB) + L1i L#78 (64KB) + Core L#78 + PU > L#78 (P#78) > L2 L#79 (512KB) + L1d L#79 (64KB) + L1i L#79 (64KB) + Core L#79 + PU > L#79 (P#79) > L2 L#80 (512KB) + L1d L#80 (64KB) + L1i L#80 (64KB) + Core L#80 + PU > L#80 (P#80) > L2 L#81 (512KB) + L1d L#81 (64KB) + L1i L#81 (64KB) + Core L#81 + PU > L#81 (P#81) > L2 L#82 (512KB) + L1d L#82 (64KB) + L1i L#82 (64KB) + Core L#82 + PU > L#82 (P#82) > L2 L#83 (512KB) + L1d L#83 (64KB) + L1i L#83 (64KB) + Core L#83 + PU > L#83 (P#83) > L2 L#84 (512KB) + L1d L#84 (64KB) + L1i L#84 (64KB) + Core L#84 + PU > L#84 (P#84) > L2 L#85 (512KB) + L1d L#85 (64KB) + L1i L#85 (64KB) + Core L#85 + PU > L#85 (P#85) > L2 L#86 (512KB) + L1d L#86 (64KB) + L1i L#86 (64KB) + Core L#86 + PU > L#86 (P#86) > L2 L#87 (512KB) + L1d L#87 (64KB) + L1i L#87 (64KB) + Core L#87 + PU > L#87 (P#87) > L2 L#88 (512KB) + L1d L#88 (64KB) + L1i L#88 (64KB) + Core L#88 + PU > L#88 (P#88) > L2 L#89 (512KB) + L1d L#89 (64KB) + L1i L#89 (64KB) + Core L#89 + PU > L#89 (P#89) > L2 L#90 (512KB) + L1d L#90 (64KB) + L1i L#90 (64KB) + Core L#90 + PU > L#90 (P#90) > L2 L#91 (512KB) + L1d L#91 (64KB) + L1i L#91 (64KB) + Core L#91 + PU > L#91 (P#91) > L2 L#92 (512KB) + L1d L#92 (64KB) + L1i L#92 (64KB) + Core L#92 + PU > L#92 (P#92) > L2 L#93 (512KB) + L1d L#93 (64KB) + L1i L#93 (64KB) + Core L#93 + PU > L#93 (P#93) > L2 L#94 (512KB) + L1d L#94 (64KB) + L1i L#94 (64KB) + Core L#94 + PU > L#94 (P#94) > L2 L#95 (512KB) + L1d L#95 (64KB) + L1i L#95 (64KB) + Core L#95 + PU > L#95 (P#95) > john@ubuntu:~$ lscpu > Architecture: aarch64 > Byte Order: Little Endian > CPU(s): 96 > On-line CPU(s) list: 0-95 > Thread(s) per core: 1 > Core(s) per socket: 48 > Socket(s): 2 > NUMA node(s): 4 > Vendor ID: 0x48 > Model: 0 > Stepping: 0x0 > BogoMIPS: 200.00 > L1d cache: 64K > L1i cache: 64K > L2 cache: 512K > L3 cache: 32768K > NUMA node0 CPU(s): 0-23 > NUMA node1 CPU(s): 24-47 > NUMA node2 CPU(s): 48-71 > NUMA node3 CPU(s): 72-95 > Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics > cpuid asimdrdm dcpop > john@ubuntu:~$ dmesg | grep "Linux v" > [ 0.000000] Linux version 5.5.0-rc1-00001-g3779c27ad995-dirty > (john@john-ThinkCentre-M93p) (gcc version 7.3.1 20180425 > [linaro-7.3-2018.05-rc1 revision 38aec9a676236eaa42ca03ccb3a6c1dd0182c29f] > (Linaro GCC 7.3-2018.05-rc1)) #1436 SMP PREEMPT Fri Dec 13 10:51:46 GMT 2019 > john@ubuntu:~$ > john@ubuntu:~$ lspci > 00:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 00:04.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 00:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 00:0c.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 00:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 00:12.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ > Network Connection (rev 01) > 01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ > Network Connection (rev 01) > 04:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. ES3000 > V3 NVMe PCIe SSD (rev 45) > 05:00.0 VGA compatible controller: Huawei Technologies Co., Ltd. Hi1710 > [iBMC Intelligent Management system chip w/VGA support] (rev 01) > 74:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge > (rev 20) > 74:02.0 Serial Attached SCSI controller: Huawei Technologies Co., Ltd. > HiSilicon SAS 3.0 HBA (rev 20) > 74:03.0 SATA controller: Huawei Technologies Co., Ltd. HiSilicon AHCI HBA > (rev 20) > 75:00.0 Processing accelerators: Huawei Technologies Co., Ltd. HiSilicon ZIP > Engine (rev 20) > 78:00.0 Network and computing encryption device: Huawei Technologies Co., > Ltd. HiSilicon HPRE Engine (rev 20) > 7a:00.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 > 2-port Host Controller (rev 20) > 7a:01.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 > 2-port Host Controller (rev 20) > 7a:02.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 3.0 Host > Controller (rev 20) > 7c:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge > (rev 20) > 7d:00.0 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE > RDMA Network Controller (rev 20) > 7d:00.1 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE > RDMA Network Controller (rev 20) > 7d:00.2 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE > RDMA Network Controller (rev 20) > 7d:00.3 Ethernet controller: Huawei Technologies Co., Ltd. HNS GE/10GE/25GE > Network Controller (rev 20) > 80:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 80:08.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 80:0c.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 80:10.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCIe Root Port > with Gen4 (rev 45) > 81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. ES3000 > V3 NVMe PCIe SSD (rev 45) > b4:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge > (rev 20) > b5:00.0 Processing accelerators: Huawei Technologies Co., Ltd. HiSilicon ZIP > Engine (rev 20) > b8:00.0 Network and computing encryption device: Huawei Technologies Co., > Ltd. HiSilicon HPRE Engine (rev 20) > ba:00.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 > 2-port Host Controller (rev 20) > ba:01.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 2.0 > 2-port Host Controller (rev 20) > ba:02.0 USB controller: Huawei Technologies Co., Ltd. HiSilicon USB 3.0 Host > Controller (rev 20) > bc:00.0 PCI bridge: Huawei Technologies Co., Ltd. HiSilicon PCI-PCI Bridge > (rev 20) > bd:00.0 Ethernet controller: Huawei Technologies Co., Ltd. HNS > GE/10GE/25GE/50GE/100GE RDMA Network Controller (rev 20) > john@ubuntu:~$ sudo /bin/bash create_fio_task_cpu_liuyifan_nvme.sh 4k read > 20 1 > Creat 4k_read_depth20_fiotest file sucessfully > job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, > ioengine=libaio, iodepth=20 > ... > job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, > ioengine=libaio, iodepth=20 > ... > fio-3.1 > Starting 40 processes > [ 175.642410] rcu: INFO: rcu_preempt self-detected stall on CPU IOPS][eta > 00m:18s] > [ 175.648150] rcu: 0-....: (1 GPs behind) idle=3ae/1/0x4000000000000004 > softirq=1589/1589 fqs=2322 > Jobs: 40 (f=40): [R(40)][100.0%][r=4270MiB/s,w=0KiB/s][r=1093k,w=0 IOPS][eta > 00m:00s] > job1: (groupid=0, jobs=40): err= 0: pid=1227: Fri Dec 13 10:57:49 2019 > read: IOPS=952k, BW=3719MiB/s (3900MB/s)(145GiB/40003msec) > slat (usec): min=2, max=20126k, avg=10.66, stdev=9637.70 > clat (usec): min=13, max=20156k, avg=517.95, stdev=31017.58 > lat (usec): min=21, max=20156k, avg=528.77, stdev=32487.76 > clat percentiles (usec): > | 1.00th=[ 103], 5.00th=[ 113], 10.00th=[ 147], 20.00th=[ 200], > | 30.00th=[ 260], 40.00th=[ 318], 50.00th=[ 375], 60.00th=[ 429], > | 70.00th=[ 486], 80.00th=[ 578], 90.00th=[ 799], 95.00th=[ 996], > | 99.00th=[ 1958], 99.50th=[ 2114], 99.90th=[ 2311], 99.95th=[ 2474], > | 99.99th=[ 7767] > bw ( KiB/s): min= 112, max=745026, per=4.60%, avg=175285.03, > stdev=117592.37, samples=1740 > iops : min= 28, max=186256, avg=43821.06, stdev=29398.12, > samples=1740 > lat (usec) : 20=0.01%, 50=0.01%, 100=0.14%, 250=28.38%, 500=43.76% > lat (usec) : 750=16.17%, 1000=6.65% > lat (msec) : 2=4.02%, 4=0.86%, 10=0.01%, 20=0.01%, 50=0.01% > lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 2000=0.01% > lat (msec) : >=2000=0.01% > cpu : usr=3.67%, sys=15.82%, ctx=20799355, majf=0, minf=4275 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, > >=64=0.0% > issued rwt: total=38086812,0,0, short=0,0,0, dropped=0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=20 > > Run status group 0 (all jobs): > READ: bw=3719MiB/s (3900MB/s), 3719MiB/s-3719MiB/s (3900MB/s-3900MB/s), > io=145GiB (156GB), run=40003-40003msec > > Disk stats (read0/0, ticks=5002739/0, in_queue=540, util=99.83% > john@ubuntu:~$ dmesg | tail -n 100 > [ 20.380611] Key type dns_resolver registered > [ 20.385000] registered taskstats version 1 > [ 20.389092] Loading compiled-in X.509 certificates > [ 20.394494] pcieport 0000:00:00.0: Adding to iommu group 9 > [ 20.401556] pcieport 0000:00:04.0: Adding to iommu group 10 > [ 20.408695] pcieport 0000:00:08.0: Adding to iommu group 11 > [ 20.415767] pcieport 0000:00:0c.0: Adding to iommu group 12 > [ 20.422842] pcieport 0000:00:10.0: Adding to iommu group 13 > [ 20.429932] pcieport 0000:00:12.0: Adding to iommu group 14 > [ 20.437077] pcieport 0000:7c:00.0: Adding to iommu group 15 > [ 20.443397] pcieport 0000:74:00.0: Adding to iommu group 16 > [ 20.449790] pcieport 0000:80:00.0: Adding to iommu group 17 > [ 20.453983] usb 1-2: new high-speed USB device number 3 using ehci-pci > [ 20.457553] pcieport 0000:80:08.0: Adding to iommu group 18 > [ 20.469455] pcieport 0000:80:0c.0: Adding to iommu group 19 > [ 20.477037] pcieport 0000:80:10.0: Adding to iommu group 20 > [ 20.484712] pcieport 0000:bc:00.0: Adding to iommu group 21 > [ 20.491155] pcieport 0000:b4:00.0: Adding to iommu group 22 > [ 20.517723] rtc-efi rtc-efi: setting system clock to 2019-12-13T10:54:56 > UTC (1576234496) > [ 20.525913] ALSA device list: > [ 20.528878] No soundcards found. > [ 20.618601] hub 1-2:1.0: USB hub found > [ 20.622440] hub 1-2:1.0: 4 ports detected > [ 20.744970] EXT4-fs (sdd1): recovery complete > [ 20.759425] EXT4-fs (sdd1): mounted filesystem with ordered data mode. > Opts: (null) > [ 20.767090] VFS: Mounted root (ext4 filesystem) on device 8:49. > [ 20.788837] devtmpfs: mounted > [ 20.793124] Freeing unused kernel memory: 5184K > [ 20.797817] Run /sbin/init as init process > [ 20.913986] usb 1-2.1: new full-speed USB device number 4 using ehci-pci > [ 21.379891] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT > +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT > +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 > default-hierarchy=hybrid) > [ 21.401921] systemd[1]: Detected architecture arm64. > [ 21.459107] systemd[1]: Set hostname to <ubuntu>. > [ 21.474734] systemd[1]: Couldn't move remaining userspace processes, > ignoring: Input/output error > [ 21.947303] systemd[1]: File > /lib/systemd/system/systemd-journald.service:36 configures an IP firewall > (IPAddressDeny=any), but the local system does not support BPF/cgroup based > firewalling. > [ 21.964340] systemd[1]: Proceeding WITHOUT firewalling in effect! (This > warning is only shown for the first loaded unit using IP firewalling.) > [ 22.268240] random: systemd: uninitialized urandom read (16 bytes read) > [ 22.274946] systemd[1]: Started Forward Password Requests to Wall > Directory Watch. > [ 22.298022] random: systemd: uninitialized urandom read (16 bytes read) > [ 22.304894] systemd[1]: Created slice User and Session Slice. > [ 22.322032] random: systemd: uninitialized urandom read (16 bytes read) > [ 22.328850] systemd[1]: Created slice System Slice. > [ 22.346109] systemd[1]: Listening on Syslog Socket. > [ 22.552644] random: crng init done > [ 22.558740] random: 7 urandom warning(s) missed due to ratelimiting > [ 23.370478] EXT4-fs (sdd1): re-mounted. Opts: errors=remount-ro > [ 23.547390] systemd-journald[806]: Received request to flush runtime > journal from PID 1 > [ 23.633956] systemd-journald[806]: File > /var/log/journal/f0ef8dc5ede84b5eb7431c01908d3558/system.journal corrupted > or uncleanly shut down, renaming and replacing. > [ 23.814035] Adding 2097148k swap on /swapfile. Priority:-2 extents:6 > across:2260988k > [ 25.012707] hns3 0000:7d:00.2 eno3: renamed from eth2 > [ 25.054228] hns3 0000:7d:00.3 eno4: renamed from eth3 > [ 25.086971] hns3 0000:7d:00.1 eno2: renamed from eth1 > [ 25.118154] hns3 0000:7d:00.0 eno1: renamed from eth0 > [ 25.154467] hns3 0000:bd:00.0 eno5: renamed from eth4 > [ 26.130742] input: Keyboard/Mouse KVM 1.1.0 as /devices/pci0000:7a/0000:7a:01.0/usb1/1-2/1-2.1/1-2.1:1.0/0003:12D1:0003.0001/input/input1 > [ 26.190189] hid-generic 0003:12D1:0003.0001: input: USB HID v1.10 > Keyboard [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input0 > [ 26.191049] input: Keyboard/Mouse KVM 1.1.0 as /devices/pci0000:7a/0000:7a:01.0/usb1/1-2/1-2.1/1-2.1:1.1/0003:12D1:0003.0002/input/input2 > [ 26.191090] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse > [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1 > [ 175.642410] rcu: INFO: rcu_preempt self-detected stall on CPU > [ 175.648150] rcu: 0-....: (1 GPs behind) idle=3ae/1/0x4000000000000004 > softirq=1589/1589 fqs=2322 > [ 175.657102] (t=5253 jiffies g=2893 q=3123) > [ 175.657105] Task dump for CPU 0: > [ 175.657108] fio R running task 0 1254 1224 > 0x00000002 > [ 175.657112] Call trace: > [ 175.657122] dump_backtrace+0x0/0x1a0 > [ 175.657126] show_stack+0x14/0x20 > [ 175.657130] sched_show_task+0x164/0x1a0 > [ 175.657133] dump_cpu_task+0x40/0x2e8 > [ 175.657137] rcu_dump_cpu_stacks+0xa0/0xe0 > [ 175.657139] rcu_sched_clock_irq+0x6d8/0xaa8 > [ 175.657143] update_process_times+0x2c/0x50 > [ 175.657147] tick_sched_handle.isra.14+0x30/0x50 > [ 175.657149] tick_sched_timer+0x48/0x98 > [ 175.657152] __hrtimer_run_queues+0x120/0x1b8 > [ 175.657154] hrtimer_interrupt+0xd4/0x250 > [ 175.657159] arch_timer_handler_phys+0x28/0x40 > [ 175.657162] handle_percpu_devid_irq+0x80/0x140 > [ 175.657165] generic_handle_irq+0x24/0x38 > [ 175.657167] __handle_domain_irq+0x5c/0xb0 > [ 175.657170] gic_handle_irq+0x5c/0x148 > [ 175.657172] el1_irq+0xb8/0x180 > [ 175.657175] efi_header_end+0x94/0x234 > [ 175.657178] irq_exit+0xd0/0xd8 > [ 175.657180] __handle_domain_irq+0x60/0xb0 > [ 175.657182] gic_handle_irq+0x5c/0x148 > [ 175.657184] el1_irq+0xb8/0x180 > [ 175.657194] nvme_open+0x80/0xc8 > [ 175.657199] __blkdev_get+0x3f8/0x4f0 > [ 175.657201] blkdev_get+0x110/0x180 > [ 175.657204] blkdev_open+0x8c/0xa0 > [ 175.657207] do_dentry_open+0x1c4/0x3d8 > [ 175.657210] vfs_open+0x28/0x30 > [ 175.657212] path_openat+0x2a8/0x12a0 > [ 175.657214] do_filp_open+0x78/0xf8 > [ 175.657217] do_sys_open+0x19c/0x258 > [ 175.657219] __arm64_sys_openat+0x20/0x28 > [ 175.657222] el0_svc_common.constprop.2+0x64/0x160 > [ 175.657225] el0_svc_handler+0x20/0x80 > [ 175.657227] el0_sync_handler+0xe4/0x188 > [ 175.657229] el0_sync+0x140/0x180 > > john@ubuntu:~$ ./dump-io-irq-affinity > kernel version: > Linux ubuntu 5.5.0-rc1-00001-g3779c27ad995-dirty #1436 SMP PREEMPT Fri Dec > 13 10:51:46 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux > PCI name is 04:00.0: nvme0n1 > irq 56, cpu list 75, effective list 75 > irq 60, cpu list 24-28, effective list 24 > irq 61, cpu list 29-33, effective list 29 > irq 62, cpu list 34-38, effective list 34 > irq 63, cpu list 39-43, effective list 39 > irq 64, cpu list 44-47, effective list 44 > irq 65, cpu list 48-51, effective list 48 > irq 66, cpu list 52-55, effective list 52 > irq 67, cpu list 56-59, effective list 56 > irq 68, cpu list 60-63, effective list 60 > irq 69, cpu list 64-67, effective list 64 > irq 70, cpu list 68-71, effective list 68 > irq 71, cpu list 72-75, effective list 72 > irq 72, cpu list 76-79, effective list 76 > irq 73, cpu list 80-83, effective list 80 > irq 74, cpu list 84-87, effective list 84 > irq 75, cpu list 88-91, effective list 88 > irq 76, cpu list 92-95, effective list 92 > irq 77, cpu list 0-3, effective list 0 > irq 78, cpu list 4-7, effective list 4 > irq 79, cpu list 8-11, effective list 8 > irq 80, cpu list 12-15, effective list 12 > irq 81, cpu list 16-19, effective list 16 > irq 82, cpu list 20-23, effective list 20 > PCI name is 81:00.0: nvme1n1 > irq 100, cpu list 0-3, effective list 0 > irq 101, cpu list 4-7, effective list 4 > irq 102, cpu list 8-11, effective list 8 > irq 103, cpu list 12-15, effective list 12 > irq 104, cpu list 16-19, effective list 16 > irq 105, cpu list 20-23, effective list 20 > irq 57, cpu list 63, effective list 63 > irq 83, cpu list 24-28, effective list 24 > irq 84, cpu list 29-33, effective list 29 > irq 85, cpu list 34-38, effective list 34 > irq 86, cpu list 39-43, effective list 39 > irq 87, cpu list 44-47, effective list 44 > irq 88, cpu list 48-51, effective list 48 > irq 89, cpu list 52-55, effective list 52 > irq 90, cpu list 56-59, effective list 56 > irq 91, cpu list 60-63, effective list 60 > irq 92, cpu list 64-67, effective list 64 > irq 93, cpu list 68-71, effective list 68 > irq 94, cpu list 72-75, effective list 72 > irq 95, cpu list 76-79, effective list 76 > irq 96, cpu list 80-83, effective list 80 > irq 97, cpu list 84-87, effective list 84 > irq 98, cpu list 88-91, effective list 88 > irq 99, cpu list 92-95, effective list 92 The above log shows there are two nvme drives, each drive has 24 hw queues. Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine, each hw queue can be assigned one unique effective CPU for handling the queue's interrupt. Because arm64's gic driver doesn't distribute irq's effective cpu affinity, each hw queue is assigned same CPU to handle its interrupt. As you saw, the detected RCU stall is on CPU0, which is for handling both irq 77 and irq 100. Please apply Marc's patch and observe if unique effective CPU is assigned to each hw queue's irq. If unique effective CPU is assigned to each hw queue's irq, and the RCU stall can still be triggered, let's investigate further, given one single ARM64 CPU core should be quick enough to handle IO completion from single NVNe drive. Thanks, Ming
On 13/12/2019 13:18, Ming Lei wrote: Hi Ming, > > On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote: >> Hi Ming, >> >>>> I am running some NVMe perf tests with Marc's patch. >>> >>> We need to confirm that if Marc's patch works as expected, could you >>> collect log via the attached script? >> >> As immediately below, I see this on vanilla mainline, so let's see what the >> issue is without that patch. > > IMO, the interrupt load needs to be distributed as what X86 IRQ matrix > does. If the ARM64 server doesn't do that, the 1st step should align to > that. That would make sense. But still, I would like to think that a CPU could sink the interrupts from 2x queues. > > Also do you pass 'use_threaded_interrupts=1' in your test? When I set this, then, as I anticipated, no lockup. But IOPS drops from ~ 1M IOPS->800K. > >> >>> > >>> You never provide the test details(how many drives, how many disks >>> attached to each drive) as I asked, so I can't comment on the reason, >>> also no reason shows that the patch is a good fix. >> >> So I have only 2x ES3000 V3s. This looks like the same one: >> https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf >> >>> >>> My theory is simple, so far, the CPU is still much quicker than >>> current storage in case that IO aren't from multiple disks which are >>> connected to same drive. >> [...] >> irq 98, cpu list 88-91, effective list 88 >> irq 99, cpu list 92-95, effective list 92 > > The above log shows there are two nvme drives, each drive has 24 hw > queues. > > Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine, > each hw queue can be assigned one unique effective CPU for handling > the queue's interrupt. > > Because arm64's gic driver doesn't distribute irq's effective cpu affinity, > each hw queue is assigned same CPU to handle its interrupt. > > As you saw, the detected RCU stall is on CPU0, which is for handling > both irq 77 and irq 100. > > Please apply Marc's patch and observe if unique effective CPU is > assigned to each hw queue's irq. > Same issue: 979826] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1 [ 38.772536] IRQ25 CPU14 -> CPU3 [ 38.777138] IRQ58 CPU8 -> CPU17 [ 119.499459] rcu: INFO: rcu_preempt self-detected stall on CPU [ 119.505202] rcu: 16-....: (1 GPs behind) idle=a8a/1/0x4000000000000002 softirq=952/1211 fqs=2625 [ 119.514188] (t=5253 jiffies g=2613 q=4573) [ 119.514193] Task dump for CPU 16: [ 119.514197] ksoftirqd/16 R running task 0 91 2 0x0000002a [ 119.514206] Call trace: [ 119.514224] dump_backtrace+0x0/0x1a0 [ 119.514228] show_stack+0x14/0x20 [ 119.514236] sched_show_task+0x164/0x1a0 [ 119.514240] dump_cpu_task+0x40/0x2e8 [ 119.514245] rcu_dump_cpu_stacks+0xa0/0xe0 [ 119.514247] rcu_sched_clock_irq+0x6d8/0xaa8 [ 119.514251] update_process_times+0x2c/0x50 [ 119.514258] tick_sched_handle.isra.14+0x30/0x50 [ 119.514261] tick_sched_timer+0x48/0x98 [ 119.514264] __hrtimer_run_queues+0x120/0x1b8 [ 119.514266] hrtimer_interrupt+0xd4/0x250 [ 119.514277] arch_timer_handler_phys+0x28/0x40 [ 119.514280] handle_percpu_devid_irq+0x80/0x140 [ 119.514283] generic_handle_irq+0x24/0x38 [ 119.514285] __handle_domain_irq+0x5c/0xb0 [ 119.514299] gic_handle_irq+0x5c/0x148 [ 119.514301] el1_irq+0xb8/0x180 [ 119.514305] load_balance+0x478/0xb98 [ 119.514308] rebalance_domains+0x1cc/0x2f8 [ 119.514311] run_rebalance_domains+0x78/0xe0 [ 119.514313] efi_header_end+0x114/0x234 [ 119.514317] run_ksoftirqd+0x38/0x48 [ 119.514322] smpboot_thread_fn+0x16c/0x270 [ 119.514324] kthread+0x118/0x120 [ 119.514326] ret_from_fork+0x10/0x18 john@ubuntu:~$ ./dump-io-irq-affinity kernel version: Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux PCI name is 04:00.0: nvme0n1 irq 56, cpu list 75, effective list 5 irq 60, cpu list 24-28, effective list 10 irq 61, cpu list 29-33, effective list 7 irq 62, cpu list 34-38, effective list 5 irq 63, cpu list 39-43, effective list 6 irq 64, cpu list 44-47, effective list 8 irq 65, cpu list 48-51, effective list 9 irq 66, cpu list 52-55, effective list 10 irq 67, cpu list 56-59, effective list 11 irq 68, cpu list 60-63, effective list 12 irq 69, cpu list 64-67, effective list 13 irq 70, cpu list 68-71, effective list 14 irq 71, cpu list 72-75, effective list 15 irq 72, cpu list 76-79, effective list 16 irq 73, cpu list 80-83, effective list 17 irq 74, cpu list 84-87, effective list 18 irq 75, cpu list 88-91, effective list 19 irq 76, cpu list 92-95, effective list 20 irq 77, cpu list 0-3, effective list 3 irq 78, cpu list 4-7, effective list 4 irq 79, cpu list 8-11, effective list 8 irq 80, cpu list 12-15, effective list 12 irq 81, cpu list 16-19, effective list 16 irq 82, cpu list 20-23, effective list 23 PCI name is 81:00.0: nvme1n1 irq 100, cpu list 0-3, effective list 0 irq 101, cpu list 4-7, effective list 5 irq 102, cpu list 8-11, effective list 9 irq 103, cpu list 12-15, effective list 13 irq 104, cpu list 16-19, effective list 17 irq 105, cpu list 20-23, effective list 21 irq 57, cpu list 63, effective list 7 irq 83, cpu list 24-28, effective list 5 irq 84, cpu list 29-33, effective list 6 irq 85, cpu list 34-38, effective list 8 irq 86, cpu list 39-43, effective list 9 irq 87, cpu list 44-47, effective list 10 irq 88, cpu list 48-51, effective list 11 irq 89, cpu list 52-55, effective list 12 irq 90, cpu list 56-59, effective list 13 irq 91, cpu list 60-63, effective list 14 irq 92, cpu list 64-67, effective list 15 irq 93, cpu list 68-71, effective list 16 irq 94, cpu list 72-75, effective list 17 irq 95, cpu list 76-79, effective list 18 irq 96, cpu list 80-83, effective list 19 irq 97, cpu list 84-87, effective list 20 irq 98, cpu list 88-91, effective list 21 irq 99, cpu list 92-95, effective list 22 john@ubuntu:~$ but you can see that CPU16 is handling irq72, 81, and 93. > If unique effective CPU is assigned to each hw queue's irq, and the RCU > stall can still be triggered, let's investigate further, given one single > ARM64 CPU core should be quick enough to handle IO completion from single > NVNe drive. If I remove the code for bring the affinity within the ITS numa node mask - as Marc hinted - then I still get a lockup, but we still we have CPUs serving multiple interrupts: 116.166881] rcu: INFO: rcu_preempt self-detected stall on CPU [ 116.181432] Task dump for CPU 4: [ 116.181502] Task dump for CPU 8: john@ubuntu:~$ ./dump-io-irq-affinity kernel version: Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1443 SMP PREEMPT Fri Dec 13 15:29:55 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux PCI name is 04:00.0: nvme0n1 irq 56, cpu list 75, effective list 75 irq 60, cpu list 24-28, effective list 25 irq 61, cpu list 29-33, effective list 29 irq 62, cpu list 34-38, effective list 34 irq 63, cpu list 39-43, effective list 39 irq 64, cpu list 44-47, effective list 44 irq 65, cpu list 48-51, effective list 49 irq 66, cpu list 52-55, effective list 55 irq 67, cpu list 56-59, effective list 56 irq 68, cpu list 60-63, effective list 61 irq 69, cpu list 64-67, effective list 64 irq 70, cpu list 68-71, effective list 68 irq 71, cpu list 72-75, effective list 73 irq 72, cpu list 76-79, effective list 76 irq 73, cpu list 80-83, effective list 80 irq 74, cpu list 84-87, effective list 85 irq 75, cpu list 88-91, effective list 88 irq 76, cpu list 92-95, effective list 92 irq 77, cpu list 0-3, effective list 1 irq 78, cpu list 4-7, effective list 4 irq 79, cpu list 8-11, effective list 8 irq 80, cpu list 12-15, effective list 14 irq 81, cpu list 16-19, effective list 16 irq 82, cpu list 20-23, effective list 20 PCI name is 81:00.0: nvme1n1 irq 100, cpu list 0-3, effective list 0 irq 101, cpu list 4-7, effective list 4 irq 102, cpu list 8-11, effective list 8 irq 103, cpu list 12-15, effective list 13 irq 104, cpu list 16-19, effective list 16 irq 105, cpu list 20-23, effective list 20 irq 57, cpu list 63, effective list 63 irq 83, cpu list 24-28, effective list 26 irq 84, cpu list 29-33, effective list 31 irq 85, cpu list 34-38, effective list 35 irq 86, cpu list 39-43, effective list 40 irq 87, cpu list 44-47, effective list 45 irq 88, cpu list 48-51, effective list 50 irq 89, cpu list 52-55, effective list 52 irq 90, cpu list 56-59, effective list 57 irq 91, cpu list 60-63, effective list 62 irq 92, cpu list 64-67, effective list 65 irq 93, cpu list 68-71, effective list 69 irq 94, cpu list 72-75, effective list 74 irq 95, cpu list 76-79, effective list 77 irq 96, cpu list 80-83, effective list 81 irq 97, cpu list 84-87, effective list 86 irq 98, cpu list 88-91, effective list 89 irq 99, cpu list 92-95, effective list 93 john@ubuntu:~$ I'm now thinking that we should just attempt this intelligent CPU affinity assignment for managed interrupts. Thanks, John
On Fri, Dec 13, 2019 at 03:43:07PM +0000, John Garry wrote: > On 13/12/2019 13:18, Ming Lei wrote: > > Hi Ming, > > > > > On Fri, Dec 13, 2019 at 11:12:49AM +0000, John Garry wrote: > > > Hi Ming, > > > > > > > > I am running some NVMe perf tests with Marc's patch. > > > > > > > > We need to confirm that if Marc's patch works as expected, could you > > > > collect log via the attached script? > > > > > > As immediately below, I see this on vanilla mainline, so let's see what the > > > issue is without that patch. > > > > IMO, the interrupt load needs to be distributed as what X86 IRQ matrix > > does. If the ARM64 server doesn't do that, the 1st step should align to > > that. > > That would make sense. But still, I would like to think that a CPU could > sink the interrupts from 2x queues. > > > > > Also do you pass 'use_threaded_interrupts=1' in your test? > > When I set this, then, as I anticipated, no lockup. But IOPS drops from ~ 1M > IOPS->800K. > > > > > > > > > > > > > > > You never provide the test details(how many drives, how many disks > > > > attached to each drive) as I asked, so I can't comment on the reason, > > > > also no reason shows that the patch is a good fix. > > > > > > So I have only 2x ES3000 V3s. This looks like the same one: > > > https://actfornet.com/HUAWEI_SERVER_DOCS/PCIeSSD/Huawei%20ES3000%20V3%20NVMe%20PCIe%20SSD%20Data%20Sheet.pdf > > > > > > > > > > > My theory is simple, so far, the CPU is still much quicker than > > > > current storage in case that IO aren't from multiple disks which are > > > > connected to same drive. > > > > > [...] > > > > irq 98, cpu list 88-91, effective list 88 > > > irq 99, cpu list 92-95, effective list 92 > > The above log shows there are two nvme drives, each drive has 24 hw > > queues. > > > > Also the system has 96 cores, and 96 > 24 * 2, so if everything is fine, > > each hw queue can be assigned one unique effective CPU for handling > > the queue's interrupt. > > > > Because arm64's gic driver doesn't distribute irq's effective cpu affinity, > > each hw queue is assigned same CPU to handle its interrupt. > > > > As you saw, the detected RCU stall is on CPU0, which is for handling > > both irq 77 and irq 100. > > > > Please apply Marc's patch and observe if unique effective CPU is > > assigned to each hw queue's irq. > > > > Same issue: > > 979826] hid-generic 0003:12D1:0003.0002: input: USB HID v1.10 Mouse > [Keyboard/Mouse KVM 1.1.0] on usb-0000:7a:01.0-2.1/input1 > [ 38.772536] IRQ25 CPU14 -> CPU3 > [ 38.777138] IRQ58 CPU8 -> CPU17 > [ 119.499459] rcu: INFO: rcu_preempt self-detected stall on CPU > [ 119.505202] rcu: 16-....: (1 GPs behind) idle=a8a/1/0x4000000000000002 > softirq=952/1211 fqs=2625 > [ 119.514188] (t=5253 jiffies g=2613 q=4573) > [ 119.514193] Task dump for CPU 16: > [ 119.514197] ksoftirqd/16 R running task 0 91 2 > 0x0000002a > [ 119.514206] Call trace: > [ 119.514224] dump_backtrace+0x0/0x1a0 > [ 119.514228] show_stack+0x14/0x20 > [ 119.514236] sched_show_task+0x164/0x1a0 > [ 119.514240] dump_cpu_task+0x40/0x2e8 > [ 119.514245] rcu_dump_cpu_stacks+0xa0/0xe0 > [ 119.514247] rcu_sched_clock_irq+0x6d8/0xaa8 > [ 119.514251] update_process_times+0x2c/0x50 > [ 119.514258] tick_sched_handle.isra.14+0x30/0x50 > [ 119.514261] tick_sched_timer+0x48/0x98 > [ 119.514264] __hrtimer_run_queues+0x120/0x1b8 > [ 119.514266] hrtimer_interrupt+0xd4/0x250 > [ 119.514277] arch_timer_handler_phys+0x28/0x40 > [ 119.514280] handle_percpu_devid_irq+0x80/0x140 > [ 119.514283] generic_handle_irq+0x24/0x38 > [ 119.514285] __handle_domain_irq+0x5c/0xb0 > [ 119.514299] gic_handle_irq+0x5c/0x148 > [ 119.514301] el1_irq+0xb8/0x180 > [ 119.514305] load_balance+0x478/0xb98 > [ 119.514308] rebalance_domains+0x1cc/0x2f8 > [ 119.514311] run_rebalance_domains+0x78/0xe0 > [ 119.514313] efi_header_end+0x114/0x234 > [ 119.514317] run_ksoftirqd+0x38/0x48 > [ 119.514322] smpboot_thread_fn+0x16c/0x270 > [ 119.514324] kthread+0x118/0x120 > [ 119.514326] ret_from_fork+0x10/0x18 > john@ubuntu:~$ ./dump-io-irq-affinity > kernel version: > Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri Dec > 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux > PCI name is 04:00.0: nvme0n1 > irq 56, cpu list 75, effective list 5 > irq 60, cpu list 24-28, effective list 10 The effect list supposes to be subset of irq's affinity(24-28). > irq 61, cpu list 29-33, effective list 7 > irq 62, cpu list 34-38, effective list 5 > irq 63, cpu list 39-43, effective list 6 > irq 64, cpu list 44-47, effective list 8 > irq 65, cpu list 48-51, effective list 9 > irq 66, cpu list 52-55, effective list 10 > irq 67, cpu list 56-59, effective list 11 > irq 68, cpu list 60-63, effective list 12 > irq 69, cpu list 64-67, effective list 13 > irq 70, cpu list 68-71, effective list 14 > irq 71, cpu list 72-75, effective list 15 > irq 72, cpu list 76-79, effective list 16 > irq 73, cpu list 80-83, effective list 17 > irq 74, cpu list 84-87, effective list 18 > irq 75, cpu list 88-91, effective list 19 > irq 76, cpu list 92-95, effective list 20 Same with above, so looks Marc's patch is wrong. > irq 77, cpu list 0-3, effective list 3 > irq 78, cpu list 4-7, effective list 4 > irq 79, cpu list 8-11, effective list 8 > irq 80, cpu list 12-15, effective list 12 > irq 81, cpu list 16-19, effective list 16 > irq 82, cpu list 20-23, effective list 23 > PCI name is 81:00.0: nvme1n1 > irq 100, cpu list 0-3, effective list 0 > irq 101, cpu list 4-7, effective list 5 > irq 102, cpu list 8-11, effective list 9 > irq 103, cpu list 12-15, effective list 13 > irq 104, cpu list 16-19, effective list 17 > irq 105, cpu list 20-23, effective list 21 > irq 57, cpu list 63, effective list 7 > irq 83, cpu list 24-28, effective list 5 > irq 84, cpu list 29-33, effective list 6 > irq 85, cpu list 34-38, effective list 8 > irq 86, cpu list 39-43, effective list 9 > irq 87, cpu list 44-47, effective list 10 > irq 88, cpu list 48-51, effective list 11 > irq 89, cpu list 52-55, effective list 12 > irq 90, cpu list 56-59, effective list 13 > irq 91, cpu list 60-63, effective list 14 > irq 92, cpu list 64-67, effective list 15 > irq 93, cpu list 68-71, effective list 16 > irq 94, cpu list 72-75, effective list 17 > irq 95, cpu list 76-79, effective list 18 > irq 96, cpu list 80-83, effective list 19 > irq 97, cpu list 84-87, effective list 20 > irq 98, cpu list 88-91, effective list 21 > irq 99, cpu list 92-95, effective list 22 More are wrong. > john@ubuntu:~$ > > but you can see that CPU16 is handling irq72, 81, and 93. As I mentioned, the effective affinity has to be subset of the irq's affinity. > > > If unique effective CPU is assigned to each hw queue's irq, and the RCU > > stall can still be triggered, let's investigate further, given one single > > ARM64 CPU core should be quick enough to handle IO completion from single > > NVNe drive. > > If I remove the code for bring the affinity within the ITS numa node mask - > as Marc hinted - then I still get a lockup, but we still we have CPUs > serving multiple interrupts: > > 116.166881] rcu: INFO: rcu_preempt self-detected stall on CPU > [ 116.181432] Task dump for CPU 4: > [ 116.181502] Task dump for CPU 8: > john@ubuntu:~$ ./dump-io-irq-affinity > kernel version: > Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1443 SMP PREEMPT Fri Dec > 13 15:29:55 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux > PCI name is 04:00.0: nvme0n1 > irq 56, cpu list 75, effective list 75 > irq 60, cpu list 24-28, effective list 25 > irq 61, cpu list 29-33, effective list 29 > irq 62, cpu list 34-38, effective list 34 > irq 63, cpu list 39-43, effective list 39 > irq 64, cpu list 44-47, effective list 44 > irq 65, cpu list 48-51, effective list 49 > irq 66, cpu list 52-55, effective list 55 > irq 67, cpu list 56-59, effective list 56 > irq 68, cpu list 60-63, effective list 61 > irq 69, cpu list 64-67, effective list 64 > irq 70, cpu list 68-71, effective list 68 > irq 71, cpu list 72-75, effective list 73 > irq 72, cpu list 76-79, effective list 76 > irq 73, cpu list 80-83, effective list 80 > irq 74, cpu list 84-87, effective list 85 > irq 75, cpu list 88-91, effective list 88 > irq 76, cpu list 92-95, effective list 92 > irq 77, cpu list 0-3, effective list 1 > irq 78, cpu list 4-7, effective list 4 > irq 79, cpu list 8-11, effective list 8 > irq 80, cpu list 12-15, effective list 14 > irq 81, cpu list 16-19, effective list 16 > irq 82, cpu list 20-23, effective list 20 > PCI name is 81:00.0: nvme1n1 > irq 100, cpu list 0-3, effective list 0 > irq 101, cpu list 4-7, effective list 4 > irq 102, cpu list 8-11, effective list 8 > irq 103, cpu list 12-15, effective list 13 > irq 104, cpu list 16-19, effective list 16 > irq 105, cpu list 20-23, effective list 20 > irq 57, cpu list 63, effective list 63 > irq 83, cpu list 24-28, effective list 26 > irq 84, cpu list 29-33, effective list 31 > irq 85, cpu list 34-38, effective list 35 > irq 86, cpu list 39-43, effective list 40 > irq 87, cpu list 44-47, effective list 45 > irq 88, cpu list 48-51, effective list 50 > irq 89, cpu list 52-55, effective list 52 > irq 90, cpu list 56-59, effective list 57 > irq 91, cpu list 60-63, effective list 62 > irq 92, cpu list 64-67, effective list 65 > irq 93, cpu list 68-71, effective list 69 > irq 94, cpu list 72-75, effective list 74 > irq 95, cpu list 76-79, effective list 77 > irq 96, cpu list 80-83, effective list 81 > irq 97, cpu list 84-87, effective list 86 > irq 98, cpu list 88-91, effective list 89 > irq 99, cpu list 92-95, effective list 93 > john@ubuntu:~$ > > I'm now thinking that we should just attempt this intelligent CPU affinity > assignment for managed interrupts. Right, the rule is simple: distribute effective list among CPUs evenly, meantime select the effective CPU from the irq's affinity mask. Thanks, Ming
On 13/12/2019 17:12, Ming Lei wrote: >> pu list 80-83, effective list 81 >> irq 97, cpu list 84-87, effective list 86 >> irq 98, cpu list 88-91, effective list 89 >> irq 99, cpu list 92-95, effective list 93 >> john@ubuntu:~$ >> >> I'm now thinking that we should just attempt this intelligent CPU affinity >> assignment for managed interrupts. > Right, the rule is simple: distribute effective list among CPUs evenly, > meantime select the effective CPU from the irq's affinity mask. > Even if we fix that, there is still a potential to have a CPU handling multiple nvme completion queues due to many factors, like cpu count, probe ordering, other PCI endpoints in the system, etc, so this lockup needs to be remedied. Thanks, John
On Fri, 13 Dec 2019 12:08:54 +0000, John Garry <john.garry@huawei.com> wrote: > > Hi Marc, > > >> JFYI, we're still testing this and the patch itself seems to work as > >> intended. > >> > >> Here's the kernel log if you just want to see how the interrupts are > >> getting assigned: > >> https://pastebin.com/hh3r810g > > > > It is a bit hard to make sense of this dump, specially on such a wide > > machine (I want one!) > > So do I :) That's the newer "D06CS" board. > > without really knowing the topology of the system. > > So it's 2x socket, each socket has 2x CPU dies, and each die has 6 > clusters of 4 CPUs, which gives 96 in total. > > > > >> For me, I did get a performance boost for NVMe testing, but my > >> colleague Xiang Chen saw a drop for our storage test of interest - > >> that's the HiSi SAS controller. We're trying to make sense of it now. > > > > One of the difference is that with this patch, the initial affinity > > is picked inside the NUMA node that matches the ITS. > > Is that even for managed interrupts? We're testing the storage > controller which uses managed interrupts. I should have made that > clearer. The ITS driver doesn't care about the fact that an interrupt affinity is 'managed' or not. And I don't think a low-level driver should, as it will just follow whatever interrupt affinity it is requested to use. If a managed interrupt has some requirements, then these requirements better be explicit in terms of CPU affinity. > In your case, > > that's either node 0 or 2. But it is unclear whether which CPUs these > > map to. > > > > Given that I see interrupts mapped to CPUs 0-23 on one side, and 48-71 > > on the other, it looks like half of your machine gets starved, > > Seems that way. > > So this is a mystery to me: > > [ 23.584192] picked CPU62 IRQ147 > > 147: 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 ITS-MSI 94404626 Edge hisi_sas_v3_hw cq > > > and > > [ 25.896728] picked CPU62 IRQ183 > > 183: 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > 0 0 ITS-MSI 94437398 Edge hisi_sas_v3_hw cq > > > But mpstat reports for CPU62: > > 12:44:58 AM CPU %usr %nice %sys %iowait %irq %soft > %steal %guest %gnice %idle > 12:45:00 AM 62 6.54 0.00 42.99 0.00 6.54 12.15 > 0.00 0.00 6.54 25.23 > > I don't know what interrupts they are... Clearly, they aren't your SAS interrupts. But the debug print do not mean that these are the only interrupts that are targeting CPU62. Looking at the 62nd column of /proc/interrupts should tell you what fires (and my bet is on something like the timer). > It's the "hisi_sas_v3_hw cq" interrupts which we're interested in. Clearly, they aren't firing. > and that > > may be because no ITS targets the NUMA nodes they are part of. > > So both storage controllers (which we're interested in for this test) > are on socket #0, node #0. > > It would > > be interesting to see what happens if you manually set the affinity > > of the interrupts outside of the NUMA node. > > > > Again, managed, so I don't think it's possible. OK, we need to get back to what the actual requirements of a 'managed' interrupt are, because there is clearly something that hasn't made it into the core code... M. -- Jazz is not dead, it just smells funny.
On Fri, 13 Dec 2019 15:43:07 +0000 John Garry <john.garry@huawei.com> wrote: [...] > john@ubuntu:~$ ./dump-io-irq-affinity > kernel version: > Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux > PCI name is 04:00.0: nvme0n1 > irq 56, cpu list 75, effective list 5 > irq 60, cpu list 24-28, effective list 10 The NUMA selection code definitely gets in the way. And to be honest, this NUMA thing is only there for the benefit of a terminally broken implementation (Cavium ThunderX), which we should have never supported the first place. Let's rework this and simply use the managed affinity whenever available instead. It may well be that it will break TX1, but I care about it just as much as Cavium/Marvell does... Please give this new patch a shot on your system (my D05 doesn't have any managed devices): https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/its-balance-mappings&id=1e987d83b8d880d56c9a2d8a86289631da94e55a Thanks, M. -- Jazz is not dead. It just smells funny...
On 14/12/2019 13:56, Marc Zyngier wrote: > On Fri, 13 Dec 2019 15:43:07 +0000 > John Garry <john.garry@huawei.com> wrote: > > [...] > >> john@ubuntu:~$ ./dump-io-irq-affinity >> kernel version: >> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT Fri Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux >> PCI name is 04:00.0: nvme0n1 >> irq 56, cpu list 75, effective list 5 >> irq 60, cpu list 24-28, effective list 10 > > The NUMA selection code definitely gets in the way. And to be honest, > this NUMA thing is only there for the benefit of a terminally broken > implementation (Cavium ThunderX), which we should have never supported > the first place. > > Let's rework this and simply use the managed affinity whenever > available instead. It may well be that it will break TX1, but I care > about it just as much as Cavium/Marvell does... I'm just wondering if non-managed interrupts should be included in the load balancing calculation? Couldn't irqbalance (if active) start moving non-managed interrupts around anyway? > > Please give this new patch a shot on your system (my D05 doesn't have > any managed devices): We could consider supporting platform msi managed interrupts, but I doubt the value. > > https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/its-balance-mappings&id=1e987d83b8d880d56c9a2d8a86289631da94e55a > I quickly tested that in my NVMe env, and I see a performance boost of 1055K -> 1206K IOPS. Results at bottom. Here's the irq mapping dump: PCI name is 04:00.0: nvme0n1 irq 56, cpu list 75, effective list 75 irq 60, cpu list 24-28, effective list 26 irq 61, cpu list 29-33, effective list 30 irq 62, cpu list 34-38, effective list 35 irq 63, cpu list 39-43, effective list 40 irq 64, cpu list 44-47, effective list 45 irq 65, cpu list 48-51, effective list 49 irq 66, cpu list 52-55, effective list 55 irq 67, cpu list 56-59, effective list 57 irq 68, cpu list 60-63, effective list 61 irq 69, cpu list 64-67, effective list 65 irq 70, cpu list 68-71, effective list 69 irq 71, cpu list 72-75, effective list 73 irq 72, cpu list 76-79, effective list 77 irq 73, cpu list 80-83, effective list 81 irq 74, cpu list 84-87, effective list 85 irq 75, cpu list 88-91, effective list 89 irq 76, cpu list 92-95, effective list 93 irq 77, cpu list 0-3, effective list 1 irq 78, cpu list 4-7, effective list 6 irq 79, cpu list 8-11, effective list 9 irq 80, cpu list 12-15, effective list 13 irq 81, cpu list 16-19, effective list 17 irq 82, cpu list 20-23, effective list 21 PCI name is 81:00.0: nvme1n1 irq 100, cpu list 0-3, effective list 0 irq 101, cpu list 4-7, effective list 4 irq 102, cpu list 8-11, effective list 8 irq 103, cpu list 12-15, effective list 12 irq 104, cpu list 16-19, effective list 16 irq 105, cpu list 20-23, effective list 20 irq 57, cpu list 63, effective list 63 irq 83, cpu list 24-28, effective list 26 irq 84, cpu list 29-33, effective list 29 irq 85, cpu list 34-38, effective list 34 irq 86, cpu list 39-43, effective list 39 irq 87, cpu list 44-47, effective list 44 irq 88, cpu list 48-51, effective list 48 irq 89, cpu list 52-55, effective list 54 irq 90, cpu list 56-59, effective list 56 irq 91, cpu list 60-63, effective list 60 irq 92, cpu list 64-67, effective list 64 irq 93, cpu list 68-71, effective list 68 irq 94, cpu list 72-75, effective list 72 irq 95, cpu list 76-79, effective list 76 irq 96, cpu list 80-83, effective list 80 irq 97, cpu list 84-87, effective list 84 irq 98, cpu list 88-91, effective list 88 irq 99, cpu list 92-95, effective list 92 I'm still getting the CPU lockup (even on CPUs which have a single NVMe completion interrupt assigned), which taints these results. That lockup needs to be fixed. We'll check on our SAS env also. I did already hack something up similar to your change and again we saw a boost there. Thanks, John before job1: (groupid=0, jobs=20): err= 0: pid=1328: Mon Dec 16 10:03:35 2019 read: IOPS=1055k, BW=4121MiB/s (4322MB/s)(994GiB/246946msec) slat (usec): min=2, max=36747k, avg= 6.08, stdev=4018.85 clat (usec): min=13, max=145774k, avg=369.87, stdev=50221.38 lat (usec): min=22, max=145774k, avg=376.12, stdev=50387.08 clat percentiles (usec): | 1.00th=[ 105], 5.00th=[ 128], 10.00th=[ 149], 20.00th=[ 178], | 30.00th=[ 210], 40.00th=[ 243], 50.00th=[ 281], 60.00th=[ 326], | 70.00th=[ 396], 80.00th=[ 486], 90.00th=[ 570], 95.00th=[ 619], | 99.00th=[ 775], 99.50th=[ 906], 99.90th=[ 1254], 99.95th=[ 1631], | 99.99th=[ 3884] bw ( KiB/s): min= 8, max=715687, per=5.65%, avg=238518.42, stdev=115795.80, samples=8726 iops : min= 2, max=178921, avg=59629.49, stdev=28948.95, samples=8726 lat (usec) : 20=0.01%, 50=0.01%, 100=0.60%, 250=41.66%, 500=39.19% lat (usec) : 750=17.36%, 1000=0.95% lat (msec) : 2=0.20%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01%, >=2000=0.01% cpu : usr=8.26%, sys=33.56%, ctx=132171506, majf=0, minf=6774 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwt: total=260541724,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=20 Run status group 0 (all jobs): READ: bw=4121MiB/s (4322MB/s), 4121MiB/s-4121MiB/s (4322MB/s-4322MB/s), io=994GiB (1067GB), run=246946-246946msec Disk stats (read/write): nvme0n1: ios=136993553/0, merge=0/0, ticks=42019997/0, in_queue=14168, util=100.00% nvme1n1: ios=123408538/0, merge=0/0, ticks=37371364/0, in_queue=44672, util=100.00% john@ubuntu:~$ dmesg | grep "Linux v" [ 0.000000] Linux version 5.5.0-rc1-dirty (john@john-ThinkCentre-M93p) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05-rc1 revision 38aec9a676236eaa42ca03ccb3a6c1dd0182c29f] (Linaro GCC 7.3-2018.05-rc1)) #546 SMP PREEMPT Mon Dec 16 09:47:44 GMT 2019 john@ubuntu:~$ after Creat 4k_read_depth20_fiotest file sucessfully_cpu_liuyifan_nvme.sh 4k read 20 1 job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20 ... job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20 ... fio-3.1 Starting 20 processes [ 318.569268] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 04m:30s] [ 318.575010] rcu: 26-....: (1 GPs behind) idle=b82/1/0x4000000000000004 softirq=842/843 fqs=2508 [ 355.781759] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 03m:53s] [ 355.787499] rcu: 34-....: (1 GPs behind) idle=35a/1/0x4000000000000004 softirq=10395/11729 fqs=2623 [ 407.805329] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 03m:01s] [ 407.811069] rcu: 0-....: (1 GPs behind) idle=0ba/0/0x3 softirq=10830/14926 fqs=2625 Jobs: 20 (f=20): [R(20)][61.0%][r=4747MiB/s,w=0KiB/s][r=1215k,w=0[ 470.817317] rcu: INFO: rcu_preempt self-detected stall on CPU [ 470.824912] rcu: 0-....: (2779 ticks this GP) idle=0ba/0/0x3 softirq=14927/14927 fqs=10501 [ 533.829618] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 00m:54s] [ 533.835360] rcu: 39-....: (1 GPs behind) idle=74e/1/0x4000000000000004 softirq=3422/3422 fqs=17226 Jobs: 20 (f=20): [R(20)][100.0%][r=4822MiB/s,w=0KiB/s][r=1234k,w=0 IOPS][eta 00m:00s] job1: (groupid=0, jobs=20): err= 0: pid=1273: Mon Dec 16 10:15:55 2019 read: IOPS=1206k, BW=4712MiB/s (4941MB/s)(1381GiB/300002msec) slat (usec): min=2, max=165648k, avg= 7.26, stdev=10373.59 clat (usec): min=12, max=191808k, avg=323.17, stdev=57005.77 lat (usec): min=19, max=191808k, avg=330.59, stdev=58014.79 clat percentiles (usec): | 1.00th=[ 106], 5.00th=[ 151], 10.00th=[ 174], 20.00th=[ 194], | 30.00th=[ 212], 40.00th=[ 231], 50.00th=[ 247], 60.00th=[ 262], | 70.00th=[ 285], 80.00th=[ 330], 90.00th=[ 457], 95.00th=[ 537], | 99.00th=[ 676], 99.50th=[ 807], 99.90th=[ 1647], 99.95th=[ 2376], | 99.99th=[ 6915] bw ( KiB/s): min= 8, max=648593, per=5.73%, avg=276597.82, stdev=98174.89, samples=10475 iops : min= 2, max=162148, avg=69149.31, stdev=24543.72, samples=10475 lat (usec) : 20=0.01%, 50=0.01%, 100=0.67%, 250=51.48%, 500=41.68% lat (usec) : 750=5.54%, 1000=0.33% lat (msec) : 2=0.23%, 4=0.05%, 10=0.02%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01%, >=2000=0.01% cpu : usr=9.77%, sys=41.68%, ctx=218155976, majf=0, minf=6376 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwt: total=361899317,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=20 Run status group 0 (all jobs): READ: bw=4712MiB/s (4941MB/s), 4712MiB/s-4712MiB/s (4941MB/s-4941MB/s), io=1381GiB (1482GB), run=300002-300002msec Disk stats (read/write): nvme0n1: ios=188627578/0, merge=0/0, ticks=50365208/0, in_queue=55380, util=100.00% nvme1n1: ios=173066657/0, merge=0/0, ticks=38804419/0, in_queue=151212, util=100.00% john@ubuntu:~$ dmesg | grep "Linux v" [ 0.000000] Linux version 5.5.0-rc1-00001-g1e987d83b8d8-dirty (john@john-ThinkCentre-M93p) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05-rc1 revision 38aec9a676236eaa42ca03ccb3a6c1dd0182c29f] (Linaro GCC 7.3-2018.05-rc1)) #547 SMP PREEMPT Mon Dec 16 10:02:27 GMT 2019
On 2019-12-16 10:47, John Garry wrote: > On 14/12/2019 13:56, Marc Zyngier wrote: >> On Fri, 13 Dec 2019 15:43:07 +0000 >> John Garry <john.garry@huawei.com> wrote: >> [...] >> >>> john@ubuntu:~$ ./dump-io-irq-affinity >>> kernel version: >>> Linux ubuntu 5.5.0-rc1-00003-g7adc5d7ec1ca-dirty #1440 SMP PREEMPT >>> Fri Dec 13 14:53:19 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux >>> PCI name is 04:00.0: nvme0n1 >>> irq 56, cpu list 75, effective list 5 >>> irq 60, cpu list 24-28, effective list 10 >> The NUMA selection code definitely gets in the way. And to be >> honest, >> this NUMA thing is only there for the benefit of a terminally broken >> implementation (Cavium ThunderX), which we should have never >> supported >> the first place. >> Let's rework this and simply use the managed affinity whenever >> available instead. It may well be that it will break TX1, but I care >> about it just as much as Cavium/Marvell does... > > I'm just wondering if non-managed interrupts should be included in > the load balancing calculation? Couldn't irqbalance (if active) start > moving non-managed interrupts around anyway? But they are, aren't they? See what we do in irq_set_affinity: + atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu)); + atomic_dec(per_cpu_ptr(&cpu_lpi_count, + its_dev->event_map.col_map[id])); We don't try to "rebalance" anything based on that though, not that I think we should. > >> Please give this new patch a shot on your system (my D05 doesn't >> have >> any managed devices): > > We could consider supporting platform msi managed interrupts, but I > doubt the value. It shouldn't be hard to do, and most of the existing code could be moved to the generic level. As for the value, I'm not convinced either. For example D05 uses the MBIGEN as an intermediate interrupt controller, so MSIs are from the PoV of MBIGEN, and not the SAS device attached to it. Not the best design... >> >> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/its-balance-mappings&id=1e987d83b8d880d56c9a2d8a86289631da94e55a >> > > I quickly tested that in my NVMe env, and I see a performance boost > of 1055K -> 1206K IOPS. Results at bottom. OK, that's encouraging. > Here's the irq mapping dump: [...] Looks good. > I'm still getting the CPU lockup (even on CPUs which have a single > NVMe completion interrupt assigned), which taints these results. That > lockup needs to be fixed. Is this interrupt screaming to the point where it prevents the completion thread from making forward progress? What if you don't use threaded interrupts? > We'll check on our SAS env also. I did already hack something up > similar to your change and again we saw a boost there. OK. Please keep me posted. If the result is overall positive, I'll push this into -next for some soaking. Thanks, M. -- Jazz is not dead. It just smells funny...
Hi Marc, >> >> I'm just wondering if non-managed interrupts should be included in >> the load balancing calculation? Couldn't irqbalance (if active) start >> moving non-managed interrupts around anyway? > > But they are, aren't they? See what we do in irq_set_affinity: > > + atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu)); > + atomic_dec(per_cpu_ptr(&cpu_lpi_count, > + its_dev->event_map.col_map[id])); > > We don't try to "rebalance" anything based on that though, not that > I think we should. Ah sorry, I meant whether they should not be included. In its_irq_domain_activate(), we increment the per-cpu lpi count and also use its_pick_target_cpu() to find the least loaded cpu. I am asking whether we should just stick with the old policy for non-managed interrupts here. After checking D05, I see a very significant performance hit for SAS controller performance - ~40% throughout lowering. With this patch, now we have effective affinity targeted at seemingly "random" CPUs, as opposed to all just using CPU0. This affects performance. The difference is that when we use managed interrupts - like for NVME or D06 SAS controller - the irq cpu affinity mask matches the CPUs which enqueue the requests to the queue associated with the interrupt. So there is an efficiency is enqueuing and deqeueing on same CPU group - all related to blk multi-queue. And this is not the case for non-managed interrupts. >> >>> Please give this new patch a shot on your system (my D05 doesn't have >>> any managed devices): >> >> We could consider supporting platform msi managed interrupts, but I >> doubt the value. > > It shouldn't be hard to do, and most of the existing code could be > moved to the generic level. As for the value, I'm not convinced > either. For example D05 uses the MBIGEN as an intermediate interrupt > controller, so MSIs are from the PoV of MBIGEN, and not the SAS device > attached to it. Not the best design... JFYI, I did raise this following topic before, but that's as far as I got: https://marc.info/?l=linux-block&m=150722088314310&w=2 > >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/its-balance-mappings&id=1e987d83b8d880d56c9a2d8a86289631da94e55a >>> >>> >> >> I quickly tested that in my NVMe env, and I see a performance boost >> of 1055K -> 1206K IOPS. Results at bottom. > > OK, that's encouraging. > >> Here's the irq mapping dump: > > [...] > > Looks good. > >> I'm still getting the CPU lockup (even on CPUs which have a single >> NVMe completion interrupt assigned), which taints these results. That >> lockup needs to be fixed. > > Is this interrupt screaming to the point where it prevents the completion > thread from making forward progress? What if you don't use threaded > interrupts? Yeah, just switching to threaded interrupts solves it (nvme core has a switch for this). So there was a big discussion on this topic a while ago: https://lkml.org/lkml/2019/8/20/45 (couldn't find this on lore) The conclusion there was to switch to irq poll, but leiming though that it was another issue - see earlier mail: https://lore.kernel.org/lkml/20191210014335.GA25022@ming.t460p/ > >> We'll check on our SAS env also. I did already hack something up >> similar to your change and again we saw a boost there. > > OK. Please keep me posted. If the result is overall positive, I'll > push this into -next for some soaking. > ok, thanks John
Hi John, On 2019-12-16 14:17, John Garry wrote: > Hi Marc, > >>> >>> I'm just wondering if non-managed interrupts should be included in >>> the load balancing calculation? Couldn't irqbalance (if active) >>> start >>> moving non-managed interrupts around anyway? >> But they are, aren't they? See what we do in irq_set_affinity: >> + atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu)); >> + atomic_dec(per_cpu_ptr(&cpu_lpi_count, >> + its_dev->event_map.col_map[id])); >> We don't try to "rebalance" anything based on that though, not that >> I think we should. > > Ah sorry, I meant whether they should not be included. In > its_irq_domain_activate(), we increment the per-cpu lpi count and > also > use its_pick_target_cpu() to find the least loaded cpu. I am asking > whether we should just stick with the old policy for non-managed > interrupts here. > > After checking D05, I see a very significant performance hit for SAS > controller performance - ~40% throughout lowering. -ETOOMANYMOVINGPARTS. > With this patch, now we have effective affinity targeted at seemingly > "random" CPUs, as opposed to all just using CPU0. This affects > performance. And piling all interrupts on the same CPU does help? > The difference is that when we use managed interrupts - like for NVME > or D06 SAS controller - the irq cpu affinity mask matches the CPUs > which enqueue the requests to the queue associated with the > interrupt. > So there is an efficiency is enqueuing and deqeueing on same CPU > group > - all related to blk multi-queue. And this is not the case for > non-managed interrupts. So you enqueue requests from CPU0 only? It seems a bit odd... >>>> Please give this new patch a shot on your system (my D05 doesn't >>>> have >>>> any managed devices): >>> >>> We could consider supporting platform msi managed interrupts, but I >>> doubt the value. >> It shouldn't be hard to do, and most of the existing code could be >> moved to the generic level. As for the value, I'm not convinced >> either. For example D05 uses the MBIGEN as an intermediate interrupt >> controller, so MSIs are from the PoV of MBIGEN, and not the SAS >> device >> attached to it. Not the best design... > > JFYI, I did raise this following topic before, but that's as far as I > got: > > https://marc.info/?l=linux-block&m=150722088314310&w=2 Yes. And that's probably not very hard, but the problem in your case is that the D05 HW is not using MSIs... You'd have to provide an abstraction for wired interrupts (please don't). You'd be better off directly setting the affinity of the interrupts from the driver, but I somehow can't believe that you're only submitting requests from the same CPU, always. There must be something I'm missing. Thanks, M. -- Jazz is not dead. It just smells funny...
Hi Marc, >> >>>> >>>> I'm just wondering if non-managed interrupts should be included in >>>> the load balancing calculation? Couldn't irqbalance (if active) start >>>> moving non-managed interrupts around anyway? >>> But they are, aren't they? See what we do in irq_set_affinity: >>> + atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu)); >>> + atomic_dec(per_cpu_ptr(&cpu_lpi_count, >>> + its_dev->event_map.col_map[id])); >>> We don't try to "rebalance" anything based on that though, not that >>> I think we should. >> >> Ah sorry, I meant whether they should not be included. In >> its_irq_domain_activate(), we increment the per-cpu lpi count and also >> use its_pick_target_cpu() to find the least loaded cpu. I am asking >> whether we should just stick with the old policy for non-managed >> interrupts here. >> >> After checking D05, I see a very significant performance hit for SAS >> controller performance - ~40% throughout lowering. > > -ETOOMANYMOVINGPARTS. Understood. > >> With this patch, now we have effective affinity targeted at seemingly >> "random" CPUs, as opposed to all just using CPU0. This affects >> performance. > > And piling all interrupts on the same CPU does help? Apparently... I need to check this more. > >> The difference is that when we use managed interrupts - like for NVME >> or D06 SAS controller - the irq cpu affinity mask matches the CPUs >> which enqueue the requests to the queue associated with the interrupt. >> So there is an efficiency is enqueuing and deqeueing on same CPU group >> - all related to blk multi-queue. And this is not the case for >> non-managed interrupts. > > So you enqueue requests from CPU0 only? It seems a bit odd... No, but maybe I wasn't clear enough. I'll give an overview: For D06 SAS controller - which is a multi-queue PCI device - we use managed interrupts. The HW has 16 submission/completion queues, so for 96 cores, we have an even spread of 6 CPUs assigned per queue; and this per-queue CPU mask is the interrupt affinity mask. So CPU0-5 would submit any IO on queue0, CPU6-11 on queue2, and so on. PCI NVMe is essentially the same. These are the environments which we're trying to promote performance. Then for D05 SAS controller - which is multi-queue platform device (mbigen) - we don't use managed interrupts. We still submit IO from any CPU, but we choose the queue to submit IO on a round-robin basis to promote some isolation, i.e. reduce inter-queue lock contention, so the queue chosen has nothing to do with the CPU. And with your change we may submit on cpu4 but service the interrupt on cpu30, as an example. While previously we would always service on cpu0. The old way still isn't ideal, I'll admit. For this env, we would just like to maintain the same performance. And it's here that we see the performance drop. > >>>>> Please give this new patch a shot on your system (my D05 doesn't have >>>>> any managed devices): >>>> >>>> We could consider supporting platform msi managed interrupts, but I >>>> doubt the value. >>> It shouldn't be hard to do, and most of the existing code could be >>> moved to the generic level. As for the value, I'm not convinced >>> either. For example D05 uses the MBIGEN as an intermediate interrupt >>> controller, so MSIs are from the PoV of MBIGEN, and not the SAS device >>> attached to it. Not the best design... >> >> JFYI, I did raise this following topic before, but that's as far as I >> got: >> >> https://marc.info/?l=linux-block&m=150722088314310&w=2 > > Yes. And that's probably not very hard, but the problem in your case is > that the D05 HW is not using MSIs... Right You'd have to provide an abstraction > for wired interrupts (please don't). > > You'd be better off directly setting the affinity of the interrupts from > the driver, but I somehow can't believe that you're only submitting > requests > from the same CPU, Maybe... always. There must be something I'm missing. > Thanks, John
>> So you enqueue requests from CPU0 only? It seems a bit odd... > > No, but maybe I wasn't clear enough. I'll give an overview: > > For D06 SAS controller - which is a multi-queue PCI device - we use > managed interrupts. The HW has 16 submission/completion queues, so for > 96 cores, we have an even spread of 6 CPUs assigned per queue; and this > per-queue CPU mask is the interrupt affinity mask. So CPU0-5 would > submit any IO on queue0, CPU6-11 on queue2, and so on. PCI NVMe is > essentially the same. > > These are the environments which we're trying to promote performance. > > Then for D05 SAS controller - which is multi-queue platform device > (mbigen) - we don't use managed interrupts. We still submit IO from any > CPU, but we choose the queue to submit IO on a round-robin basis to > promote some isolation, i.e. reduce inter-queue lock contention, so the > queue chosen has nothing to do with the CPU. > > And with your change we may submit on cpu4 but service the interrupt on > cpu30, as an example. While previously we would always service on cpu0. > The old way still isn't ideal, I'll admit. > > For this env, we would just like to maintain the same performance. And > it's here that we see the performance drop. > Hi Marc, We've got some more results and it looks promising. So with your patch we get a performance boost of 3180.1K -> 3294.9K IOPS in the D06 SAS env. Then when we change the driver to use threaded interrupt handler (mainline currently uses tasklet), we get a boost again up to 3415K IOPS. Now this is essentially the same figure we had with using threaded handler + the gen irq change in spreading the handler CPU affinity. We did also test your patch + gen irq change and got a performance drop, to 3347K IOPS. So tentatively I'd say your patch may be all we need. FYI, here is how the effective affinity is looking for both SAS controllers with your patch: 74:02.0 irq 81, cpu list 24-29, effective list 24 cq irq 82, cpu list 30-35, effective list 30 cq irq 83, cpu list 36-41, effective list 36 cq irq 84, cpu list 42-47, effective list 42 cq irq 85, cpu list 48-53, effective list 48 cq irq 86, cpu list 54-59, effective list 56 cq irq 87, cpu list 60-65, effective list 60 cq irq 88, cpu list 66-71, effective list 66 cq irq 89, cpu list 72-77, effective list 72 cq irq 90, cpu list 78-83, effective list 78 cq irq 91, cpu list 84-89, effective list 84 cq irq 92, cpu list 90-95, effective list 90 cq irq 93, cpu list 0-5, effective list 0 cq irq 94, cpu list 6-11, effective list 6 cq irq 95, cpu list 12-17, effective list 12 cq irq 96, cpu list 18-23, effective list 18 cq 74:04.0 irq 113, cpu list 24-29, effective list 25 cq irq 114, cpu list 30-35, effective list 31 cq irq 115, cpu list 36-41, effective list 37 cq irq 116, cpu list 42-47, effective list 43 cq irq 117, cpu list 48-53, effective list 49 cq irq 118, cpu list 54-59, effective list 57 cq irq 119, cpu list 60-65, effective list 61 cq irq 120, cpu list 66-71, effective list 67 cq irq 121, cpu list 72-77, effective list 73 cq irq 122, cpu list 78-83, effective list 79 cq irq 123, cpu list 84-89, effective list 85 cq irq 124, cpu list 90-95, effective list 91 cq irq 125, cpu list 0-5, effective list 1 cq irq 126, cpu list 6-11, effective list 7 cq irq 127, cpu list 12-17, effective list 17 cq irq 128, cpu list 18-23, effective list 19 cq As for your patch itself, I'm still concerned of possible regressions if we don't apply this effective interrupt affinity spread policy to only managed interrupts. JFYI, about NVMe CPU lockup issue, there are 2 works on going here: https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t Cheers, John Ps. Thanks to Xiang Chen for all the work here in getting these results. >> >>>>>> Please give this new patch a shot on your system (my D05 doesn't have >>>>>> any managed devices): >>>>> >>>>> We could consider supporting platform msi managed interrupts, but I
Hi John, On 2019-12-20 11:30, John Garry wrote: >>> So you enqueue requests from CPU0 only? It seems a bit odd... >> No, but maybe I wasn't clear enough. I'll give an overview: >> For D06 SAS controller - which is a multi-queue PCI device - we use >> managed interrupts. The HW has 16 submission/completion queues, so for >> 96 cores, we have an even spread of 6 CPUs assigned per queue; and >> this per-queue CPU mask is the interrupt affinity mask. So CPU0-5 >> would submit any IO on queue0, CPU6-11 on queue2, and so on. PCI NVMe >> is essentially the same. >> These are the environments which we're trying to promote >> performance. >> Then for D05 SAS controller - which is multi-queue platform device >> (mbigen) - we don't use managed interrupts. We still submit IO from >> any CPU, but we choose the queue to submit IO on a round-robin basis >> to promote some isolation, i.e. reduce inter-queue lock contention, so >> the queue chosen has nothing to do with the CPU. >> And with your change we may submit on cpu4 but service the interrupt >> on cpu30, as an example. While previously we would always service on >> cpu0. The old way still isn't ideal, I'll admit. >> For this env, we would just like to maintain the same performance. >> And it's here that we see the performance drop. >> > > Hi Marc, > > We've got some more results and it looks promising. > > So with your patch we get a performance boost of 3180.1K -> 3294.9K > IOPS in the D06 SAS env. Then when we change the driver to use > threaded interrupt handler (mainline currently uses tasklet), we get > a > boost again up to 3415K IOPS. > > Now this is essentially the same figure we had with using threaded > handler + the gen irq change in spreading the handler CPU affinity. > We > did also test your patch + gen irq change and got a performance drop, > to 3347K IOPS. > > So tentatively I'd say your patch may be all we need. OK. > FYI, here is how the effective affinity is looking for both SAS > controllers with your patch: > > 74:02.0 > irq 81, cpu list 24-29, effective list 24 cq > irq 82, cpu list 30-35, effective list 30 cq Cool. [...] > As for your patch itself, I'm still concerned of possible regressions > if we don't apply this effective interrupt affinity spread policy to > only managed interrupts. I'll try and revise that as I post the patch, probably at some point between now and Christmas. I still think we should find a way to address this for the D05 SAS driver though, maybe by managing the affinity yourself in the driver. But this requires experimentation. > JFYI, about NVMe CPU lockup issue, there are 2 works on going here: > > https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t > > https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t I've also managed to trigger some of them now that I have access to a decent box with nvme storage. Out of curiosity, have you tried with the SMMU disabled? I'm wondering whether we hit some livelock condition on unmapping buffers... > Cheers, > John > > Ps. Thanks to Xiang Chen for all the work here in getting these > results. Yup, much appreciated! Thanks, M. -- Jazz is not dead. It just smells funny...
>> We've got some more results and it looks promising. >> >> So with your patch we get a performance boost of 3180.1K -> 3294.9K >> IOPS in the D06 SAS env. Then when we change the driver to use >> threaded interrupt handler (mainline currently uses tasklet), we get a >> boost again up to 3415K IOPS. >> >> Now this is essentially the same figure we had with using threaded >> handler + the gen irq change in spreading the handler CPU affinity. We >> did also test your patch + gen irq change and got a performance drop, >> to 3347K IOPS. >> >> So tentatively I'd say your patch may be all we need. > > OK. > >> FYI, here is how the effective affinity is looking for both SAS >> controllers with your patch: >> >> 74:02.0 >> irq 81, cpu list 24-29, effective list 24 cq >> irq 82, cpu list 30-35, effective list 30 cq > > Cool. > > [...] > >> As for your patch itself, I'm still concerned of possible regressions >> if we don't apply this effective interrupt affinity spread policy to >> only managed interrupts. > > I'll try and revise that as I post the patch, probably at some point > between now and Christmas. I still think we should find a way to > address this for the D05 SAS driver though, maybe by managing the > affinity yourself in the driver. But this requires experimentation. I've already done something experimental for the driver to manage the affinity, and performance is generally much better: https://github.com/hisilicon/kernel-dev/commit/e15bd404ed1086fed44da34ed3bd37a8433688a7 But I still think it's wise to only consider managed interrupts for now. > >> JFYI, about NVMe CPU lockup issue, there are 2 works on going here: >> >> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t >> >> >> https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t >> > > I've also managed to trigger some of them now that I have access to > a decent box with nvme storage. I only have 2x NVMe SSDs when this occurs - I should not be hitting this... Out of curiosity, have you tried > with the SMMU disabled? I'm wondering whether we hit some livelock > condition on unmapping buffers... No, but I can give it a try. Doing that should lower the CPU usage, though, so maybe masks the issue - probably not. Much appreciated, John
On 2019-12-20 15:38, John Garry wrote: > I've already done something experimental for the driver to manage the > affinity, and performance is generally much better: > > > https://github.com/hisilicon/kernel-dev/commit/e15bd404ed1086fed44da34ed3bd37a8433688a7 > > But I still think it's wise to only consider managed interrupts for > now. Sure. We've lived with it so far, we can make it last a bit longer... ;-) >> >>> JFYI, about NVMe CPU lockup issue, there are 2 works on going here: >>> >>> >>> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t >>> >>> >>> >>> https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t >>> >> I've also managed to trigger some of them now that I have access to >> a decent box with nvme storage. > > I only have 2x NVMe SSDs when this occurs - I should not be hitting > this... Same configuration here. And the number of interrupts is pretty low (less that 20k/s per CPU), so I doubt this is interrupt related. > Out of curiosity, have you tried >> with the SMMU disabled? I'm wondering whether we hit some livelock >> condition on unmapping buffers... > > No, but I can give it a try. Doing that should lower the CPU usage, > though, so maybe masks the issue - probably not. I wonder whether we could end-up in some form of unmap storm on completion, with a CPU being starved trying to insert its TLBI command into the queue. Anyway, more digging in perspective. M. -- Jazz is not dead. It just smells funny...
On Fri, Dec 20, 2019 at 03:38:24PM +0000, John Garry wrote: > > > We've got some more results and it looks promising. > > > > > > So with your patch we get a performance boost of 3180.1K -> 3294.9K > > > IOPS in the D06 SAS env. Then when we change the driver to use > > > threaded interrupt handler (mainline currently uses tasklet), we get a > > > boost again up to 3415K IOPS. > > > > > > Now this is essentially the same figure we had with using threaded > > > handler + the gen irq change in spreading the handler CPU affinity. We > > > did also test your patch + gen irq change and got a performance drop, > > > to 3347K IOPS. > > > > > > So tentatively I'd say your patch may be all we need. > > > > OK. > > > > > FYI, here is how the effective affinity is looking for both SAS > > > controllers with your patch: > > > > > > 74:02.0 > > > irq 81, cpu list 24-29, effective list 24 cq > > > irq 82, cpu list 30-35, effective list 30 cq > > > > Cool. > > > > [...] > > > > > As for your patch itself, I'm still concerned of possible regressions > > > if we don't apply this effective interrupt affinity spread policy to > > > only managed interrupts. > > > > I'll try and revise that as I post the patch, probably at some point > > between now and Christmas. I still think we should find a way to > > address this for the D05 SAS driver though, maybe by managing the > > affinity yourself in the driver. But this requires experimentation. > > I've already done something experimental for the driver to manage the > affinity, and performance is generally much better: > > https://github.com/hisilicon/kernel-dev/commit/e15bd404ed1086fed44da34ed3bd37a8433688a7 > > But I still think it's wise to only consider managed interrupts for now. > > > > > > JFYI, about NVMe CPU lockup issue, there are 2 works on going here: > > > > > > https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t > > > > > > > > > https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t > > > > > > > I've also managed to trigger some of them now that I have access to > > a decent box with nvme storage. > > I only have 2x NVMe SSDs when this occurs - I should not be hitting this... > > Out of curiosity, have you tried > > with the SMMU disabled? I'm wondering whether we hit some livelock > > condition on unmapping buffers... > > No, but I can give it a try. Doing that should lower the CPU usage, though, > so maybe masks the issue - probably not. Lots of CPU lockup can is performance issue if there isn't obvious bug. I am wondering if you may explain it a bit why enabling SMMU may save CPU a it? Thanks, Ming
On 2019-12-20 23:31, Ming Lei wrote: > On Fri, Dec 20, 2019 at 03:38:24PM +0000, John Garry wrote: >> > > We've got some more results and it looks promising. >> > > >> > > So with your patch we get a performance boost of 3180.1K -> >> 3294.9K >> > > IOPS in the D06 SAS env. Then when we change the driver to use >> > > threaded interrupt handler (mainline currently uses tasklet), we >> get a >> > > boost again up to 3415K IOPS. >> > > >> > > Now this is essentially the same figure we had with using >> threaded >> > > handler + the gen irq change in spreading the handler CPU >> affinity. We >> > > did also test your patch + gen irq change and got a performance >> drop, >> > > to 3347K IOPS. >> > > >> > > So tentatively I'd say your patch may be all we need. >> > >> > OK. >> > >> > > FYI, here is how the effective affinity is looking for both SAS >> > > controllers with your patch: >> > > >> > > 74:02.0 >> > > irq 81, cpu list 24-29, effective list 24 cq >> > > irq 82, cpu list 30-35, effective list 30 cq >> > >> > Cool. >> > >> > [...] >> > >> > > As for your patch itself, I'm still concerned of possible >> regressions >> > > if we don't apply this effective interrupt affinity spread >> policy to >> > > only managed interrupts. >> > >> > I'll try and revise that as I post the patch, probably at some >> point >> > between now and Christmas. I still think we should find a way to >> > address this for the D05 SAS driver though, maybe by managing the >> > affinity yourself in the driver. But this requires >> experimentation. >> >> I've already done something experimental for the driver to manage >> the >> affinity, and performance is generally much better: >> >> >> https://github.com/hisilicon/kernel-dev/commit/e15bd404ed1086fed44da34ed3bd37a8433688a7 >> >> But I still think it's wise to only consider managed interrupts for >> now. >> >> > >> > > JFYI, about NVMe CPU lockup issue, there are 2 works on going >> here: >> > > >> > > >> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/T/#t >> > > >> > > >> > > >> https://lore.kernel.org/linux-block/20191218071942.22336-1-ming.lei@redhat.com/T/#t >> > > >> > >> > I've also managed to trigger some of them now that I have access >> to >> > a decent box with nvme storage. >> >> I only have 2x NVMe SSDs when this occurs - I should not be hitting >> this... >> >> Out of curiosity, have you tried >> > with the SMMU disabled? I'm wondering whether we hit some livelock >> > condition on unmapping buffers... >> >> No, but I can give it a try. Doing that should lower the CPU usage, >> though, >> so maybe masks the issue - probably not. > > Lots of CPU lockup can is performance issue if there isn't obvious > bug. > > I am wondering if you may explain it a bit why enabling SMMU may save > CPU a it? The other way around. mapping/unmapping IOVAs doesn't comes for free. I'm trying to find out whether the NVMe map/unmap patterns trigger something unexpected in the SMMU driver, but that's a very long shot. M. -- Jazz is not dead. It just smells funny...
>>> > I've also managed to trigger some of them now that I have access to >>> > a decent box with nvme storage. >>> >>> I only have 2x NVMe SSDs when this occurs - I should not be hitting >>> this... >>> >>> Out of curiosity, have you tried >>> > with the SMMU disabled? I'm wondering whether we hit some livelock >>> > condition on unmapping buffers... >>> >>> No, but I can give it a try. Doing that should lower the CPU usage, >>> though, >>> so maybe masks the issue - probably not. >> >> Lots of CPU lockup can is performance issue if there isn't obvious bug. >> >> I am wondering if you may explain it a bit why enabling SMMU may save >> CPU a it? > > The other way around. mapping/unmapping IOVAs doesn't comes for free. > I'm trying to find out whether the NVMe map/unmap patterns trigger > something unexpected in the SMMU driver, but that's a very long shot. So I tested v5.5-rc3 with and without the SMMU enabled, and without the SMMU enabled I don't get the lockup. fio summary SMMU enabled: john@ubuntu:~$ dmesg | grep "Adding to iommu group" [ 10.550212] hisi_sas_v3_hw 0000:74:02.0: Adding to iommu group 0 [ 14.773231] nvme 0000:04:00.0: Adding to iommu group 1 [ 14.784000] nvme 0000:81:00.0: Adding to iommu group 2 [ 14.794884] ahci 0000:74:03.0: Adding to iommu group 3 [snip] sudo sh create_fio_task_cpu_liuyifan_nvme.sh 4k read 20 1 Creat 4k_read_depth20_fiotest file sucessfully job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20 ... job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20 ... fio-3.1 Starting 20 processes [ 110.155618] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 04m:11s] [ 110.161360] rcu: 4-....: (1 GPs behind) idle=00e/1/0x4000000000000004 softirq=1284/4115 fqs=2625 [ 173.167743] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 03m:08s] [ 173.173484] rcu: 29-....: (1 GPs behind) idle=e1e/0/0x3 softirq=662/5436 fqs=10501 [ 236.179623] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 02m:05s] [ 236.185362] rcu: 29-....: (1 GPs behind) idle=e1e/0/0x3 softirq=662/5436 fqs=18220 [ 271.735648] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 01m:30s] [ 271.741387] rcu: 16-....: (1 GPs behind) idle=fb6/1/0x4000000000000002 softirq=858/1168 fqs=2605 [ 334.747590] rcu: INFO: rcu_preempt self-detected stall on CPU0 IOPS][eta 00m:27s] [ 334.753328] rcu: 0-....: (1 GPs behind) idle=57a/1/0x4000000000000002 softirq=1384/1384 fqs=10309 Jobs: 20 (f=20): [R(20)][100.0%][r=4230MiB/s,w=0KiB/s][r=1083k,w=0 IOPS][eta 00m:00s] job1: (groupid=0, jobs=20): err= 0: pid=1242: Mon Dec 23 09:45:12 2019 read: IOPS=1183k, BW=4621MiB/s (4846MB/s)(1354GiB/300002msec) slat (usec): min=2, max=183172k, avg= 6.47, stdev=12846.53 clat (usec): min=4, max=183173k, avg=330.40, stdev=63380.85 lat (usec): min=20, max=183173k, avg=337.02, stdev=64670.18 clat percentiles (usec): | 1.00th=[ 104], 5.00th=[ 112], 10.00th=[ 137], 20.00th=[ 182], | 30.00th=[ 219], 40.00th=[ 245], 50.00th=[ 269], 60.00th=[ 297], | 70.00th=[ 338], 80.00th=[ 379], 90.00th=[ 429], 95.00th=[ 482], | 99.00th=[ 635], 99.50th=[ 742], 99.90th=[ 1221], 99.95th=[ 1876], | 99.99th=[ 6194] bw ( KiB/s): min= 32, max=733328, per=5.75%, avg=272330.58, stdev=110721.72, samples=10435 iops : min= 8, max=183332, avg=68082.49, stdev=27680.43, samples=10435 lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.46%, 250=41.97% lat (usec) : 500=53.32%, 750=3.78%, 1000=0.31% lat (msec) : 2=0.11%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01%, >=2000=0.01% cpu : usr=8.38%, sys=33.43%, ctx=134950965, majf=0, minf=4371 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwt: total=354924097,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=20 Run status group 0 (all jobs): READ: bw=4621MiB/s (4846MB/s), 4621MiB/s-4621MiB/s (4846MB/s-4846MB/s), io=1354GiB (1454GB), run=300002-300002msec Disk stats (read/write): nvme0n1: ios=187325975/0, merge=0/0, ticks=49841664/0, in_queue=11620, util=100.00% nvme1n1: ios=167416192/0, merge=0/0, ticks=42280120/0, in_queue=194576, util=100.00% john@ubuntu:~$ fio summary SMMU disabled: john@ubuntu:~$ dmesg | grep "Adding to iommu group" john@ubuntu:~$ sudo sh create_fio_task_cpu_liuyifan_nvme.sh 4k read 20 1 Creat 4k_read_depth20_fiotest file sucessfully job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20 ... job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20 ... fio-3.1 Starting 20 processes Jobs: 20 (f=20): [R(20)][100.0%][r=6053MiB/s,w=0KiB/s][r=1550k,w=0 IOPS][eta 00m:00s] job1: (groupid=0, jobs=20): err= 0: pid=1221: Mon Dec 23 09:54:15 2019 read: IOPS=1539k, BW=6011MiB/s (6303MB/s)(1761GiB/300001msec) slat (usec): min=2, max=224572, avg= 4.44, stdev=14.57 clat (usec): min=11, max=238636, avg=254.59, stdev=140.45 lat (usec): min=15, max=240025, avg=259.17, stdev=142.61 clat percentiles (usec): | 1.00th=[ 94], 5.00th=[ 125], 10.00th=[ 167], 20.00th=[ 208], | 30.00th=[ 221], 40.00th=[ 227], 50.00th=[ 237], 60.00th=[ 247], | 70.00th=[ 262], 80.00th=[ 281], 90.00th=[ 338], 95.00th=[ 420], | 99.00th=[ 701], 99.50th=[ 857], 99.90th=[ 1270], 99.95th=[ 1483], | 99.99th=[ 2114] bw ( KiB/s): min= 2292, max=429480, per=5.01%, avg=308068.30, stdev=36800.42, samples=12000 iops : min= 573, max=107370, avg=77016.89, stdev=9200.10, samples=12000 lat (usec) : 20=0.01%, 50=0.04%, 100=1.56%, 250=61.54%, 500=33.86% lat (usec) : 750=2.19%, 1000=0.53% lat (msec) : 2=0.26%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=11.50%, sys=40.49%, ctx=198764008, majf=0, minf=30760 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwt: total=461640046,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=20 Run status group 0 (all jobs): READ: bw=6011MiB/s (6303MB/s), 6011MiB/s-6011MiB/s (6303MB/s-6303MB/s), io=1761GiB (1891GB), run=300001-300001msec Disk stats (read/write): nvme0n1: ios=229212121/0, merge=0/0, ticks=56349577/0, in_queue=2908, util=100.00% nvme1n1: ios=232165508/0, merge=0/0, ticks=56708137/0, in_queue=372, util=100.00% john@ubuntu:~$ Obviously this is not conclusive, especially with such limited testing - 5 minute runs each. The CPU load goes up when disabling the SMMU, but that could be attributed to extra throughput (1183K -> 1539K) loading. I do notice that since we complete the NVMe request in irq context, we also do the DMA unmap, i.e. talk to the SMMU, in the same context, which is less than ideal. I need to finish for the Christmas break today, so can't check this much further ATM. Thanks, John
On 2019-12-23 10:26, John Garry wrote: >>>> > I've also managed to trigger some of them now that I have access >>>> to >>>> > a decent box with nvme storage. >>>> >>>> I only have 2x NVMe SSDs when this occurs - I should not be >>>> hitting this... >>>> >>>> Out of curiosity, have you tried >>>> > with the SMMU disabled? I'm wondering whether we hit some >>>> livelock >>>> > condition on unmapping buffers... >>>> >>>> No, but I can give it a try. Doing that should lower the CPU >>>> usage, though, >>>> so maybe masks the issue - probably not. >>> >>> Lots of CPU lockup can is performance issue if there isn't obvious >>> bug. >>> >>> I am wondering if you may explain it a bit why enabling SMMU may >>> save >>> CPU a it? >> The other way around. mapping/unmapping IOVAs doesn't comes for >> free. >> I'm trying to find out whether the NVMe map/unmap patterns trigger >> something unexpected in the SMMU driver, but that's a very long >> shot. > > So I tested v5.5-rc3 with and without the SMMU enabled, and without > the SMMU enabled I don't get the lockup. OK, so my hunch wasn't completely off... At least we have something to look into. [...] > Obviously this is not conclusive, especially with such limited > testing - 5 minute runs each. The CPU load goes up when disabling the > SMMU, but that could be attributed to extra throughput (1183K -> > 1539K) loading. > > I do notice that since we complete the NVMe request in irq context, > we also do the DMA unmap, i.e. talk to the SMMU, in the same context, > which is less than ideal. It depends on how much overhead invalidating the TLB adds to the equation, but we should be able to do some tracing and find out. > I need to finish for the Christmas break today, so can't check this > much further ATM. No worries. May I suggest creating a new thread in the new year, maybe involving Robin and Will as well? Thanks, M. -- Jazz is not dead. It just smells funny...
On 23/12/2019 10:47, Marc Zyngier wrote: > On 2019-12-23 10:26, John Garry wrote: >>>>> > I've also managed to trigger some of them now that I have access to >>>>> > a decent box with nvme storage. >>>>> >>>>> I only have 2x NVMe SSDs when this occurs - I should not be hitting >>>>> this... >>>>> >>>>> Out of curiosity, have you tried >>>>> > with the SMMU disabled? I'm wondering whether we hit some livelock >>>>> > condition on unmapping buffers... >>>>> >>>>> No, but I can give it a try. Doing that should lower the CPU usage, >>>>> though, >>>>> so maybe masks the issue - probably not. >>>> >>>> Lots of CPU lockup can is performance issue if there isn't obvious bug. >>>> >>>> I am wondering if you may explain it a bit why enabling SMMU may save >>>> CPU a it? >>> The other way around. mapping/unmapping IOVAs doesn't comes for free. >>> I'm trying to find out whether the NVMe map/unmap patterns trigger >>> something unexpected in the SMMU driver, but that's a very long shot. >> >> So I tested v5.5-rc3 with and without the SMMU enabled, and without >> the SMMU enabled I don't get the lockup. > > OK, so my hunch wasn't completely off... At least we have something > to look into. > > [...] > >> Obviously this is not conclusive, especially with such limited >> testing - 5 minute runs each. The CPU load goes up when disabling the >> SMMU, but that could be attributed to extra throughput (1183K -> >> 1539K) loading. >> >> I do notice that since we complete the NVMe request in irq context, >> we also do the DMA unmap, i.e. talk to the SMMU, in the same context, >> which is less than ideal. > > It depends on how much overhead invalidating the TLB adds to the > equation, but we should be able to do some tracing and find out. ok, but let's remember that x86 iommu uses non-strict unmapping by default, and they also see this issue. > >> I need to finish for the Christmas break today, so can't check this >> much further ATM. > > No worries. May I suggest creating a new thread in the new year, maybe > involving Robin and Will as well? Can do, but would be good to know how x86 fairs and the IOMMU config used for testing also when the lockup occurs. Cheers, John
On Mon, Dec 23, 2019 at 10:47:07AM +0000, Marc Zyngier wrote: > On 2019-12-23 10:26, John Garry wrote: > > > > > > I've also managed to trigger some of them now that I have > > > > > access to > > > > > > a decent box with nvme storage. > > > > > > > > > > I only have 2x NVMe SSDs when this occurs - I should not be > > > > > hitting this... > > > > > > > > > > Out of curiosity, have you tried > > > > > > with the SMMU disabled? I'm wondering whether we hit some > > > > > livelock > > > > > > condition on unmapping buffers... > > > > > > > > > > No, but I can give it a try. Doing that should lower the CPU > > > > > usage, though, > > > > > so maybe masks the issue - probably not. > > > > > > > > Lots of CPU lockup can is performance issue if there isn't > > > > obvious bug. > > > > > > > > I am wondering if you may explain it a bit why enabling SMMU may > > > > save > > > > CPU a it? > > > The other way around. mapping/unmapping IOVAs doesn't comes for > > > free. > > > I'm trying to find out whether the NVMe map/unmap patterns trigger > > > something unexpected in the SMMU driver, but that's a very long > > > shot. > > > > So I tested v5.5-rc3 with and without the SMMU enabled, and without > > the SMMU enabled I don't get the lockup. > > OK, so my hunch wasn't completely off... At least we have something > to look into. > > [...] > > > Obviously this is not conclusive, especially with such limited > > testing - 5 minute runs each. The CPU load goes up when disabling the > > SMMU, but that could be attributed to extra throughput (1183K -> > > 1539K) loading. > > > > I do notice that since we complete the NVMe request in irq context, > > we also do the DMA unmap, i.e. talk to the SMMU, in the same context, > > which is less than ideal. > > It depends on how much overhead invalidating the TLB adds to the > equation, but we should be able to do some tracing and find out. > > > I need to finish for the Christmas break today, so can't check this > > much further ATM. > > No worries. May I suggest creating a new thread in the new year, maybe > involving Robin and Will as well? Zhang Yi has observed the CPU lockup issue once when running heavy IO on single nvme drive, and please CC him if you have new patch to try. Then looks the DMA unmap cost is too big on aarch64 if SMMU is involved. Thanks, Ming
On 2019-12-24 01:59, Ming Lei wrote: > On Mon, Dec 23, 2019 at 10:47:07AM +0000, Marc Zyngier wrote: >> On 2019-12-23 10:26, John Garry wrote: >> > > > > > I've also managed to trigger some of them now that I have >> > > > > access to >> > > > > > a decent box with nvme storage. >> > > > > >> > > > > I only have 2x NVMe SSDs when this occurs - I should not be >> > > > > hitting this... >> > > > > >> > > > > Out of curiosity, have you tried >> > > > > > with the SMMU disabled? I'm wondering whether we hit some >> > > > > livelock >> > > > > > condition on unmapping buffers... >> > > > > >> > > > > No, but I can give it a try. Doing that should lower the CPU >> > > > > usage, though, >> > > > > so maybe masks the issue - probably not. >> > > > >> > > > Lots of CPU lockup can is performance issue if there isn't >> > > > obvious bug. >> > > > >> > > > I am wondering if you may explain it a bit why enabling SMMU >> may >> > > > save >> > > > CPU a it? >> > > The other way around. mapping/unmapping IOVAs doesn't comes for >> > > free. >> > > I'm trying to find out whether the NVMe map/unmap patterns >> trigger >> > > something unexpected in the SMMU driver, but that's a very long >> > > shot. >> > >> > So I tested v5.5-rc3 with and without the SMMU enabled, and >> without >> > the SMMU enabled I don't get the lockup. >> >> OK, so my hunch wasn't completely off... At least we have something >> to look into. >> >> [...] >> >> > Obviously this is not conclusive, especially with such limited >> > testing - 5 minute runs each. The CPU load goes up when disabling >> the >> > SMMU, but that could be attributed to extra throughput (1183K -> >> > 1539K) loading. >> > >> > I do notice that since we complete the NVMe request in irq >> context, >> > we also do the DMA unmap, i.e. talk to the SMMU, in the same >> context, >> > which is less than ideal. >> >> It depends on how much overhead invalidating the TLB adds to the >> equation, but we should be able to do some tracing and find out. >> >> > I need to finish for the Christmas break today, so can't check >> this >> > much further ATM. >> >> No worries. May I suggest creating a new thread in the new year, >> maybe >> involving Robin and Will as well? > > Zhang Yi has observed the CPU lockup issue once when running heavy IO > on > single nvme drive, and please CC him if you have new patch to try. On which architecture? John was indicating that this also happen on x86. > Then looks the DMA unmap cost is too big on aarch64 if SMMU is > involved. So far, we don't have any data suggesting that this is actually the case. Also, other workloads (such as networking) do not exhibit this behaviour, while being least as unmap-heavy as NVMe is. If the cross-architecture aspect is confirmed, this points more into the direction of an interaction between the NVMe subsystem and the DMA API more than an architecture-specific problem. Given that we have so far very little data, I'd hold off any conclusion. M. -- Jazz is not dead. It just smells funny...
On Tue, Dec 24, 2019 at 11:20:25AM +0000, Marc Zyngier wrote: > On 2019-12-24 01:59, Ming Lei wrote: > > On Mon, Dec 23, 2019 at 10:47:07AM +0000, Marc Zyngier wrote: > > > On 2019-12-23 10:26, John Garry wrote: > > > > > > > > I've also managed to trigger some of them now that I have > > > > > > > access to > > > > > > > > a decent box with nvme storage. > > > > > > > > > > > > > > I only have 2x NVMe SSDs when this occurs - I should not be > > > > > > > hitting this... > > > > > > > > > > > > > > Out of curiosity, have you tried > > > > > > > > with the SMMU disabled? I'm wondering whether we hit some > > > > > > > livelock > > > > > > > > condition on unmapping buffers... > > > > > > > > > > > > > > No, but I can give it a try. Doing that should lower the CPU > > > > > > > usage, though, > > > > > > > so maybe masks the issue - probably not. > > > > > > > > > > > > Lots of CPU lockup can is performance issue if there isn't > > > > > > obvious bug. > > > > > > > > > > > > I am wondering if you may explain it a bit why enabling SMMU > > > may > > > > > > save > > > > > > CPU a it? > > > > > The other way around. mapping/unmapping IOVAs doesn't comes for > > > > > free. > > > > > I'm trying to find out whether the NVMe map/unmap patterns > > > trigger > > > > > something unexpected in the SMMU driver, but that's a very long > > > > > shot. > > > > > > > > So I tested v5.5-rc3 with and without the SMMU enabled, and > > > without > > > > the SMMU enabled I don't get the lockup. > > > > > > OK, so my hunch wasn't completely off... At least we have something > > > to look into. > > > > > > [...] > > > > > > > Obviously this is not conclusive, especially with such limited > > > > testing - 5 minute runs each. The CPU load goes up when disabling > > > the > > > > SMMU, but that could be attributed to extra throughput (1183K -> > > > > 1539K) loading. > > > > > > > > I do notice that since we complete the NVMe request in irq > > > context, > > > > we also do the DMA unmap, i.e. talk to the SMMU, in the same > > > context, > > > > which is less than ideal. > > > > > > It depends on how much overhead invalidating the TLB adds to the > > > equation, but we should be able to do some tracing and find out. > > > > > > > I need to finish for the Christmas break today, so can't check > > > this > > > > much further ATM. > > > > > > No worries. May I suggest creating a new thread in the new year, > > > maybe > > > involving Robin and Will as well? > > > > Zhang Yi has observed the CPU lockup issue once when running heavy IO on > > single nvme drive, and please CC him if you have new patch to try. > > On which architecture? John was indicating that this also happen on x86. ARM64. To be honest, I never see such CPU lockup issue on x86 in case of running heavy IO on single NVMe drive. > > > Then looks the DMA unmap cost is too big on aarch64 if SMMU is involved. > > So far, we don't have any data suggesting that this is actually the case. > Also, other workloads (such as networking) do not exhibit this behaviour, > while being least as unmap-heavy as NVMe is. Maybe it is because networking workloads usually completes IO in softirq context, instead of hard interrupt context. > > If the cross-architecture aspect is confirmed, this points more into > the direction of an interaction between the NVMe subsystem and the > DMA API more than an architecture-specific problem. > > Given that we have so far very little data, I'd hold off any conclusion. We can start to collect latency data of dma unmapping vs nvme_irq() on both x86 and arm64. I will see if I can get a such box for collecting the latency data. Thanks, Ming
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 1753486b440c..8e7f8e758a88 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) if (cpumask_available(desc->irq_common_data.affinity)) { const struct cpumask *m; - m = irq_data_get_effective_affinity_mask(&desc->irq_data); + if (irqd_affinity_is_managed(&desc->irq_data)) + m = desc->irq_common_data.affinity; + else + m = irq_data_get_effective_affinity_mask( + &desc->irq_data); cpumask_copy(mask, m); } else { valid = false;
Currently the cpu allowed mask for the threaded part of a threaded irq handler will be set to the effective affinity of the hard irq. Typically the effective affinity of the hard irq will be for a single cpu. As such, the threaded handler would always run on the same cpu as the hard irq. We have seen scenarios in high data-rate throughput testing that the cpu handling the interrupt can be totally saturated handling both the hard interrupt and threaded handler parts, limiting throughput. For when the interrupt is managed, allow the threaded part to run on all cpus in the irq affinity mask. Signed-off-by: John Garry <john.garry@huawei.com> --- kernel/irq/manage.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) -- 2.17.1