diff mbox series

[2/2] qla2xxx: Fix missed DMA unmap for aborted cmds

Message ID AS8PR10MB4952D545F84B6B1DFD39EC1E9DEE9@AS8PR10MB4952.EURPRD10.PROD.OUTLOOK.COM
State New
Headers show
Series [1/2] qla2xxx: Remove free_sg command flag | expand

Commit Message

Chesnokov Gleb April 15, 2022, 12:42 p.m. UTC
Aborting commands that have already been sent to the firmware can
cause BUG in qlt_free_cmd(): BUG_ON(cmd->sg_mapped)

For instance:
- command passes rdx_to_xfer state, maps sgl, sends to the
  fimware
- reset occurs, qla2xxx performs ISP error recovery,
  aborts the command
- target stack calls qlt_abort_cmd() and then qlt_free_cmd()
- BUG_ON(cmd->sg_mapped) in qlt_free_cmd() occurs because sgl was not
  unmapped

Thus, unmap sgl in qlt_abort_cmd() for commands with the aborted flag
set.

Signed-off-by: Gleb Chesnokov <Chesnokov.G@raidix.com>
---
 drivers/scsi/qla2xxx/qla_target.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Himanshu Madhani April 19, 2022, 7:41 p.m. UTC | #1
> On Apr 15, 2022, at 5:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote:
> 
> Aborting commands that have already been sent to the firmware can
> cause BUG in qlt_free_cmd(): BUG_ON(cmd->sg_mapped)
> 
> For instance:
> - command passes rdx_to_xfer state, maps sgl, sends to the
>  fimware
> - reset occurs, qla2xxx performs ISP error recovery,
>  aborts the command
> - target stack calls qlt_abort_cmd() and then qlt_free_cmd()
> - BUG_ON(cmd->sg_mapped) in qlt_free_cmd() occurs because sgl was not
>  unmapped
> 
> Thus, unmap sgl in qlt_abort_cmd() for commands with the aborted flag
> set.
> 

Do you have a log showing this error sequence? Can you share more details?

> Signed-off-by: Gleb Chesnokov <Chesnokov.G@raidix.com>
> ---
> drivers/scsi/qla2xxx/qla_target.c | 3 +++
> 1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c
> index 2d30578aebcf..a02235a6a8e9 100644
> --- a/drivers/scsi/qla2xxx/qla_target.c
> +++ b/drivers/scsi/qla2xxx/qla_target.c
> @@ -3826,6 +3826,9 @@ int qlt_abort_cmd(struct qla_tgt_cmd *cmd)
> 
> 	spin_lock_irqsave(&cmd->cmd_lock, flags);
> 	if (cmd->aborted) {
> +		if (cmd->sg_mapped)
> +			qlt_unmap_sg(vha, cmd);
> +
> 		spin_unlock_irqrestore(&cmd->cmd_lock, flags);
> 		/*
> 		 * It's normal to see 2 calls in this path:
> -- 
> 2.35.1
Chesnokov Gleb April 20, 2022, 2:42 p.m. UTC | #2
> Do you have a log showing this error sequence?

Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only,
except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped).
Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list.

> Can you share more details?

What I am observing:

1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware
2) Qlogic adapter reset occurs

qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h.
qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer
qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000.

3) Somehow the command is being aborted, so that means the command's abort flag has already been set.
I think it may happens something like this:
qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds()

4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort.

5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended with qlt_free_cmd()

I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware.
Himanshu Madhani April 20, 2022, 7:09 p.m. UTC | #3
> On Apr 20, 2022, at 7:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote:
> 
>> Do you have a log showing this error sequence?
> 
> Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only,
> except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped).
> Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list.

Agree. Am curious to understand the test case/steps that would trigger this issue in your env. If you can share your test scenario would be a bit more helpful. 

> 
>> Can you share more details?
> 
> What I am observing:
> 
> 1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware
> 2) Qlogic adapter reset occurs
> 
> qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h.

This message indicates there was a firmware crash. Qlogic/Marvell folks should be able to help you capture/save dump. That firmware dump might give you clues on what is the cause of the firmware crash. 

> qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer

> qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000.
> 

> 3) Somehow the command is being aborted, so that means the command's abort flag has already been set.
> I think it may happens something like this:
> qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds()
> 

I think this is the aftereffect of a firmware crash and the driver is just recovering from that. A good firmware analysis will shed more light on this issue. 

> 4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort.
> 
> 5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended with qlt_free_cmd()
> 
> I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware.


Himanshu Madhani	Oracle Linux Engineering
Chesnokov Gleb April 25, 2022, 6:49 p.m. UTC | #4
>> On Apr 20, 2022, at 7:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote:
>> 
>>> Do you have a log showing this error sequence?
>> 
>> Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only,
>> except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped).
>> Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list.
>
> Agree. Am curious to understand the test case/steps that would trigger this issue in your env. If you can share your test scenario would be a bit more helpful. 
>>
>> 
>>> Can you share more details?
>> 
>> What I am observing:
>> 
>> 1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware
>> 2) Qlogic adapter reset occurs
>> 
>> qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h.
>
> This message indicates there was a firmware crash. Qlogic/Marvell folks should be able to help you capture/save dump. That firmware dump might give you clues on what is the cause of the firmware crash. 
>
>> qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer
>
>> qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000.
>> 
>
>> 3) Somehow the command is being aborted, so that means the command's abort flag has already been set.
>> I think it may happens something like this:
>> qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds()
>> 
>
> I think this is the aftereffect of a firmware crash and the driver is just recovering from that. A good firmware analysis will shed more light on this issue. 
>
>> 4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort.
>> 
>> 5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended > with qlt_free_cmd()
>> 
>> I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware.
>
>
> Himanshu Madhani        Oracle Linux Engineering

I seem to know the cause of the firmware crash. This is an abnormal sg list that is generated by my backend driver and passed to the Qlogic driver via target stack. The abnormal state of the sg list in my case means that it contains more than a thousand nents. So apparently Qlogic adapter does not know how to work with such buffers.

In any case, I think that the main thing is not to find the cause of the firmware crash or fix it (because it actually comes from my side), but to fix the crash during recovery the Qlogic driver after a firmware crash.

I have special case that allows me to reproduce the problem, but perhaps it can be reproduced in other cases that cause a firmware crash. Maybe there is a way to manually cause the firmware crash and it will allow to artificially reproduce the problem?
Himanshu Madhani April 27, 2022, 4:52 p.m. UTC | #5
> On Apr 25, 2022, at 11:49 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote:
> 
>>> On Apr 20, 2022, at 7:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote:
>>> 
>>>> Do you have a log showing this error sequence?
>>> 
>>> Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only,
>>> except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped).
>>> Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list.
>> 
>> Agree. Am curious to understand the test case/steps that would trigger this issue in your env. If you can share your test scenario would be a bit more helpful. 
>>> 
>>> 
>>>> Can you share more details?
>>> 
>>> What I am observing:
>>> 
>>> 1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware
>>> 2) Qlogic adapter reset occurs
>>> 
>>> qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h.
>> 
>> This message indicates there was a firmware crash. Qlogic/Marvell folks should be able to help you capture/save dump. That firmware dump might give you clues on what is the cause of the firmware crash. 
>> 
>>> qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer
>> 
>>> qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000.
>>> 
>> 
>>> 3) Somehow the command is being aborted, so that means the command's abort flag has already been set.
>>> I think it may happens something like this:
>>> qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds()
>>> 
>> 
>> I think this is the aftereffect of a firmware crash and the driver is just recovering from that. A good firmware analysis will shed more light on this issue. 
>> 
>>> 4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort.
>>> 
>>> 5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended > with qlt_free_cmd()
>>> 
>>> I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware.
>> 
>> 
>> Himanshu Madhani        Oracle Linux Engineering
> 
> I seem to know the cause of the firmware crash. This is an abnormal sg list that is generated by my backend driver and passed to the Qlogic driver via target stack. The abnormal state of the sg list in my case means that it contains more than a thousand nents. So apparently Qlogic adapter does not know how to work with such buffers.
> 
> In any case, I think that the main thing is not to find the cause of the firmware crash or fix it (because it actually comes from my side), but to fix the crash during recovery the Qlogic driver after a firmware crash.
> 

Okay. Now that I understand what was triggering the firmware crash, I agree with your patch. 

> I have special case that allows me to reproduce the problem, but perhaps it can be reproduced in other cases that cause a firmware crash. Maybe there is a way to manually cause the firmware crash and it will allow to artificially reproduce the problem?

Sure. It is an unusual case but you are correct this error condition which leads to firmware can be reproduced easily, the driver should handle such a scenario. I would leave it to Marvell folks if they want to investigate the firmware crash and see if there is a fix that can be added to FW. 

For the patch in question, you can add my 

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

--
Himanshu Madhani	Oracle Linux Engineering
Martin K. Petersen May 3, 2022, 12:50 a.m. UTC | #6
On Fri, 15 Apr 2022 12:42:29 +0000, Chesnokov Gleb wrote:

> Aborting commands that have already been sent to the firmware can
> cause BUG in qlt_free_cmd(): BUG_ON(cmd->sg_mapped)
> 
> For instance:
> - command passes rdx_to_xfer state, maps sgl, sends to the
>   fimware
> - reset occurs, qla2xxx performs ISP error recovery,
>   aborts the command
> - target stack calls qlt_abort_cmd() and then qlt_free_cmd()
> - BUG_ON(cmd->sg_mapped) in qlt_free_cmd() occurs because sgl was not
>   unmapped
> 
> [...]

Applied to 5.18/scsi-fixes, thanks!

[2/2] qla2xxx: Fix missed DMA unmap for aborted cmds
      https://git.kernel.org/mkp/scsi/c/26f9ce53817a
diff mbox series

Patch

diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c
index 2d30578aebcf..a02235a6a8e9 100644
--- a/drivers/scsi/qla2xxx/qla_target.c
+++ b/drivers/scsi/qla2xxx/qla_target.c
@@ -3826,6 +3826,9 @@  int qlt_abort_cmd(struct qla_tgt_cmd *cmd)
 
 	spin_lock_irqsave(&cmd->cmd_lock, flags);
 	if (cmd->aborted) {
+		if (cmd->sg_mapped)
+			qlt_unmap_sg(vha, cmd);
+
 		spin_unlock_irqrestore(&cmd->cmd_lock, flags);
 		/*
 		 * It's normal to see 2 calls in this path: