Message ID | AS8PR10MB4952D545F84B6B1DFD39EC1E9DEE9@AS8PR10MB4952.EURPRD10.PROD.OUTLOOK.COM |
---|---|
State | New |
Headers | show |
Series | [1/2] qla2xxx: Remove free_sg command flag | expand |
> On Apr 15, 2022, at 5:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote: > > Aborting commands that have already been sent to the firmware can > cause BUG in qlt_free_cmd(): BUG_ON(cmd->sg_mapped) > > For instance: > - command passes rdx_to_xfer state, maps sgl, sends to the > fimware > - reset occurs, qla2xxx performs ISP error recovery, > aborts the command > - target stack calls qlt_abort_cmd() and then qlt_free_cmd() > - BUG_ON(cmd->sg_mapped) in qlt_free_cmd() occurs because sgl was not > unmapped > > Thus, unmap sgl in qlt_abort_cmd() for commands with the aborted flag > set. > Do you have a log showing this error sequence? Can you share more details? > Signed-off-by: Gleb Chesnokov <Chesnokov.G@raidix.com> > --- > drivers/scsi/qla2xxx/qla_target.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c > index 2d30578aebcf..a02235a6a8e9 100644 > --- a/drivers/scsi/qla2xxx/qla_target.c > +++ b/drivers/scsi/qla2xxx/qla_target.c > @@ -3826,6 +3826,9 @@ int qlt_abort_cmd(struct qla_tgt_cmd *cmd) > > spin_lock_irqsave(&cmd->cmd_lock, flags); > if (cmd->aborted) { > + if (cmd->sg_mapped) > + qlt_unmap_sg(vha, cmd); > + > spin_unlock_irqrestore(&cmd->cmd_lock, flags); > /* > * It's normal to see 2 calls in this path: > -- > 2.35.1
> Do you have a log showing this error sequence? Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only, except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped). Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list. > Can you share more details? What I am observing: 1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware 2) Qlogic adapter reset occurs qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h. qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000. 3) Somehow the command is being aborted, so that means the command's abort flag has already been set. I think it may happens something like this: qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds() 4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort. 5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended with qlt_free_cmd() I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware.
> On Apr 20, 2022, at 7:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote: > >> Do you have a log showing this error sequence? > > Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only, > except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped). > Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list. Agree. Am curious to understand the test case/steps that would trigger this issue in your env. If you can share your test scenario would be a bit more helpful. > >> Can you share more details? > > What I am observing: > > 1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware > 2) Qlogic adapter reset occurs > > qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h. This message indicates there was a firmware crash. Qlogic/Marvell folks should be able to help you capture/save dump. That firmware dump might give you clues on what is the cause of the firmware crash. > qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer > qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000. > > 3) Somehow the command is being aborted, so that means the command's abort flag has already been set. > I think it may happens something like this: > qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds() > I think this is the aftereffect of a firmware crash and the driver is just recovering from that. A good firmware analysis will shed more light on this issue. > 4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort. > > 5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended with qlt_free_cmd() > > I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware. Himanshu Madhani Oracle Linux Engineering
>> On Apr 20, 2022, at 7:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote: >> >>> Do you have a log showing this error sequence? >> >> Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only, >> except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped). >> Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list. > > Agree. Am curious to understand the test case/steps that would trigger this issue in your env. If you can share your test scenario would be a bit more helpful. >> >> >>> Can you share more details? >> >> What I am observing: >> >> 1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware >> 2) Qlogic adapter reset occurs >> >> qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h. > > This message indicates there was a firmware crash. Qlogic/Marvell folks should be able to help you capture/save dump. That firmware dump might give you clues on what is the cause of the firmware crash. > >> qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer > >> qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000. >> > >> 3) Somehow the command is being aborted, so that means the command's abort flag has already been set. >> I think it may happens something like this: >> qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds() >> > > I think this is the aftereffect of a firmware crash and the driver is just recovering from that. A good firmware analysis will shed more light on this issue. > >> 4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort. >> >> 5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended > with qlt_free_cmd() >> >> I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware. > > > Himanshu Madhani Oracle Linux Engineering I seem to know the cause of the firmware crash. This is an abnormal sg list that is generated by my backend driver and passed to the Qlogic driver via target stack. The abnormal state of the sg list in my case means that it contains more than a thousand nents. So apparently Qlogic adapter does not know how to work with such buffers. In any case, I think that the main thing is not to find the cause of the firmware crash or fix it (because it actually comes from my side), but to fix the crash during recovery the Qlogic driver after a firmware crash. I have special case that allows me to reproduce the problem, but perhaps it can be reproduced in other cases that cause a firmware crash. Maybe there is a way to manually cause the firmware crash and it will allow to artificially reproduce the problem?
> On Apr 25, 2022, at 11:49 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote: > >>> On Apr 20, 2022, at 7:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote: >>> >>>> Do you have a log showing this error sequence? >>> >>> Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only, >>> except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped). >>> Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list. >> >> Agree. Am curious to understand the test case/steps that would trigger this issue in your env. If you can share your test scenario would be a bit more helpful. >>> >>> >>>> Can you share more details? >>> >>> What I am observing: >>> >>> 1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware >>> 2) Qlogic adapter reset occurs >>> >>> qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h. >> >> This message indicates there was a firmware crash. Qlogic/Marvell folks should be able to help you capture/save dump. That firmware dump might give you clues on what is the cause of the firmware crash. >> >>> qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer >> >>> qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000. >>> >> >>> 3) Somehow the command is being aborted, so that means the command's abort flag has already been set. >>> I think it may happens something like this: >>> qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds() >>> >> >> I think this is the aftereffect of a firmware crash and the driver is just recovering from that. A good firmware analysis will shed more light on this issue. >> >>> 4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort. >>> >>> 5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended > with qlt_free_cmd() >>> >>> I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware. >> >> >> Himanshu Madhani Oracle Linux Engineering > > I seem to know the cause of the firmware crash. This is an abnormal sg list that is generated by my backend driver and passed to the Qlogic driver via target stack. The abnormal state of the sg list in my case means that it contains more than a thousand nents. So apparently Qlogic adapter does not know how to work with such buffers. > > In any case, I think that the main thing is not to find the cause of the firmware crash or fix it (because it actually comes from my side), but to fix the crash during recovery the Qlogic driver after a firmware crash. > Okay. Now that I understand what was triggering the firmware crash, I agree with your patch. > I have special case that allows me to reproduce the problem, but perhaps it can be reproduced in other cases that cause a firmware crash. Maybe there is a way to manually cause the firmware crash and it will allow to artificially reproduce the problem? Sure. It is an unusual case but you are correct this error condition which leads to firmware can be reproduced easily, the driver should handle such a scenario. I would leave it to Marvell folks if they want to investigate the firmware crash and see if there is a fix that can be added to FW. For the patch in question, you can add my Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> -- Himanshu Madhani Oracle Linux Engineering
On Fri, 15 Apr 2022 12:42:29 +0000, Chesnokov Gleb wrote: > Aborting commands that have already been sent to the firmware can > cause BUG in qlt_free_cmd(): BUG_ON(cmd->sg_mapped) > > For instance: > - command passes rdx_to_xfer state, maps sgl, sends to the > fimware > - reset occurs, qla2xxx performs ISP error recovery, > aborts the command > - target stack calls qlt_abort_cmd() and then qlt_free_cmd() > - BUG_ON(cmd->sg_mapped) in qlt_free_cmd() occurs because sgl was not > unmapped > > [...] Applied to 5.18/scsi-fixes, thanks! [2/2] qla2xxx: Fix missed DMA unmap for aborted cmds https://git.kernel.org/mkp/scsi/c/26f9ce53817a
diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c index 2d30578aebcf..a02235a6a8e9 100644 --- a/drivers/scsi/qla2xxx/qla_target.c +++ b/drivers/scsi/qla2xxx/qla_target.c @@ -3826,6 +3826,9 @@ int qlt_abort_cmd(struct qla_tgt_cmd *cmd) spin_lock_irqsave(&cmd->cmd_lock, flags); if (cmd->aborted) { + if (cmd->sg_mapped) + qlt_unmap_sg(vha, cmd); + spin_unlock_irqrestore(&cmd->cmd_lock, flags); /* * It's normal to see 2 calls in this path:
Aborting commands that have already been sent to the firmware can cause BUG in qlt_free_cmd(): BUG_ON(cmd->sg_mapped) For instance: - command passes rdx_to_xfer state, maps sgl, sends to the fimware - reset occurs, qla2xxx performs ISP error recovery, aborts the command - target stack calls qlt_abort_cmd() and then qlt_free_cmd() - BUG_ON(cmd->sg_mapped) in qlt_free_cmd() occurs because sgl was not unmapped Thus, unmap sgl in qlt_abort_cmd() for commands with the aborted flag set. Signed-off-by: Gleb Chesnokov <Chesnokov.G@raidix.com> --- drivers/scsi/qla2xxx/qla_target.c | 3 +++ 1 file changed, 3 insertions(+)