[v10,2/2] ufs: core: requeue aborted request

Message ID	20241001091917.6917-3-peter.wang@mediatek.com
State	New
Headers	show Received: from mailgw01.mediatek.com (unknown [60.244.123.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 105C019D884; Tue, 1 Oct 2024 09:19:27 +0000 (UTC) From: <peter.wang@mediatek.com> To: <linux-scsi@vger.kernel.org>, <martin.petersen@oracle.com>, <avri.altman@wdc.com>, <alim.akhtar@samsung.com>, <jejb@linux.ibm.com> CC: <wsd_upstream@mediatek.com>, <linux-mediatek@lists.infradead.org>, <peter.wang@mediatek.com>, <chun-hung.wu@mediatek.com>, <alice.chao@mediatek.com>, <cc.chou@mediatek.com>, <chaotian.jing@mediatek.com>, <jiajie.hao@mediatek.com>, <powen.kao@mediatek.com>, <qilin.tan@mediatek.com>, <lin.gui@mediatek.com>, <tun-yu.yu@mediatek.com>, <eddie.huang@mediatek.com>, <naomi.chu@mediatek.com>, <ed.tsai@mediatek.com>, <bvanassche@acm.org>, <quic_nguyenb@quicinc.com>, <stable@vger.kernel.org> Subject: [PATCH v10 2/2] ufs: core: requeue aborted request Date: Tue, 1 Oct 2024 17:19:17 +0800 Message-ID: <20241001091917.6917-3-peter.wang@mediatek.com> In-Reply-To: <20241001091917.6917-1-peter.wang@mediatek.com> References: <20241001091917.6917-1-peter.wang@mediatek.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain
Series	fix abort defect \| expand [v10,0/2] fix abort defect [v10,1/2] ufs: core: fix the issue of ICU failure [v10,2/2] ufs: core: requeue aborted request

Peter Wang (王信友) Oct. 1, 2024, 9:19 a.m. UTC

From: Peter Wang <peter.wang@mediatek.com>

After the SQ cleanup fix, the CQ will receive a response with
the corresponding tag marked as OCS: ABORTED. To align with
the behavior of Legacy SDB mode, the handling of OCS: ABORTED
has been changed to match that of OCS_INVALID_COMMAND_STATUS
(SDB), with both returning a SCSI result of DID_REQUEUE.

Furthermore, the workaround implemented before the SQ cleanup
fix can be removed.

Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ mode")
Cc: stable@vger.kernel.org
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

Bart Van Assche Oct. 3, 2024, 8:02 p.m. UTC | #1

On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
> This patch merely aligns with the approach of SDB mode
> and does not involve the flow of scsi_done. Besides,
> I don't see any issue with concurrency between
> ufshcd_abort_one() calling ufshcd_try_to_abort_task()
> and scsi_done(). Can you point out the specific flow where
> the problem occurs? If there is one, shouldn't SDB mode
> have the same issue?

Hi Peter,

Correct, my comment applies to both legacy mode and MCQ mode. From the 
section in the UFS standard about ABORT TASK: "A response of FUNCTION
COMPLETE shall indicate that the command was aborted or was not in the 
task set." In other words, if a command completes just before 
ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
that has already completed. In legacy mode, this call will succeed.
Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to be
decremented twice instead of only once. Do you agree that this can
happen and also that it should be prevented that this happens?

Thanks,

Bart.

Peter Wang (王信友) Oct. 7, 2024, 7:20 a.m. UTC | #2

On Thu, 2024-10-03 at 13:02 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
> > This patch merely aligns with the approach of SDB mode
> > and does not involve the flow of scsi_done. Besides,
> > I don't see any issue with concurrency between
> > ufshcd_abort_one() calling ufshcd_try_to_abort_task()
> > and scsi_done(). Can you point out the specific flow where
> > the problem occurs? If there is one, shouldn't SDB mode
> > have the same issue?
> 
> Hi Peter,
> 
> Correct, my comment applies to both legacy mode and MCQ mode. From
> the 
> section in the UFS standard about ABORT TASK: "A response of FUNCTION
> COMPLETE shall indicate that the command was aborted or was not in
> the 
> task set." In other words, if a command completes just before 
> ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
> ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
> that has already completed. In legacy mode, this call will succeed.
> 

Hi Bart,

Yes, the legacy SDB mode is protected by the outstanding_lock.


> Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
> ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to
> be
> decremented twice instead of only once. Do you agree that this can
> happen and also that it should be prevented that this happens?
> 
> Thanks,
> 
> Bart.

Sorry, I still don't understand why both ufshcd_compl_one_cqe() 
and ufshcd_abort_all() will call ufshcd_release(hba)? 
Because I have already removed the ufshcd_release_scsi_cmd from 
ufshcd_abort_one, so the command won't be released immediately 
after ufshcd_try_to_abort_task succeeds. Instead, it will wait 
until the CQ Entry comes in before releasing. And since it is 
protected by the cq_lock, it should only release once, right?

Thanks.
Peter

Bart Van Assche Oct. 8, 2024, 6:29 p.m. UTC | #3

On 10/7/24 12:20 AM, Peter Wang (王信友) wrote:
> On Thu, 2024-10-03 at 13:02 -0700, Bart Van Assche wrote:
>>   	
>> External email : Please do not click links or open attachments until
>> you have verified the sender or the content.
>>   On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
>>> This patch merely aligns with the approach of SDB mode
>>> and does not involve the flow of scsi_done. Besides,
>>> I don't see any issue with concurrency between
>>> ufshcd_abort_one() calling ufshcd_try_to_abort_task()
>>> and scsi_done(). Can you point out the specific flow where
>>> the problem occurs? If there is one, shouldn't SDB mode
>>> have the same issue?
>>
>> Hi Peter,
>>
>> Correct, my comment applies to both legacy mode and MCQ mode. From
>> the
>> section in the UFS standard about ABORT TASK: "A response of FUNCTION
>> COMPLETE shall indicate that the command was aborted or was not in
>> the
>> task set." In other words, if a command completes just before
>> ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
>> ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
>> that has already completed. In legacy mode, this call will succeed.
>>
> 
> Hi Bart,
> 
> Yes, the legacy SDB mode is protected by the outstanding_lock.
> 
> 
>> Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
>> ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to
>> be
>> decremented twice instead of only once. Do you agree that this can
>> happen and also that it should be prevented that this happens?
>>
>> Thanks,
>>
>> Bart.
> 
> Sorry, I still don't understand why both ufshcd_compl_one_cqe()
> and ufshcd_abort_all() will call ufshcd_release(hba)?
> Because I have already removed the ufshcd_release_scsi_cmd from
> ufshcd_abort_one, so the command won't be released immediately
> after ufshcd_try_to_abort_task succeeds. Instead, it will wait
> until the CQ Entry comes in before releasing. And since it is
> protected by the cq_lock, it should only release once, right?

Hi Peter,

I think what you wrote applies to MCQ mode only. In my previous email
I clearly referred to "legacy mode" (SDB mode). Summarizing my previous
email, I think that in legacy mode it is possible that ufshcd_release()
is called twice while it only should be called once. Here are the
possible solutions I see:
* Add a function to the SCSI core for setting SCMD_STATE_COMPLETE. This
   may be controversial since no other SCSI LLD needs this functionality.
* Changing the error handling approach in the UFS driver to the same
   approach other SCSI LLDs use: instead of using queue_work() to
   activate the error handler, call scsi_schedule_eh(). This will cause
   the error handler to be activated later, namely after all pending
   commands have timed out instead of aborting any pending commands
   first.
* Add a variant of scsi_schedule_eh() to the SCSI core that accelerates
   error handling by calling scsi_timeout() on all pending commands.

Thanks,

Bart.

Peter Wang (王信友) Oct. 9, 2024, 2:17 a.m. UTC | #4

On Tue, 2024-10-08 at 11:29 -0700, Bart Van Assche wrote:
> Hi Peter,
> 
> I think what you wrote applies to MCQ mode only. In my previous email
> I clearly referred to "legacy mode" (SDB mode). Summarizing my
> previous
> email, I think that in legacy mode it is possible that
> ufshcd_release()
> is called twice while it only should be called once. Here are the
> possible solutions I see:
> * Add a function to the SCSI core for setting SCMD_STATE_COMPLETE.
> This
>    may be controversial since no other SCSI LLD needs this
> functionality.
> * Changing the error handling approach in the UFS driver to the same
>    approach other SCSI LLDs use: instead of using queue_work() to
>    activate the error handler, call scsi_schedule_eh(). This will
> cause
>    the error handler to be activated later, namely after all pending
>    commands have timed out instead of aborting any pending commands
>    first.
> * Add a variant of scsi_schedule_eh() to the SCSI core that
> accelerates
>    error handling by calling scsi_timeout() on all pending commands.
> 
> Thanks,
> 
> Bart.
> 

Hi Bart,

Yes, this patch is only for MCQ mode, because only MCQ mode 
receives OCS: ABORTED, right? This patch doesn't modify 
any of the Legacy mode flows, does it?

Additionally, I still don't understand why you say there would 
be an issue with legacy mode having duplicate ufshcd_release(hba) 
calls. As I mentioned before, it is protected by the 
outstanding_lock. Could you please clarify the detailed 
error flow?

Furthermore, even if there is an issue with Legacy mode, it 
should be addressed by a separate patch, not by this one, which is 
intended to resolve the MCQ mode issue. We shouldn't mix two 
different issues together, don't you agree?

Thanks
Peter

Bart Van Assche Oct. 9, 2024, 5:59 p.m. UTC | #5

On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote:
> After the SQ cleanup fix, the CQ will receive a response with
> the corresponding tag marked as OCS: ABORTED. To align with
> the behavior of Legacy SDB mode, the handling of OCS: ABORTED
> has been changed to match that of OCS_INVALID_COMMAND_STATUS
> (SDB), with both returning a SCSI result of DID_REQUEUE.
> 
> Furthermore, the workaround implemented before the SQ cleanup
> fix can be removed.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Bart Van Assche Oct. 9, 2024, 6:06 p.m. UTC | #6

On 10/8/24 7:17 PM, Peter Wang (王信友) wrote:
> Yes, this patch is only for MCQ mode, because only MCQ mode
> receives OCS: ABORTED, right? This patch doesn't modify
> any of the Legacy mode flows, does it?

Agreed. What I mentioned in my email is an existing bug in the legacy
flow for ufshcd_abort_all().

> Furthermore, even if there is an issue with Legacy mode, it
> should be addressed by a separate patch, not by this one, which is
> intended to resolve the MCQ mode issue. We shouldn't mix two
> different issues together, don't you agree?

Let's proceed with this patch series and let's address what I brought
up in my email separately.

With the current approach for error handling in the UFS driver, anyone
who wants to verify or modify ufshcd_try_to_abort_task() has to consider
all possible interleavings of ufshcd_try_to_abort_task() and the
completion path (ufshcd_compl_one_cqe()). That's an unnecessary burden
on UFS driver contributors. Additionally, this is error-prone. This
applies to both modes (legacy and MCQ). I know of reports of sporadic
crashes in legacy mode related to UFS error handling. I'm wondering
whether these are perhaps the result of the issue I mentioned in a
previous email. Anyway, I will look further into this myself as soon as
I have the time.

Thanks,

Bart.

Peter Wang (王信友) Oct. 11, 2024, 5:44 a.m. UTC | #7

On Wed, 2024-10-09 at 11:06 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  
> On 10/8/24 7:17 PM, Peter Wang (王信友) wrote:
> > Yes, this patch is only for MCQ mode, because only MCQ mode
> > receives OCS: ABORTED, right? This patch doesn't modify
> > any of the Legacy mode flows, does it?
> 
> Agreed. What I mentioned in my email is an existing bug in the legacy
> flow for ufshcd_abort_all().
> 
> > Furthermore, even if there is an issue with Legacy mode, it
> > should be addressed by a separate patch, not by this one, which is
> > intended to resolve the MCQ mode issue. We shouldn't mix two
> > different issues together, don't you agree?
> 
> Let's proceed with this patch series and let's address what I brought
> up in my email separately.
> 
> With the current approach for error handling in the UFS driver,
> anyone
> who wants to verify or modify ufshcd_try_to_abort_task() has to
> consider
> all possible interleavings of ufshcd_try_to_abort_task() and the
> completion path (ufshcd_compl_one_cqe()). That's an unnecessary
> burden
> on UFS driver contributors. Additionally, this is error-prone. This
> applies to both modes (legacy and MCQ). I know of reports of sporadic
> crashes in legacy mode related to UFS error handling. I'm wondering
> whether these are perhaps the result of the issue I mentioned in a
> previous email. Anyway, I will look further into this myself as soon
> as
> I have the time.
> 
> Thanks,
> 
> Bart.

Hi Bart,

Thank you for your review.
 
I currently cannot see the issue of duplicate releases in 
legacy SDB mode. ufshcd_try_to_abort_task() will directly 
reset if it fails. It is only in the case of success that 
we need to consider the possibility of ufshcd_compl_one_cqe. 
I believe the original design flow has already taken this 
into account, which is why there is protection with 
outstanding_lock/cq_lock. Perhaps we can wait for an actual 
example to occur before making corrections. Even if there 
is an issue, I think the probability should be very low, 
because the flow for legacy SDB mode has been in use 
for several years.

Thank you again for your review.

Thanks
Peter

[v10,2/2] ufs: core: requeue aborted request

Commit Message

Comments

Patch