scsi: core: ufs: Fix a hang in the error handler

Message ID	20250523201409.1676055-1-bvanassche@acm.org
State	New
Headers	show Received: from 004.mia.mailroute.net (004.mia.mailroute.net [199.89.3.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 60D9A1A315A for <linux-scsi@vger.kernel.org>; Fri, 23 May 2025 20:14:35 +0000 (UTC) sender: bvanassche@acm.org) by 004.mia.mailroute.net (Postfix) with ESMTPSA id 4b3xFq0ndBzm0gbn; Fri, 23 May 2025 20:14:22 +0000 (UTC) From: Bart Van Assche <bvanassche@acm.org> To: "Martin K . Petersen" <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org, Sanjeev Yadav <sanjeev.y@mediatek.com>, Bart Van Assche <bvanassche@acm.org>, "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>, Matthias Brugger <matthias.bgg@gmail.com>, AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>, Peter Wang <peter.wang@mediatek.com>, Avri Altman <avri.altman@wdc.com>, Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>, "Bao D. Nguyen" <quic_nguyenb@quicinc.com>, "James E.J. Bottomley" <jejb@linux.ibm.com>, Santosh Y <santoshsy@gmail.com> Subject: [PATCH] scsi: core: ufs: Fix a hang in the error handler Date: Fri, 23 May 2025 13:14:01 -0700 Message-ID: <20250523201409.1676055-1-bvanassche@acm.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable
Series	scsi: core: ufs: Fix a hang in the error handler \| expand scsi: core: ufs: Fix a hang in the error handler

Bart Van Assche May 23, 2025, 8:14 p.m. UTC

From: Sanjeev Yadav <sanjeev.y@mediatek.com>

ufshcd_err_handling_prepare() calls ufshcd_rpm_get_sync(). The latter
function can only succeed if UFSHCD_EH_IN_PROGRESS is not set because
resuming involves submitting a SCSI command and ufshcd_queuecommand()
returns SCSI_MLQUEUE_HOST_BUSY if UFSHCD_EH_IN_PROGRESS is set. Fix
this hang by setting UFSHCD_EH_IN_PROGRESS after
ufshcd_rpm_get_sync() has been called instead of before.

Backtrace:
__switch_to+0x174/0x338
__schedule+0x600/0x9e4
schedule+0x7c/0xe8
schedule_timeout+0xa4/0x1c8
io_schedule_timeout+0x48/0x70
wait_for_common_io+0xa8/0x160 //waiting on START_STOP
wait_for_completion_io_timeout+0x10/0x20
blk_execute_rq+0xe4/0x1e4
scsi_execute_cmd+0x108/0x244
ufshcd_set_dev_pwr_mode+0xe8/0x250
__ufshcd_wl_resume+0x94/0x354
ufshcd_wl_runtime_resume+0x3c/0x174
scsi_runtime_resume+0x64/0xa4
rpm_resume+0x15c/0xa1c
__pm_runtime_resume+0x4c/0x90 // Runtime resume ongoing
ufshcd_err_handler+0x1a0/0xd08
process_one_work+0x174/0x808
worker_thread+0x15c/0x490
kthread+0xf4/0x1ec
ret_from_fork+0x10/0x20

Signed-off-by: Sanjeev Yadav <sanjeev.y@mediatek.com>
[ bvanassche: rewrote patch description ]
Fixes: 62694735ca95 ("[SCSI] ufs: Add runtime PM support for UFS host controller driver")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Bart Van Assche May 27, 2025, 10:01 p.m. UTC | #1

On 5/26/25 6:21 AM, Peter Wang (王信友) wrote:
> On Fri, 2025-05-23 at 13:14 -0700, Bart Van Assche wrote:
>>
>> From: Sanjeev Yadav <sanjeev.y@mediatek.com>
>>
>> ufshcd_err_handling_prepare() calls ufshcd_rpm_get_sync(). The latter
>> function can only succeed if UFSHCD_EH_IN_PROGRESS is not set because
>> resuming involves submitting a SCSI command and ufshcd_queuecommand()
>> returns SCSI_MLQUEUE_HOST_BUSY if UFSHCD_EH_IN_PROGRESS is set. Fix
>> this hang by setting UFSHCD_EH_IN_PROGRESS after
>> ufshcd_rpm_get_sync() has been called instead of before.
>>
>> Backtrace:
>> __switch_to+0x174/0x338
>> __schedule+0x600/0x9e4
>> schedule+0x7c/0xe8
>> schedule_timeout+0xa4/0x1c8
>> io_schedule_timeout+0x48/0x70
>> wait_for_common_io+0xa8/0x160 //waiting on START_STOP
>> wait_for_completion_io_timeout+0x10/0x20
>> blk_execute_rq+0xe4/0x1e4
>> scsi_execute_cmd+0x108/0x244
>> ufshcd_set_dev_pwr_mode+0xe8/0x250
>> __ufshcd_wl_resume+0x94/0x354
>> ufshcd_wl_runtime_resume+0x3c/0x174
>> scsi_runtime_resume+0x64/0xa4
>> rpm_resume+0x15c/0xa1c
>> __pm_runtime_resume+0x4c/0x90 // Runtime resume ongoing
>> ufshcd_err_handler+0x1a0/0xd08
>> process_one_work+0x174/0x808
>> worker_thread+0x15c/0x490
>> kthread+0xf4/0x1ec
>> ret_from_fork+0x10/0x20
>>
>> Signed-off-by: Sanjeev Yadav <sanjeev.y@mediatek.com>
>> [ bvanassche: rewrote patch description ]
>> Fixes: 62694735ca95 ("[SCSI] ufs: Add runtime PM support for UFS host
>> controller driver")
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> 
> I'm a bit curious why the ufshcd_err_handler was triggered?
> Because during the runtime suspend/resume process, if there
> are any errors, it should use ufshcd_link_recovery directly?

Hi Peter,

I have not yet tried to perform a full root-cause analysis. I decided
to share this patch since this issue has been observed not only by the
author of the above patch but also by my colleagues.

Regarding your question, are you referring to the ufshcd_link_recovery()
call in ufshcd_eh_host_reset_handler()? ufshcd_eh_host_reset_handler()
is not the only function that can trigger the UFS error handler. There
are several ufshcd_schedule_eh_work() calls outside the
ufshcd_eh_host_reset_handler() function.

Bart.

Peter Wang (王信友) May 28, 2025, 8:32 a.m. UTC | #2

On Tue, 2025-05-27 at 15:01 -0700, Bart Van Assche wrote:
> 
> 
> Hi Peter,
> 
> I have not yet tried to perform a full root-cause analysis. I decided
> to share this patch since this issue has been observed not only by
> the
> author of the above patch but also by my colleagues.
> 
> Regarding your question, are you referring to the
> ufshcd_link_recovery()
> call in ufshcd_eh_host_reset_handler()?
> ufshcd_eh_host_reset_handler()
> is not the only function that can trigger the UFS error handler.
> There
> are several ufshcd_schedule_eh_work() calls outside the
> ufshcd_eh_host_reset_handler() function.
> 
> Bart.

Hi Bart,

Yes, in my opinion, if an error occurs during runtime suspend,
it can trigger ufshcd_err_handler and then directly return busy 
to break this suspend attempt.
But for a runtime resume error, can only use ufshcd_link_recovery; 
otherwise, ufshcd_err_handler will still get stuck waiting 
for runtime PM to become active.
The previous patch was specifically for this resume case.
https://patchwork.kernel.org/project/linux-scsi/patch/20230927033557.13801-1-peter.wang@mediatek.com/

And for this patch, I feel that it's like what you mentioned, other 
functions triggered the UFS error handler, such as a UIC error. 
Since the SSU can be sent normally, it means it's an error that 
the hardware can recover from automatically. After the hardware 
automatically recovers, there's no issue during the runtime 
suspend process, and the suspend isn't break.
Or, if there's no issue during the runtime resume process, 
ufshcd_link_recovery isn't used directly either.

But this kind of patch assumes that the hardware can recover 
automatically. If the hardware can't recover, ufshcd_err_handler 
might still get stuck waiting for PM resume.
In other words, this patch solves part of the problem, but it may 
not be able to address all issues that occur during suspend/resume.
However, before we can identify the root cause, it seems there 
isn't a better solution for now. What do you think?

Thanks.
Peter

Bart Van Assche May 28, 2025, 3:43 p.m. UTC | #3

On 5/28/25 1:32 AM, Peter Wang (王信友) wrote:
> However, before we can identify the root cause, it seems there
> isn't a better solution for now. What do you think?

Hi Peter,

It seems to me that this patch doesn't have any adverse side effects and
also that it fixes a bug, namely that ufshcd_set_eh_in_progress() should
be called after ufshcd_err_handling_prepare() instead of before. Do you
agree with this and do you also agree that it would be good to have this
patch upstream?

Thank you,

Bart.

Peter Wang (王信友) May 29, 2025, 9:43 a.m. UTC | #4

On Wed, 2025-05-28 at 08:43 -0700, Bart Van Assche wrote:
> Hi Peter,
> 
> It seems to me that this patch doesn't have any adverse side effects
> and
> also that it fixes a bug, namely that ufshcd_set_eh_in_progress()
> should
> be called after ufshcd_err_handling_prepare() instead of before. Do
> you
> agree with this and do you also agree that it would be good to have
> this
> patch upstream?
> 
> Thank you,
> 
> Bart.

Hi Bart,

Yes, this patch should be able to solve part of the problem
and shouldn't cause any side effects.

Reviewed-by: Peter Wang <peter.wang@mediatek.com>

Martin K. Petersen June 4, 2025, 2 a.m. UTC | #5

Bart,

> ufshcd_err_handling_prepare() calls ufshcd_rpm_get_sync(). The latter
> function can only succeed if UFSHCD_EH_IN_PROGRESS is not set because
> resuming involves submitting a SCSI command and ufshcd_queuecommand()
> returns SCSI_MLQUEUE_HOST_BUSY if UFSHCD_EH_IN_PROGRESS is set. Fix
> this hang by setting UFSHCD_EH_IN_PROGRESS after ufshcd_rpm_get_sync()
> has been called instead of before.

Applied to 6.16/scsi-staging, thanks!

scsi: core: ufs: Fix a hang in the error handler

Commit Message

Comments

Patch