Message ID | 20250523201409.1676055-1-bvanassche@acm.org |
---|---|
State | New |
Headers | show |
Series | scsi: core: ufs: Fix a hang in the error handler | expand |
On 5/26/25 6:21 AM, Peter Wang (王信友) wrote: > On Fri, 2025-05-23 at 13:14 -0700, Bart Van Assche wrote: >> >> From: Sanjeev Yadav <sanjeev.y@mediatek.com> >> >> ufshcd_err_handling_prepare() calls ufshcd_rpm_get_sync(). The latter >> function can only succeed if UFSHCD_EH_IN_PROGRESS is not set because >> resuming involves submitting a SCSI command and ufshcd_queuecommand() >> returns SCSI_MLQUEUE_HOST_BUSY if UFSHCD_EH_IN_PROGRESS is set. Fix >> this hang by setting UFSHCD_EH_IN_PROGRESS after >> ufshcd_rpm_get_sync() has been called instead of before. >> >> Backtrace: >> __switch_to+0x174/0x338 >> __schedule+0x600/0x9e4 >> schedule+0x7c/0xe8 >> schedule_timeout+0xa4/0x1c8 >> io_schedule_timeout+0x48/0x70 >> wait_for_common_io+0xa8/0x160 //waiting on START_STOP >> wait_for_completion_io_timeout+0x10/0x20 >> blk_execute_rq+0xe4/0x1e4 >> scsi_execute_cmd+0x108/0x244 >> ufshcd_set_dev_pwr_mode+0xe8/0x250 >> __ufshcd_wl_resume+0x94/0x354 >> ufshcd_wl_runtime_resume+0x3c/0x174 >> scsi_runtime_resume+0x64/0xa4 >> rpm_resume+0x15c/0xa1c >> __pm_runtime_resume+0x4c/0x90 // Runtime resume ongoing >> ufshcd_err_handler+0x1a0/0xd08 >> process_one_work+0x174/0x808 >> worker_thread+0x15c/0x490 >> kthread+0xf4/0x1ec >> ret_from_fork+0x10/0x20 >> >> Signed-off-by: Sanjeev Yadav <sanjeev.y@mediatek.com> >> [ bvanassche: rewrote patch description ] >> Fixes: 62694735ca95 ("[SCSI] ufs: Add runtime PM support for UFS host >> controller driver") >> Signed-off-by: Bart Van Assche <bvanassche@acm.org> > > I'm a bit curious why the ufshcd_err_handler was triggered? > Because during the runtime suspend/resume process, if there > are any errors, it should use ufshcd_link_recovery directly? Hi Peter, I have not yet tried to perform a full root-cause analysis. I decided to share this patch since this issue has been observed not only by the author of the above patch but also by my colleagues. Regarding your question, are you referring to the ufshcd_link_recovery() call in ufshcd_eh_host_reset_handler()? ufshcd_eh_host_reset_handler() is not the only function that can trigger the UFS error handler. There are several ufshcd_schedule_eh_work() calls outside the ufshcd_eh_host_reset_handler() function. Bart.
On Tue, 2025-05-27 at 15:01 -0700, Bart Van Assche wrote: > > > Hi Peter, > > I have not yet tried to perform a full root-cause analysis. I decided > to share this patch since this issue has been observed not only by > the > author of the above patch but also by my colleagues. > > Regarding your question, are you referring to the > ufshcd_link_recovery() > call in ufshcd_eh_host_reset_handler()? > ufshcd_eh_host_reset_handler() > is not the only function that can trigger the UFS error handler. > There > are several ufshcd_schedule_eh_work() calls outside the > ufshcd_eh_host_reset_handler() function. > > Bart. Hi Bart, Yes, in my opinion, if an error occurs during runtime suspend, it can trigger ufshcd_err_handler and then directly return busy to break this suspend attempt. But for a runtime resume error, can only use ufshcd_link_recovery; otherwise, ufshcd_err_handler will still get stuck waiting for runtime PM to become active. The previous patch was specifically for this resume case. https://patchwork.kernel.org/project/linux-scsi/patch/20230927033557.13801-1-peter.wang@mediatek.com/ And for this patch, I feel that it's like what you mentioned, other functions triggered the UFS error handler, such as a UIC error. Since the SSU can be sent normally, it means it's an error that the hardware can recover from automatically. After the hardware automatically recovers, there's no issue during the runtime suspend process, and the suspend isn't break. Or, if there's no issue during the runtime resume process, ufshcd_link_recovery isn't used directly either. But this kind of patch assumes that the hardware can recover automatically. If the hardware can't recover, ufshcd_err_handler might still get stuck waiting for PM resume. In other words, this patch solves part of the problem, but it may not be able to address all issues that occur during suspend/resume. However, before we can identify the root cause, it seems there isn't a better solution for now. What do you think? Thanks. Peter
On 5/28/25 1:32 AM, Peter Wang (王信友) wrote: > However, before we can identify the root cause, it seems there > isn't a better solution for now. What do you think? Hi Peter, It seems to me that this patch doesn't have any adverse side effects and also that it fixes a bug, namely that ufshcd_set_eh_in_progress() should be called after ufshcd_err_handling_prepare() instead of before. Do you agree with this and do you also agree that it would be good to have this patch upstream? Thank you, Bart.
On Wed, 2025-05-28 at 08:43 -0700, Bart Van Assche wrote: > Hi Peter, > > It seems to me that this patch doesn't have any adverse side effects > and > also that it fixes a bug, namely that ufshcd_set_eh_in_progress() > should > be called after ufshcd_err_handling_prepare() instead of before. Do > you > agree with this and do you also agree that it would be good to have > this > patch upstream? > > Thank you, > > Bart. Hi Bart, Yes, this patch should be able to solve part of the problem and shouldn't cause any side effects. Reviewed-by: Peter Wang <peter.wang@mediatek.com>
Bart, > ufshcd_err_handling_prepare() calls ufshcd_rpm_get_sync(). The latter > function can only succeed if UFSHCD_EH_IN_PROGRESS is not set because > resuming involves submitting a SCSI command and ufshcd_queuecommand() > returns SCSI_MLQUEUE_HOST_BUSY if UFSHCD_EH_IN_PROGRESS is set. Fix > this hang by setting UFSHCD_EH_IN_PROGRESS after ufshcd_rpm_get_sync() > has been called instead of before. Applied to 6.16/scsi-staging, thanks!
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index 34a499b19480..b43dd1f88fbc 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -6530,9 +6530,14 @@ static void ufshcd_err_handler(struct ufs_hba *hba) up(&hba->host_sem); return; } - ufshcd_set_eh_in_progress(hba); spin_unlock_irqrestore(hba->host->host_lock, flags); + ufshcd_err_handling_prepare(hba); + + spin_lock_irqsave(hba->host->host_lock, flags); + ufshcd_set_eh_in_progress(hba); + spin_unlock_irqrestore(hba->host->host_lock, flags); + /* Complete requests that have door-bell cleared by h/w */ ufshcd_complete_requests(hba, false); spin_lock_irqsave(hba->host->host_lock, flags);