Message ID | 20250306173226.857335-1-jeff.hugo@oss.qualcomm.com |
---|---|
State | New |
Headers | show |
Series | bus: mhi: host: Address conflict between power_up and syserr | expand |
On 3/6/25 12:32 PM, Jeff Hugo wrote: > From: Jeffrey Hugo <quic_jhugo@quicinc.com> > > mhi_async_power_up() enables IRQs, at which point we can receive a syserr > notification from the device. The syserr notification queues a work item > that cannot execute until the pm_mutex is released. > > If we receive a syserr notification at the right time during > mhi_async_power_up(), we will fail to initialize the device. > > The syserr work item will be pending. If mhi_async_power_up() detects the > syserr, it will handle it. If the device is in PBL, then the PBL state > transition event will be queued, resulting in a work item after the > pending syserr work item. Once mhi_async_power_up() releases the pm_mutex > the syserr work item can run. It will blindly attempt to reset the MHI > state machine, which is the recovery action for syserr. PBL/SBL are not > interrupt driven and will ignore the MHI Reset unless syserr is actively > advertised. This will cause the syserr work item to timeout waiting for > Reset to be cleared, and will leave the host state in syserr processing. > The PBL transition work item will then run, and immediately fail because > syserr processing is not a valid state for PBL transition. > > This leaves the device uninitialized. > > This issue has a fairly unique signature in the kernel log: > > [ 909.803598] mhi mhi3: Requested to power ON > [ 909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device. Attempting to recover > [ 909.803945] mhi mhi3: Power on setup success > [ 911.808444] mhi mhi3: Device failed to exit MHI Reset state > [ 911.808448] mhi mhi3: Device MHI is not in valid state > > We cannot remove the syserr handling from mhi_async_power_up() because the > device may be in the syserr state, but we missed the notification as the > irq was fired before irqs were enabled. We also can't queue the syserr > work item from mhi_async_power_up() if syserr is detected because that may > result in a duplicate work item, and cause the same issue since the > duplicate item will blindly issue MHI Reset even if syserr is no longer > active. > > Instead, add a check in the syserr work item to make sure that the device > is in the syserr state if the device is in the PBL or SBL EEs. > > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Reviewed-by: Troy Hanson <quic_thanson@quicinc.com>
diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c index 11c0e751f223..3dff0f932726 100644 --- a/drivers/bus/mhi/host/pm.c +++ b/drivers/bus/mhi/host/pm.c @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) struct mhi_cmd *mhi_cmd; struct mhi_event_ctxt *er_ctxt; struct device *dev = &mhi_cntrl->mhi_dev->dev; + bool reset_device = false; int ret, i; dev_dbg(dev, "Transitioning from PM state: %s to: %s\n", @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) /* Wake up threads waiting for state transition */ wake_up_all(&mhi_cntrl->state_event); - /* Trigger MHI RESET so that the device will not access host memory */ + /* + * Trigger MHI RESET so that the device will not access host memory. + * If the device is in PBL or SBL, it will only respond to RESET if + * the device is in SYSERR state. SYSERR might already be cleared + * at this point. + */ if (MHI_REG_ACCESS_VALID(prev_state)) { + enum mhi_state cur_statemachine_state = mhi_get_mhi_state(mhi_cntrl); + enum mhi_ee_type cur_ee = mhi_get_exec_env(mhi_cntrl); + + if (cur_statemachine_state == MHI_STATE_SYS_ERR) + reset_device = true; + else if (cur_ee != MHI_EE_PBL && cur_ee != MHI_EE_SBL) + reset_device = true; + } + + if (reset_device) { u32 in_reset = -1; unsigned long timeout = msecs_to_jiffies(mhi_cntrl->timeout_ms);