Message ID | 20250517043942.372315-1-royluo@google.com |
---|---|
State | Superseded |
Headers | show |
Series | [v1] Revert "usb: xhci: Implement xhci_handshake_check_state() helper" | expand |
On 17.5.2025 7.39, Roy Luo wrote: > This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c. > > Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() > helper") was introduced to workaround watchdog timeout issues on some > platforms, allowing xhci_reset() to bail out early without waiting > for the reset to complete. > > Skipping the xhci handshake during a reset is a dangerous move. The > xhci specification explicitly states that certain registers cannot > be accessed during reset in section 5.4.1 USB Command Register (USBCMD), > Host Controller Reset (HCRST) field: > "This bit is cleared to '0' by the Host Controller when the reset > process is complete. Software cannot terminate the reset process > early by writinga '0' to this bit and shall not write any xHC > Operational or Runtime registers until while HCRST is '1'." > > This behavior causes a regression on SNPS DWC3 USB controller with > dual-role capability. When the DWC3 controller exits host mode and > removes xhci while a reset is still in progress, and then tries to > configure its hardware for device mode, the ongoing reset leads to > register access issues; specifically, all register reads returns 0. > These issues extend beyond the xhci register space (which is expected > during a reset) and affect the entire DWC3 IP block, causing the DWC3 > device mode to malfunction. I agree with you and Thinh that waiting for the HCRST bit to clear during reset is the right thing to do, especially now when we know skipping it causes issues for SNPS DWC3, even if it's only during remove phase. But reverting this patch will re-introduce the issue originally worked around by Udipto Goswami, causing regression. Best thing to do would be to wait for HCRST to clear for all other platforms except the one with the issue. Udipto Goswami, can you recall the platforms that needed this workaroud? and do we have an easy way to detect those? Thanks Mathias
On Mon, May 19, 2025 at 6:23 PM Mathias Nyman <mathias.nyman@linux.intel.com> wrote: > > On 17.5.2025 7.39, Roy Luo wrote: > > This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c. > > > > Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() > > helper") was introduced to workaround watchdog timeout issues on some > > platforms, allowing xhci_reset() to bail out early without waiting > > for the reset to complete. > > > > Skipping the xhci handshake during a reset is a dangerous move. The > > xhci specification explicitly states that certain registers cannot > > be accessed during reset in section 5.4.1 USB Command Register (USBCMD), > > Host Controller Reset (HCRST) field: > > "This bit is cleared to '0' by the Host Controller when the reset > > process is complete. Software cannot terminate the reset process > > early by writinga '0' to this bit and shall not write any xHC > > Operational or Runtime registers until while HCRST is '1'." > > > > This behavior causes a regression on SNPS DWC3 USB controller with > > dual-role capability. When the DWC3 controller exits host mode and > > removes xhci while a reset is still in progress, and then tries to > > configure its hardware for device mode, the ongoing reset leads to > > register access issues; specifically, all register reads returns 0. > > These issues extend beyond the xhci register space (which is expected > > during a reset) and affect the entire DWC3 IP block, causing the DWC3 > > device mode to malfunction. > > I agree with you and Thinh that waiting for the HCRST bit to clear during > reset is the right thing to do, especially now when we know skipping it > causes issues for SNPS DWC3, even if it's only during remove phase. > > But reverting this patch will re-introduce the issue originally worked > around by Udipto Goswami, causing regression. > > Best thing to do would be to wait for HCRST to clear for all other platforms > except the one with the issue. > > Udipto Goswami, can you recall the platforms that needed this workaroud? > and do we have an easy way to detect those? Hi Mathias,
On Mon, 19 May 2025 23:43:21 +0530, Udipto Goswami wrote: > Hi Mathias, > > From what I recall, we saw this issue coming up on our QCOM mobile > platforms but it was not consistent. It was only reported in long runs > i believe. The most recent instance when I pushed this patch was with > platform SM8650, it was a watchdog timeout issue where xhci_reset() -> > xhci_handshake() polling read timeout upon xhci remove. Was it some system-wide watchdog, i.e. unrelated tasks were locking up? It looks similar to that command abort freeze: xhci_resume() calls xhci_reset() under xhci->lock, and xhci_handshake() spins for a few seconds with the spinlock held. Anything else (workers, IRQs) trying to grab the lock will also spin and delay unrelated things. Not sure why your commit message says "Use this helper in places where xhci_handshake is called unlocked and has a long timeout", because you end up calling it from two places where the lock is (incorrectly) held. That's why adding the early bailout helped, I guess. > Unfortunately I was not able to simulate the scenario for more > granular testing and had validated it with long hours stress testing. Looking at xhci_resume(), it will call xhci_reset() if the controller has known bugs (e.g. the RESET_ON_RESUME quirk) or it fails resuming. I guess you could simulate this case by forcing the quirk with a module parameter and adding some extra delay to xhci_handshake(), so you are not dependent on the hardware actually failing in any manner. > Full call stack on core6: > -000|readl([X19] addr = 0xFFFFFFC03CC08020) > -001|xhci_handshake(inline) > -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) > -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) > -003|xhci_plat_runtime_resume([locdesc] dev = ?) > -004|pm_generic_runtime_resume([locdesc] dev = ?) > -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = > 0xFFFFFF890F619C10) > -006|rpm_callback(inline) > -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, > [NSD:0xFFFFFFC041453AD4] rpmflags = 4) > -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) > -008|pm_runtime_get_sync(inline) > -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00) > -009|platform_remove([X19] _dev = 0xFFFFFF890F619C10) > -010|device_remove(inline)
On Tue, May 20, 2025 at 4:02 AM Michał Pecio <michal.pecio@gmail.com> wrote: > > On Mon, 19 May 2025 23:43:21 +0530, Udipto Goswami wrote: > > Hi Mathias, > > > > From what I recall, we saw this issue coming up on our QCOM mobile > > platforms but it was not consistent. It was only reported in long runs > > i believe. The most recent instance when I pushed this patch was with > > platform SM8650, it was a watchdog timeout issue where xhci_reset() -> > > xhci_handshake() polling read timeout upon xhci remove. > > Was it some system-wide watchdog, i.e. unrelated tasks were locking up? Hi Michal, Not exactly, I could see the other tasks were not stuck, only the readl which we do as part of xhci_handshake with a 10 sec timer. Our watchdog barks out at 10 sec and we saw in that timeframe it didn't respond i.e the readl_poll_timeout_atomic was still polling. Since the timer is exactly aligned to the Watchdog timer here therefore it crashed. > It looks similar to that command abort freeze: xhci_resume() calls > xhci_reset() under xhci->lock, and xhci_handshake() spins for a few > seconds with the spinlock held. Anything else (workers, IRQs) trying > to grab the lock will also spin and delay unrelated things. > > Not sure why your commit message says "Use this helper in places where > xhci_handshake is called unlocked and has a long timeout", because you > end up calling it from two places where the lock is (incorrectly) held. > That's why adding the early bailout helped, I guess. > I think we had re-worded the patch a little, this was my original commit: "In some situations where xhci removal happens parallel to xhci_handshake, we enoughter a scenario where the xhci_handshake will fails because the status does not change the entire duration of polling. This causes the xhci_handshake to timeout resulting in long wait which might lead to watchdog timeout." The API handles command timeout which may happen upon XHCI stack removal. Check for xhci state and exit the handshake if xhci is removed. https://lore.kernel.org/all/20230919085847.8210-1-quic_ugoswami@quicinc.com/ But yeah the main motive was to bail out handshake to get around this. So you could say my main problem was that the CMD_RESET was stuck for a long time. In other cases the reset passes in very short amount of time. It was unusual for this case. > > Unfortunately I was not able to simulate the scenario for more > > granular testing and had validated it with long hours stress testing. > > Looking at xhci_resume(), it will call xhci_reset() if the controller > has known bugs (e.g. the RESET_ON_RESUME quirk) or it fails resuming. > > I guess you could simulate this case by forcing the quirk with a module > parameter and adding some extra delay to xhci_handshake(), so you are > not dependent on the hardware actually failing in any manner. This would definitely bark, but like I mentioned my case was that the CMD_RESET didn't exit the poll in the 10 sec. I'm not sure if that points to the controller stuck at processing something else? Please correct me if i'm wrong here. Thanks, -Udipto
On 19.5.2025 21.13, Udipto Goswami wrote: > On Mon, May 19, 2025 at 6:23 PM Mathias Nyman > <mathias.nyman@linux.intel.com> wrote: >> >> On 17.5.2025 7.39, Roy Luo wrote: >>> This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c. >>> >>> Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() >>> helper") was introduced to workaround watchdog timeout issues on some >>> platforms, allowing xhci_reset() to bail out early without waiting >>> for the reset to complete. >>> >>> Skipping the xhci handshake during a reset is a dangerous move. The >>> xhci specification explicitly states that certain registers cannot >>> be accessed during reset in section 5.4.1 USB Command Register (USBCMD), >>> Host Controller Reset (HCRST) field: >>> "This bit is cleared to '0' by the Host Controller when the reset >>> process is complete. Software cannot terminate the reset process >>> early by writinga '0' to this bit and shall not write any xHC >>> Operational or Runtime registers until while HCRST is '1'." >>> >>> This behavior causes a regression on SNPS DWC3 USB controller with >>> dual-role capability. When the DWC3 controller exits host mode and >>> removes xhci while a reset is still in progress, and then tries to >>> configure its hardware for device mode, the ongoing reset leads to >>> register access issues; specifically, all register reads returns 0. >>> These issues extend beyond the xhci register space (which is expected >>> during a reset) and affect the entire DWC3 IP block, causing the DWC3 >>> device mode to malfunction. >> >> I agree with you and Thinh that waiting for the HCRST bit to clear during >> reset is the right thing to do, especially now when we know skipping it >> causes issues for SNPS DWC3, even if it's only during remove phase. >> >> But reverting this patch will re-introduce the issue originally worked >> around by Udipto Goswami, causing regression. >> >> Best thing to do would be to wait for HCRST to clear for all other platforms >> except the one with the issue. >> >> Udipto Goswami, can you recall the platforms that needed this workaroud? >> and do we have an easy way to detect those? > > Hi Mathias, > > From what I recall, we saw this issue coming up on our QCOM mobile > platforms but it was not consistent. It was only reported in long runs > i believe. The most recent instance when I pushed this patch was with > platform SM8650, it was a watchdog timeout issue where xhci_reset() -> > xhci_handshake() polling read timeout upon xhci remove. Unfortunately > I was not able to simulate the scenario for more granular testing and > had validated it with long hours stress testing. > The callstack was like so: > > Full call stack on core6: > -000|readl([X19] addr = 0xFFFFFFC03CC08020) > -001|xhci_handshake(inline) > -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) > -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) > -003|xhci_plat_runtime_resume([locdesc] dev = ?) > -004|pm_generic_runtime_resume([locdesc] dev = ?) > -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = > 0xFFFFFF890F619C10) > -006|rpm_callback(inline) > -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, > [NSD:0xFFFFFFC041453AD4] rpmflags = 4) > -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) > -008|pm_runtime_get_sync(inline) > -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00) Thank you for clarifying this. So patch avoids the long timeout by always cutting xhci reinit path short in xhci_resume() if resume was caused by pm_runtime_get_sync() call in xhci_plat_remove() void xhci_plat_remove(struct platform_device *dev) { xhci->xhc_state |= XHCI_STATE_REMOVING; pm_runtime_get_sync(&dev->dev); ... } I think we can revert this patch, and just make sure that we don't reset the host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set. Just return immediately instead. xhci_reset() will be called with a shorter timeout later in the remove path Not entirely sure remove path needs to call pm_runtime_get_sync(). I think it just tries to prevent runtime suspend/resume from racing with remove. PCI code seems to call pm_runtime_get_noresume() in remove path instead. Thanks Mathias
On Tue, May 20, 2025 at 9:18 AM Mathias Nyman <mathias.nyman@linux.intel.com> wrote: > > On 19.5.2025 21.13, Udipto Goswami wrote: > > On Mon, May 19, 2025 at 6:23 PM Mathias Nyman > > <mathias.nyman@linux.intel.com> wrote: > >> > >> On 17.5.2025 7.39, Roy Luo wrote: > >>> This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c. > >>> > >>> Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() > >>> helper") was introduced to workaround watchdog timeout issues on some > >>> platforms, allowing xhci_reset() to bail out early without waiting > >>> for the reset to complete. > >>> > >>> Skipping the xhci handshake during a reset is a dangerous move. The > >>> xhci specification explicitly states that certain registers cannot > >>> be accessed during reset in section 5.4.1 USB Command Register (USBCMD), > >>> Host Controller Reset (HCRST) field: > >>> "This bit is cleared to '0' by the Host Controller when the reset > >>> process is complete. Software cannot terminate the reset process > >>> early by writinga '0' to this bit and shall not write any xHC > >>> Operational or Runtime registers until while HCRST is '1'." > >>> > >>> This behavior causes a regression on SNPS DWC3 USB controller with > >>> dual-role capability. When the DWC3 controller exits host mode and > >>> removes xhci while a reset is still in progress, and then tries to > >>> configure its hardware for device mode, the ongoing reset leads to > >>> register access issues; specifically, all register reads returns 0. > >>> These issues extend beyond the xhci register space (which is expected > >>> during a reset) and affect the entire DWC3 IP block, causing the DWC3 > >>> device mode to malfunction. > >> > >> I agree with you and Thinh that waiting for the HCRST bit to clear during > >> reset is the right thing to do, especially now when we know skipping it > >> causes issues for SNPS DWC3, even if it's only during remove phase. > >> > >> But reverting this patch will re-introduce the issue originally worked > >> around by Udipto Goswami, causing regression. > >> > >> Best thing to do would be to wait for HCRST to clear for all other platforms > >> except the one with the issue. > >> > >> Udipto Goswami, can you recall the platforms that needed this workaroud? > >> and do we have an easy way to detect those? > > > > Hi Mathias, > > > > From what I recall, we saw this issue coming up on our QCOM mobile > > platforms but it was not consistent. It was only reported in long runs > > i believe. The most recent instance when I pushed this patch was with > > platform SM8650, it was a watchdog timeout issue where xhci_reset() -> > > xhci_handshake() polling read timeout upon xhci remove. Unfortunately > > I was not able to simulate the scenario for more granular testing and > > had validated it with long hours stress testing. > > The callstack was like so: > > > > Full call stack on core6: > > -000|readl([X19] addr = 0xFFFFFFC03CC08020) > > -001|xhci_handshake(inline) > > -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) > > -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) > > -003|xhci_plat_runtime_resume([locdesc] dev = ?) > > -004|pm_generic_runtime_resume([locdesc] dev = ?) > > -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = > > 0xFFFFFF890F619C10) > > -006|rpm_callback(inline) > > -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, > > [NSD:0xFFFFFFC041453AD4] rpmflags = 4) > > -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) > > -008|pm_runtime_get_sync(inline) > > -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00) > > Thank you for clarifying this. > > So patch avoids the long timeout by always cutting xhci reinit path short in > xhci_resume() if resume was caused by pm_runtime_get_sync() call in > xhci_plat_remove() > > void xhci_plat_remove(struct platform_device *dev) > { > xhci->xhc_state |= XHCI_STATE_REMOVING; > pm_runtime_get_sync(&dev->dev); > ... > } > > I think we can revert this patch, and just make sure that we don't reset the > host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set. > Just return immediately instead. > Just to be sure, are you proposing that we skip xhci_reset() within the reinit path of xhci_resume()? If we do that, could that lead to issues with subsequent operations in the reinit sequence, such as xhci_init() or xhci_run()? Do you prefer to group the change to skip xhci_reset() within the reinit path together with this revert? or do you want it to be sent and reviewed separately? Thanks, Roy
On 22.5.2025 5.21, Roy Luo wrote: >>>> Udipto Goswami, can you recall the platforms that needed this workaroud? >>>> and do we have an easy way to detect those? >>> >>> Hi Mathias, >>> >>> From what I recall, we saw this issue coming up on our QCOM mobile >>> platforms but it was not consistent. It was only reported in long runs >>> i believe. The most recent instance when I pushed this patch was with >>> platform SM8650, it was a watchdog timeout issue where xhci_reset() -> >>> xhci_handshake() polling read timeout upon xhci remove. Unfortunately >>> I was not able to simulate the scenario for more granular testing and >>> had validated it with long hours stress testing. >>> The callstack was like so: >>> >>> Full call stack on core6: >>> -000|readl([X19] addr = 0xFFFFFFC03CC08020) >>> -001|xhci_handshake(inline) >>> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) >>> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) >>> -003|xhci_plat_runtime_resume([locdesc] dev = ?) >>> -004|pm_generic_runtime_resume([locdesc] dev = ?) >>> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = >>> 0xFFFFFF890F619C10) >>> -006|rpm_callback(inline) >>> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, >>> [NSD:0xFFFFFFC041453AD4] rpmflags = 4) >>> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) >>> -008|pm_runtime_get_sync(inline) >>> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00) >> >> Thank you for clarifying this. >> >> So patch avoids the long timeout by always cutting xhci reinit path short in >> xhci_resume() if resume was caused by pm_runtime_get_sync() call in >> xhci_plat_remove() >> >> void xhci_plat_remove(struct platform_device *dev) >> { >> xhci->xhc_state |= XHCI_STATE_REMOVING; >> pm_runtime_get_sync(&dev->dev); >> ... >> } >> >> I think we can revert this patch, and just make sure that we don't reset the >> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set. >> Just return immediately instead. >> > > Just to be sure, are you proposing that we skip xhci_reset() within > the reinit path > of xhci_resume()? If we do that, could that lead to issues with > subsequent operations > in the reinit sequence, such as xhci_init() or xhci_run()? I suggest to only skip xhci_reset in xhci_resume() if XHCI_STATE_REMOVING is set. This should be similar to what is going on already. xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is set, unless reset completes extremely fast. xhci_resume() bails out if xhci_reset() returns error: xhci_resume() ... if (power_lost) { ... retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC); spin_unlock_irq(&xhci->lock); if (retval) return retval; > > Do you prefer to group the change to skip xhci_reset() within the > reinit path together > with this revert? or do you want it to be sent and reviewed separately? First a patch that bails out from xhci_resume() if XHCI_STATE_REMOVING is set and we are in the reinit (power_lost) path about to call xhci_reset(); Then a second patch that reverts 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() Does this sound reasonable? should avoid the QCOM 10sec watchdog issue as next xhci_rest() called in xhci_remove() path has a short 250ms timeout, and ensure the SNPS DWC3 USB regression won't trigger. Thanks Mathias
On Thu, May 22, 2025 at 5:24 AM Mathias Nyman <mathias.nyman@intel.com> wrote: > > On 22.5.2025 5.21, Roy Luo wrote: > >>>> Udipto Goswami, can you recall the platforms that needed this workaroud? > >>>> and do we have an easy way to detect those? > >>> > >>> Hi Mathias, > >>> > >>> From what I recall, we saw this issue coming up on our QCOM mobile > >>> platforms but it was not consistent. It was only reported in long runs > >>> i believe. The most recent instance when I pushed this patch was with > >>> platform SM8650, it was a watchdog timeout issue where xhci_reset() -> > >>> xhci_handshake() polling read timeout upon xhci remove. Unfortunately > >>> I was not able to simulate the scenario for more granular testing and > >>> had validated it with long hours stress testing. > >>> The callstack was like so: > >>> > >>> Full call stack on core6: > >>> -000|readl([X19] addr = 0xFFFFFFC03CC08020) > >>> -001|xhci_handshake(inline) > >>> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000) > >>> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?) > >>> -003|xhci_plat_runtime_resume([locdesc] dev = ?) > >>> -004|pm_generic_runtime_resume([locdesc] dev = ?) > >>> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev = > >>> 0xFFFFFF890F619C10) > >>> -006|rpm_callback(inline) > >>> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10, > >>> [NSD:0xFFFFFFC041453AD4] rpmflags = 4) > >>> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4) > >>> -008|pm_runtime_get_sync(inline) > >>> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00) > >> > >> Thank you for clarifying this. > >> > >> So patch avoids the long timeout by always cutting xhci reinit path short in > >> xhci_resume() if resume was caused by pm_runtime_get_sync() call in > >> xhci_plat_remove() > >> > >> void xhci_plat_remove(struct platform_device *dev) > >> { > >> xhci->xhc_state |= XHCI_STATE_REMOVING; > >> pm_runtime_get_sync(&dev->dev); > >> ... > >> } > >> > >> I think we can revert this patch, and just make sure that we don't reset the > >> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set. > >> Just return immediately instead. > >> > > > > Just to be sure, are you proposing that we skip xhci_reset() within > > the reinit path > > of xhci_resume()? If we do that, could that lead to issues with > > subsequent operations > > in the reinit sequence, such as xhci_init() or xhci_run()? > > I suggest to only skip xhci_reset in xhci_resume() if XHCI_STATE_REMOVING is set. > > This should be similar to what is going on already. > > xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is set, unless reset > completes extremely fast. xhci_resume() bails out if xhci_reset() returns error: > > xhci_resume() > ... > if (power_lost) { > ... > retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC); > spin_unlock_irq(&xhci->lock); > if (retval) > return retval; > > > > Do you prefer to group the change to skip xhci_reset() within the > > reinit path together > > with this revert? or do you want it to be sent and reviewed separately? > > First a patch that bails out from xhci_resume() if XHCI_STATE_REMOVING is set > and we are in the reinit (power_lost) path about to call xhci_reset(); > > Then a second patch that reverts 6ccb83d6c497 ("usb: xhci: Implement > xhci_handshake_check_state() > > Does this sound reasonable? > > should avoid the QCOM 10sec watchdog issue as next xhci_rest() called > in xhci_remove() path has a short 250ms timeout, and ensure the > SNPS DWC3 USB regression won't trigger. > > Thanks > Mathias > Thanks for the clarification! SGTM. I've sent out a new patchset accordingly https://lore.kernel.org/linux-usb/20250522190912.457583-1-royluo@google.com/ Thanks, Roy
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 423bf3649570..b720e04ce7d8 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -518,9 +518,8 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci, unsigned long flags) * In the future we should distinguish between -ENODEV and -ETIMEDOUT * and try to recover a -ETIMEDOUT with a host controller reset. */ - ret = xhci_handshake_check_state(xhci, &xhci->op_regs->cmd_ring, - CMD_RING_RUNNING, 0, 5 * 1000 * 1000, - XHCI_STATE_REMOVING); + ret = xhci_handshake(&xhci->op_regs->cmd_ring, + CMD_RING_RUNNING, 0, 5 * 1000 * 1000); if (ret < 0) { xhci_err(xhci, "Abort failed to stop command ring: %d\n", ret); xhci_halt(xhci); diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 90eb491267b5..472c4b6ae59e 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -83,29 +83,6 @@ int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us) return ret; } -/* - * xhci_handshake_check_state - same as xhci_handshake but takes an additional - * exit_state parameter, and bails out with an error immediately when xhc_state - * has exit_state flag set. - */ -int xhci_handshake_check_state(struct xhci_hcd *xhci, void __iomem *ptr, - u32 mask, u32 done, int usec, unsigned int exit_state) -{ - u32 result; - int ret; - - ret = readl_poll_timeout_atomic(ptr, result, - (result & mask) == done || - result == U32_MAX || - xhci->xhc_state & exit_state, - 1, usec); - - if (result == U32_MAX || xhci->xhc_state & exit_state) - return -ENODEV; - - return ret; -} - /* * Disable interrupts and begin the xHCI halting process. */ @@ -226,8 +203,7 @@ int xhci_reset(struct xhci_hcd *xhci, u64 timeout_us) if (xhci->quirks & XHCI_INTEL_HOST) udelay(1000); - ret = xhci_handshake_check_state(xhci, &xhci->op_regs->command, - CMD_RESET, 0, timeout_us, XHCI_STATE_REMOVING); + ret = xhci_handshake(&xhci->op_regs->command, CMD_RESET, 0, timeout_us); if (ret) return ret; diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 242ab9fbc8ae..5e698561b96d 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1855,8 +1855,6 @@ void xhci_remove_secondary_interrupter(struct usb_hcd /* xHCI host controller glue */ typedef void (*xhci_get_quirks_t)(struct device *, struct xhci_hcd *); int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us); -int xhci_handshake_check_state(struct xhci_hcd *xhci, void __iomem *ptr, - u32 mask, u32 done, int usec, unsigned int exit_state); void xhci_quiesce(struct xhci_hcd *xhci); int xhci_halt(struct xhci_hcd *xhci); int xhci_start(struct xhci_hcd *xhci);
This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c. Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() helper") was introduced to workaround watchdog timeout issues on some platforms, allowing xhci_reset() to bail out early without waiting for the reset to complete. Skipping the xhci handshake during a reset is a dangerous move. The xhci specification explicitly states that certain registers cannot be accessed during reset in section 5.4.1 USB Command Register (USBCMD), Host Controller Reset (HCRST) field: "This bit is cleared to '0' by the Host Controller when the reset process is complete. Software cannot terminate the reset process early by writinga '0' to this bit and shall not write any xHC Operational or Runtime registers until while HCRST is '1'." This behavior causes a regression on SNPS DWC3 USB controller with dual-role capability. When the DWC3 controller exits host mode and removes xhci while a reset is still in progress, and then tries to configure its hardware for device mode, the ongoing reset leads to register access issues; specifically, all register reads returns 0. These issues extend beyond the xhci register space (which is expected during a reset) and affect the entire DWC3 IP block, causing the DWC3 device mode to malfunction. Cc: stable@vger.kernel.org Fixes: 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() helper") Signed-off-by: Roy Luo <royluo@google.com> --- Changes in v1: - Link to previous patchset: https://lore.kernel.org/r/20250515185227.1507363-1-royluo@google.com/ --- drivers/usb/host/xhci-ring.c | 5 ++--- drivers/usb/host/xhci.c | 26 +------------------------- drivers/usb/host/xhci.h | 2 -- 3 files changed, 3 insertions(+), 30 deletions(-) base-commit: 172a9d94339cea832d89630b89d314e41d622bd8