diff mbox series

[v1] Revert "usb: xhci: Implement xhci_handshake_check_state() helper"

Message ID 20250517043942.372315-1-royluo@google.com
State Superseded
Headers show
Series [v1] Revert "usb: xhci: Implement xhci_handshake_check_state() helper" | expand

Commit Message

Roy Luo May 17, 2025, 4:39 a.m. UTC
This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.

Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
helper") was introduced to workaround watchdog timeout issues on some
platforms, allowing xhci_reset() to bail out early without waiting
for the reset to complete.

Skipping the xhci handshake during a reset is a dangerous move. The
xhci specification explicitly states that certain registers cannot
be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
Host Controller Reset (HCRST) field:
"This bit is cleared to '0' by the Host Controller when the reset
process is complete. Software cannot terminate the reset process
early by writinga '0' to this bit and shall not write any xHC
Operational or Runtime registers until while HCRST is '1'."

This behavior causes a regression on SNPS DWC3 USB controller with
dual-role capability. When the DWC3 controller exits host mode and
removes xhci while a reset is still in progress, and then tries to
configure its hardware for device mode, the ongoing reset leads to
register access issues; specifically, all register reads returns 0.
These issues extend beyond the xhci register space (which is expected
during a reset) and affect the entire DWC3 IP block, causing the DWC3
device mode to malfunction.

Cc: stable@vger.kernel.org
Fixes: 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() helper")
Signed-off-by: Roy Luo <royluo@google.com>
---
Changes in v1:
- Link to previous patchset: https://lore.kernel.org/r/20250515185227.1507363-1-royluo@google.com/ 
---
 drivers/usb/host/xhci-ring.c |  5 ++---
 drivers/usb/host/xhci.c      | 26 +-------------------------
 drivers/usb/host/xhci.h      |  2 --
 3 files changed, 3 insertions(+), 30 deletions(-)


base-commit: 172a9d94339cea832d89630b89d314e41d622bd8

Comments

Mathias Nyman May 19, 2025, 12:52 p.m. UTC | #1
On 17.5.2025 7.39, Roy Luo wrote:
> This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
> 
> Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
> helper") was introduced to workaround watchdog timeout issues on some
> platforms, allowing xhci_reset() to bail out early without waiting
> for the reset to complete.
> 
> Skipping the xhci handshake during a reset is a dangerous move. The
> xhci specification explicitly states that certain registers cannot
> be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
> Host Controller Reset (HCRST) field:
> "This bit is cleared to '0' by the Host Controller when the reset
> process is complete. Software cannot terminate the reset process
> early by writinga '0' to this bit and shall not write any xHC
> Operational or Runtime registers until while HCRST is '1'."
> 
> This behavior causes a regression on SNPS DWC3 USB controller with
> dual-role capability. When the DWC3 controller exits host mode and
> removes xhci while a reset is still in progress, and then tries to
> configure its hardware for device mode, the ongoing reset leads to
> register access issues; specifically, all register reads returns 0.
> These issues extend beyond the xhci register space (which is expected
> during a reset) and affect the entire DWC3 IP block, causing the DWC3
> device mode to malfunction.

I agree with you and Thinh that waiting for the HCRST bit to clear during
reset is the right thing to do, especially now when we know skipping it
causes issues for SNPS DWC3, even if it's only during remove phase.

But reverting this patch will re-introduce the issue originally worked
around by Udipto Goswami, causing regression.

Best thing to do would be to wait for HCRST to clear for all other platforms
except the one with the issue.

Udipto Goswami, can you recall the platforms that needed this workaroud?
and do we have an easy way to detect those?

Thanks
Mathias
Udipto Goswami May 19, 2025, 6:13 p.m. UTC | #2
On Mon, May 19, 2025 at 6:23 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> On 17.5.2025 7.39, Roy Luo wrote:
> > This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
> >
> > Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
> > helper") was introduced to workaround watchdog timeout issues on some
> > platforms, allowing xhci_reset() to bail out early without waiting
> > for the reset to complete.
> >
> > Skipping the xhci handshake during a reset is a dangerous move. The
> > xhci specification explicitly states that certain registers cannot
> > be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
> > Host Controller Reset (HCRST) field:
> > "This bit is cleared to '0' by the Host Controller when the reset
> > process is complete. Software cannot terminate the reset process
> > early by writinga '0' to this bit and shall not write any xHC
> > Operational or Runtime registers until while HCRST is '1'."
> >
> > This behavior causes a regression on SNPS DWC3 USB controller with
> > dual-role capability. When the DWC3 controller exits host mode and
> > removes xhci while a reset is still in progress, and then tries to
> > configure its hardware for device mode, the ongoing reset leads to
> > register access issues; specifically, all register reads returns 0.
> > These issues extend beyond the xhci register space (which is expected
> > during a reset) and affect the entire DWC3 IP block, causing the DWC3
> > device mode to malfunction.
>
> I agree with you and Thinh that waiting for the HCRST bit to clear during
> reset is the right thing to do, especially now when we know skipping it
> causes issues for SNPS DWC3, even if it's only during remove phase.
>
> But reverting this patch will re-introduce the issue originally worked
> around by Udipto Goswami, causing regression.
>
> Best thing to do would be to wait for HCRST to clear for all other platforms
> except the one with the issue.
>
> Udipto Goswami, can you recall the platforms that needed this workaroud?
> and do we have an easy way to detect those?

Hi Mathias,
Michał Pecio May 19, 2025, 10:32 p.m. UTC | #3
On Mon, 19 May 2025 23:43:21 +0530, Udipto Goswami wrote:
> Hi Mathias,
> 
> From what I recall, we saw this issue coming up on our QCOM mobile
> platforms but it was not consistent. It was only reported in long runs
> i believe. The most recent instance when I pushed this patch was with
> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> xhci_handshake() polling read timeout upon xhci remove.

Was it some system-wide watchdog, i.e. unrelated tasks were locking up?

It looks similar to that command abort freeze: xhci_resume() calls
xhci_reset() under xhci->lock, and xhci_handshake() spins for a few
seconds with the spinlock held. Anything else (workers, IRQs) trying
to grab the lock will also spin and delay unrelated things.

Not sure why your commit message says "Use this helper in places where
xhci_handshake is called unlocked and has a long timeout", because you
end up calling it from two places where the lock is (incorrectly) held.
That's why adding the early bailout helped, I guess.

> Unfortunately I was not able to simulate the scenario for more
> granular testing and had validated it with long hours stress testing.

Looking at xhci_resume(), it will call xhci_reset() if the controller
has known bugs (e.g. the RESET_ON_RESUME quirk) or it fails resuming.

I guess you could simulate this case by forcing the quirk with a module
parameter and adding some extra delay to xhci_handshake(), so you are
not dependent on the hardware actually failing in any manner.

> Full call stack on core6:
> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> -001|xhci_handshake(inline)
> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> -004|pm_generic_runtime_resume([locdesc] dev = ?)
> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> 0xFFFFFF890F619C10)
> -006|rpm_callback(inline)
> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> -008|pm_runtime_get_sync(inline)
> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
> -009|platform_remove([X19] _dev = 0xFFFFFF890F619C10)
> -010|device_remove(inline)
Udipto Goswami May 20, 2025, 12:30 p.m. UTC | #4
On Tue, May 20, 2025 at 4:02 AM Michał Pecio <michal.pecio@gmail.com> wrote:
>
> On Mon, 19 May 2025 23:43:21 +0530, Udipto Goswami wrote:
> > Hi Mathias,
> >
> > From what I recall, we saw this issue coming up on our QCOM mobile
> > platforms but it was not consistent. It was only reported in long runs
> > i believe. The most recent instance when I pushed this patch was with
> > platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> > xhci_handshake() polling read timeout upon xhci remove.
>
> Was it some system-wide watchdog, i.e. unrelated tasks were locking up?
Hi Michal,
Not exactly, I could see the other tasks were not stuck, only the
readl which we do as part of xhci_handshake with a 10 sec timer.
Our watchdog barks out at 10 sec and we saw in that timeframe it
didn't respond i.e the readl_poll_timeout_atomic  was still polling.
Since the timer is exactly aligned to the Watchdog timer here
therefore it crashed.


> It looks similar to that command abort freeze: xhci_resume() calls
> xhci_reset() under xhci->lock, and xhci_handshake() spins for a few
> seconds with the spinlock held. Anything else (workers, IRQs) trying
> to grab the lock will also spin and delay unrelated things.
>
> Not sure why your commit message says "Use this helper in places where
> xhci_handshake is called unlocked and has a long timeout", because you
> end up calling it from two places where the lock is (incorrectly) held.
> That's why adding the early bailout helped, I guess.
>
I think we had re-worded the patch a little, this was my original commit:

"In some situations where xhci removal happens parallel to
xhci_handshake, we enoughter a scenario where the xhci_handshake will
fails because the status does not change the entire duration of
polling. This causes the xhci_handshake to timeout resulting in long
wait which might lead to watchdog timeout." The API  handles command
timeout which may happen upon XHCI stack removal. Check for xhci state
and exit the handshake if xhci is removed.
https://lore.kernel.org/all/20230919085847.8210-1-quic_ugoswami@quicinc.com/

But yeah the main motive was to bail out handshake to get around this.

So you could say my main problem was that the CMD_RESET was stuck for
a long time.
In other cases the reset passes in very short amount of time. It was
unusual for this case.

> > Unfortunately I was not able to simulate the scenario for more
> > granular testing and had validated it with long hours stress testing.
>
> Looking at xhci_resume(), it will call xhci_reset() if the controller
> has known bugs (e.g. the RESET_ON_RESUME quirk) or it fails resuming.
>
> I guess you could simulate this case by forcing the quirk with a module
> parameter and adding some extra delay to xhci_handshake(), so you are
> not dependent on the hardware actually failing in any manner.

This would definitely bark, but like I mentioned my case was that the
CMD_RESET didn't exit the poll in the 10 sec.
I'm not sure if that points to the controller stuck at processing
something else?
Please correct me if i'm wrong here.

Thanks,
-Udipto
Mathias Nyman May 20, 2025, 4:18 p.m. UTC | #5
On 19.5.2025 21.13, Udipto Goswami wrote:
> On Mon, May 19, 2025 at 6:23 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>>
>> On 17.5.2025 7.39, Roy Luo wrote:
>>> This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
>>>
>>> Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
>>> helper") was introduced to workaround watchdog timeout issues on some
>>> platforms, allowing xhci_reset() to bail out early without waiting
>>> for the reset to complete.
>>>
>>> Skipping the xhci handshake during a reset is a dangerous move. The
>>> xhci specification explicitly states that certain registers cannot
>>> be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
>>> Host Controller Reset (HCRST) field:
>>> "This bit is cleared to '0' by the Host Controller when the reset
>>> process is complete. Software cannot terminate the reset process
>>> early by writinga '0' to this bit and shall not write any xHC
>>> Operational or Runtime registers until while HCRST is '1'."
>>>
>>> This behavior causes a regression on SNPS DWC3 USB controller with
>>> dual-role capability. When the DWC3 controller exits host mode and
>>> removes xhci while a reset is still in progress, and then tries to
>>> configure its hardware for device mode, the ongoing reset leads to
>>> register access issues; specifically, all register reads returns 0.
>>> These issues extend beyond the xhci register space (which is expected
>>> during a reset) and affect the entire DWC3 IP block, causing the DWC3
>>> device mode to malfunction.
>>
>> I agree with you and Thinh that waiting for the HCRST bit to clear during
>> reset is the right thing to do, especially now when we know skipping it
>> causes issues for SNPS DWC3, even if it's only during remove phase.
>>
>> But reverting this patch will re-introduce the issue originally worked
>> around by Udipto Goswami, causing regression.
>>
>> Best thing to do would be to wait for HCRST to clear for all other platforms
>> except the one with the issue.
>>
>> Udipto Goswami, can you recall the platforms that needed this workaroud?
>> and do we have an easy way to detect those?
> 
> Hi Mathias,
> 
>  From what I recall, we saw this issue coming up on our QCOM mobile
> platforms but it was not consistent. It was only reported in long runs
> i believe. The most recent instance when I pushed this patch was with
> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> xhci_handshake() polling read timeout upon xhci remove. Unfortunately
> I was not able to simulate the scenario for more granular testing and
> had validated it with long hours stress testing.
> The callstack was like so:
> 
> Full call stack on core6:
> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> -001|xhci_handshake(inline)
> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> -004|pm_generic_runtime_resume([locdesc] dev = ?)
> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> 0xFFFFFF890F619C10)
> -006|rpm_callback(inline)
> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> -008|pm_runtime_get_sync(inline)
> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)

Thank you for clarifying this.

So patch avoids the long timeout by always cutting xhci reinit path short in
xhci_resume() if resume was caused by pm_runtime_get_sync() call in
xhci_plat_remove()

void xhci_plat_remove(struct platform_device *dev)
{
	xhci->xhc_state |= XHCI_STATE_REMOVING;
	pm_runtime_get_sync(&dev->dev);
	...
}

I think we can revert this patch, and just make sure that we don't reset the
host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
Just return immediately instead.

xhci_reset() will be called with a shorter timeout later in the remove path

Not entirely sure remove path needs to call pm_runtime_get_sync().
I think it just tries to prevent runtime suspend/resume from racing with remove.
PCI code seems to call pm_runtime_get_noresume() in remove path instead.

Thanks
Mathias
Roy Luo May 22, 2025, 2:21 a.m. UTC | #6
On Tue, May 20, 2025 at 9:18 AM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> On 19.5.2025 21.13, Udipto Goswami wrote:
> > On Mon, May 19, 2025 at 6:23 PM Mathias Nyman
> > <mathias.nyman@linux.intel.com> wrote:
> >>
> >> On 17.5.2025 7.39, Roy Luo wrote:
> >>> This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
> >>>
> >>> Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
> >>> helper") was introduced to workaround watchdog timeout issues on some
> >>> platforms, allowing xhci_reset() to bail out early without waiting
> >>> for the reset to complete.
> >>>
> >>> Skipping the xhci handshake during a reset is a dangerous move. The
> >>> xhci specification explicitly states that certain registers cannot
> >>> be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
> >>> Host Controller Reset (HCRST) field:
> >>> "This bit is cleared to '0' by the Host Controller when the reset
> >>> process is complete. Software cannot terminate the reset process
> >>> early by writinga '0' to this bit and shall not write any xHC
> >>> Operational or Runtime registers until while HCRST is '1'."
> >>>
> >>> This behavior causes a regression on SNPS DWC3 USB controller with
> >>> dual-role capability. When the DWC3 controller exits host mode and
> >>> removes xhci while a reset is still in progress, and then tries to
> >>> configure its hardware for device mode, the ongoing reset leads to
> >>> register access issues; specifically, all register reads returns 0.
> >>> These issues extend beyond the xhci register space (which is expected
> >>> during a reset) and affect the entire DWC3 IP block, causing the DWC3
> >>> device mode to malfunction.
> >>
> >> I agree with you and Thinh that waiting for the HCRST bit to clear during
> >> reset is the right thing to do, especially now when we know skipping it
> >> causes issues for SNPS DWC3, even if it's only during remove phase.
> >>
> >> But reverting this patch will re-introduce the issue originally worked
> >> around by Udipto Goswami, causing regression.
> >>
> >> Best thing to do would be to wait for HCRST to clear for all other platforms
> >> except the one with the issue.
> >>
> >> Udipto Goswami, can you recall the platforms that needed this workaroud?
> >> and do we have an easy way to detect those?
> >
> > Hi Mathias,
> >
> >  From what I recall, we saw this issue coming up on our QCOM mobile
> > platforms but it was not consistent. It was only reported in long runs
> > i believe. The most recent instance when I pushed this patch was with
> > platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> > xhci_handshake() polling read timeout upon xhci remove. Unfortunately
> > I was not able to simulate the scenario for more granular testing and
> > had validated it with long hours stress testing.
> > The callstack was like so:
> >
> > Full call stack on core6:
> > -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> > -001|xhci_handshake(inline)
> > -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> > -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> > -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> > -004|pm_generic_runtime_resume([locdesc] dev = ?)
> > -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> > 0xFFFFFF890F619C10)
> > -006|rpm_callback(inline)
> > -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> > [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> > -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> > -008|pm_runtime_get_sync(inline)
> > -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
>
> Thank you for clarifying this.
>
> So patch avoids the long timeout by always cutting xhci reinit path short in
> xhci_resume() if resume was caused by pm_runtime_get_sync() call in
> xhci_plat_remove()
>
> void xhci_plat_remove(struct platform_device *dev)
> {
>         xhci->xhc_state |= XHCI_STATE_REMOVING;
>         pm_runtime_get_sync(&dev->dev);
>         ...
> }
>
> I think we can revert this patch, and just make sure that we don't reset the
> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
> Just return immediately instead.
>

Just to be sure, are you proposing that we skip xhci_reset() within
the reinit path
of xhci_resume()? If we do that, could that lead to issues with
subsequent operations
in the reinit sequence, such as xhci_init() or xhci_run()?

Do you prefer to group the change to skip xhci_reset() within the
reinit path together
with this revert? or do you want it to be sent and reviewed separately?

Thanks,
Roy
Mathias Nyman May 22, 2025, 12:24 p.m. UTC | #7
On 22.5.2025 5.21, Roy Luo wrote:
>>>> Udipto Goswami, can you recall the platforms that needed this workaroud?
>>>> and do we have an easy way to detect those?
>>>
>>> Hi Mathias,
>>>
>>>   From what I recall, we saw this issue coming up on our QCOM mobile
>>> platforms but it was not consistent. It was only reported in long runs
>>> i believe. The most recent instance when I pushed this patch was with
>>> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
>>> xhci_handshake() polling read timeout upon xhci remove. Unfortunately
>>> I was not able to simulate the scenario for more granular testing and
>>> had validated it with long hours stress testing.
>>> The callstack was like so:
>>>
>>> Full call stack on core6:
>>> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
>>> -001|xhci_handshake(inline)
>>> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
>>> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
>>> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
>>> -004|pm_generic_runtime_resume([locdesc] dev = ?)
>>> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
>>> 0xFFFFFF890F619C10)
>>> -006|rpm_callback(inline)
>>> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
>>> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
>>> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
>>> -008|pm_runtime_get_sync(inline)
>>> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
>>
>> Thank you for clarifying this.
>>
>> So patch avoids the long timeout by always cutting xhci reinit path short in
>> xhci_resume() if resume was caused by pm_runtime_get_sync() call in
>> xhci_plat_remove()
>>
>> void xhci_plat_remove(struct platform_device *dev)
>> {
>>          xhci->xhc_state |= XHCI_STATE_REMOVING;
>>          pm_runtime_get_sync(&dev->dev);
>>          ...
>> }
>>
>> I think we can revert this patch, and just make sure that we don't reset the
>> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
>> Just return immediately instead.
>>
> 
> Just to be sure, are you proposing that we skip xhci_reset() within
> the reinit path
> of xhci_resume()? If we do that, could that lead to issues with
> subsequent operations
> in the reinit sequence, such as xhci_init() or xhci_run()?

I suggest to only skip xhci_reset in xhci_resume() if XHCI_STATE_REMOVING is set.

This should be similar to what is going on already.

xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is set, unless reset
completes extremely fast. xhci_resume() bails out if xhci_reset() returns error:

xhci_resume()
   ...
   if (power_lost) {
     ...
     retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
     spin_unlock_irq(&xhci->lock);
     if (retval)
       return retval;
> 
> Do you prefer to group the change to skip xhci_reset() within the
> reinit path together
> with this revert? or do you want it to be sent and reviewed separately?

First a patch that bails out from xhci_resume() if XHCI_STATE_REMOVING is set
and we are in the reinit (power_lost) path about to call xhci_reset();

Then a second patch that reverts 6ccb83d6c497 ("usb: xhci: Implement
xhci_handshake_check_state()

Does this sound reasonable?

should avoid the QCOM 10sec watchdog issue as next xhci_rest() called
in xhci_remove() path has a short 250ms timeout, and ensure the
SNPS DWC3 USB regression won't trigger.

Thanks
Mathias
Roy Luo May 22, 2025, 7:19 p.m. UTC | #8
On Thu, May 22, 2025 at 5:24 AM Mathias Nyman <mathias.nyman@intel.com> wrote:
>
> On 22.5.2025 5.21, Roy Luo wrote:
> >>>> Udipto Goswami, can you recall the platforms that needed this workaroud?
> >>>> and do we have an easy way to detect those?
> >>>
> >>> Hi Mathias,
> >>>
> >>>   From what I recall, we saw this issue coming up on our QCOM mobile
> >>> platforms but it was not consistent. It was only reported in long runs
> >>> i believe. The most recent instance when I pushed this patch was with
> >>> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> >>> xhci_handshake() polling read timeout upon xhci remove. Unfortunately
> >>> I was not able to simulate the scenario for more granular testing and
> >>> had validated it with long hours stress testing.
> >>> The callstack was like so:
> >>>
> >>> Full call stack on core6:
> >>> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> >>> -001|xhci_handshake(inline)
> >>> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> >>> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> >>> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> >>> -004|pm_generic_runtime_resume([locdesc] dev = ?)
> >>> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> >>> 0xFFFFFF890F619C10)
> >>> -006|rpm_callback(inline)
> >>> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> >>> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> >>> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> >>> -008|pm_runtime_get_sync(inline)
> >>> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
> >>
> >> Thank you for clarifying this.
> >>
> >> So patch avoids the long timeout by always cutting xhci reinit path short in
> >> xhci_resume() if resume was caused by pm_runtime_get_sync() call in
> >> xhci_plat_remove()
> >>
> >> void xhci_plat_remove(struct platform_device *dev)
> >> {
> >>          xhci->xhc_state |= XHCI_STATE_REMOVING;
> >>          pm_runtime_get_sync(&dev->dev);
> >>          ...
> >> }
> >>
> >> I think we can revert this patch, and just make sure that we don't reset the
> >> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
> >> Just return immediately instead.
> >>
> >
> > Just to be sure, are you proposing that we skip xhci_reset() within
> > the reinit path
> > of xhci_resume()? If we do that, could that lead to issues with
> > subsequent operations
> > in the reinit sequence, such as xhci_init() or xhci_run()?
>
> I suggest to only skip xhci_reset in xhci_resume() if XHCI_STATE_REMOVING is set.
>
> This should be similar to what is going on already.
>
> xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is set, unless reset
> completes extremely fast. xhci_resume() bails out if xhci_reset() returns error:
>
> xhci_resume()
>    ...
>    if (power_lost) {
>      ...
>      retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
>      spin_unlock_irq(&xhci->lock);
>      if (retval)
>        return retval;
> >
> > Do you prefer to group the change to skip xhci_reset() within the
> > reinit path together
> > with this revert? or do you want it to be sent and reviewed separately?
>
> First a patch that bails out from xhci_resume() if XHCI_STATE_REMOVING is set
> and we are in the reinit (power_lost) path about to call xhci_reset();
>
> Then a second patch that reverts 6ccb83d6c497 ("usb: xhci: Implement
> xhci_handshake_check_state()
>
> Does this sound reasonable?
>
> should avoid the QCOM 10sec watchdog issue as next xhci_rest() called
> in xhci_remove() path has a short 250ms timeout, and ensure the
> SNPS DWC3 USB regression won't trigger.
>
> Thanks
> Mathias
>

Thanks for the clarification! SGTM.
I've sent out a new patchset accordingly
https://lore.kernel.org/linux-usb/20250522190912.457583-1-royluo@google.com/

Thanks,
Roy
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 423bf3649570..b720e04ce7d8 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -518,9 +518,8 @@  static int xhci_abort_cmd_ring(struct xhci_hcd *xhci, unsigned long flags)
 	 * In the future we should distinguish between -ENODEV and -ETIMEDOUT
 	 * and try to recover a -ETIMEDOUT with a host controller reset.
 	 */
-	ret = xhci_handshake_check_state(xhci, &xhci->op_regs->cmd_ring,
-			CMD_RING_RUNNING, 0, 5 * 1000 * 1000,
-			XHCI_STATE_REMOVING);
+	ret = xhci_handshake(&xhci->op_regs->cmd_ring,
+			CMD_RING_RUNNING, 0, 5 * 1000 * 1000);
 	if (ret < 0) {
 		xhci_err(xhci, "Abort failed to stop command ring: %d\n", ret);
 		xhci_halt(xhci);
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 90eb491267b5..472c4b6ae59e 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -83,29 +83,6 @@  int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us)
 	return ret;
 }
 
-/*
- * xhci_handshake_check_state - same as xhci_handshake but takes an additional
- * exit_state parameter, and bails out with an error immediately when xhc_state
- * has exit_state flag set.
- */
-int xhci_handshake_check_state(struct xhci_hcd *xhci, void __iomem *ptr,
-		u32 mask, u32 done, int usec, unsigned int exit_state)
-{
-	u32	result;
-	int	ret;
-
-	ret = readl_poll_timeout_atomic(ptr, result,
-				(result & mask) == done ||
-				result == U32_MAX ||
-				xhci->xhc_state & exit_state,
-				1, usec);
-
-	if (result == U32_MAX || xhci->xhc_state & exit_state)
-		return -ENODEV;
-
-	return ret;
-}
-
 /*
  * Disable interrupts and begin the xHCI halting process.
  */
@@ -226,8 +203,7 @@  int xhci_reset(struct xhci_hcd *xhci, u64 timeout_us)
 	if (xhci->quirks & XHCI_INTEL_HOST)
 		udelay(1000);
 
-	ret = xhci_handshake_check_state(xhci, &xhci->op_regs->command,
-				CMD_RESET, 0, timeout_us, XHCI_STATE_REMOVING);
+	ret = xhci_handshake(&xhci->op_regs->command, CMD_RESET, 0, timeout_us);
 	if (ret)
 		return ret;
 
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 242ab9fbc8ae..5e698561b96d 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1855,8 +1855,6 @@  void xhci_remove_secondary_interrupter(struct usb_hcd
 /* xHCI host controller glue */
 typedef void (*xhci_get_quirks_t)(struct device *, struct xhci_hcd *);
 int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us);
-int xhci_handshake_check_state(struct xhci_hcd *xhci, void __iomem *ptr,
-		u32 mask, u32 done, int usec, unsigned int exit_state);
 void xhci_quiesce(struct xhci_hcd *xhci);
 int xhci_halt(struct xhci_hcd *xhci);
 int xhci_start(struct xhci_hcd *xhci);