diff mbox series

[2/2] usb: xhci: Warn about suspected "start-stop" bugs in HCs

Message ID 20241014211122.7cb5b133@foxbook
State New
Headers show
Series Fix the NEC stop bug workaround | expand

Commit Message

Michal Pecio Oct. 14, 2024, 7:11 p.m. UTC
NEC controllers have a bug, where stopping an endpoint soon after it
has been restarted doesn't quite work as expected. This forces us to
track whether each Stop Endpoint command is expected to fail or not.

Reuse this infrastracture to warn about similar bugs on other chips,
if any are found.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Greg KH Oct. 15, 2024, 10:40 a.m. UTC | #1
On Mon, Oct 14, 2024 at 09:11:22PM +0200, Michal Pecio wrote:
> NEC controllers have a bug, where stopping an endpoint soon after it
> has been restarted doesn't quite work as expected. This forces us to
> track whether each Stop Endpoint command is expected to fail or not.
> 
> Reuse this infrastracture to warn about similar bugs on other chips,
> if any are found.
> 
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> ---
>  drivers/usb/host/xhci-ring.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index c0efb4d34ab9..c326b86d713c 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1186,8 +1186,11 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
>  			 * So keep retrying until the command clearly succeeds.
>  			 * Not clear what to do if other HCs have similar bugs.
>  			 */
> -			if (!(xhci->quirks & XHCI_NEC_HOST))
> +			if (!(xhci->quirks & XHCI_NEC_HOST)) {
> +				xhci_warn(xhci, "Unhandled Stop Endpoint failure on slot %d ep_index %d\n",
> +						slot_id, ep_index);

If a user sees this, what are they supposed to do?  This is a hardware
bug, but with this we are going to get reports of "something broke in
the kernel", right?  Why not make it just more informative, like:
	xhci_info(xhci, "hardware can not deal with...

or something like that so that people know we know about the bug, and
are working around it, but that it's not our issue, but rather the
hardware that is at fault?

thanks,

greg k-h
Michal Pecio Oct. 15, 2024, 6:52 p.m. UTC | #2
Hi,

> +			if (!(xhci->quirks & XHCI_NEC_HOST)) {
> > +				xhci_warn(xhci, "Unhandled Stop Endpoint failure on slot %d ep_index %d\n",
> > +						slot_id, ep_index);  
> 
> If a user sees this, what are they supposed to do?  This is a hardware
> bug, but with this we are going to get reports of "something broke in
> the kernel", right?  Why not make it just more informative, like:
> 	xhci_info(xhci, "hardware can not deal with...
> 
> or something like that so that people know we know about the bug, and
> are working around it, but that it's not our issue, but rather the
> hardware that is at fault?

Yes, the point is that ideally some users would report when they see it.
This is not a warning for a bug we know about and work around, it's for
hypothetical bugs that aren't known. If I'm adding code to handle a bug
of my HC, I might as well use it to sanity check other HCs for free.

Ideally they are OK and no one will ever see it. I tested several HCs
in January and found no similar issues. Maybe I will test them again.

I specifically worded it as something the kernel could do better rather
than the kernel just whining about something incomprehensibly. For the
latter, we already have the absolute classic "ERROR Transfer event TRB
DMA ptr not part of current TD", which would often follow this one, but
offer little clue about the root cause. It's kinda too late by then.

Regards,
Michal
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index c0efb4d34ab9..c326b86d713c 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1186,8 +1186,11 @@  static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 			 * So keep retrying until the command clearly succeeds.
 			 * Not clear what to do if other HCs have similar bugs.
 			 */
-			if (!(xhci->quirks & XHCI_NEC_HOST))
+			if (!(xhci->quirks & XHCI_NEC_HOST)) {
+				xhci_warn(xhci, "Unhandled Stop Endpoint failure on slot %d ep_index %d\n",
+						slot_id, ep_index);
 				break;
+			}
 			fallthrough;
 
 		case EP_STATE_RUNNING: