diff mbox series

[02/14] mailbox: pcc: Always clear the platform ack interrupt first

Message ID 20250303-pcc_fixes_updates-v1-2-3b44f3d134b1@arm.com
State Superseded
Headers show
Series mailbox: pcc: Fixes and cleanup/refactoring | expand

Commit Message

Sudeep Holla March 3, 2025, 10:51 a.m. UTC
The PCC mailbox interrupt handler (pcc_mbox_irq()) currently checks
for command completion flags and any error status before clearing the
interrupt.

The below sequence highlights an issue in the handling of PCC mailbox
interrupts, specifically when dealing with doorbell notifications and
acknowledgment between the OSPM and the platform where type3 and type4
channels are sharing the interrupt.

        Platform Firmware              OSPM/Linux PCC driver
------------------------------------------------------------------------
                                     build message in shmem
                                     ring type3 channel doorbell
receives the doorbell interrupt
  process the message from OSPM
  build response for the message
ring the platform ack interrupt to OSPM
				--->
build notification in type4 channel
                                     start processing in pcc_mbox_irq()
                                      enter pcc handler for type4 chan
                                         command complete cleared
			        	 read the notification
                                <---     clear platform ack irq
  		* no effect from above as platform ack irq *
		* not yet triggered on this channel *
ring the platform ack irq on type4 channel
				--->
                                      leave pcc handler for type4 chan
                                      enter pcc handler for type3 chan
                                         command complete set
					 read the response
                                <---     clear platform ack irq
                                      leave pcc handler for type3 chan
                                     leave pcc_mbox_irq() handler
                                     start processing in pcc_mbox_irq()
                                      enter pcc handler for type4 chan
                                      leave pcc handler for type4 chan
                                      enter pcc handler for type3 chan
                                      leave pcc handler for type3 chan
                                     leave pcc_mbox_irq() handler

The key issue occurs when OSPM tries to acknowledge platform ack
interrupt for a notification which is ready to be read and processed
but the interrupt itself is not yet triggered by the platform.

This ineffective acknowledgment leads to an issue later in time where
the interrupt remains pending as we exit the interrupt handler without
clearing the platform ack interrupt as there is no pending response or
notification. The interrupt acknowledgment order is incorrect.

To resolve this issue, the platform acknowledgment interrupt should
always be cleared before processing the interrupt for any notifications
or response.

Reported-by: Robbie King <robbiek@xsightlabs.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
---
 drivers/mailbox/pcc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

lihuisong (C) March 5, 2025, 3:45 a.m. UTC | #1
在 2025/3/3 18:51, Sudeep Holla 写道:
> The PCC mailbox interrupt handler (pcc_mbox_irq()) currently checks
> for command completion flags and any error status before clearing the
> interrupt.
>
> The below sequence highlights an issue in the handling of PCC mailbox
> interrupts, specifically when dealing with doorbell notifications and
> acknowledgment between the OSPM and the platform where type3 and type4
> channels are sharing the interrupt.
>
>          Platform Firmware              OSPM/Linux PCC driver
> ------------------------------------------------------------------------
>                                       build message in shmem
>                                       ring type3 channel doorbell
> receives the doorbell interrupt
>    process the message from OSPM
>    build response for the message
> ring the platform ack interrupt to OSPM
> 				--->
> build notification in type4 channel
>                                       start processing in pcc_mbox_irq()
>                                        enter pcc handler for type4 chan
>                                           command complete cleared
> 			        	 read the notification
>                                  <---     clear platform ack irq
>    		* no effect from above as platform ack irq *
> 		* not yet triggered on this channel *
> ring the platform ack irq on type4 channel
> 				--->
>                                        leave pcc handler for type4 chan
>                                        enter pcc handler for type3 chan
>                                           command complete set
> 					 read the response
>                                  <---     clear platform ack irq
>                                        leave pcc handler for type3 chan
>                                       leave pcc_mbox_irq() handler
>                                       start processing in pcc_mbox_irq()
>                                        enter pcc handler for type4 chan
>                                        leave pcc handler for type4 chan
>                                        enter pcc handler for type3 chan
>                                        leave pcc handler for type3 chan
>                                       leave pcc_mbox_irq() handler
This is not easy to understand to me.
The issue as below described is already very clear to me.
So suggest remove above flow graph.
> The key issue occurs when OSPM tries to acknowledge platform ack
> interrupt for a notification which is ready to be read and processed
> but the interrupt itself is not yet triggered by the platform.
>
> This ineffective acknowledgment leads to an issue later in time where
> the interrupt remains pending as we exit the interrupt handler without
> clearing the platform ack interrupt as there is no pending response or
> notification. The interrupt acknowledgment order is incorrect.
Has this issue been confired? It's more better if has the log.😁
But it seems a valid issue.
>
> To resolve this issue, the platform acknowledgment interrupt should
> always be cleared before processing the interrupt for any notifications
> or response.
AFAIC,always clearing the platform ack interrupt first which is also the 
communication flow as ACPI spec described.
I am not sure if it is ok when triggering interrupt and clearing 
interrupt occur concurrently.
But this scenario is always possible. I think It doesn't matter with 
this patch. It's just my confusion.
>
> Reported-by: Robbie King <robbiek@xsightlabs.com>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Lgtm,
Reviewed-by: Huisong Li <lihuisong@huawei.com>
> ---
>   drivers/mailbox/pcc.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
> index f2e4087281c70eeb5b9b33371596613a371dff4f..4c582fa2b8bf4c9a9368dba8220f567555dba963 100644
> --- a/drivers/mailbox/pcc.c
> +++ b/drivers/mailbox/pcc.c
> @@ -313,6 +313,10 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
>   	int ret;
>   
>   	pchan = chan->con_priv;
> +
> +	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
> +		return IRQ_NONE;
> +
>   	if (pchan->type == ACPI_PCCT_TYPE_EXT_PCC_MASTER_SUBSPACE &&
>   	    !pchan->chan_in_use)
>   		return IRQ_NONE;
> @@ -330,9 +334,6 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
>   		return IRQ_NONE;
>   	}
>   
> -	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
> -		return IRQ_NONE;
> -
>   	/*
>   	 * Clear this flag immediately after updating interrupt ack register
>   	 * to avoid possible race in updatation of the flag from
>
Sudeep Holla March 5, 2025, 2:29 p.m. UTC | #2
On Wed, Mar 05, 2025 at 11:45:35AM +0800, lihuisong (C) wrote:
> 
> 在 2025/3/3 18:51, Sudeep Holla 写道:
> > The PCC mailbox interrupt handler (pcc_mbox_irq()) currently checks
> > for command completion flags and any error status before clearing the
> > interrupt.
> > 
> > The below sequence highlights an issue in the handling of PCC mailbox
> > interrupts, specifically when dealing with doorbell notifications and
> > acknowledgment between the OSPM and the platform where type3 and type4
> > channels are sharing the interrupt.
> > 
> >          Platform Firmware              OSPM/Linux PCC driver
> > ------------------------------------------------------------------------
> >                                       build message in shmem
> >                                       ring type3 channel doorbell
> > receives the doorbell interrupt
> >    process the message from OSPM
> >    build response for the message
> > ring the platform ack interrupt to OSPM
> > 				--->
> > build notification in type4 channel
> >                                       start processing in pcc_mbox_irq()
> >                                        enter pcc handler for type4 chan
> >                                           command complete cleared
> > 			        	 read the notification
> >                                  <---     clear platform ack irq
> >    		* no effect from above as platform ack irq *
> > 		* not yet triggered on this channel *
> > ring the platform ack irq on type4 channel
> > 				--->
> >                                        leave pcc handler for type4 chan
> >                                        enter pcc handler for type3 chan
> >                                           command complete set
> > 					 read the response
> >                                  <---     clear platform ack irq
> >                                        leave pcc handler for type3 chan
> >                                       leave pcc_mbox_irq() handler
> >                                       start processing in pcc_mbox_irq()
> >                                        enter pcc handler for type4 chan
> >                                        leave pcc handler for type4 chan
> >                                        enter pcc handler for type3 chan
> >                                        leave pcc handler for type3 chan
> >                                       leave pcc_mbox_irq() handler
> This is not easy to understand to me.
> The issue as below described is already very clear to me.
> So suggest remove above flow graph.

I understood it with the graph similar to the one above, though I simplified
it in terms of PCC rather than specific IP reference.

> > The key issue occurs when OSPM tries to acknowledge platform ack
> > interrupt for a notification which is ready to be read and processed
> > but the interrupt itself is not yet triggered by the platform.
> >
> > This ineffective acknowledgment leads to an issue later in time where
> > the interrupt remains pending as we exit the interrupt handler without
> > clearing the platform ack interrupt as there is no pending response or
> > notification. The interrupt acknowledgment order is incorrect.
> >

> Has this issue been confired? It's more better if has the log.😁
> But it seems a valid issue.

Yes Robbie reported this. He is away and can't test or respond until next
week. The log just says there was loads of spurious interrupts and nobody
cared log as you got in the first patch of yours fixing similar race.

> >
> > To resolve this issue, the platform acknowledgment interrupt should
> > always be cleared before processing the interrupt for any notifications
> > or response.
> >
> AFAIC,always clearing the platform ack interrupt first which is also the
> communication flow as ACPI spec described.

Indeed, not sure how we missed it so far.

> I am not sure if it is ok when triggering interrupt and clearing interrupt
> occur concurrently.

Should be OK as we start clearing all the channels that share, if the
handler doesn't clear any source, the interrupt must remain asserted.

> But this scenario is always possible. I think It doesn't matter with this
> patch. It's just my confusion.

Indeed, it can happen any time as you mentioned. No worries better to ask
and clarify than assume. Thanks for your time and review.

--
Regards,
Sudeep
lihuisong (C) March 6, 2025, 3:44 a.m. UTC | #3
在 2025/3/5 22:29, Sudeep Holla 写道:
> On Wed, Mar 05, 2025 at 11:45:35AM +0800, lihuisong (C) wrote:
>> 在 2025/3/3 18:51, Sudeep Holla 写道:
>>> The PCC mailbox interrupt handler (pcc_mbox_irq()) currently checks
>>> for command completion flags and any error status before clearing the
>>> interrupt.
>>>
>>> The below sequence highlights an issue in the handling of PCC mailbox
>>> interrupts, specifically when dealing with doorbell notifications and
>>> acknowledgment between the OSPM and the platform where type3 and type4
>>> channels are sharing the interrupt.
>>>
>>>           Platform Firmware              OSPM/Linux PCC driver
>>> ------------------------------------------------------------------------
>>>                                        build message in shmem
>>>                                        ring type3 channel doorbell
>>> receives the doorbell interrupt
>>>     process the message from OSPM
>>>     build response for the message
>>> ring the platform ack interrupt to OSPM
>>> 				--->
>>> build notification in type4 channel
>>>                                        start processing in pcc_mbox_irq()
>>>                                         enter pcc handler for type4 chan
>>>                                            command complete cleared
>>> 			        	 read the notification
>>>                                   <---     clear platform ack irq
>>>     		* no effect from above as platform ack irq *
>>> 		* not yet triggered on this channel *
>>> ring the platform ack irq on type4 channel
>>> 				--->
>>>                                         leave pcc handler for type4 chan
>>>                                         enter pcc handler for type3 chan
>>>                                            command complete set
>>> 					 read the response
>>>                                   <---     clear platform ack irq
>>>                                         leave pcc handler for type3 chan
>>>                                        leave pcc_mbox_irq() handler
>>>                                        start processing in pcc_mbox_irq()
>>>                                         enter pcc handler for type4 chan
>>>                                         leave pcc handler for type4 chan
>>>                                         enter pcc handler for type3 chan
>>>                                         leave pcc handler for type3 chan
>>>                                        leave pcc_mbox_irq() handler
>> This is not easy to understand to me.
>> The issue as below described is already very clear to me.
>> So suggest remove above flow graph.
> I understood it with the graph similar to the one above, though I simplified
> it in terms of PCC rather than specific IP reference.
>
>>> The key issue occurs when OSPM tries to acknowledge platform ack
>>> interrupt for a notification which is ready to be read and processed
>>> but the interrupt itself is not yet triggered by the platform.
>>>
>>> This ineffective acknowledgment leads to an issue later in time where
>>> the interrupt remains pending as we exit the interrupt handler without
>>> clearing the platform ack interrupt as there is no pending response or
>>> notification. The interrupt acknowledgment order is incorrect.
>>>
>> Has this issue been confired? It's more better if has the log.😁
>> But it seems a valid issue.
> Yes Robbie reported this. He is away and can't test or respond until next
> week. The log just says there was loads of spurious interrupts and nobody
> cared log as you got in the first patch of yours fixing similar race.
Yeah
>
>>> To resolve this issue, the platform acknowledgment interrupt should
>>> always be cleared before processing the interrupt for any notifications
>>> or response.
>>>
>> AFAIC,always clearing the platform ack interrupt first which is also the
>> communication flow as ACPI spec described.
> Indeed, not sure how we missed it so far.
>
>> I am not sure if it is ok when triggering interrupt and clearing interrupt
>> occur concurrently.
> Should be OK as we start clearing all the channels that share, if the
> handler doesn't clear any source, the interrupt must remain asserted.
ok, thank you for clarifying to me.
>
>> But this scenario is always possible. I think It doesn't matter with this
>> patch. It's just my confusion.
> Indeed, it can happen any time as you mentioned. No worries better to ask
> and clarify than assume. Thanks for your time and review.
>
> --
> Regards,
> Sudeep
>
>
> .
diff mbox series

Patch

diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index f2e4087281c70eeb5b9b33371596613a371dff4f..4c582fa2b8bf4c9a9368dba8220f567555dba963 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -313,6 +313,10 @@  static irqreturn_t pcc_mbox_irq(int irq, void *p)
 	int ret;
 
 	pchan = chan->con_priv;
+
+	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
+		return IRQ_NONE;
+
 	if (pchan->type == ACPI_PCCT_TYPE_EXT_PCC_MASTER_SUBSPACE &&
 	    !pchan->chan_in_use)
 		return IRQ_NONE;
@@ -330,9 +334,6 @@  static irqreturn_t pcc_mbox_irq(int irq, void *p)
 		return IRQ_NONE;
 	}
 
-	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
-		return IRQ_NONE;
-
 	/*
 	 * Clear this flag immediately after updating interrupt ack register
 	 * to avoid possible race in updatation of the flag from