[v6,0/7] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault

Message ID	20250515-msm-gpu-fault-fixes-next-v6-0-4fe2a583a878@gmail.com
Headers	show Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 51723266B66 for <linux-arm-msm@vger.kernel.org>; Thu, 15 May 2025 19:59:38 +0000 (UTC) From: Connor Abbott <cwabbott0@gmail.com> Subject: [PATCH v6 0/7] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault Date: Thu, 15 May 2025 15:58:42 -0400 Message-Id: <20250515-msm-gpu-fault-fixes-next-v6-0-4fe2a583a878@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Rob Clark <robdclark@gmail.com>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com>, Joerg Roedel <joro@8bytes.org>, Sean Paul <sean@poorly.run>, Konrad Dybcio <konradybcio@kernel.org>, Abhinav Kumar <quic_abhinavk@quicinc.com>, Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>, Marijn Suijten <marijn.suijten@somainline.org> Cc: iommu@lists.linux.dev, linux-arm-msm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, freedreno@lists.freedesktop.org, Connor Abbott <cwabbott0@gmail.com>
Series	iommu/arm-smmu, drm/msm: Fixes for stall-on-fault \| expand [v6,0/7] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault [v6,1/7] iommu/arm-smmu-qcom: Enable threaded IRQ for Adreno SMMUv2/MMU500 [v6,2/7] iommu/arm-smmu: Move handing of RESUME to the context fault handler [v6,3/7] iommu/arm-smmu-qcom: Make set_stall work when the device is on [v6,4/7] drm/msm: Don't use a worker to capture fault devcoredump [v6,5/7] drm/msm: Delete resume_translation() [v6,6/7] drm/msm: Temporarily disable stall-on-fault after a page fault [v6,7/7] iommu/smmu-arm-qcom: Delete resume_translation()

Message ID

20250515-msm-gpu-fault-fixes-next-v6-0-4fe2a583a878@gmail.com

Headers

From: Connor Abbott <cwabbott0@gmail.com>
Subject: [PATCH v6 0/7] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault
Date: Thu, 15 May 2025 15:58:42 -0400
Message-Id: <20250515-msm-gpu-fault-fixes-next-v6-0-4fe2a583a878@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: Rob Clark <robdclark@gmail.com>, Will Deacon <will@kernel.org>,
 Robin Murphy <robin.murphy@arm.com>, Joerg Roedel <joro@8bytes.org>,
 Sean Paul <sean@poorly.run>, Konrad Dybcio <konradybcio@kernel.org>,
 Abhinav Kumar <quic_abhinavk@quicinc.com>,
 Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>,
 Marijn Suijten <marijn.suijten@somainline.org>
Cc: iommu@lists.linux.dev, linux-arm-msm@vger.kernel.org,
 linux-arm-kernel@lists.infradead.org, freedreno@lists.freedesktop.org,
 Connor Abbott <cwabbott0@gmail.com>

Series

iommu/arm-smmu, drm/msm: Fixes for stall-on-fault | expand

Message

Connor Abbott May 15, 2025, 7:58 p.m. UTC

drm/msm uses the stall-on-fault model to record the GPU state on the
first GPU page fault to help debugging. On systems where the GPU is
paired with a MMU-500, there were two problems:

1. The MMU-500 doesn't de-assert its interrupt line until the fault is
   resumed, which led to a storm of interrupts until the fault handler
   was called. If we got unlucky and the fault handler was on the same
   CPU as the interrupt, there was a deadlock.
2. The GPU is capable of generating page faults much faster than we can
   resume them. GMU (GPU Management Unit) shares the same context bank
   as the GPU, so if there was a sudden spurt of page faults it would be
   effectively starved and would trigger a watchdog reset, made even
   worse because the GPU cannot be reset while there's a pending
   transaction leaving the GPU permanently wedged.

Patches 1-2 and 4 fix the first problem by switching the IRQ to be a
threaded IRQ and then making drm/msm do its devcoredump work
synchronously in the threaded IRQ. Patch 4 is dependent on patches 1-2.
Patch 6 fixes the second problem and is dependent on patch 3. Patch 5 is
a cleanup for patch 4 and patch 7 is a subsequent further cleanup to get
rid of the resume_fault() callback once we switch resuming to being done
by the SMMU's fault handler.

I've organized the series in the order that it should be picked up:

- Patches 1-3 need to be applied to the iommu tree first.
- Patches 4-6, which depend on 1-3 should be taken by drm/msm. We will
  probably want to create an immutable tag and merge it into drm/msm to
  be able to take them in the same cycle and avoid the temporary
  regression noted in patch 2.
- Patch 7 can be applied to the iommu tree later, it's just a smaller
  cleanup dependent on the changes landing in drm/msm.

Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
---
Changes in v6:
- Rewrite to use a threaded IRQ instead in iommu/arm-smmu (Will). As a
  result we can drop most of the previous changes and instead move
  writing RESUME to the fault handler.
- Link to v5: https://lore.kernel.org/r/20250319-msm-gpu-fault-fixes-next-v5-0-97561209dd8c@gmail.com

Changes in v5:
- Don't read CONTEXTIDR for stage 2 domains.
- Clarify that we don't need TLB invalidation when changing
  SMMU_CBn_SCTLR.CFCFG.
- Link to v4: https://lore.kernel.org/r/20250304-msm-gpu-fault-fixes-next-v4-0-be14be37f4c3@gmail.com

Changes in v4:
- Add patches 1-2, which fix reading registers in drm/msm when
  acknowledging the fault early. This was Robin's preferred solution
  compared to making drm/msm's fault handler tell arm-smmu to resume the
  fault.
- Link to v3: https://lore.kernel.org/r/20250122-msm-gpu-fault-fixes-next-v3-0-0afa00158521@gmail.com

Changes in v3:
- Acknowledge the fault before resuming the transaction in patch 1.
- Add suggested extra context to commit messages.
- Link to v2: https://lore.kernel.org/r/20250120-msm-gpu-fault-fixes-next-v2-0-d636c4027042@gmail.com

Changes in v2:
- Remove unnecessary _irqsave when locking in IRQ handler (Robin)
- Reuse existing spinlock for CFIE manipulation (Robin)
- Lock CFCFG manipulation against concurrent CFIE manipulation
- Don't use timer to re-enable stall-on-fault. (Rob)
- Use more descriptive name for the function that re-enables
  stall-on-fault if the cooldown period has ended. (Rob)
- Link to v1: https://lore.kernel.org/r/20250117-msm-gpu-fault-fixes-next-v1-0-bc9b332b5d0b@gmail.com

---
Connor Abbott (7):
      iommu/arm-smmu-qcom: Enable threaded IRQ for Adreno SMMUv2/MMU500
      iommu/arm-smmu: Move handing of RESUME to the context fault handler
      iommu/arm-smmu-qcom: Make set_stall work when the device is on
      drm/msm: Don't use a worker to capture fault devcoredump
      drm/msm: Delete resume_translation()
      drm/msm: Temporarily disable stall-on-fault after a page fault
      iommu/smmu-arm-qcom: Delete resume_translation()

 drivers/gpu/drm/msm/adreno/a2xx_gpummu.c         |  5 ---
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c            |  2 +
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c            |  4 ++
 drivers/gpu/drm/msm/adreno/adreno_gpu.c          | 56 +++++++++++++++++++-----
 drivers/gpu/drm/msm/adreno/adreno_gpu.h          | 26 +++++++++++
 drivers/gpu/drm/msm/msm_gpu.c                    | 20 ++++-----
 drivers/gpu/drm/msm/msm_gpu.h                    |  8 +---
 drivers/gpu/drm/msm/msm_iommu.c                  | 12 ++---
 drivers/gpu/drm/msm/msm_mmu.h                    |  2 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c |  9 ++++
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c       | 43 ++++++++++++------
 drivers/iommu/arm/arm-smmu/arm-smmu.c            |  6 +++
 include/linux/adreno-smmu-priv.h                 |  8 ++--
 13 files changed, 140 insertions(+), 61 deletions(-)
---
base-commit: 866e43b945bf98f8e807dfa45eca92f931f3a032
change-id: 20250117-msm-gpu-fault-fixes-next-96e3098023e1

Best regards,

Comments

Will Deacon May 20, 2025, 2:18 p.m. UTC | #1

Hi Connor,

On Thu, May 15, 2025 at 03:58:42PM -0400, Connor Abbott wrote:
> drm/msm uses the stall-on-fault model to record the GPU state on the
> first GPU page fault to help debugging. On systems where the GPU is
> paired with a MMU-500, there were two problems:
> 
> 1. The MMU-500 doesn't de-assert its interrupt line until the fault is
>    resumed, which led to a storm of interrupts until the fault handler
>    was called. If we got unlucky and the fault handler was on the same
>    CPU as the interrupt, there was a deadlock.
> 2. The GPU is capable of generating page faults much faster than we can
>    resume them. GMU (GPU Management Unit) shares the same context bank
>    as the GPU, so if there was a sudden spurt of page faults it would be
>    effectively starved and would trigger a watchdog reset, made even
>    worse because the GPU cannot be reset while there's a pending
>    transaction leaving the GPU permanently wedged.
> 
> Patches 1-2 and 4 fix the first problem by switching the IRQ to be a
> threaded IRQ and then making drm/msm do its devcoredump work
> synchronously in the threaded IRQ. Patch 4 is dependent on patches 1-2.
> Patch 6 fixes the second problem and is dependent on patch 3. Patch 5 is
> a cleanup for patch 4 and patch 7 is a subsequent further cleanup to get
> rid of the resume_fault() callback once we switch resuming to being done
> by the SMMU's fault handler.

Thanks for reworking this; I think it looks much better now from the
SMMU standpoint.

> I've organized the series in the order that it should be picked up:
> 
> - Patches 1-3 need to be applied to the iommu tree first.

Which kernel version did you base these on? I can't see to apply the
second patch, as you seem to have a stale copy of arm-smmu-qcom.c?

Will

Connor Abbott May 20, 2025, 2:42 p.m. UTC | #2

On Tue, May 20, 2025 at 10:19 AM Will Deacon <will@kernel.org> wrote:
>
> Hi Connor,
>
> On Thu, May 15, 2025 at 03:58:42PM -0400, Connor Abbott wrote:
> > drm/msm uses the stall-on-fault model to record the GPU state on the
> > first GPU page fault to help debugging. On systems where the GPU is
> > paired with a MMU-500, there were two problems:
> >
> > 1. The MMU-500 doesn't de-assert its interrupt line until the fault is
> >    resumed, which led to a storm of interrupts until the fault handler
> >    was called. If we got unlucky and the fault handler was on the same
> >    CPU as the interrupt, there was a deadlock.
> > 2. The GPU is capable of generating page faults much faster than we can
> >    resume them. GMU (GPU Management Unit) shares the same context bank
> >    as the GPU, so if there was a sudden spurt of page faults it would be
> >    effectively starved and would trigger a watchdog reset, made even
> >    worse because the GPU cannot be reset while there's a pending
> >    transaction leaving the GPU permanently wedged.
> >
> > Patches 1-2 and 4 fix the first problem by switching the IRQ to be a
> > threaded IRQ and then making drm/msm do its devcoredump work
> > synchronously in the threaded IRQ. Patch 4 is dependent on patches 1-2.
> > Patch 6 fixes the second problem and is dependent on patch 3. Patch 5 is
> > a cleanup for patch 4 and patch 7 is a subsequent further cleanup to get
> > rid of the resume_fault() callback once we switch resuming to being done
> > by the SMMU's fault handler.
>
> Thanks for reworking this; I think it looks much better now from the
> SMMU standpoint.
>
> > I've organized the series in the order that it should be picked up:
> >
> > - Patches 1-3 need to be applied to the iommu tree first.
>
> Which kernel version did you base these on? I can't see to apply the
> second patch, as you seem to have a stale copy of arm-smmu-qcom.c?
>
> Will

Sorry about that, for the next version I'll rebase on linux-next. I
was using an older version of msm-next for a while now.

Connor

Will Deacon May 20, 2025, 3:38 p.m. UTC | #3

On Tue, May 20, 2025 at 10:42:49AM -0400, Connor Abbott wrote:
> On Tue, May 20, 2025 at 10:19 AM Will Deacon <will@kernel.org> wrote:
> > On Thu, May 15, 2025 at 03:58:42PM -0400, Connor Abbott wrote:
> > > drm/msm uses the stall-on-fault model to record the GPU state on the
> > > first GPU page fault to help debugging. On systems where the GPU is
> > > paired with a MMU-500, there were two problems:
> > >
> > > 1. The MMU-500 doesn't de-assert its interrupt line until the fault is
> > >    resumed, which led to a storm of interrupts until the fault handler
> > >    was called. If we got unlucky and the fault handler was on the same
> > >    CPU as the interrupt, there was a deadlock.
> > > 2. The GPU is capable of generating page faults much faster than we can
> > >    resume them. GMU (GPU Management Unit) shares the same context bank
> > >    as the GPU, so if there was a sudden spurt of page faults it would be
> > >    effectively starved and would trigger a watchdog reset, made even
> > >    worse because the GPU cannot be reset while there's a pending
> > >    transaction leaving the GPU permanently wedged.
> > >
> > > Patches 1-2 and 4 fix the first problem by switching the IRQ to be a
> > > threaded IRQ and then making drm/msm do its devcoredump work
> > > synchronously in the threaded IRQ. Patch 4 is dependent on patches 1-2.
> > > Patch 6 fixes the second problem and is dependent on patch 3. Patch 5 is
> > > a cleanup for patch 4 and patch 7 is a subsequent further cleanup to get
> > > rid of the resume_fault() callback once we switch resuming to being done
> > > by the SMMU's fault handler.
> >
> > Thanks for reworking this; I think it looks much better now from the
> > SMMU standpoint.
> >
> > > I've organized the series in the order that it should be picked up:
> > >
> > > - Patches 1-3 need to be applied to the iommu tree first.
> >
> > Which kernel version did you base these on? I can't see to apply the
> > second patch, as you seem to have a stale copy of arm-smmu-qcom.c?
> >
> Sorry about that, for the next version I'll rebase on linux-next. I
> was using an older version of msm-next for a while now.

Can you base on v6.15-rc2 instead, please? linux-next is a moving
target so it's not massively helpful to use that.

Cheers,

Will