[RFCv2,00/13] iommu: Add MSI mapping support with nested SMMU

Message ID	cover.1736550979.git.nicolinc@nvidia.com
Headers	show Received: from NAM02-BN1-obe.outbound.protection.outlook.com (mail-bn1nam02on2084.outbound.protection.outlook.com [40.107.212.84]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65195149DF0; Sat, 11 Jan 2025 03:32:52 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.232 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.232; helo=mail.nvidia.com; pr=C From: Nicolin Chen <nicolinc@nvidia.com> To: <will@kernel.org>, <robin.murphy@arm.com>, <jgg@nvidia.com>, <kevin.tian@intel.com>, <tglx@linutronix.de>, <maz@kernel.org>, <alex.williamson@redhat.com> CC: <joro@8bytes.org>, <shuah@kernel.org>, <reinette.chatre@intel.com>, <eric.auger@redhat.com>, <yebin10@huawei.com>, <apatel@ventanamicro.com>, <shivamurthy.shastri@linutronix.de>, <bhelgaas@google.com>, <anna-maria@linutronix.de>, <yury.norov@gmail.com>, <nipun.gupta@amd.com>, <iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org>, <kvm@vger.kernel.org>, <linux-kselftest@vger.kernel.org>, <patches@lists.linux.dev>, <jean-philippe@linaro.org>, <mdf@kernel.org>, <mshavit@google.com>, <shameerali.kolothum.thodi@huawei.com>, <smostafa@google.com>, <ddutile@redhat.com> Subject: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Date: Fri, 10 Jan 2025 19:32:16 -0800 Message-ID: <cover.1736550979.git.nicolinc@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	iommu: Add MSI mapping support with nested SMMU \| expand [RFCv2,00/13] iommu: Add MSI mapping support with nested SMMU [RFCv2,01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie [RFCv2,02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr() [RFCv2,03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation [RFCv2,04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it [RFCv2,05/13] iommu: Turn fault_data to iommufd private pointer [RFCv2,06/13] iommufd: Make attach_handle generic [RFCv2,07/13] iommufd: Implement sw_msi support natively [RFCv2,08/13] iommu: Turn iova_cookie to dma-iommu private pointer [RFCv2,09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls [RFCv2,10/13] iommufd/selftes: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE [RFCv2,11/13] iommufd/device: Allow setting IOVAs for MSI(x) vectors [RFCv2,12/13] vfio-iommufd: Provide another layer of msi_iova helpers [RFCv2,13/13] vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE

Message ID

cover.1736550979.git.nicolinc@nvidia.com

Headers

Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates
 216.228.118.232 as permitted sender) receiver=protection.outlook.com;
 client-ip=216.228.118.232; helo=mail.nvidia.com; pr=C
From: Nicolin Chen <nicolinc@nvidia.com>
To: <will@kernel.org>, <robin.murphy@arm.com>, <jgg@nvidia.com>,
	<kevin.tian@intel.com>, <tglx@linutronix.de>, <maz@kernel.org>,
	<alex.williamson@redhat.com>
CC: <joro@8bytes.org>, <shuah@kernel.org>, <reinette.chatre@intel.com>,
	<eric.auger@redhat.com>, <yebin10@huawei.com>, <apatel@ventanamicro.com>,
	<shivamurthy.shastri@linutronix.de>, <bhelgaas@google.com>,
	<anna-maria@linutronix.de>, <yury.norov@gmail.com>, <nipun.gupta@amd.com>,
	<iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>, <kvm@vger.kernel.org>,
	<linux-kselftest@vger.kernel.org>, <patches@lists.linux.dev>,
	<jean-philippe@linaro.org>, <mdf@kernel.org>, <mshavit@google.com>,
	<shameerali.kolothum.thodi@huawei.com>, <smostafa@google.com>,
	<ddutile@redhat.com>
Subject: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
Date: Fri, 10 Jan 2025 19:32:16 -0800
Message-ID: <cover.1736550979.git.nicolinc@nvidia.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 11 Jan 2025 03:32:48.0335
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 c939503c-7d33-4bee-2032-08dd31f09ebd
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.118.232];Helo=[mail.nvidia.com]
X-MS-Exchange-CrossTenant-AuthSource: 
	DS1PEPF0001709A.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: LV3PR12MB9216

Series

iommu: Add MSI mapping support with nested SMMU | expand

Message

Nicolin Chen Jan. 11, 2025, 3:32 a.m. UTC

[ Background ]
On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
  IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
(IPA stands for Intermediate Physical Address).

If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.

So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.

There have been two approaches to solve this problem:
1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
   (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
   fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT
   region per iommu group for a direct mapping. Eventually, the mappings
   would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
   This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
   driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
   page location (IPA) to the kernel for the stage-2 mapping. Eventually:
   IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
   This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).

Worth mentioning that when Eric Auger was working on the same topic with
the VFIO iommu uAPI, he had the approach (2) first, and then switched to
the approach (1), suggested by Jean-Philippe for reduction of complexity.

The approach (1) basically feels like the existing VFIO passthrough that
has a 1-stage mapping for the unmanaged domain, yet only by shifting the
MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, by
sharing the same idea of "VMM leaving everything to the kernel".

The approach (2) is an ideal solution, yet it requires additional effort
for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
page(s), which demands VMM to closely cooperate.
 * It also brings some complicated use cases to the table where the host
   or/and guest system(s) has/have multiple ITS pages.

[ Execution ]
Though these two approaches feel very different on the surface, they can
share some underlying common infrastructure. Currently, only one pair of
sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
drivers to directly use. There could be different versions of functions
from different domain owners: for existing VFIO passthrough cases and in-
kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
functions; for nested translation use cases, there can be another version
of sw_msi functions to handle mapping and msi_msg(s) differently.

To support both approaches, in this series
 - Get rid of the duplication in the "compose" function
 - Introduce a function pointer for the previously "prepare" function
 - Allow different domain owners to set their own "sw_msi" implementations
 - Implement an iommufd_sw_msi function to additionally support a nested
   translation use case using the approach (2), i.e. the RMR solution
 - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
   agree on (for approach 1)
 - Add a new VFIO ioctl to set the MSI(x) vector(s) for iommufd_sw_msi()
   to update the msi_desc structure accordingly (for approach 2)

A missing piece
 - Potentially another IOMMUFD_CMD_IOAS_MAP_MSI ioctl for VMM to map the
   IPAs of the vITS page(s) in the stage-2 io page table. (for approach 2)
   (in this RFC, conveniently reuse the new IOMMUFD SW_MSI options to set
    the vITS page's IPA, which works finely in a single-vITS-page case.)

This is a joint effort that includes Jason's rework in irq/iommu/iommufd
base level and my additional patches on top of that for new uAPIs.

This series is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_msi-rfcv2
Pairing QEMU branch for testing (approach 1):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi-rfcv2-rmr
Pairing QEMU branch for testing (approach 2):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi-rfcv2-vits

Changelog
v2
 * Rebase on v6.13-rc6
 * Drop all the irq/pci patches and rework the compose function instead
 * Add a new sw_msi op to iommu_domain for a per type implementation and
   let iommufd core has its own implementation to support both approaches
 * Add RMR-solution (approach 1) support since it is straightforward and
   have been used in some out-of-tree projects widely
v1
 https://lore.kernel.org/kvm/cover.1731130093.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Jason Gunthorpe (5):
  genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of
    iommu_cookie
  genirq/msi: Rename iommu_dma_compose_msi_msg() to
    msi_msg_set_msi_addr()
  iommu: Make iommu_dma_prepare_msi() into a generic operation
  irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that
    need it
  iommufd: Implement sw_msi support natively

Nicolin Chen (8):
  iommu: Turn fault_data to iommufd private pointer
  iommufd: Make attach_handle generic
  iommu: Turn iova_cookie to dma-iommu private pointer
  iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  iommufd/selftes: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE
  iommufd/device: Allow setting IOVAs for MSI(x) vectors
  vfio-iommufd: Provide another layer of msi_iova helpers
  vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE

 drivers/iommu/Kconfig                         |   1 -
 drivers/irqchip/Kconfig                       |   4 +
 kernel/irq/Kconfig                            |   1 +
 drivers/iommu/iommufd/iommufd_private.h       |  69 ++--
 include/linux/iommu.h                         |  58 ++--
 include/linux/iommufd.h                       |   6 +
 include/linux/msi.h                           |  43 ++-
 include/linux/vfio.h                          |  25 ++
 include/uapi/linux/iommufd.h                  |  18 +-
 include/uapi/linux/vfio.h                     |   8 +-
 drivers/iommu/dma-iommu.c                     |  63 ++--
 drivers/iommu/iommu.c                         |  29 ++
 drivers/iommu/iommufd/device.c                | 312 ++++++++++++++++--
 drivers/iommu/iommufd/fault.c                 | 122 +------
 drivers/iommu/iommufd/hw_pagetable.c          |   5 +-
 drivers/iommu/iommufd/io_pagetable.c          |   4 +-
 drivers/iommu/iommufd/ioas.c                  |  34 ++
 drivers/iommu/iommufd/main.c                  |  15 +
 drivers/irqchip/irq-gic-v2m.c                 |   5 +-
 drivers/irqchip/irq-gic-v3-its.c              |  13 +-
 drivers/irqchip/irq-gic-v3-mbi.c              |  12 +-
 drivers/irqchip/irq-ls-scfg-msi.c             |   5 +-
 drivers/vfio/iommufd.c                        |  27 ++
 drivers/vfio/pci/vfio_pci_intrs.c             |  46 +++
 drivers/vfio/vfio_main.c                      |   3 +
 tools/testing/selftests/iommu/iommufd.c       |  53 +++
 .../selftests/iommu/iommufd_fail_nth.c        |  14 +
 27 files changed, 712 insertions(+), 283 deletions(-)

Comments

Eric Auger Jan. 29, 2025, 5:46 p.m. UTC | #1

On 1/29/25 4:04 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 03:54:48PM +0100, Eric Auger wrote:
>>>> or you are just mentioning it here because
>>>> it is still possible to make use of that. I think from previous discussions the
>>>> argument was to adopt a more dedicated MSI pass-through model which I
>>>> think is  approach-2 here.  
>>> The basic flow of the pass through model is shown in the last two
>>> patches, it is not fully complete but is testable. It assumes a single
>>> ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
>>> ITS page at the correct S2 location and then describe it in the ACPI
>>> as an ITS page not a RMR.
>> This is a nice to have feature but not mandated in the first place,
>> is it?
> Not mandated. It just sort of happens because of the design. IMHO
> nothing should use it because there is no way for userspace to
> discover how many ITS pages there may be.
>
>>> This missing peice is cleaning up the ITS mapping to allow for
>>> multiple ITS pages. I've imagined that kvm would someone give iommufd
>>> a FD that holds the specific ITS pages instead of the
>>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
>> That's what I don't get: at the moment you only pass the gIOVA. With
>> technique 2, how can you build the nested mapping, ie.
>>
>>          S1           S2
>> gIOVA    ->    gDB    ->    hDB
>>
>> without passing the full gIOVA/gDB S1 mapping to the host?
> The nested S2 mapping is already setup before the VM boots:
>
>  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
Ah OK. Your gDB has nothing to do with the actual S1 guest gDB, right?
It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
Is that correct? In
https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
I was passing both the gIOVA and the "true" gDB Eric
>  - The ACPI tells the VM that the GIC has an ITS page at the S2's
>    address (hDB)
>  - The VM sets up its S1 with a gIOVA that points to the S2's ITS 
>    page (gDB). The S2 already has gDB -> hDB.
>  - The VMM traps the gIOVA write to the MSI-X table. Both the S1 and
>    S2 are populated at this moment.
>
> If you have multiple ITS pages then the ACPI has to tell the guest GIC
> about them, what their gDB address is, and what devices use which ITS.
>
> Jason
>

Jason Gunthorpe Jan. 29, 2025, 8:13 p.m. UTC | #2

On Wed, Jan 29, 2025 at 06:46:20PM +0100, Eric Auger wrote:
> >>> This missing peice is cleaning up the ITS mapping to allow for
> >>> multiple ITS pages. I've imagined that kvm would someone give iommufd
> >>> a FD that holds the specific ITS pages instead of the
> >>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
> >> That's what I don't get: at the moment you only pass the gIOVA. With
> >> technique 2, how can you build the nested mapping, ie.
> >>
> >>          S1           S2
> >> gIOVA    ->    gDB    ->    hDB
> >>
> >> without passing the full gIOVA/gDB S1 mapping to the host?
> > The nested S2 mapping is already setup before the VM boots:
> >
> >  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
> Ah OK. Your gDB has nothing to do with the actual S1 guest gDB,
> right?

I'm not totally sure what you mean by gDB? The above diagram suggests
it is the ITS page address in the S2? Ie the guest physical address of
the ITS.

Within the VM, when it goes to call iommu_dma_prepare_msi(), it will
provide the gDB adress as the phys_addr_t msi_addr.

This happens because the GIC driver will have been informed of the ITS
page at the gDB address, and it will use
iommu_dma_prepare_msi(). Exactly the same as bare metal.

> It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
> Is that correct?

Yes, for a single ITS page it will reliably be put at sw_msi_start.
Since the VMM can provide sw_msi_start through the OPTION, the VMM can
place the ITS page where it wants and then program the ACPI to tell
the VM to call iommu_dma_prepare_msi(). (don't use this flow, it
doesn't work for multi ITS, for testing only)

> https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
> I was passing both the gIOVA and the "true" gDB Eric

If I understand this right, it still had the hypervisor dynamically
setting up the S2, here it is pre-set and static?

Jason

Jacob Pan Feb. 5, 2025, 10:49 p.m. UTC | #3

Hi Nicolin,

On Fri, 10 Jan 2025 19:32:16 -0800
Nicolin Chen <nicolinc@nvidia.com> wrote:

> [ Background ]
> On ARM GIC systems and others, the target address of the MSI is
> translated by the IOMMU. For GIC, the MSI address page is called
> "ITS" page. When the IOMMU is disabled, the MSI address is programmed
> to the physical location of the GIC ITS page (e.g. 0x20200000). When
> the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI
> address is programmed to an allocated IO virtual address (a.k.a
> IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS
> page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage
> translation is enabled, IOVA will be still used to program the MSI
> address, though the mappings will be in two stages: IOVA (0xFFFF0000)
> ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for
> Intermediate Physical Address).
> 
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA,
> the IOVA is dynamically allocated from the top of the IOVA space. If
> attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough
> device), the IOVA is fixed to an MSI window reported by the IOMMU
> driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE
> (IOVA==0x8000000) for ARM IOMMUs.
> 
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in
> charge of the IOMMU translation (1-stage translation), since the IOVA
> for the ITS page is fixed and known by kernel. However, with virtual
> machine enabling a nested IOMMU translation (2-stage), a guest kernel
> directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA,
> mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space
> (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level
> IOVA to program the MSI address.
> 
> There have been two approaches to solve this problem:
> 1. Create an identity mapping in the stage-1. VMM could insert a few
> RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel
> would fetch these RMR entries from the IORT and create an
> IOMMU_RESV_DIRECT region per iommu group for a direct mapping.
> Eventually, the mappings would look like: IOVA (0x8000000) === IPA
> (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel
> and VMM to agree on the IPA.

Should this RMR be in a separate range than MSI_IOVA_BASE? The guest
will have MSI_IOVA_BASE in a reserved region already, no?
e.g. # cat
/sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions
0x0000000008000000 0x00000000080fffff msi

Nicolin Chen Feb. 5, 2025, 10:56 p.m. UTC | #4

On Wed, Feb 05, 2025 at 02:49:04PM -0800, Jacob Pan wrote:
> > There have been two approaches to solve this problem:
> > 1. Create an identity mapping in the stage-1. VMM could insert a few
> > RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel
> > would fetch these RMR entries from the IORT and create an
> > IOMMU_RESV_DIRECT region per iommu group for a direct mapping.
> > Eventually, the mappings would look like: IOVA (0x8000000) === IPA
> > (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel
> > and VMM to agree on the IPA.
> 
> Should this RMR be in a separate range than MSI_IOVA_BASE? The guest
> will have MSI_IOVA_BASE in a reserved region already, no?
> e.g. # cat
> /sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions
> 0x0000000008000000 0x00000000080fffff msi

No. In Patch-9, the driver-defined MSI_IOVA_BASE will be ignored if
userspace has assigned IOMMU_OPTION_SW_MSI_START/SIZE, even if they
might have the same values as the MSI_IOVA_BASE window.

The idea of MSI_IOVA_BASE in this series is a kernel default that'd
be only effective when user space doesn't care to set anything.

Thanks
Nicolin