[v2,0/7] iommu: Add MSI mapping support with nested SMMU (Part-1 core)

Message ID	cover.1740014950.git.nicolinc@nvidia.com
Headers	show Received: from NAM12-MW2-obe.outbound.protection.outlook.com (mail-mw2nam12on2080.outbound.protection.outlook.com [40.107.244.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 730811C5D5E; Thu, 20 Feb 2025 01:32:55 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.161 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C From: Nicolin Chen <nicolinc@nvidia.com> To: <jgg@nvidia.com>, <kevin.tian@intel.com>, <tglx@linutronix.de>, <maz@kernel.org> CC: <joro@8bytes.org>, <will@kernel.org>, <robin.murphy@arm.com>, <shuah@kernel.org>, <iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org>, <linux-kselftest@vger.kernel.org>, <eric.auger@redhat.com>, <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>, <yury.norov@gmail.com>, <jacob.pan@linux.microsoft.com>, <patches@lists.linux.dev> Subject: [PATCH v2 0/7] iommu: Add MSI mapping support with nested SMMU (Part-1 core) Date: Wed, 19 Feb 2025 17:31:35 -0800 Message-ID: <cover.1740014950.git.nicolinc@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	iommu: Add MSI mapping support with nested SMMU (Part-1 core) \| expand [v2,0/7] iommu: Add MSI mapping support with nested SMMU (Part-1 core) [v2,4/7] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by irqchips that need it [v2,5/7] iommu: Turn fault_data to iommufd private pointer [v2,6/7] iommufd: Implement sw_msi support natively

Nicolin Chen Feb. 20, 2025, 1:31 a.m. UTC

[ Background ]
On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
  IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
(IPA stands for Intermediate Physical Address).

If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.

So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.

There have been two approaches to solve this problem:
1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
   (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
   fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT
   region per iommu group for a direct mapping. Eventually, the mappings
   would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
   This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
   driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
   page location (IPA) to the kernel for the stage-2 mapping. Eventually:
   IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
   This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).

Worth mentioning that when Eric Auger was working on the same topic with
the VFIO iommu uAPI, he had a solution for approach (2) first, and then
switched to approach (1), suggested by Jean-Philippe for the reduction of
complexity.

Approach (1) basically feels like the existing VFIO passthrough that has
a 1-stage mapping for the unmanaged domain, yet only by shifting the MSI
mapping from stage 1 (no-viommu case) to stage 2 (has-viommu case). So,
it could reuse the existing IOMMU_RESV_SW_MSI piece, by sharing the same
idea of "VMM leaving everything to the kernel".

Approach (2) is an ideal solution, yet it requires additional effort for
kernel to be aware of the stage-1 gIOVAs and the stage-2 IPAs for vITS
page(s), which demands VMM to closely cooperate.
 * It also brings some complicated use cases to the table where the host
   or/and guest system(s) has/have multiple ITS pages.

[ Execution ]
Though these two approaches feel very different on the surface, they can
share some underlying common infrastructure. Currently, only one pair of
sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
drivers to directly use. There could be different versions of functions
from different domain owners: for existing VFIO passthrough cases and in-
kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
functions; for nested translation use cases, there can be another version
of sw_msi functions to handle mapping and msi_msg(s) differently.

As a part-1 series, this refactors the core infrastructure:
 - Get rid of the duplication in the "compose" function
 - Introduce a function pointer for the previously "prepare" function
 - Allow different domain owners to set their own "sw_msi" implementations
 - Implement an iommufd_sw_msi function to additionally support non-nested
   use cases and also prepare for a nested translation use case using the
   approach (1)

[ Future Plan ]
Part-2 will add support of approach (1), i.e. RMR solution:
 - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
   agree on (for approach 1)
Part-3 and beyond will continue the effort of supporting approach (2) i.e.
a complete vITS-to-pITS mapping:
 - Map the phsical ITS page (potentially via IOMMUFD_CMD_IOAS_MAP_MSI)
 - Convey the IOVAs per-irq (potentially via VFIO_IRQ_SET_ACTION_PREPARE)

---

This is a joint effort that includes Jason's rework in irq/iommu/iommufd
base level and my additional patches on top of that for new uAPIs.

This series is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_msi_p1-v2

For testing with nested SMMU (approach 1):
https://github.com/nicolinc/iommufd/commits/wip/iommufd_msi_p2-v2
Pairing QEMU branch for testing (approach 1):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi_p2-v2-rmr

Changelog
v2
 * Split the iommufd ioctl for approach (1) out of this part-1
 * Rebase on Jason's for-next tree (6.14-rc2) for two iommufd patches
 * Update commit logs in two irqchip patches to make narrative clearer
 * Keep iommu_dma_compose_msi_msg() in PATCH-1 as a small cleaner step
 * Improve with some coding style changes: kdoc and 100-char wrapping
v1
 https://lore.kernel.org/kvm/cover.1739005085.git.nicolinc@nvidia.com/
 * Rebase on v6.14-rc1 and iommufd_attach_handle-v1 series
   https://lore.kernel.org/all/cover.1738645017.git.nicolinc@nvidia.com/
 * Correct typos
 * Replace set_bit with __set_bit
 * Use a common helper to get iommufd_handle
 * Add kdoc for iommu_msi_iova/iommu_msi_page_shift
 * Rename msi_msg_set_msi_addr() to msi_msg_set_addr()
 * Update selftest for a better coverage for the new options
 * Change IOMMU_OPTION_SW_MSI_START/SIZE to be per-idev and properly
   check against device's reserved region list
RFCv2
 https://lore.kernel.org/kvm/cover.1736550979.git.nicolinc@nvidia.com/
 * Rebase on v6.13-rc6
 * Drop all the irq/pci patches and rework the compose function instead
 * Add a new sw_msi op to iommu_domain for a per type implementation and
   let iommufd core has its own implementation to support both approaches
 * Add RMR-solution (approach 1) support since it is straightforward and
   have been used in some out-of-tree projects widely
RFCv1
 https://lore.kernel.org/kvm/cover.1731130093.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Jason Gunthorpe (5):
  genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of
    iommu_cookie
  genirq/msi: Refactor iommu_dma_compose_msi_msg()
  iommu: Make iommu_dma_prepare_msi() into a generic operation
  irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by irqchips that need
    it
  iommufd: Implement sw_msi support natively

Nicolin Chen (2):
  iommu: Turn fault_data to iommufd private pointer
  iommu: Turn iova_cookie to dma-iommu private pointer

 drivers/iommu/Kconfig                   |   1 -
 drivers/irqchip/Kconfig                 |   4 +
 kernel/irq/Kconfig                      |   1 +
 drivers/iommu/iommufd/iommufd_private.h |  23 +++-
 include/linux/iommu.h                   |  58 +++++----
 include/linux/msi.h                     |  55 +++++---
 drivers/iommu/dma-iommu.c               |  63 +++-------
 drivers/iommu/iommu.c                   |  29 +++++
 drivers/iommu/iommufd/device.c          | 160 ++++++++++++++++++++----
 drivers/iommu/iommufd/fault.c           |   2 +-
 drivers/iommu/iommufd/hw_pagetable.c    |   5 +-
 drivers/iommu/iommufd/main.c            |   9 ++
 drivers/irqchip/irq-gic-v2m.c           |   5 +-
 drivers/irqchip/irq-gic-v3-its.c        |  13 +-
 drivers/irqchip/irq-gic-v3-mbi.c        |  12 +-
 drivers/irqchip/irq-ls-scfg-msi.c       |   5 +-
 16 files changed, 309 insertions(+), 136 deletions(-)


base-commit: dc10ba25d43f433ad5d9e8e6be4f4d2bb3cd9ddb
prerequisite-patch-id: 0000000000000000000000000000000000000000

Jason Gunthorpe Feb. 20, 2025, 5:50 p.m. UTC | #1

On Wed, Feb 19, 2025 at 05:31:42PM -0800, Nicolin Chen wrote:
> Now that iommufd does not rely on dma-iommu.c for any purpose. We can
> combine the dma-iommu.c iova_cookie and the iommufd_hwpt under the same
> union. This union is effectively 'owner data' and can be used by the
> entity that allocated the domain. Note that legacy vfio type1 flows
> continue to use dma-iommu.c for sw_msi and still need iova_cookie.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/iommu.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

Thomas Gleixner Feb. 21, 2025, 9:28 a.m. UTC | #2

On Wed, Feb 19 2025 at 17:31, Nicolin Chen wrote:
> Fix the MSI cookie UAF by removing the cookie pointer. The translated IOVA
> address is already known during iommu_dma_prepare_msi() and cannot change.
> Thus, it can simply be stored as an integer in the MSI descriptor.
>
> A following patch will fix the other UAF in iommu_get_domain_for_dev(), by
> using the IOMMU group mutex.

"A following patch" has no meaning once the current one is
applied. Simply say:

  The other UAF in iommu_get_domain_for_dev() will be addressed
  seperately, by ....

> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

With that fixed:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Jason Gunthorpe Feb. 21, 2025, 2:05 p.m. UTC | #3

On Fri, Feb 21, 2025 at 10:28:20AM +0100, Thomas Gleixner wrote:
> On Wed, Feb 19 2025 at 17:31, Nicolin Chen wrote:
> > Fix the MSI cookie UAF by removing the cookie pointer. The translated IOVA
> > address is already known during iommu_dma_prepare_msi() and cannot change.
> > Thus, it can simply be stored as an integer in the MSI descriptor.
> >
> > A following patch will fix the other UAF in iommu_get_domain_for_dev(), by
> > using the IOMMU group mutex.
> 
> "A following patch" has no meaning once the current one is
> applied. Simply say:
> 
>   The other UAF in iommu_get_domain_for_dev() will be addressed
>   seperately, by ....

I used this paragraph: 

The other UAF related to iommu_get_domain_for_dev() will be addressed in
patch "iommu: Make iommu_dma_prepare_msi() into a generic operation" by
using the IOMMU group mutex.

Thanks,
Jason

Jason Gunthorpe Feb. 21, 2025, 2:59 p.m. UTC | #4

On Wed, Feb 19, 2025 at 05:31:35PM -0800, Nicolin Chen wrote:
> 
> Jason Gunthorpe (5):
>   genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of
>     iommu_cookie
>   genirq/msi: Refactor iommu_dma_compose_msi_msg()
>   iommu: Make iommu_dma_prepare_msi() into a generic operation
>   irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by irqchips that need
>     it
>   iommufd: Implement sw_msi support natively
> 
> Nicolin Chen (2):
>   iommu: Turn fault_data to iommufd private pointer

I dropped this patch:

>   iommu: Turn iova_cookie to dma-iommu private pointer

And fixed up the two compilation issues found by building on my x86
config, plus Thomas's language update.

It is headed toward linux-next, give it till monday for a PR to Joerg
just incase there are more randconfig issues.

Thanks,
Jason

Robin Murphy Feb. 21, 2025, 3:39 p.m. UTC | #5

On 2025-02-20 1:31 am, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> SW_MSI supports IOMMU to translate an MSI message before the MSI message
> is delivered to the interrupt controller. On such systems, an iommu_domain
> must have a translation for the MSI message for interrupts to work.
> 
> The IRQ subsystem will call into IOMMU to request that a physical page be
> set up to receive MSI messages, and the IOMMU then sets an IOVA that maps
> to that physical page. Ultimately the IOVA is programmed into the device
> via the msi_msg.
> 
> Generalize this by allowing iommu_domain owners to provide implementations
> of this mapping. Add a function pointer in struct iommu_domain to allow a
> domain owner to provide its own implementation.
> 
> Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during
> the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by
> VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in
> the iommu_get_msi_cookie() path.
> 
> Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain
> doesn't change or become freed while running. Races with IRQ operations
> from VFIO and domain changes from iommufd are possible here.
> 
> Replace the msi_prepare_lock with a lockdep assertion for the group mutex
> as documentation. For the dmau_iommu.c each iommu_domain is unique to a
> group.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   include/linux/iommu.h     | 44 ++++++++++++++++++++++++++-------------
>   drivers/iommu/dma-iommu.c | 33 +++++++++++++----------------
>   drivers/iommu/iommu.c     | 29 ++++++++++++++++++++++++++
>   3 files changed, 73 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index caee952febd4..761c5e186de9 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -44,6 +44,8 @@ struct iommu_dma_cookie;
>   struct iommu_fault_param;
>   struct iommufd_ctx;
>   struct iommufd_viommu;
> +struct msi_desc;
> +struct msi_msg;
>   
>   #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
>   #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
> @@ -216,6 +218,12 @@ struct iommu_domain {
>   	struct iommu_domain_geometry geometry;
>   	struct iommu_dma_cookie *iova_cookie;
>   	int (*iopf_handler)(struct iopf_group *group);
> +
> +#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
> +	int (*sw_msi)(struct iommu_domain *domain, struct msi_desc *desc,
> +		      phys_addr_t msi_addr);
> +#endif
> +
>   	void *fault_data;
>   	union {
>   		struct {
> @@ -234,6 +242,16 @@ struct iommu_domain {
>   	};
>   };
>   
> +static inline void iommu_domain_set_sw_msi(
> +	struct iommu_domain *domain,
> +	int (*sw_msi)(struct iommu_domain *domain, struct msi_desc *desc,
> +		      phys_addr_t msi_addr))
> +{
> +#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
> +	domain->sw_msi = sw_msi;
> +#endif
> +}

Yuck. Realistically we are going to have no more than two different 
implementations of this; a fiddly callback interface seems overkill. All 
we should need in the domain is a simple indicator of *which* MSI 
translation scheme is in use (if it can't be squeezed into the domain 
type itself), then iommu_dma_prepare_msi() can simply dispatch between 
iommu-dma and IOMMUFD based on that, and then it's easy to solve all the 
other fragility issues too.

Thanks,
Robin.

> +
>   static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
>   {
>   	return domain->type & __IOMMU_DOMAIN_DMA_API;
> @@ -1470,6 +1488,18 @@ static inline ioasid_t iommu_alloc_global_pasid(struct device *dev)
>   static inline void iommu_free_global_pasid(ioasid_t pasid) {}
>   #endif /* CONFIG_IOMMU_API */
>   
> +#ifdef CONFIG_IRQ_MSI_IOMMU
> +#ifdef CONFIG_IOMMU_API
> +int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
> +#else
> +static inline int iommu_dma_prepare_msi(struct msi_desc *desc,
> +					phys_addr_t msi_addr)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +#endif /* CONFIG_IRQ_MSI_IOMMU */
> +
>   #if IS_ENABLED(CONFIG_LOCKDEP) && IS_ENABLED(CONFIG_IOMMU_API)
>   void iommu_group_mutex_assert(struct device *dev);
>   #else
> @@ -1503,26 +1533,12 @@ static inline void iommu_debugfs_setup(void) {}
>   #endif
>   
>   #ifdef CONFIG_IOMMU_DMA
> -#include <linux/msi.h>
> -
>   int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
> -
> -int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
> -
>   #else /* CONFIG_IOMMU_DMA */
> -
> -struct msi_desc;
> -struct msi_msg;
> -
>   static inline int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
>   {
>   	return -ENODEV;
>   }
> -
> -static inline int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
> -{
> -	return 0;
> -}
>   #endif	/* CONFIG_IOMMU_DMA */
>   
>   /*
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index bf91e014d179..3b58244e6344 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -24,6 +24,7 @@
>   #include <linux/memremap.h>
>   #include <linux/mm.h>
>   #include <linux/mutex.h>
> +#include <linux/msi.h>
>   #include <linux/of_iommu.h>
>   #include <linux/pci.h>
>   #include <linux/scatterlist.h>
> @@ -102,6 +103,9 @@ static int __init iommu_dma_forcedac_setup(char *str)
>   }
>   early_param("iommu.forcedac", iommu_dma_forcedac_setup);
>   
> +static int iommu_dma_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
> +			    phys_addr_t msi_addr);
> +
>   /* Number of entries per flush queue */
>   #define IOVA_DEFAULT_FQ_SIZE	256
>   #define IOVA_SINGLE_FQ_SIZE	32768
> @@ -398,6 +402,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain)
>   		return -ENOMEM;
>   
>   	mutex_init(&domain->iova_cookie->mutex);
> +	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
>   	return 0;
>   }
>   
> @@ -429,6 +434,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
>   
>   	cookie->msi_iova = base;
>   	domain->iova_cookie = cookie;
> +	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
>   	return 0;
>   }
>   EXPORT_SYMBOL(iommu_get_msi_cookie);
> @@ -443,6 +449,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
>   	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>   	struct iommu_dma_msi_page *msi, *tmp;
>   
> +	if (domain->sw_msi != iommu_dma_sw_msi)
> +		return;
> +
>   	if (!cookie)
>   		return;
>   
> @@ -1800,33 +1809,19 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
>   	return NULL;
>   }
>   
> -/**
> - * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
> - * @desc: MSI descriptor, will store the MSI page
> - * @msi_addr: MSI target address to be mapped
> - *
> - * Return: 0 on success or negative error code if the mapping failed.
> - */
> -int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
> +static int iommu_dma_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
> +			    phys_addr_t msi_addr)
>   {
>   	struct device *dev = msi_desc_to_dev(desc);
> -	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> -	struct iommu_dma_msi_page *msi_page;
> -	static DEFINE_MUTEX(msi_prepare_lock); /* see below */
> +	const struct iommu_dma_msi_page *msi_page;
>   
> -	if (!domain || !domain->iova_cookie) {
> +	if (!domain->iova_cookie) {
>   		msi_desc_set_iommu_msi_iova(desc, 0, 0);
>   		return 0;
>   	}
>   
> -	/*
> -	 * In fact the whole prepare operation should already be serialised by
> -	 * irq_domain_mutex further up the callchain, but that's pretty subtle
> -	 * on its own, so consider this locking as failsafe documentation...
> -	 */
> -	mutex_lock(&msi_prepare_lock);
> +	iommu_group_mutex_assert(dev);
>   	msi_page = iommu_dma_get_msi_page(dev, msi_addr, domain);
> -	mutex_unlock(&msi_prepare_lock);
>   	if (!msi_page)
>   		return -ENOMEM;
>   
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 870c3cdbd0f6..022bf96a18c5 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -3596,3 +3596,32 @@ int iommu_replace_group_handle(struct iommu_group *group,
>   	return ret;
>   }
>   EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
> +
> +#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
> +/**
> + * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
> + * @desc: MSI descriptor, will store the MSI page
> + * @msi_addr: MSI target address to be mapped
> + *
> + * The implementation of sw_msi() should take msi_addr and map it to
> + * an IOVA in the domain and call msi_desc_set_iommu_msi_iova() with the
> + * mapping information.
> + *
> + * Return: 0 on success or negative error code if the mapping failed.
> + */
> +int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
> +{
> +	struct device *dev = msi_desc_to_dev(desc);
> +	struct iommu_group *group = dev->iommu_group;
> +	int ret = 0;
> +
> +	if (!group)
> +		return 0;
> +
> +	mutex_lock(&group->mutex);
> +	if (group->domain && group->domain->sw_msi)
> +		ret = group->domain->sw_msi(group->domain, desc, msi_addr);
> +	mutex_unlock(&group->mutex);
> +	return ret;
> +}
> +#endif /* CONFIG_IRQ_MSI_IOMMU */

Jason Gunthorpe Feb. 21, 2025, 4:44 p.m. UTC | #6

On Fri, Feb 21, 2025 at 03:39:45PM +0000, Robin Murphy wrote:
> Yuck. Realistically we are going to have no more than two different
> implementations of this; a fiddly callback interface seems overkill. All we
> should need in the domain is a simple indicator of *which* MSI translation
> scheme is in use (if it can't be squeezed into the domain type itself), then
> iommu_dma_prepare_msi() can simply dispatch between iommu-dma and IOMMUFD
> based on that, and then it's easy to solve all the other fragility issues
> too.

That would make module dependency problems, we have so far avoided
having the core kernel hard depend on iommufd.

Jason

[v2,0/7] iommu: Add MSI mapping support with nested SMMU (Part-1 core)

Message

Comments