mbox series

[v7,00/13] iommufd: Add vIOMMU infrastructure (Part-1)

Message ID cover.1730836219.git.nicolinc@nvidia.com
Headers show
Series iommufd: Add vIOMMU infrastructure (Part-1) | expand

Message

Nicolin Chen Nov. 5, 2024, 8:04 p.m. UTC
This series introduces a new vIOMMU infrastructure and related ioctls.

IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available. And this container will be able to hold some advanced
feature too.

For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
  _______________________________________________________________________
 |                      iommufd (with vIOMMU)                            |
 |                        _____________                                  |
 |                       |             |                                 |
 |      |----------------|    vIOMMU   |                                 |
 |      |     ______     |             |     _____________     ________  |
 |      |    |      |    |             |    |             |   |        | |
 |      |    | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
 |      |    |______|    |_____________|    |_____________|   |________| |
 |      |        |              |                  |               |     |
 |______|________|______________|__________________|_______________|_____|
        |        |              |                  |               |
  ______v_____   |        ______v_____       ______v_____       ___v__
 |   struct   |  |  PFN  |  (paging)  |     |  (nested)  |     |struct|
 |iommu_device|  |------>|iommu_domain|<----|iommu_domain|<----|device|
 |____________|   storage|____________|     |____________|     |______|

The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
 - Security namespace for guest owned ID, e.g. guest-controlled cache tags
 - Non-device-affiliated event reporting, e.g. invalidation queue errors
 - Access to a sharable nesting parent pagetable across physical IOMMUs
 - Virtualization of various platforms IDs, e.g. RIDs and others
 - Delivery of paravirtualized invalidation
 - Direct assigned invalidation queues
 - Direct assigned interrupts

On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that have a slice passed to (via device) a guest VM,
while being able to hold the shareable parent HWPT. Each vIOMMU then just
needs to allocate its own individual ID to tag its own cache:
                     ----------------------------
 ----------------    |         |  paging_hwpt0  |
 | hwpt_nested0 |--->| viommu0 ------------------
 ----------------    |         |      IDx       |
                     ----------------------------
                     ----------------------------
 ----------------    |         |  paging_hwpt0  |
 | hwpt_nested1 |--->| viommu1 ------------------
 ----------------    |         |      IDy       |
                     ----------------------------

As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only.

More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/

This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v7
(QEMU branch for testing will be provided in Jason's nesting series)

Changelog
v7
 * Added "Reviewed-by" from Jason
 * Dropped "select IOMMUFD_DRIVER_CORE" in Kconfig
 * Decoupled IOMMUFD_DRIVER from IOMMUFD_DRIVER_CORE
 * Moved vIOMMU negative tests out of FIXTURE_SETUP to a TEST_F
 * Dropped the "flags" check in iommufd_viommu_alloc_hwpt_nested
 * Added the kdoc for "flags" in iommufd_viommu_alloc_hwpt_nested
v6
 https://lore.kernel.org/all/cover.1730313237.git.nicolinc@nvidia.com/
 * Improved comment lines
 * Added a TEST_F for IO page fault
 * Fixed indentations in iommufd.rst
 * Revised kdoc of the viommu_alloc op
 * Added "Reviewed-by" from Kevin and Jason
 * Passed in "flags" to ->alloc_domain_nested
 * Renamed "free" op to "destroy" in viommu_ops
 * Skipped SMMUv3 driver changes (to post in a separate series)
 * Fixed "flags" validation in iommufd_viommu_alloc_hwpt_nested
 * Added CONFIG_IOMMUFD_DRIVER_CORE for sharing between iommufd
   core and IOMMU dirvers
 * Replaced iommufd_verify_unfinalized_object with xa_cmpxchg in
   iommufd_object_finalize/abort functions
v5
 https://lore.kernel.org/all/cover.1729897352.git.nicolinc@nvidia.com/
 * Added "Reviewed-by" from Kevin
 * Reworked iommufd_viommu_alloc helper
 * Revised the uAPI kdoc for vIOMMU object
 * Revised comments for pluggable iommu_dev
 * Added a couple of cleanup patches for selftest
 * Renamed domain_alloc_nested op to alloc_domain_nested
 * Updated a few commit messages to reflect the latest series
 * Renamed iommufd_hwpt_nested_alloc_for_viommu to
   iommufd_viommu_alloc_hwpt_nested, and added flag validation
v4
 https://lore.kernel.org/all/cover.1729553811.git.nicolinc@nvidia.com/
 * Added "Reviewed-by" from Jason
 * Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
 * Dropped iommufd_object_alloc_elm renamings
 * Renamed iommufd's viommu_api.c to driver.c
 * Reworked iommufd_viommu_alloc helper
 * Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
   hwpt_nested allocations on a vIOMMU, and added comparison between
   viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
 * Replaced s2_parent with vsmmu in arm_smmu_nested_domain
 * Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
   viommu_ops
 * Replaced wait_queue_head_t with a completion, to delay the unplug of
   mock_iommu_dev
 * Corrected documentation graph that was missing struct iommu_device
 * Added an iommufd_verify_unfinalized_object helper to verify driver-
   allocated vIOMMU/vDEVICE objects
 * Added missing test cases for TEST_LENGTH and fail_nth
v3
 https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
 * Rebased on top of Jason's nesting v3 series
   https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvidia.com/
 * Split the series into smaller parts
 * Added Jason's Reviewed-by
 * Added back viommu->iommu_dev
 * Added support for driver-allocated vIOMMU v.s. core-allocated
 * Dropped arm_smmu_cache_invalidate_user
 * Added an iommufd_test_wait_for_users() in selftest
 * Reworked test code to make viommu an individual FIXTURE
 * Added missing TEST_LENGTH case for the new ioctl command
v2
 https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
 * Limited vdev_id to one per idev
 * Added a rw_sem to protect the vdev_id list
 * Reworked driver-level APIs with proper lockings
 * Added a new viommu_api file for IOMMUFD_DRIVER config
 * Dropped useless iommu_dev point from the viommu structure
 * Added missing index numnbers to new types in the uAPI header
 * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
 * Reworked mock_viommu_cache_invalidate() using the new iommu helper
 * Reordered details of set/unset_vdev_id handlers for proper lockings
v1
 https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Nicolin Chen (13):
  iommufd: Move struct iommufd_object to public iommufd header
  iommufd: Move _iommufd_object_alloc helper to a sharable file
  iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
  iommufd: Verify object in iommufd_object_finalize/abort()
  iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
  iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
  iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
  iommufd/selftest: Add container_of helpers
  iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
  iommufd/selftest: Add refcount to mock_iommu_device
  iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
  iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
  Documentation: userspace-api: iommufd: Update vIOMMU

 drivers/iommu/iommufd/Kconfig                 |   4 +
 drivers/iommu/iommufd/Makefile                |   4 +-
 drivers/iommu/iommufd/iommufd_private.h       |  33 +--
 drivers/iommu/iommufd/iommufd_test.h          |   2 +
 include/linux/iommu.h                         |  14 +
 include/linux/iommufd.h                       |  86 ++++++
 include/uapi/linux/iommufd.h                  |  54 +++-
 tools/testing/selftests/iommu/iommufd_utils.h |  28 ++
 drivers/iommu/iommufd/driver.c                |  40 +++
 drivers/iommu/iommufd/hw_pagetable.c          |  73 ++++-
 drivers/iommu/iommufd/main.c                  |  54 ++--
 drivers/iommu/iommufd/selftest.c              | 266 +++++++++++++-----
 drivers/iommu/iommufd/viommu.c                |  81 ++++++
 tools/testing/selftests/iommu/iommufd.c       | 137 +++++++++
 .../selftests/iommu/iommufd_fail_nth.c        |  11 +
 Documentation/userspace-api/iommufd.rst       |  69 ++++-
 16 files changed, 804 insertions(+), 152 deletions(-)
 create mode 100644 drivers/iommu/iommufd/driver.c
 create mode 100644 drivers/iommu/iommufd/viommu.c


base-commit: 0bcceb1f51c77f6b98a7aab00847ed340bf36e35

Comments

Jason Gunthorpe Nov. 8, 2024, 5:40 p.m. UTC | #1
On Tue, Nov 05, 2024 at 12:04:16PM -0800, Nicolin Chen wrote:
> This series introduces a new vIOMMU infrastructure and related ioctls.
> 
> IOMMUFD has been using the HWPT infrastructure for all cases, including a
> nested IO page table support. Yet, there're limitations for an HWPT-based
> structure to support some advanced HW-accelerated features, such as CMDQV
> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
> environment, it is not straightforward for nested HWPTs to share the same
> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
> parent HWPT typically hold one stage-2 IO pagetable and tag it with only
> one ID in the cache entries. When sharing one large stage-2 IO pagetable
> across physical IOMMU instances, that one ID may not always be available
> across all the IOMMU instances. In other word, it's ideal for SW to have
> a different container for the stage-2 IO pagetable so it can hold another
> ID that's available. And this container will be able to hold some advanced
> feature too.

Applied to iommufd for-next, thanks

Jason