[v4,00/12] Initial support for SMMUv3 nested translation

Message ID	0-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Headers	show Received: from NAM04-MW2-obe.outbound.protection.outlook.com (mail-mw2nam04on2061.outbound.protection.outlook.com [40.107.101.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C7AA92CAB; Thu, 31 Oct 2024 00:21:21 +0000 (UTC) From: Jason Gunthorpe <jgg@nvidia.com> To: acpica-devel@lists.linux.dev, iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>, Kevin Tian <kevin.tian@intel.com>, kvm@vger.kernel.org, Len Brown <lenb@kernel.org>, linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi <lpieralisi@kernel.org>, "Rafael J. Wysocki" <rafael@kernel.org>, Robert Moore <robert.moore@intel.com>, Robin Murphy <robin.murphy@arm.com>, Sudeep Holla <sudeep.holla@arm.com>, Will Deacon <will@kernel.org> Cc: Alex Williamson <alex.williamson@redhat.com>, Donald Dutile <ddutile@redhat.com>, Eric Auger <eric.auger@redhat.com>, Hanjun Guo <guohanjun@huawei.com>, Jean-Philippe Brucker <jean-philippe@linaro.org>, Jerry Snitselaar <jsnitsel@redhat.com>, Moritz Fischer <mdf@kernel.org>, Michael Shavit <mshavit@google.com>, Nicolin Chen <nicolinc@nvidia.com>, patches@lists.linux.dev, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>, Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>, Mostafa Saleh <smostafa@google.com> Subject: [PATCH v4 00/12] Initial support for SMMUv3 nested translation Date: Wed, 30 Oct 2024 21:20:44 -0300 Message-ID: <0-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	Initial support for SMMUv3 nested translation \| expand [v4,00/12] Initial support for SMMUv3 nested translation [v4,02/12] ACPICA: IORT: Update for revision E.f [v4,03/12] ACPI/IORT: Support CANWBS memory access flag [v4,06/12] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT [v4,08/12] iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC [v4,11/12] iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED [v4,12/12] iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object

Jason Gunthorpe Oct. 31, 2024, 12:20 a.m. UTC

[This is now based on Nicolin's iommufd patches for vIOMMU and will need
to go through the iommufd tree, please ack]

This brings support for the IOMMFD ioctls:

 - IOMMU_GET_HW_INFO
 - IOMMU_HWPT_ALLOC_NEST_PARENT
 - IOMMU_VIOMMU_ALLOC
 - IOMMU_DOMAIN_NESTED
 - IOMMU_HWPT_INVALIDATE
 - ops->enforce_cache_coherency()

This is quite straightforward as the nested STE can just be built in the
special NESTED domain op and fed through the generic update machinery.

The design allows the user provided STE fragment to control several
aspects of the translation, including putting the STE into a "virtual
bypass" or a aborting state. This duplicates functionality available by
other means, but it allows trivially preserving the VMID in the STE as we
eventually move towards the vIOMMU owning the VMID.

Nesting support requires the system to either support S2FWB or the
stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
cache and view incoherent data, currently VFIO lacks any cache flushing
that would make this safe.

Yan has a series to add some of the needed infrastructure for VFIO cache
flushing here:

 https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/

Which may someday allow relaxing this further.

The VIOMMU object provides the framework to allow the invalidation path to
translate the vSID to a pSID and then issue the correct physical
invalidation. This is all done in the kernel as pSID has to
limited. Future patches will extend VIOMMU to handle specific HW features
like vMPAM and NVIDIA's vCMDQ.

Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
this.

This is the first series in what will be several to complete nesting
support. At least:
 - IOMMU_RESV_SW_MSI related fixups
    https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
 - vCMDQ hypervisor support for direct invalidation queue assignment
    https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
 - KVM pinned VMID using vIOMMU for vBTM
    https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
 - Cross instance S2 sharing
 - Virtual Machine Structure using vIOMMU (for vMPAM?)
 - Fault forwarding support through IOMMUFD's fault fd for vSVA

The vIOMMU series is essential to allow the invalidations to be processed
for the CD as well.

It is enough to allow qemu work to progress.

This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting

v4:
 - Rebase on Nicolin's patches
 - Add user_pasid_table=1 to support fault reporting on NESTED domains
 - Reorder STRTAB constants
 - Fix whitespace
 - Roll in the patches Nicolin had and merge together into a logical order
   Includes vIOMMU, ATS and invalidation patches
v3: https://patch.msgid.link/r/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvidia.com
 - Rebase on v6.12-rc2
 - Revise commit messages
 - Consolidate CANWB checks into arm_smmu_master_canwbs()
 - Add CONFIG_ARM_SMMU_V3_IOMMUFD to compile out iommufd only features
   like nesting
 - Shift code into arm-smmu-v3-iommufd.c
 - Add missed IS_ERR check
 - Add S2FWB to arm_smmu_get_ste_used()
 - Fixup quirks checks
 - Drop ARM_SMMU_FEAT_COHERENCY checks for S2FWB
 - Limit S2FWB to S2 Nesting Parent domains "just in case"
v2: https://patch.msgid.link/r/0-v2-621370057090+91fec-smmuv3_nesting_jgg@nvidia.com
 - Revise commit messages
 - Guard S2FWB support with ARM_SMMU_FEAT_COHERENCY, since it doesn't make
   sense to use S2FWB to enforce coherency on inherently non-coherent hardware.
 - Add missing IO_PGTABLE_QUIRK_ARM_S2FWB validation
 - Include formal ACPIA commit for IORT built using
   generate/linux/gen-patch.sh
 - Use FEAT_NESTING to block creating a NESTING_PARENT
 - Use an abort STE instead of non-valid if the user requests a non-valid
   vSTE
 - Consistently use 'nest_parent' for naming variables
 - Use the right domain for arm_smmu_remove_master_domain() when it
   removes the master
 - Join bitfields together
 - Drop arm_smmu_cache_invalidate_user patch, invalidation will
   exclusively go via viommu
v1: https://patch.msgid.link/r/0-v1-54e734311a7f+14f72-smmuv3_nesting_jgg@nvidia.com

Jason Gunthorpe (7):
  vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  iommu/arm-smmu-v3: Expose the arm_smmu_attach interface
  iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  iommu/arm-smmu-v3: Use S2FWB for NESTED domains
  iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED

Nicolin Chen (5):
  ACPICA: IORT: Update for revision E.f
  ACPI/IORT: Support CANWBS memory access flag
  iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct
    arm_smmu_hw_info
  iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
  iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object

 drivers/acpi/arm64/iort.c                     |  13 +
 drivers/iommu/Kconfig                         |   9 +
 drivers/iommu/arm/arm-smmu-v3/Makefile        |   1 +
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     | 393 ++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 139 +++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  92 +++-
 drivers/iommu/arm/arm-smmu/arm-smmu.c         |  16 -
 drivers/iommu/io-pgtable-arm.c                |  27 +-
 drivers/iommu/iommu.c                         |  10 -
 drivers/iommu/iommufd/vfio_compat.c           |   7 +-
 drivers/vfio/vfio_iommu_type1.c               |  12 +-
 include/acpi/actbl2.h                         |   3 +-
 include/linux/io-pgtable.h                    |   2 +
 include/linux/iommu.h                         |   5 +-
 include/uapi/linux/iommufd.h                  |  83 ++++
 include/uapi/linux/vfio.h                     |   2 +-
 16 files changed, 712 insertions(+), 102 deletions(-)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c


base-commit: 9ffbeb478d44c57b9b2e263750b1056e5faebc9b

Will Deacon Nov. 1, 2024, 12:18 p.m. UTC | #1

On Wed, Oct 30, 2024 at 09:20:44PM -0300, Jason Gunthorpe wrote:
> [This is now based on Nicolin's iommufd patches for vIOMMU and will need
> to go through the iommufd tree, please ack]

Can't we separate out the SMMUv3 driver changes? They shouldn't depend on
Nicolin's work afaict.

Will

Jason Gunthorpe Nov. 1, 2024, 1:25 p.m. UTC | #2

On Fri, Nov 01, 2024 at 12:18:50PM +0000, Will Deacon wrote:
> On Wed, Oct 30, 2024 at 09:20:44PM -0300, Jason Gunthorpe wrote:
> > [This is now based on Nicolin's iommufd patches for vIOMMU and will need
> > to go through the iommufd tree, please ack]
> 
> Can't we separate out the SMMUv3 driver changes? They shouldn't depend on
> Nicolin's work afaict.

We can put everything before "iommu/arm-smmu-v3: Support
IOMMU_VIOMMU_ALLOC" directly on a rc and share a branch with your tree.

That patch and following all depend on Nicolin's work, as they rely on
the VIOMMU due to how different ARM is from Intel.

How about you take these patches:

 [PATCH v4 01/12] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
 [PATCH v4 02/12] ACPICA: IORT: Update for revision E.f
 [PATCH v4 03/12] ACPI/IORT: Support CANWBS memory access flag
 [PATCH v4 04/12] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
 [PATCH v4 05/12] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
 [PATCH v4 06/12] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
 [PATCH v4 07/12] iommu/arm-smmu-v3: Expose the arm_smmu_attach interface

Onto a branch.

I'll take these patches after merging your branch and Nicolin's:

 [PATCH v4 08/12] iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
 [PATCH v4 09/12] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
 [PATCH v4 10/12] iommu/arm-smmu-v3: Use S2FWB for NESTED domains
 [PATCH v4 11/12] iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
 [PATCH v4 12/12] iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object

?

I can also probably push most of S2FWB and ATS into the first batch.

Please let me know, I would like this to be done this cycle, Nicolin's
vIOMMU series are all reviewed now.

Jason

Will Deacon Nov. 4, 2024, 11:47 a.m. UTC | #3

On Wed, Oct 30, 2024 at 09:20:49PM -0300, Jason Gunthorpe wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU
> instance need to be available to the VMM so it can construct an
> appropriate vSMMUv3 that reflects the correct HW capabilities.
> 
> For userspace page tables these values are required to constrain the valid
> values within the CD table and the IOPTEs.
> 
> The kernel does not sanitize these values. If building a VMM then
> userspace is required to only forward bits into a VM that it knows it can
> implement. Some bits will also require a VMM to detect if appropriate
> kernel support is available such as for ATS and BTM.
> 
> Start a new file and kconfig for the advanced iommufd support. This lets
> it be compiled out for kernels that are not intended to support
> virtualization, and allows distros to leave it disabled until they are
> shipping a matching qemu too.
> 
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
> Reviewed-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

--->8

> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index e266dfa6a38d9d..b227ac16333fe1 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -488,15 +488,50 @@ struct iommu_hw_info_vtd {
>  	__aligned_u64 ecap_reg;
>  };
>  
> +/**
> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> + *
> + * @flags: Must be set to 0
> + * @__reserved: Must be 0
> + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> + * @iidr: Information about the implementation and implementer of ARM SMMU,
> + *        and architecture version supported
> + * @aidr: ARM SMMU architecture version
> + *
> + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> + *
> + * User space should read the underlying ARM SMMUv3 hardware information for
> + * the list of supported features.
> + *
> + * Note that these values reflect the raw HW capability, without any insight if
> + * any required kernel driver support is present. Bits may be set indicating the
> + * HW has functionality that is lacking kernel software support, such as BTM. If
> + * a VMM is using this information to construct emulated copies of these
> + * registers it should only forward bits that it knows it can support.
> + *
> + * In future, presence of required kernel support will be indicated in flags.

What about the case where we _know_ that some functionality is broken in
the hardware? For example, we nobble BTM support on MMU 700 thanks to
erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here.
Similarly, HTTU can be overridden by IORT, so should we update the view
that we advertise for that as well?

Will

Jason Gunthorpe Nov. 4, 2024, 12:41 p.m. UTC | #4

On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote:
> > +/**
> > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> > + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> > + *
> > + * @flags: Must be set to 0
> > + * @__reserved: Must be 0
> > + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> > + * @iidr: Information about the implementation and implementer of ARM SMMU,
> > + *        and architecture version supported
> > + * @aidr: ARM SMMU architecture version
> > + *
> > + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> > + *
> > + * User space should read the underlying ARM SMMUv3 hardware information for
> > + * the list of supported features.
> > + *
> > + * Note that these values reflect the raw HW capability, without any insight if
> > + * any required kernel driver support is present. Bits may be set indicating the
> > + * HW has functionality that is lacking kernel software support, such as BTM. If
> > + * a VMM is using this information to construct emulated copies of these
> > + * registers it should only forward bits that it knows it can support.
> > + *
> > + * In future, presence of required kernel support will be indicated in flags.
> 
> What about the case where we _know_ that some functionality is broken in
> the hardware? For example, we nobble BTM support on MMU 700 thanks to
> erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here.
> Similarly, HTTU can be overridden by IORT, so should we update the view
> that we advertise for that as well?

My knee jerk answer is no, these struct fields should just report the
raw HW register. A VMM should not copy these fields directly into a
VM. The principle purpose is to give the VMM the same details about the
HW as the kernel so it can apply erratas/etc.

For instance, if we hide these fields how will the VMM/VM know to
apply the various flushing errata? With vCMDQ/etc the VM is directly
pushing flushes to HW, it must know the errata.

For BTM/HTTU/etc - those all require kernel SW support and per-device
permission in the kernel to turn on. For instance requesting a nested
vSTE that needs BTM will fail today during attach. Turning on HTTU on
the S2 already has an API that will fail if the IORT blocks it.

Incrementally dealing with expanding the support is part of the
"required kernel support will be indicated in flags."

Basically, exposing the information as-is doesn't do any harm.

Jason

Jason Gunthorpe Nov. 4, 2024, 5:19 p.m. UTC | #5

On Thu, Oct 31, 2024 at 02:21:11PM +0800, Zhangfei Gao wrote:

> > +static struct iommu_domain *
> > +arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
> > +                             const struct iommu_user_data *user_data)
> > +{
> > +       struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
> > +       struct arm_smmu_nested_domain *nested_domain;
> > +       struct iommu_hwpt_arm_smmuv3 arg;
> > +       int ret;
> > +
> > +       if (flags)
> > +               return ERR_PTR(-EOPNOTSUPP);
> 
> This check fails when using user page fault, with flags =
> IOMMU_HWPT_FAULT_ID_VALID (4)
> Strange, the check is not exist in last version?
> 
> iommufd_viommu_alloc_hwpt_nested ->
> viommu->ops->alloc_domain_nested(viommu, flags, user_data) ->
> arm_vsmmu_alloc_domain_nested

It should permit IOMMU_HWPT_FAULT_ID_VALID, I'll add this hunk:

--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -178,12 +178,18 @@ arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
                              const struct iommu_user_data *user_data)
 {
        struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
+       const u32 SUPPORTED_FLAGS = IOMMU_HWPT_FAULT_ID_VALID;
        struct arm_smmu_nested_domain *nested_domain;
        struct iommu_hwpt_arm_smmuv3 arg;
        bool enable_ats = false;
        int ret;
 
-       if (flags)
+       /*
+        * Faults delivered to the nested domain are faults that originated by
+        * the S1 in the domain. The core code will match all PASIDs when
+        * delivering the fault due to user_pasid_table
+        */
+       if (flags & ~SUPPORTED_FLAGS)
                return ERR_PTR(-EOPNOTSUPP);
 
        ret = iommu_copy_struct_from_user(&arg, user_data,

Thanks,
Jason

Will Deacon Nov. 5, 2024, 4:48 p.m. UTC | #6

Hi Jason,

On Fri, Nov 01, 2024 at 10:25:03AM -0300, Jason Gunthorpe wrote:
> On Fri, Nov 01, 2024 at 12:18:50PM +0000, Will Deacon wrote:
> > On Wed, Oct 30, 2024 at 09:20:44PM -0300, Jason Gunthorpe wrote:
> > > [This is now based on Nicolin's iommufd patches for vIOMMU and will need
> > > to go through the iommufd tree, please ack]
> > 
> > Can't we separate out the SMMUv3 driver changes? They shouldn't depend on
> > Nicolin's work afaict.
> 
> We can put everything before "iommu/arm-smmu-v3: Support
> IOMMU_VIOMMU_ALLOC" directly on a rc and share a branch with your tree.
> 
> That patch and following all depend on Nicolin's work, as they rely on
> the VIOMMU due to how different ARM is from Intel.
> 
> How about you take these patches:
> 
>  [PATCH v4 01/12] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
>  [PATCH v4 02/12] ACPICA: IORT: Update for revision E.f
>  [PATCH v4 03/12] ACPI/IORT: Support CANWBS memory access flag
>  [PATCH v4 04/12] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
>  [PATCH v4 05/12] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
>  [PATCH v4 06/12] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
>  [PATCH v4 07/12] iommu/arm-smmu-v3: Expose the arm_smmu_attach interface
> 
> Onto a branch.

I've pushed these onto a new branch in the IOMMU tree:

	iommufd/arm-smmuv3-nested

However, please can you give it a day or two to get some exposure in
-next before you merge that into iommufd? I'll ping back here later in
the week.

Cheers,

Will

Jason Gunthorpe Nov. 5, 2024, 5:03 p.m. UTC | #7

On Tue, Nov 05, 2024 at 04:48:29PM +0000, Will Deacon wrote:
> Hi Jason,
> 
> On Fri, Nov 01, 2024 at 10:25:03AM -0300, Jason Gunthorpe wrote:
> > On Fri, Nov 01, 2024 at 12:18:50PM +0000, Will Deacon wrote:
> > > On Wed, Oct 30, 2024 at 09:20:44PM -0300, Jason Gunthorpe wrote:
> > > > [This is now based on Nicolin's iommufd patches for vIOMMU and will need
> > > > to go through the iommufd tree, please ack]
> > > 
> > > Can't we separate out the SMMUv3 driver changes? They shouldn't depend on
> > > Nicolin's work afaict.
> > 
> > We can put everything before "iommu/arm-smmu-v3: Support
> > IOMMU_VIOMMU_ALLOC" directly on a rc and share a branch with your tree.
> > 
> > That patch and following all depend on Nicolin's work, as they rely on
> > the VIOMMU due to how different ARM is from Intel.
> > 
> > How about you take these patches:
> > 
> >  [PATCH v4 01/12] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
> >  [PATCH v4 02/12] ACPICA: IORT: Update for revision E.f
> >  [PATCH v4 03/12] ACPI/IORT: Support CANWBS memory access flag
> >  [PATCH v4 04/12] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
> >  [PATCH v4 05/12] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
> >  [PATCH v4 06/12] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
> >  [PATCH v4 07/12] iommu/arm-smmu-v3: Expose the arm_smmu_attach interface
> > 
> > Onto a branch.
> 
> I've pushed these onto a new branch in the IOMMU tree:
> 
> 	iommufd/arm-smmuv3-nested
> 
> However, please can you give it a day or two to get some exposure in
> -next before you merge that into iommufd? I'll ping back here later in
> the week.

Thanks, I will hope to put the other parts together on Friday then so
it can also get its time in -next, as we are running out of days
before the merge window

0-day is still complaining about some kconfig in Nicolin's patches so
we need to settle that anyhow.

Jason

Zhangfei Gao Nov. 6, 2024, 5:39 a.m. UTC | #8

On Mon, 4 Nov 2024 at 17:19, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 31, 2024 at 02:21:11PM +0800, Zhangfei Gao wrote:
>
> > > +static struct iommu_domain *
> > > +arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
> > > +                             const struct iommu_user_data *user_data)
> > > +{
> > > +       struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
> > > +       struct arm_smmu_nested_domain *nested_domain;
> > > +       struct iommu_hwpt_arm_smmuv3 arg;
> > > +       int ret;
> > > +
> > > +       if (flags)
> > > +               return ERR_PTR(-EOPNOTSUPP);
> >
> > This check fails when using user page fault, with flags =
> > IOMMU_HWPT_FAULT_ID_VALID (4)
> > Strange, the check is not exist in last version?
> >
> > iommufd_viommu_alloc_hwpt_nested ->
> > viommu->ops->alloc_domain_nested(viommu, flags, user_data) ->
> > arm_vsmmu_alloc_domain_nested
>
> It should permit IOMMU_HWPT_FAULT_ID_VALID, I'll add this hunk:
>
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> @@ -178,12 +178,18 @@ arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
>                               const struct iommu_user_data *user_data)
>  {
>         struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
> +       const u32 SUPPORTED_FLAGS = IOMMU_HWPT_FAULT_ID_VALID;
>         struct arm_smmu_nested_domain *nested_domain;
>         struct iommu_hwpt_arm_smmuv3 arg;
>         bool enable_ats = false;
>         int ret;
>
> -       if (flags)
> +       /*
> +        * Faults delivered to the nested domain are faults that originated by
> +        * the S1 in the domain. The core code will match all PASIDs when
> +        * delivering the fault due to user_pasid_table
> +        */
> +       if (flags & ~SUPPORTED_FLAGS)
>                 return ERR_PTR(-EOPNOTSUPP);

Thanks Jason, this works

Robin Murphy Nov. 6, 2024, 4:37 p.m. UTC | #9

On 2024-11-04 12:41 pm, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote:
>>> +/**
>>> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
>>> + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
>>> + *
>>> + * @flags: Must be set to 0
>>> + * @__reserved: Must be 0
>>> + * @idr: Implemented features for ARM SMMU Non-secure programming interface
>>> + * @iidr: Information about the implementation and implementer of ARM SMMU,
>>> + *        and architecture version supported
>>> + * @aidr: ARM SMMU architecture version
>>> + *
>>> + * For the details of @idr, @iidr and @aidr, please refer to the chapters
>>> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
>>> + *
>>> + * User space should read the underlying ARM SMMUv3 hardware information for
>>> + * the list of supported features.
>>> + *
>>> + * Note that these values reflect the raw HW capability, without any insight if
>>> + * any required kernel driver support is present. Bits may be set indicating the
>>> + * HW has functionality that is lacking kernel software support, such as BTM. If
>>> + * a VMM is using this information to construct emulated copies of these
>>> + * registers it should only forward bits that it knows it can support.

But how *is* a VMM supposed to know what it can support? Are they all 
expected to grovel the host devicetree/ACPI tables and maintain their 
own knowledge of implementation errata to understand what's actually usable?

>>> + *
>>> + * In future, presence of required kernel support will be indicated in flags.
>>
>> What about the case where we _know_ that some functionality is broken in
>> the hardware? For example, we nobble BTM support on MMU 700 thanks to
>> erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here.
>> Similarly, HTTU can be overridden by IORT, so should we update the view
>> that we advertise for that as well?
> 
> My knee jerk answer is no, these struct fields should just report the
> raw HW register. A VMM should not copy these fields directly into a
> VM. The principle purpose is to give the VMM the same details about the
> HW as the kernel so it can apply erratas/etc.
> 
> For instance, if we hide these fields how will the VMM/VM know to
> apply the various flushing errata? With vCMDQ/etc the VM is directly
> pushing flushes to HW, it must know the errata.

That doesn't seem like a valid argument. We obviously can't abstract 
SMMU_IIDR, that would indeed be an invitation for trouble, but 
otherwise, if an erratum affects S1 operation under conditions dependent 
on an optional feature, then not advertising that feature would make the 
workaround irrelevant anyway, since as far as the VM is concerned it 
would be wrong to expect a non-existent feature to work in the first place.

> For BTM/HTTU/etc - those all require kernel SW support and per-device
> permission in the kernel to turn on. For instance requesting a nested
> vSTE that needs BTM will fail today during attach. Turning on HTTU on
> the S2 already has an API that will fail if the IORT blocks it.

What does S2 HTTU have to do with the VM? How the host wants to maintain 
its S2 tables it its own business. AFAICS, unless the VMM wants to do 
some fiddly CD shadowing, it's going to be kinda hard to prevent the 
SMMU seeing a guest CD with CD.HA and/or CD.HD set if the guest expects 
S1 HTTU to work.

I'm not sure what "vSTE that needs BTM" means. Even if the system does 
support BTM, the only control is the global SMMU_CR2.PTM, and a vSMMU 
can't usefully emulate changing that either way. Either the host set 
PTM=0 before enabling the SMMU, so BTM can be advertised and expected to 
work, or it didn't, in which case there can be no BTM, full stop.

> Incrementally dealing with expanding the support is part of the
> "required kernel support will be indicated in flags."
> 
> Basically, exposing the information as-is doesn't do any harm.

I would say it does. Advertising a feature when we already know it's not 
usable at all puts a non-trivial and unnecessary burden on the VMM and 
VM to then have to somehow derive that information from other sources, 
at the risk of being confused by unexpected behaviour if they don't.

We sanitise CPU ID registers for userspace and KVM, so I see no 
compelling reason for SMMU ID registers to be different.

Thanks,
Robin.

Jason Gunthorpe Nov. 6, 2024, 6:05 p.m. UTC | #10

On Wed, Nov 06, 2024 at 04:37:53PM +0000, Robin Murphy wrote:
> On 2024-11-04 12:41 pm, Jason Gunthorpe wrote:
> > On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote:
> > > > +/**
> > > > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> > > > + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> > > > + *
> > > > + * @flags: Must be set to 0
> > > > + * @__reserved: Must be 0
> > > > + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> > > > + * @iidr: Information about the implementation and implementer of ARM SMMU,
> > > > + *        and architecture version supported
> > > > + * @aidr: ARM SMMU architecture version
> > > > + *
> > > > + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> > > > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> > > > + *
> > > > + * User space should read the underlying ARM SMMUv3 hardware information for
> > > > + * the list of supported features.
> > > > + *
> > > > + * Note that these values reflect the raw HW capability, without any insight if
> > > > + * any required kernel driver support is present. Bits may be set indicating the
> > > > + * HW has functionality that is lacking kernel software support, such as BTM. If
> > > > + * a VMM is using this information to construct emulated copies of these
> > > > + * registers it should only forward bits that it knows it can support.
> 
> But how *is* a VMM supposed to know what it can support?

I answered a related question to Mostafa with an example:

https://lore.kernel.org/linux-iommu/20240903235532.GJ3773488@nvidia.com/

"global" capabilities that are enabled directly from the CD entry
would follow the pattern.

> Are they all expected to grovel the host devicetree/ACPI tables and
> maintain their own knowledge of implementation errata to understand
> what's actually usable?

No, VMMs are expected to only implement base line features we have
working today and not blindly add new features based only HW registers
reported here.

Each future capability we want to enable at the VMM needs an analysis:

 1) Does it require kernel SW changes, ie like BTM? Then it needs a
    kernel_capabilities bit to say the kernel SW exists
 2) Does it require data from ACPI/DT/etc? Then it needs a
    kernel_capabilities bit
 3) Does it need to be "turned on" per VM, ie with a VMS enablement?
    Then it needs a new request flag in ALLOC_VIOMMU
 4) Otherwise it can be read directly from the idr[] array

This is why the comment above is so stern that the VMM "should only
forward bits that it knows it can support".

> S2 tables it its own business. AFAICS, unless the VMM wants to do some
> fiddly CD shadowing, it's going to be kinda hard to prevent the SMMU seeing
> a guest CD with CD.HA and/or CD.HD set if the guest expects S1 HTTU to work.

If the VMM wrongly indicates HTTU support to the VM, because it
wrongly inspected those bits in the idr report, then it is just
broken.

> I would say it does. Advertising a feature when we already know it's not
> usable at all puts a non-trivial and unnecessary burden on the VMM and VM to
> then have to somehow derive that information from other sources, at the risk
> of being confused by unexpected behaviour if they don't.

That is not the purpose here, the register report is not to be used as
"advertising features". It describes details of the raw HW that the
VMM may need to use *some* of the fields.

There are quite a few fields that fit #4 today: OAS, VAX, GRAN, BBML,
CD2L, etc.

Basically we will pass most of the bits and mask a few. If we get the
masking wrong and pass something we shouldn't, then we've improved
nothing compared to this proposal. I think we are likely to get the
masking wrong :)

> We sanitise CPU ID registers for userspace and KVM, so I see no compelling
> reason for SMMU ID registers to be different.

We discussed this already:

https://lore.kernel.org/linux-iommu/20240904120103.GB3915968@nvidia.com

It is a false comparison, for KVM the kernel is responsible to control
the CPU ID registers. Reporting the registers the VM sees to the VMM
makes alot of sense. For SMMU the VMM exclusively controls the VM's ID
registers.

If you still feel strongly about this please let me know by Friday and
I will drop the idr[] array from this cycle. We can continue to
discuss a solution for the next cycle.

Regards,
Jason

Robin Murphy Nov. 6, 2024, 9:05 p.m. UTC | #11

On 2024-11-06 6:05 pm, Jason Gunthorpe wrote:
> On Wed, Nov 06, 2024 at 04:37:53PM +0000, Robin Murphy wrote:
>> On 2024-11-04 12:41 pm, Jason Gunthorpe wrote:
>>> On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote:
>>>>> +/**
>>>>> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
>>>>> + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
>>>>> + *
>>>>> + * @flags: Must be set to 0
>>>>> + * @__reserved: Must be 0
>>>>> + * @idr: Implemented features for ARM SMMU Non-secure programming interface
>>>>> + * @iidr: Information about the implementation and implementer of ARM SMMU,
>>>>> + *        and architecture version supported
>>>>> + * @aidr: ARM SMMU architecture version
>>>>> + *
>>>>> + * For the details of @idr, @iidr and @aidr, please refer to the chapters
>>>>> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
>>>>> + *
>>>>> + * User space should read the underlying ARM SMMUv3 hardware information for
>>>>> + * the list of supported features.
>>>>> + *
>>>>> + * Note that these values reflect the raw HW capability, without any insight if
>>>>> + * any required kernel driver support is present. Bits may be set indicating the
>>>>> + * HW has functionality that is lacking kernel software support, such as BTM. If
>>>>> + * a VMM is using this information to construct emulated copies of these
>>>>> + * registers it should only forward bits that it knows it can support.
>>
>> But how *is* a VMM supposed to know what it can support?
> 
> I answered a related question to Mostafa with an example:
> 
> https://lore.kernel.org/linux-iommu/20240903235532.GJ3773488@nvidia.com/
> 
> "global" capabilities that are enabled directly from the CD entry
> would follow the pattern.
> 
>> Are they all expected to grovel the host devicetree/ACPI tables and
>> maintain their own knowledge of implementation errata to understand
>> what's actually usable?
> 
> No, VMMs are expected to only implement base line features we have
> working today and not blindly add new features based only HW registers
> reported here.
> 
> Each future capability we want to enable at the VMM needs an analysis:
> 
>   1) Does it require kernel SW changes, ie like BTM? Then it needs a
>      kernel_capabilities bit to say the kernel SW exists
>   2) Does it require data from ACPI/DT/etc? Then it needs a
>      kernel_capabilities bit
>   3) Does it need to be "turned on" per VM, ie with a VMS enablement?
>      Then it needs a new request flag in ALLOC_VIOMMU
>   4) Otherwise it can be read directly from the idr[] array
> 
> This is why the comment above is so stern that the VMM "should only
> forward bits that it knows it can support".

So... you're saying this patch is in fact broken, or at least uselessly 
incomplete, since VMMs aren't allowed to emulate a vSMMU at all without 
first consulting some other interface which does not exist? Great.

>> S2 tables it its own business. AFAICS, unless the VMM wants to do some
>> fiddly CD shadowing, it's going to be kinda hard to prevent the SMMU seeing
>> a guest CD with CD.HA and/or CD.HD set if the guest expects S1 HTTU to work.
> 
> If the VMM wrongly indicates HTTU support to the VM, because it
> wrongly inspected those bits in the idr report, then it is just
> broken.

What do you mean? We could have a system right now where the hardware is 
configured with SMMU_IDR0.HTTU=2, but it turned out that atomics were 
broken in the interconnect so firmware sets the IORT "HTTU override" 
field is set to 0. We know about that in the kernel, but all a VMM sees 
is iommu_hw_info_arm_smmuv3.idr[0] indicating HTTU=2. If it is "broken" 
to take the only information available at face value, assume HTTU is 
available, and reflect that in a vSMMU interface, then what is the 
correct thing to do, other than to not dare emulate a vSMMU at all, in 
fear of a sternly worded comment?

>> I would say it does. Advertising a feature when we already know it's not
>> usable at all puts a non-trivial and unnecessary burden on the VMM and VM to
>> then have to somehow derive that information from other sources, at the risk
>> of being confused by unexpected behaviour if they don't.
> 
> That is not the purpose here, the register report is not to be used as
> "advertising features". It describes details of the raw HW that the
> VMM may need to use *some* of the fields.
> 
> There are quite a few fields that fit #4 today: OAS, VAX, GRAN, BBML,
> CD2L, etc.
> 
> Basically we will pass most of the bits and mask a few. If we get the
> masking wrong and pass something we shouldn't, then we've improved
> nothing compared to this proposal. I think we are likely to get the
> masking wrong :)

Seriously? A simple inverse of the feature detection the kernel driver 
already does for its own needs, implemented once in the same place, is hard?

Compared to maintaining the exact same information within the driver but 
in some new different form, and also maintaining it in the UAPI, and 
having every VMM ever all do the same work to put the two together, and 
always be up to date with the right UAPI, and never ever let any field 
slip through as-is, especially not all the ones which were RES0 at time 
of writing, enforced by a sternly worded comment? Why yes, of course I 
can see how that's trivially easy and carries no risk whatsoever.

>> We sanitise CPU ID registers for userspace and KVM, so I see no compelling
>> reason for SMMU ID registers to be different.
> 
> We discussed this already:
> 
> https://lore.kernel.org/linux-iommu/20240904120103.GB3915968@nvidia.com
> 
> It is a false comparison, for KVM the kernel is responsible to control
> the CPU ID registers. Reporting the registers the VM sees to the VMM
> makes alot of sense. For SMMU the VMM exclusively controls the VM's ID
> registers.

Pointing out that two things are different is a false comparison because 
they are different, by virtue of your choice to make them different? 
Please try making sense.

Your tautology still does not offer any reasoning against doing the 
logical thing and following the same basic pattern: the kernel uses the 
ID register mechanism itself to advertise the set of features it's 
able/willing to support, by sanitising the values it offers to the VMM, 
combining the notions of hardware and kernel support where the 
distinction is irrelevant anyway. The VMM is then still free to take 
those values and hide more features, or potentially add any that it is 
capable of emulating without the kernel's help, and advertise that final 
set to the VM. Obviously there are significant *implementation* 
differences, most notably that the latter VMM->VM part doesn't need to 
involve IOMMUFD at all since MMIO register emulation can stay entirely 
in userspace, whereas for CPU system registers the final VM-visible 
values need to be plugged back in to KVM for it to handle the traps.

We are all asking you to explain why you think doing the kernel->VMM 
advertisement naturally and intuitively is somehow bad, and forcing VMMs 
to instead rely on a more complex, fragile, and crucially non-existent 
additional interface is better. You should take "We discussed this 
already" as more of a clue to yourself than to me - if 4 different 
people have all said the exact same thing in so many words, perhaps 
there's something in it...

And in case I need to spell it out with less sarcasm, "we'll get masking 
wrong in the kernel" only implies "we'll get kernel_capabilities wrong 
in the kernel (and elsewhere)", so it's clearly not a useful argument to 
keep repeating. Besides, as KVM + sysfs + MRS emulation shows, we're 
pretty experienced at masking ID registers in the kernel. It's not hard 
to do it right in a robust manner, where particularly with the nature of 
SMMU features, the only real risk might be forgetting to expose 
something new once we do actually support it.

> If you still feel strongly about this please let me know by Friday and
> I will drop the idr[] array from this cycle. We can continue to
> discuss a solution for the next cycle.

It already can't work as-is, I don't see how making it even more broken 
would help. IMO it doesn't seem like a good idea to be merging UAPI at 
all while it's still clearly incomplete and by its own definition unusable.

Thanks,
Robin.

Nicolin Chen Nov. 6, 2024, 9:53 p.m. UTC | #12

On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote:
> On 2024-11-06 6:05 pm, Jason Gunthorpe wrote:
> > If you still feel strongly about this please let me know by Friday and
> > I will drop the idr[] array from this cycle. We can continue to
> > discuss a solution for the next cycle.
> 
> It already can't work as-is, I don't see how making it even more broken
> would help. IMO it doesn't seem like a good idea to be merging UAPI at
> all while it's still clearly incomplete and by its own definition unusable.

Robin, would you please give a clear suggestion for the hw_info?

My takeaway is that you would want the unsupported features (per
firmware overrides and errata) to be stripped from the reporting
IDR array. Alternatively, we could start with some basic nesting
features less those advanced ones (HTTU/PRI or so), and then add
then later once we're comfortable to advertise.

Does this sound okay to you?

Thanks
Nicolin

Jason Gunthorpe Nov. 7, 2024, 2:35 a.m. UTC | #13

On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote:
> > Each future capability we want to enable at the VMM needs an analysis:
> > 
> >   1) Does it require kernel SW changes, ie like BTM? Then it needs a
> >      kernel_capabilities bit to say the kernel SW exists
> >   2) Does it require data from ACPI/DT/etc? Then it needs a
> >      kernel_capabilities bit
> >   3) Does it need to be "turned on" per VM, ie with a VMS enablement?
> >      Then it needs a new request flag in ALLOC_VIOMMU
> >   4) Otherwise it can be read directly from the idr[] array
> > 
> > This is why the comment above is so stern that the VMM "should only
> > forward bits that it knows it can support".
> 
> So... you're saying this patch is in fact broken, or at least uselessly
> incomplete, since VMMs aren't allowed to emulate a vSMMU at all without
> first consulting some other interface which does not exist? Great.

That is too far, there is nothing fundamentally broken here.

This does not enable every SMMU feature in the architecture. It
implements a basic baseline and no more. The VMMs are only permitted
to read bits in that baseline. VMMs that want to go beyond that have
to go through the 4 steps above.

Fundamentally all we are arguing about here is if bits the VMM
shouldn't even read should be forced to zero by the kernel.

If the bits are forced to zero then perhaps they could be used as a
future SW feature negotiation, and maybe that would be better, or
maybe not. See below about BTM for a counter example.

I agree the kdoc does not describe what the baseline actually is.

> We know about that in the kernel, but all a VMM sees is
> iommu_hw_info_arm_smmuv3.idr[0] indicating HTTU=2. If it is "broken" to take
> the only information available at face value, assume HTTU is available, and
> reflect that in a vSMMU interface, then what is the correct thing to do,
> other than to not dare emulate a vSMMU at all, in fear of a sternly worded
> comment?

These patches support a baseline feature set. They do not support the
entire SMMU architecture. They do not support vHTTU.

Things are going step by step because this is a huge project. HTTU is
not in the baseline, so it is definitely wrong for any VMM working on
these patches to advertise support to the VM! The VMM *MUST* ignore
such bits in the IDR report.

The correct thing is the 4 steps I outlined above. When a VMM author
wants to go beyond the baseline they have to determine what kernel and
VMM software is required and implement the missing parts.

> Your tautology still does not offer any reasoning against doing the logical
> thing and following the same basic pattern: the kernel uses the ID register
> mechanism itself to advertise the set of features it's able/willing to
> support, by sanitising the values it offers to the VMM, combining the
> notions of hardware and kernel support where the distinction is irrelevant
> anyway.

Even with your plan the VMM can not pass the kernel's IDR register
into the VM unprotected. It must always sanitize it against the
features that the VMM supports.

Let's consider vBTM support. This requires kernel support to enable
the global BTM register *AND* a new VMM ioctl flow with iommufd to
link to KVM's VMID.

You are suggesting that the old kernel should wire IDR BTM to 0 and
the new kernel will set it to 1. However the VMM cannot just forward
this bit to the VM! A old VMM that does not do the KVM flow does NOT
support BTM.

All VMM's must mask the BTM bit from day 0, or we can never set it to
1 in a future kernel. Which is exactly the same plan as this patch!

> You should take "We discussed this already"
> as more of a clue to yourself than to me - if 4 different people have all
> said the exact same thing in so many words, perhaps there's something in
> it...

And all seemed to agree it was not a big deal after the discussion.

I think Mostafa was driving in a direction that we break up the IDR
into explicit fields and thus be explicit about what information the
VMM is able to access. This would effectively document and enforce
what the baseline is.

> And in case I need to spell it out with less sarcasm, "we'll get masking
> wrong in the kernel" only implies "we'll get kernel_capabilities wrong in
> the kernel (and elsewhere)",

Less sarcasm and hyperbole would be nice.

> > If you still feel strongly about this please let me know by Friday and
> > I will drop the idr[] array from this cycle. We can continue to
> > discuss a solution for the next cycle.
> 
> It already can't work as-is, 

As above, this is not true.

> I don't see how making it even more broken
> would help. IMO it doesn't seem like a good idea to be merging UAPI at all
> while it's still clearly incomplete and by its own definition unusable.

I agree that removing the idr would make it definitely unusable. There
is quite a bit of essential information in there the VMM
needs. However, the uAPI is extensible so adding a new field in the
next cycle is not breaking.

I'm trying to offer you an olive branch so we can have this discussion
with time instead of urgently at the 11th hour and derailing the
series. This exact topic has been discussed already two months ago,
and is, IMHO, mostly a style choice not an urgent technical matter.

If you have another option we can conclude in the next few days then
I'm happy to hear it.

If we focus only on the baseline functionality Nicolin can come up
with a list of fields and we can write it very explicitly in the kdoc
that the VMM can only read those fields.

I could also reluctantly agree to permanent masking to report only the
above kdoc bits in the kernel. Noting if you want this because you
don't trust the VMM authors to follow an explicit kdoc, then we likely
also cannot ever set some bits to 1. Eg BTM would be permanently 0.

Regards,
Jason

Jason Gunthorpe Nov. 7, 2024, 3:51 p.m. UTC | #14

On Wed, Nov 06, 2024 at 10:35:07PM -0400, Jason Gunthorpe wrote:

> I agree the kdoc does not describe what the baseline actually is.

Nicolin worked on this, here is a more detailed kdoc:

/**
 * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
 *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
 *
 * @flags: Must be set to 0
 * @__reserved: Must be 0
 * @idr: Implemented features for ARM SMMU Non-secure programming interface
 * @iidr: Information about the implementation and implementer of ARM SMMU,
 *        and architecture version supported
 * @aidr: ARM SMMU architecture version
 *
 * For the details of @idr, @iidr and @aidr, please refer to the chapters
 * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
 *
 * This reports the raw HW capability, and not all bits are meaningful to be
 * read by userspace. Only the following fields should be used:
 *
 * idr[0]: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN , CD2L, ASID16, TTF
 * idr[1]: SIDSIZE, SSIDSIZE
 * idr[3]: BBML, RIL
 * idr[5]: VAX, GRAN64K, GRAN16K, GRAN4K
 *
 * - S1P should be assumed to be true if a NESTED HWPT can be created
 * - VFIO/iommufd only support platforms with COHACC, it should be assumed to be
 *   true.
 * - ATS is a per-device property. If the VMM describes any devices as ATS
 *   capable in ACPI/DT it should set the corresponding idr.
 *
 * This list may expand in future (eg E0PD, AIE, PBHA, D128, DS etc). It is
 * important that VMMs do not read bits outside the list to allow for
 * compatibility with future kernels. Several features in the SMMUv3
 * architecture are not currently supported by the kernel for nesting: HTTU,
 * BTM, MPAM and others.
 */

This focuses on stuff we can actually test and gives a path to test
and confirm a no-code update to future stuff.

The future list (E0PD/etc) reflects things the current kernel doesn't
use. Our naive read of the spec suggests they are probably fine to
just read the raw HW IDR. When someone implements guest kernel
support, does a detailed spec read, and crucially tests them - then we
can update the comment and have immediate support.

HTTU, BTM, and others like that will need additional bits outside the
IDR if someone wishes to enable them.

Jason

Will Deacon Nov. 8, 2024, 2:53 p.m. UTC | #15

Hi guys,

On Wed, Nov 06, 2024 at 10:35:06PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote:
> > You should take "We discussed this already"
> > as more of a clue to yourself than to me - if 4 different people have all
> > said the exact same thing in so many words, perhaps there's something in
> > it...
> 
> And all seemed to agree it was not a big deal after the discussion.
> 
> I think Mostafa was driving in a direction that we break up the IDR
> into explicit fields and thus be explicit about what information the
> VMM is able to access. This would effectively document and enforce
> what the baseline is.

As one of the four people mentioned above, I figured I'd chime in with
my rationale for queuing this in case it's of any help or interest.

Initially, I was reasonably sure that we should be sanitising the ID
registers and being selective about what we advertise to userspace.
However, after Jason's reply to my comments, mulling it over in my head
and having lively conversations with Mostafa at lunchtime, I've come
full circle. Is it a great interface? Not at all. It's 8 register values
copied from the hardware to userspace. But is it good enough? I think it
is. It's also extremely simple (i.e. easy to explain what it does and
trivial to implement), which I think is a huge benefit given that the
IOMMUFD work around it is still evolving.

I'm firmly of the opinion that the VMM is going to need a tonne of help
from other sources to expose a virtual IOMMU successfully. For example,
anything relating to policy or configuration choices should be driven
from userspace rather than the kernel. If we start to expose policy in
the id registers for the cases where it happens to fit nicely, I fear
that this will backfire and the VMM will end up second-guessing the
kernel in cases where it decides it knows best. I'm not sold on the
analogy with CPU ID registers as (a) we don't have big/little idiocy to
deal with in the SMMU and (b) I don't think SMMU features always need
support code in the host, which is quite unlike many CPU features that
cannot be exposed safely without hypervisor context-switching support.

On top of all that, this interface can be extended if we change our
minds. If we decide we need to mask out fields, I think we could add
that after the fact. Hell, if we all decide that it's a disaster in a
few releases time, we can try something else. Ultimately, Jason is the
one maintaining IOMMUFD and he gets to deal with the problems if his
UAPI doesn't work out :)

So, given where we are in the cycle, I think the pragmatic thing to do
is to land this change now and enable the ongoing IOMMUFD work to
continue. We still have over two months to resolve any major problems
with the interface (and even unplug it entirely from the driver if we
really get stuck) but for now I think it's "fine".

Will

Will Deacon Nov. 8, 2024, 2:56 p.m. UTC | #16

On Tue, Nov 05, 2024 at 04:48:29PM +0000, Will Deacon wrote:
> On Fri, Nov 01, 2024 at 10:25:03AM -0300, Jason Gunthorpe wrote:
> > On Fri, Nov 01, 2024 at 12:18:50PM +0000, Will Deacon wrote:
> > > On Wed, Oct 30, 2024 at 09:20:44PM -0300, Jason Gunthorpe wrote:
> > > > [This is now based on Nicolin's iommufd patches for vIOMMU and will need
> > > > to go through the iommufd tree, please ack]
> > > 
> > > Can't we separate out the SMMUv3 driver changes? They shouldn't depend on
> > > Nicolin's work afaict.
> > 
> > We can put everything before "iommu/arm-smmu-v3: Support
> > IOMMU_VIOMMU_ALLOC" directly on a rc and share a branch with your tree.
> > 
> > That patch and following all depend on Nicolin's work, as they rely on
> > the VIOMMU due to how different ARM is from Intel.
> > 
> > How about you take these patches:
> > 
> >  [PATCH v4 01/12] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
> >  [PATCH v4 02/12] ACPICA: IORT: Update for revision E.f
> >  [PATCH v4 03/12] ACPI/IORT: Support CANWBS memory access flag
> >  [PATCH v4 04/12] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
> >  [PATCH v4 05/12] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
> >  [PATCH v4 06/12] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
> >  [PATCH v4 07/12] iommu/arm-smmu-v3: Expose the arm_smmu_attach interface
> > 
> > Onto a branch.
> 
> I've pushed these onto a new branch in the IOMMU tree:
> 
> 	iommufd/arm-smmuv3-nested
> 
> However, please can you give it a day or two to get some exposure in
> -next before you merge that into iommufd? I'll ping back here later in
> the week.

Not seen anything go bang in -next, so I think we can consider this
branch stable now.

Will

Jason Gunthorpe Nov. 8, 2024, 3:42 p.m. UTC | #17

On Fri, Nov 08, 2024 at 02:53:22PM +0000, Will Deacon wrote:

> So, given where we are in the cycle, I think the pragmatic thing to do
> is to land this change now and enable the ongoing IOMMUFD work to
> continue. We still have over two months to resolve any major problems
> with the interface (and even unplug it entirely from the driver if we
> really get stuck) but for now I think it's "fine".

Thanks Will, this all eloquently captures my line of reasoning also.

I will add that after going through the exercise with Nicolin, to
write down all the fields the VMM is allowed to touch, it seems we do
have a list of future fields that likely do not require any host
support (E0PD, AIE, PBHA, D128, DS). We also have ones that will (BTM)
and then the fwspec weirdo of HTTU.

Given those ratios I feel pretty good about showing that data to the
userspace, expecting that down the road we will "turn it on" for
E0PD/etc without any kernel change, and BTM/HTTU will not use idr for
discovery.

The biggest risk is the VMM's badly muck it up. I now think it was a
lazy shortcut to not enumerate all the fields in the kdoc. Now that
we've done that we see that the (unreviewed) RFC qemu patches did miss
some things and it is being corrected. 

Nicolin and I had some discussion if we should go ahead and include
E0PD/etc in the documentation right now, but we both felt some
uncertainty that we really understood those features deeply enough to
be confident, and have no way to test.

There is also the nanny option of zeroing everything not in the
approved list, but I feel like it is already so hard to write a VMM
vSMMU3 that is too nannyish. Hoepfully time will not show that my
opinion of VMM authors is too high. :\

Jason

Jason Gunthorpe Nov. 12, 2024, 6:29 p.m. UTC | #18

On Wed, Oct 30, 2024 at 09:20:44PM -0300, Jason Gunthorpe wrote:
> Jason Gunthorpe (7):
>   iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
>   iommu/arm-smmu-v3: Use S2FWB for NESTED domains
>   iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
> 
> Nicolin Chen (5):
>   iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
>   iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object

Applied to iommufd for-next along with all the dependencies and the
additional hunk Zhangfei pointed out.

Thanks,
Jason

Zhangfei Gao Nov. 13, 2024, 1:01 a.m. UTC | #19

On Wed, 13 Nov 2024 at 02:29, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Oct 30, 2024 at 09:20:44PM -0300, Jason Gunthorpe wrote:
> > Jason Gunthorpe (7):
> >   iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
> >   iommu/arm-smmu-v3: Use S2FWB for NESTED domains
> >   iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
> >
> > Nicolin Chen (5):
> >   iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
> >   iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object
>
> Applied to iommufd for-next along with all the dependencies and the
> additional hunk Zhangfei pointed out.

Thanks Jason, I have verified on aarch64 based on your
jason/smmuv3_nesting branch

https://github.com/Linaro/linux-kernel-uadk/tree/6.12-wip
https://github.com/Linaro/qemu/tree/6.12-wip

Still need this hack
https://github.com/Linaro/linux-kernel-uadk/commit/eaa194d954112cad4da7852e29343e546baf8683

One is adding iommu_dev_enable/disable_feature IOMMU_DEV_FEAT_SVA,
which you have patchset before.
The other is to temporarily ignore S2FWB or CANWBS.

Thanks

Jason Gunthorpe Nov. 13, 2024, 1:23 a.m. UTC | #20

On Wed, Nov 13, 2024 at 09:01:41AM +0800, Zhangfei Gao wrote:
> On Wed, 13 Nov 2024 at 02:29, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 09:20:44PM -0300, Jason Gunthorpe wrote:
> > > Jason Gunthorpe (7):
> > >   iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
> > >   iommu/arm-smmu-v3: Use S2FWB for NESTED domains
> > >   iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
> > >
> > > Nicolin Chen (5):
> > >   iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
> > >   iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object
> >
> > Applied to iommufd for-next along with all the dependencies and the
> > additional hunk Zhangfei pointed out.
> 
> Thanks Jason, I have verified on aarch64 based on your
> jason/smmuv3_nesting branch

Great, thanks

> https://github.com/Linaro/linux-kernel-uadk/tree/6.12-wip
> https://github.com/Linaro/qemu/tree/6.12-wip
> 
> Still need this hack
> https://github.com/Linaro/linux-kernel-uadk/commit/eaa194d954112cad4da7852e29343e546baf8683
>
> One is adding iommu_dev_enable/disable_feature IOMMU_DEV_FEAT_SVA,
> which you have patchset before.

Yes, I have a more complete version of that here someplace. Need some
help on vt-d but hope to get that done next cycle.

> The other is to temporarily ignore S2FWB or CANWBS.

Yes, ideally you should do that in the device FW, or perhaps via the
Linux ACPI patching?

Jason

Baolu Lu Nov. 13, 2024, 2:55 a.m. UTC | #21

On 11/13/24 09:23, Jason Gunthorpe wrote:
>> https://github.com/Linaro/linux-kernel-uadk/tree/6.12-wip
>> https://github.com/Linaro/qemu/tree/6.12-wip
>>
>> Still need this hack
>> https://github.com/Linaro/linux-kernel-uadk/commit/ 
>> eaa194d954112cad4da7852e29343e546baf8683
>>
>> One is adding iommu_dev_enable/disable_feature IOMMU_DEV_FEAT_SVA,
>> which you have patchset before.
> Yes, I have a more complete version of that here someplace. Need some
> help on vt-d but hope to get that done next cycle.

Can you please elaborate this a bit more? Are you talking about below
change

+	ret = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_SVA);
+	if (ret)
+		return ret;

in iommufd_fault_iopf_enable()?

I have no idea about why SVA is affected when enabling iopf.

--
baolu

Zhangfei Gao Nov. 13, 2024, 6:28 a.m. UTC | #22

Hi, Baolu

On Wed, 13 Nov 2024 at 10:56, Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
> On 11/13/24 09:23, Jason Gunthorpe wrote:
> >> https://github.com/Linaro/linux-kernel-uadk/tree/6.12-wip
> >> https://github.com/Linaro/qemu/tree/6.12-wip
> >>
> >> Still need this hack
> >> https://github.com/Linaro/linux-kernel-uadk/commit/
> >> eaa194d954112cad4da7852e29343e546baf8683
> >>
> >> One is adding iommu_dev_enable/disable_feature IOMMU_DEV_FEAT_SVA,
> >> which you have patchset before.
> > Yes, I have a more complete version of that here someplace. Need some
> > help on vt-d but hope to get that done next cycle.
>
> Can you please elaborate this a bit more? Are you talking about below
> change
>
> +       ret = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_SVA);
> +       if (ret)
> +               return ret;
>
> in iommufd_fault_iopf_enable()?
>
> I have no idea about why SVA is affected when enabling iopf.

In drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c,
iommu_dev_enable_feature(IOMMU_DEV_FEAT_SVA) will real call
iopf_queue_add_device,
while iommu_dev_enable_feature(IOPF)  only set flag.

arm_smmu_dev_enable_feature
case IOMMU_DEV_FEAT_SVA:
arm_smmu_master_enable_sva(master)
iopf_queue_add_device(master->smmu->evtq.iopf, dev);

Thanks

Jason Gunthorpe Nov. 13, 2024, 4:43 p.m. UTC | #23

On Wed, Nov 13, 2024 at 10:55:41AM +0800, Baolu Lu wrote:
> On 11/13/24 09:23, Jason Gunthorpe wrote:
> > > https://github.com/Linaro/linux-kernel-uadk/tree/6.12-wip
> > > https://github.com/Linaro/qemu/tree/6.12-wip
> > > 
> > > Still need this hack
> > > https://github.com/Linaro/linux-kernel-uadk/commit/
> > > eaa194d954112cad4da7852e29343e546baf8683
> > > 
> > > One is adding iommu_dev_enable/disable_feature IOMMU_DEV_FEAT_SVA,
> > > which you have patchset before.
> > Yes, I have a more complete version of that here someplace. Need some
> > help on vt-d but hope to get that done next cycle.
> 
> Can you please elaborate this a bit more? Are you talking about below
> change

I need your help to remove IOMMU_DEV_FEAT_IOPF from the intel
driver. I have a patch series that eliminates it from all the other
drivers, and I wrote a patch to remove FEAT_SVA from intel..

> +	ret = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_SVA);
> +	if (ret)
> +		return ret;
> 
> in iommufd_fault_iopf_enable()?
> 
> I have no idea about why SVA is affected when enabling iopf.

It is ARM not implementing the API correctly. Only SVA turns on the
page fault reporting mechanism.

In the new world the page fault reporting should be managed during
domain attachment. If the domain is fault capable then faults should
be delivered to that domain. That is the correct time to setup the
iopf mechanism as well.

So I fixed that and now ARM and AMD both have no-op implementations of
IOMMU_DEV_FEAT_IOPF and IOMMU_DEV_FEAT_SVA. Thus I'd like to remove it
entirely.

Jason

Baolu Lu Nov. 14, 2024, 12:51 a.m. UTC | #24

On 11/14/24 00:43, Jason Gunthorpe wrote:
> On Wed, Nov 13, 2024 at 10:55:41AM +0800, Baolu Lu wrote:
>> On 11/13/24 09:23, Jason Gunthorpe wrote:
>>>> https://github.com/Linaro/linux-kernel-uadk/tree/6.12-wip
>>>> https://github.com/Linaro/qemu/tree/6.12-wip
>>>>
>>>> Still need this hack
>>>> https://github.com/Linaro/linux-kernel-uadk/commit/
>>>> eaa194d954112cad4da7852e29343e546baf8683
>>>>
>>>> One is adding iommu_dev_enable/disable_feature IOMMU_DEV_FEAT_SVA,
>>>> which you have patchset before.
>>> Yes, I have a more complete version of that here someplace. Need some
>>> help on vt-d but hope to get that done next cycle.
>>
>> Can you please elaborate this a bit more? Are you talking about below
>> change
> 
> I need your help to remove IOMMU_DEV_FEAT_IOPF from the intel
> driver. I have a patch series that eliminates it from all the other
> drivers, and I wrote a patch to remove FEAT_SVA from intel..

Yes, sure. Let's make this happen in the next cycle.

FEAT_IOPF could be removed. IOPF manipulation can be handled in the
domain attachment path. A per-device refcount can be implemented. This
count increments with each iopf-capable domain attachment and decrements
with each detachment. PCI PRI is enabled for the first iopf-capable
domain and disabled when the last one is removed. Probably we can also
solve the PF/VF sharing PRI issue.

With iopf moved to the domain attach path and hardware capability checks
to the SVA domain allocation path, FEAT_SVA becomes essentially a no-op.

> 
>> +	ret = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_SVA);
>> +	if (ret)
>> +		return ret;
>>
>> in iommufd_fault_iopf_enable()?
>>
>> I have no idea about why SVA is affected when enabling iopf.
> 
> It is ARM not implementing the API correctly. Only SVA turns on the
> page fault reporting mechanism.
> 
> In the new world the page fault reporting should be managed during
> domain attachment. If the domain is fault capable then faults should
> be delivered to that domain. That is the correct time to setup the
> iopf mechanism as well.
> 
> So I fixed that and now ARM and AMD both have no-op implementations of
> IOMMU_DEV_FEAT_IOPF and IOMMU_DEV_FEAT_SVA. Thus I'd like to remove it
> entirely.

Thank you for the explanation.

--
baolu

Jason Gunthorpe Nov. 15, 2024, 5:55 p.m. UTC | #25

On Thu, Nov 14, 2024 at 08:51:47AM +0800, Baolu Lu wrote:
> On 11/14/24 00:43, Jason Gunthorpe wrote:
> > On Wed, Nov 13, 2024 at 10:55:41AM +0800, Baolu Lu wrote:
> > > On 11/13/24 09:23, Jason Gunthorpe wrote:
> > > > > https://github.com/Linaro/linux-kernel-uadk/tree/6.12-wip
> > > > > https://github.com/Linaro/qemu/tree/6.12-wip
> > > > > 
> > > > > Still need this hack
> > > > > https://github.com/Linaro/linux-kernel-uadk/commit/
> > > > > eaa194d954112cad4da7852e29343e546baf8683
> > > > > 
> > > > > One is adding iommu_dev_enable/disable_feature IOMMU_DEV_FEAT_SVA,
> > > > > which you have patchset before.
> > > > Yes, I have a more complete version of that here someplace. Need some
> > > > help on vt-d but hope to get that done next cycle.
> > > 
> > > Can you please elaborate this a bit more? Are you talking about below
> > > change
> > 
> > I need your help to remove IOMMU_DEV_FEAT_IOPF from the intel
> > driver. I have a patch series that eliminates it from all the other
> > drivers, and I wrote a patch to remove FEAT_SVA from intel..
> 
> Yes, sure. Let's make this happen in the next cycle.
>
> FEAT_IOPF could be removed. IOPF manipulation can be handled in the
> domain attachment path. A per-device refcount can be implemented. This
> count increments with each iopf-capable domain attachment and decrements
> with each detachment. PCI PRI is enabled for the first iopf-capable
> domain and disabled when the last one is removed. Probably we can also
> solve the PF/VF sharing PRI issue.

Here is what I have so far, if you send me a patch for vt-d to move
FEAT_IOPF into attach as you describe above (see what I did to arm for
example), then I can send it next cycle

https://github.com/jgunthorpe/linux/commits/iommu_no_feat/

Thanks,
Jason

Jason Gunthorpe Jan. 22, 2025, 7:26 p.m. UTC | #26

On Fri, Nov 15, 2024 at 01:55:22PM -0400, Jason Gunthorpe wrote:
> > > I need your help to remove IOMMU_DEV_FEAT_IOPF from the intel
> > > driver. I have a patch series that eliminates it from all the other
> > > drivers, and I wrote a patch to remove FEAT_SVA from intel..
> > 
> > Yes, sure. Let's make this happen in the next cycle.
> >
> > FEAT_IOPF could be removed. IOPF manipulation can be handled in the
> > domain attachment path. A per-device refcount can be implemented. This
> > count increments with each iopf-capable domain attachment and decrements
> > with each detachment. PCI PRI is enabled for the first iopf-capable
> > domain and disabled when the last one is removed. Probably we can also
> > solve the PF/VF sharing PRI issue.
> 
> Here is what I have so far, if you send me a patch for vt-d to move
> FEAT_IOPF into attach as you describe above (see what I did to arm for
> example), then I can send it next cycle
> 
> https://github.com/jgunthorpe/linux/commits/iommu_no_feat/

Hey Baolu, a reminder on this, lets try for it next cycle?

Thanks,
Jason

Baolu Lu Feb. 5, 2025, 3:45 a.m. UTC | #27

On 2025/1/23 3:26, Jason Gunthorpe wrote:
> On Fri, Nov 15, 2024 at 01:55:22PM -0400, Jason Gunthorpe wrote:
>>>> I need your help to remove IOMMU_DEV_FEAT_IOPF from the intel
>>>> driver. I have a patch series that eliminates it from all the other
>>>> drivers, and I wrote a patch to remove FEAT_SVA from intel..
>>> Yes, sure. Let's make this happen in the next cycle.
>>>
>>> FEAT_IOPF could be removed. IOPF manipulation can be handled in the
>>> domain attachment path. A per-device refcount can be implemented. This
>>> count increments with each iopf-capable domain attachment and decrements
>>> with each detachment. PCI PRI is enabled for the first iopf-capable
>>> domain and disabled when the last one is removed. Probably we can also
>>> solve the PF/VF sharing PRI issue.
>> Here is what I have so far, if you send me a patch for vt-d to move
>> FEAT_IOPF into attach as you describe above (see what I did to arm for
>> example), then I can send it next cycle
>>
>> https://github.com/jgunthorpe/linux/commits/iommu_no_feat/
> Hey Baolu, a reminder on this, lets try for it next cycle?

Oh, I forgot this. Thanks for the reminding. Sure, let's try to make it
in the next cycle.

---
baolu

Baolu Lu Feb. 13, 2025, 11:57 a.m. UTC | #28

On 2025/2/5 11:45, Baolu Lu wrote:
> On 2025/1/23 3:26, Jason Gunthorpe wrote:
>> On Fri, Nov 15, 2024 at 01:55:22PM -0400, Jason Gunthorpe wrote:
>>>>> I need your help to remove IOMMU_DEV_FEAT_IOPF from the intel
>>>>> driver. I have a patch series that eliminates it from all the other
>>>>> drivers, and I wrote a patch to remove FEAT_SVA from intel..
>>>> Yes, sure. Let's make this happen in the next cycle.
>>>>
>>>> FEAT_IOPF could be removed. IOPF manipulation can be handled in the
>>>> domain attachment path. A per-device refcount can be implemented. This
>>>> count increments with each iopf-capable domain attachment and 
>>>> decrements
>>>> with each detachment. PCI PRI is enabled for the first iopf-capable
>>>> domain and disabled when the last one is removed. Probably we can also
>>>> solve the PF/VF sharing PRI issue.
>>> Here is what I have so far, if you send me a patch for vt-d to move
>>> FEAT_IOPF into attach as you describe above (see what I did to arm for
>>> example), then I can send it next cycle
>>>
>>> https://github.com/jgunthorpe/linux/commits/iommu_no_feat/
>> Hey Baolu, a reminder on this, lets try for it next cycle?
> 
> Oh, I forgot this. Thanks for the reminding. Sure, let's try to make it
> in the next cycle.

Hi Jason,

I've worked through the entire series. The patches are available here:

https://github.com/LuBaolu/intel-iommu/commits/iommu-no-feat-v6.14-rc2

Please check if this is the right direction.

Thanks,
baolu

Jason Gunthorpe Feb. 13, 2025, 6:43 p.m. UTC | #29

On Thu, Feb 13, 2025 at 07:57:51PM +0800, Baolu Lu wrote:
> On 2025/2/5 11:45, Baolu Lu wrote:
> > On 2025/1/23 3:26, Jason Gunthorpe wrote:
> > > On Fri, Nov 15, 2024 at 01:55:22PM -0400, Jason Gunthorpe wrote:
> > > > > > I need your help to remove IOMMU_DEV_FEAT_IOPF from the intel
> > > > > > driver. I have a patch series that eliminates it from all the other
> > > > > > drivers, and I wrote a patch to remove FEAT_SVA from intel..
> > > > > Yes, sure. Let's make this happen in the next cycle.
> > > > > 
> > > > > FEAT_IOPF could be removed. IOPF manipulation can be handled in the
> > > > > domain attachment path. A per-device refcount can be implemented. This
> > > > > count increments with each iopf-capable domain attachment
> > > > > and decrements
> > > > > with each detachment. PCI PRI is enabled for the first iopf-capable
> > > > > domain and disabled when the last one is removed. Probably we can also
> > > > > solve the PF/VF sharing PRI issue.
> > > > Here is what I have so far, if you send me a patch for vt-d to move
> > > > FEAT_IOPF into attach as you describe above (see what I did to arm for
> > > > example), then I can send it next cycle
> > > > 
> > > > https://github.com/jgunthorpe/linux/commits/iommu_no_feat/
> > > Hey Baolu, a reminder on this, lets try for it next cycle?
> > 
> > Oh, I forgot this. Thanks for the reminding. Sure, let's try to make it
> > in the next cycle.
> 
> Hi Jason,
> 
> I've worked through the entire series. The patches are available here:
> 
> https://github.com/LuBaolu/intel-iommu/commits/iommu-no-feat-v6.14-rc2
> 
> Please check if this is the right direction.

Looks great, and you did all the cleanup stuff too!

The vt-d flow is a little more complicated than the ARM logic because
the driver flow is structed differently.

Do we really want ATS turned on if the only thing attached is an
IDENTITY domain? That will unnecessarily slow down ATS capable HW.. It
is functionally OK though.

Also, are there enough ATC flushes around any domain type change? I
didn't check..

I feel like we should leave "iommu: Move PRI enablement for VF to
iommu driver" out for now, every driver needs this check?  AMD
supports SVA and PRI so it needs it too.

Do you want to squash those fixup patches and post it?

Thanks,
Jason

Baolu Lu Feb. 14, 2025, 5:39 a.m. UTC | #30

On 2/14/25 02:43, Jason Gunthorpe wrote:
> On Thu, Feb 13, 2025 at 07:57:51PM +0800, Baolu Lu wrote:
>> On 2025/2/5 11:45, Baolu Lu wrote:
>>> On 2025/1/23 3:26, Jason Gunthorpe wrote:
>>>> On Fri, Nov 15, 2024 at 01:55:22PM -0400, Jason Gunthorpe wrote:
>>>>>>> I need your help to remove IOMMU_DEV_FEAT_IOPF from the intel
>>>>>>> driver. I have a patch series that eliminates it from all the other
>>>>>>> drivers, and I wrote a patch to remove FEAT_SVA from intel..
>>>>>> Yes, sure. Let's make this happen in the next cycle.
>>>>>>
>>>>>> FEAT_IOPF could be removed. IOPF manipulation can be handled in the
>>>>>> domain attachment path. A per-device refcount can be implemented. This
>>>>>> count increments with each iopf-capable domain attachment
>>>>>> and decrements
>>>>>> with each detachment. PCI PRI is enabled for the first iopf-capable
>>>>>> domain and disabled when the last one is removed. Probably we can also
>>>>>> solve the PF/VF sharing PRI issue.
>>>>> Here is what I have so far, if you send me a patch for vt-d to move
>>>>> FEAT_IOPF into attach as you describe above (see what I did to arm for
>>>>> example), then I can send it next cycle
>>>>>
>>>>> https://github.com/jgunthorpe/linux/commits/iommu_no_feat/
>>>> Hey Baolu, a reminder on this, lets try for it next cycle?
>>>
>>> Oh, I forgot this. Thanks for the reminding. Sure, let's try to make it
>>> in the next cycle.
>>
>> Hi Jason,
>>
>> I've worked through the entire series. The patches are available here:
>>
>> https://github.com/LuBaolu/intel-iommu/commits/iommu-no-feat-v6.14-rc2
>>
>> Please check if this is the right direction.
> 
> Looks great, and you did all the cleanup stuff too!
> 
> The vt-d flow is a little more complicated than the ARM logic because
> the driver flow is structed differently.
> 
> Do we really want ATS turned on if the only thing attached is an
> IDENTITY domain? That will unnecessarily slow down ATS capable HW.. It
> is functionally OK though.

It depends. The Intel driver uses a simple approach.

When the IOMMU is working in legacy mode, PASID and PRI are not
supported. In this case, ATS will not be enabled if the identity domain
is attached to the device.

When the IOMMU is working in scalable mode, PASID and PRI are supported.
ATS will always be enabled, even if the identity domain is attached to
the device, because the PASID might use PRI, which depends on ATS
functionality. This might not be the best choice, but it is the
simplest and functional.

If any device does not work with the identity domain and ATS enabled,
then we can add a quirk to the driver, as we already have such a
mechanism.

> 
> Also, are there enough ATC flushes around any domain type change? I
> didn't check..

The VT-d specification defines guidance for software to manage caches,
both the IOTLB in the IOMMU and the ATC on the device (Table 28,
Guidance to Software for Invalidations). The driver was written
according to that guidance.

> I feel like we should leave "iommu: Move PRI enablement for VF to
> iommu driver" out for now, every driver needs this check?  AMD
> supports SVA and PRI so it needs it too.

Yes, agreed.

> 
> Do you want to squash those fixup patches and post it?

Yes, sure. Let me post it for further review.

> Thanks,
> Jason

Thanks,
baolu

Jason Gunthorpe Feb. 14, 2025, 12:41 p.m. UTC | #31

On Fri, Feb 14, 2025 at 01:39:52PM +0800, Baolu Lu wrote:

> When the IOMMU is working in scalable mode, PASID and PRI are supported.
> ATS will always be enabled, even if the identity domain is attached to
> the device, because the PASID might use PRI, which depends on ATS
> functionality. This might not be the best choice, but it is the
> simplest and functional.

The arm driver keeps track of things and enables ATS when PASIDs are
present

> If any device does not work with the identity domain and ATS enabled,
> then we can add a quirk to the driver, as we already have such a
> mechanism.

The device should not care, as long as the HW works.. ARM has a weird
quirk where one of the two ways to configure an identity domain does
not work with ATS. If you have something like that as well then you
have to switch ATS off if the IOMMU is in a state where it will not
respond to it.

Otherwise, the HW I'm aware of uses ATS pretty selectively and it may
not make any real difference..

> > I feel like we should leave "iommu: Move PRI enablement for VF to
> > iommu driver" out for now, every driver needs this check?  AMD
> > supports SVA and PRI so it needs it too.
> 
> Yes, agreed.

Although, I'm wondering now, that check should be on the SVA paths as
well as the iommufd path..

> > Do you want to squash those fixup patches and post it?
> 
> Yes, sure. Let me post it for further review.

Thanks

Jason

Baolu Lu Feb. 15, 2025, 9:53 a.m. UTC | #32

On 2/14/25 20:41, Jason Gunthorpe wrote:
> On Fri, Feb 14, 2025 at 01:39:52PM +0800, Baolu Lu wrote:
> 
>> When the IOMMU is working in scalable mode, PASID and PRI are supported.
>> ATS will always be enabled, even if the identity domain is attached to
>> the device, because the PASID might use PRI, which depends on ATS
>> functionality. This might not be the best choice, but it is the
>> simplest and functional.
> The arm driver keeps track of things and enables ATS when PASIDs are
> present

I am not aware of any VT-d hardware implementation that supports
scalable mode but not PASID. If there were one, it would be worthwhile
to add an optimization to avoid enabling ATS during probe if PASID is
not supported.

> 
>> If any device does not work with the identity domain and ATS enabled,
>> then we can add a quirk to the driver, as we already have such a
>> mechanism.
> The device should not care, as long as the HW works.. ARM has a weird
> quirk where one of the two ways to configure an identity domain does
> not work with ATS. If you have something like that as well then you
> have to switch ATS off if the IOMMU is in a state where it will not
> respond to it.
> 
> Otherwise, the HW I'm aware of uses ATS pretty selectively and it may
> not make any real difference..
> 
>>> I feel like we should leave "iommu: Move PRI enablement for VF to
>>> iommu driver" out for now, every driver needs this check?  AMD
>>> supports SVA and PRI so it needs it too.
>> Yes, agreed.
> Although, I'm wondering now, that check should be on the SVA paths as
> well as the iommufd path..

That appears to be a fix.

Thanks,
baolu

Jason Gunthorpe Feb. 18, 2025, 1:03 p.m. UTC | #33

On Sat, Feb 15, 2025 at 05:53:13PM +0800, Baolu Lu wrote:
> On 2/14/25 20:41, Jason Gunthorpe wrote:
> > On Fri, Feb 14, 2025 at 01:39:52PM +0800, Baolu Lu wrote:
> > 
> > > When the IOMMU is working in scalable mode, PASID and PRI are supported.
> > > ATS will always be enabled, even if the identity domain is attached to
> > > the device, because the PASID might use PRI, which depends on ATS
> > > functionality. This might not be the best choice, but it is the
> > > simplest and functional.
> > The arm driver keeps track of things and enables ATS when PASIDs are
> > present
> 
> I am not aware of any VT-d hardware implementation that supports
> scalable mode but not PASID. If there were one, it would be worthwhile
> to add an optimization to avoid enabling ATS during probe if PASID is
> not supported.

I mean domains attached to PASIDs that need PRI/ATS/etc

> > Although, I'm wondering now, that check should be on the SVA paths as
> > well as the iommufd path..
> 
> That appears to be a fix.

Does SVA have the same issue?

Jason

Baolu Lu Feb. 19, 2025, 2:09 a.m. UTC | #34

On 2/18/25 21:03, Jason Gunthorpe wrote:
> On Sat, Feb 15, 2025 at 05:53:13PM +0800, Baolu Lu wrote:
>> On 2/14/25 20:41, Jason Gunthorpe wrote:
>>> On Fri, Feb 14, 2025 at 01:39:52PM +0800, Baolu Lu wrote:
>>>
>>>> When the IOMMU is working in scalable mode, PASID and PRI are supported.
>>>> ATS will always be enabled, even if the identity domain is attached to
>>>> the device, because the PASID might use PRI, which depends on ATS
>>>> functionality. This might not be the best choice, but it is the
>>>> simplest and functional.
>>> The arm driver keeps track of things and enables ATS when PASIDs are
>>> present
>> I am not aware of any VT-d hardware implementation that supports
>> scalable mode but not PASID. If there were one, it would be worthwhile
>> to add an optimization to avoid enabling ATS during probe if PASID is
>> not supported.
> I mean domains attached to PASIDs that need PRI/ATS/etc

Yeah, that's a better solution. The PCI PRI/ATS features are only
enabled when a domain that requires them is attached to it. I will
consider it in the Intel driver later.

>>> Although, I'm wondering now, that check should be on the SVA paths as
>>> well as the iommufd path..
>> That appears to be a fix.
> Does SVA have the same issue?

One case I can think of is SVA on SR-IOV VFs. Without the in-progress
iopf refcount patch series, enabling and disabling iopf could be
problematic, because all PRI enablement is switched in the PF, it's
possible that enable and disable operations won't be paired correctly.

Another issue is that a failure or invalid page group response may halt
the PRI interface, which would cause SVA on other VFs to stop working.

So, we should probably disable SVA on VFs for now and re-enable it after
all these issues are resolved.

Thanks,
baolu

Tian, Kevin Feb. 19, 2025, 8:34 a.m. UTC | #35

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Wednesday, February 19, 2025 10:10 AM
> 
> On 2/18/25 21:03, Jason Gunthorpe wrote:
> > On Sat, Feb 15, 2025 at 05:53:13PM +0800, Baolu Lu wrote:
> >> On 2/14/25 20:41, Jason Gunthorpe wrote:
> >>> On Fri, Feb 14, 2025 at 01:39:52PM +0800, Baolu Lu wrote:
> >>>
> >>>> When the IOMMU is working in scalable mode, PASID and PRI are
> supported.
> >>>> ATS will always be enabled, even if the identity domain is attached to
> >>>> the device, because the PASID might use PRI, which depends on ATS
> >>>> functionality. This might not be the best choice, but it is the
> >>>> simplest and functional.
> >>> The arm driver keeps track of things and enables ATS when PASIDs are
> >>> present
> >> I am not aware of any VT-d hardware implementation that supports
> >> scalable mode but not PASID. If there were one, it would be worthwhile
> >> to add an optimization to avoid enabling ATS during probe if PASID is
> >> not supported.
> > I mean domains attached to PASIDs that need PRI/ATS/etc
> 
> Yeah, that's a better solution. The PCI PRI/ATS features are only
> enabled when a domain that requires them is attached to it. I will
> consider it in the Intel driver later.
> 

I didn't get the connection here. ATS can run w/o PASID per PCIe
spec. Why do we want to add a dependency on PASID here?

[v4,00/12] Initial support for SMMUv3 nested translation

Message

Comments