Message ID | cover.1736550979.git.nicolinc@nvidia.com |
---|---|
Headers | show |
Series | iommu: Add MSI mapping support with nested SMMU | expand |
On 1/29/25 4:04 PM, Jason Gunthorpe wrote: > On Wed, Jan 29, 2025 at 03:54:48PM +0100, Eric Auger wrote: >>>> or you are just mentioning it here because >>>> it is still possible to make use of that. I think from previous discussions the >>>> argument was to adopt a more dedicated MSI pass-through model which I >>>> think is approach-2 here. >>> The basic flow of the pass through model is shown in the last two >>> patches, it is not fully complete but is testable. It assumes a single >>> ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the >>> ITS page at the correct S2 location and then describe it in the ACPI >>> as an ITS page not a RMR. >> This is a nice to have feature but not mandated in the first place, >> is it? > Not mandated. It just sort of happens because of the design. IMHO > nothing should use it because there is no way for userspace to > discover how many ITS pages there may be. > >>> This missing peice is cleaning up the ITS mapping to allow for >>> multiple ITS pages. I've imagined that kvm would someone give iommufd >>> a FD that holds the specific ITS pages instead of the >>> IOMMU_OPTION_SW_MSI_START/SIZE flow. >> That's what I don't get: at the moment you only pass the gIOVA. With >> technique 2, how can you build the nested mapping, ie. >> >> S1 S2 >> gIOVA -> gDB -> hDB >> >> without passing the full gIOVA/gDB S1 mapping to the host? > The nested S2 mapping is already setup before the VM boots: > > - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB) Ah OK. Your gDB has nothing to do with the actual S1 guest gDB, right? It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool. Is that correct? In https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/ I was passing both the gIOVA and the "true" gDB Eric > - The ACPI tells the VM that the GIC has an ITS page at the S2's > address (hDB) > - The VM sets up its S1 with a gIOVA that points to the S2's ITS > page (gDB). The S2 already has gDB -> hDB. > - The VMM traps the gIOVA write to the MSI-X table. Both the S1 and > S2 are populated at this moment. > > If you have multiple ITS pages then the ACPI has to tell the guest GIC > about them, what their gDB address is, and what devices use which ITS. > > Jason >
On Wed, Jan 29, 2025 at 06:46:20PM +0100, Eric Auger wrote: > >>> This missing peice is cleaning up the ITS mapping to allow for > >>> multiple ITS pages. I've imagined that kvm would someone give iommufd > >>> a FD that holds the specific ITS pages instead of the > >>> IOMMU_OPTION_SW_MSI_START/SIZE flow. > >> That's what I don't get: at the moment you only pass the gIOVA. With > >> technique 2, how can you build the nested mapping, ie. > >> > >> S1 S2 > >> gIOVA -> gDB -> hDB > >> > >> without passing the full gIOVA/gDB S1 mapping to the host? > > The nested S2 mapping is already setup before the VM boots: > > > > - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB) > Ah OK. Your gDB has nothing to do with the actual S1 guest gDB, > right? I'm not totally sure what you mean by gDB? The above diagram suggests it is the ITS page address in the S2? Ie the guest physical address of the ITS. Within the VM, when it goes to call iommu_dma_prepare_msi(), it will provide the gDB adress as the phys_addr_t msi_addr. This happens because the GIC driver will have been informed of the ITS page at the gDB address, and it will use iommu_dma_prepare_msi(). Exactly the same as bare metal. > It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool. > Is that correct? Yes, for a single ITS page it will reliably be put at sw_msi_start. Since the VMM can provide sw_msi_start through the OPTION, the VMM can place the ITS page where it wants and then program the ACPI to tell the VM to call iommu_dma_prepare_msi(). (don't use this flow, it doesn't work for multi ITS, for testing only) > https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/ > I was passing both the gIOVA and the "true" gDB Eric If I understand this right, it still had the hypervisor dynamically setting up the S2, here it is pre-set and static? Jason
Hi Nicolin, On Fri, 10 Jan 2025 19:32:16 -0800 Nicolin Chen <nicolinc@nvidia.com> wrote: > [ Background ] > On ARM GIC systems and others, the target address of the MSI is > translated by the IOMMU. For GIC, the MSI address page is called > "ITS" page. When the IOMMU is disabled, the MSI address is programmed > to the physical location of the GIC ITS page (e.g. 0x20200000). When > the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI > address is programmed to an allocated IO virtual address (a.k.a > IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS > page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage > translation is enabled, IOVA will be still used to program the MSI > address, though the mappings will be in two stages: IOVA (0xFFFF0000) > ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for > Intermediate Physical Address). > > If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, > the IOVA is dynamically allocated from the top of the IOVA space. If > attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough > device), the IOVA is fixed to an MSI window reported by the IOMMU > driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE > (IOVA==0x8000000) for ARM IOMMUs. > > So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in > charge of the IOMMU translation (1-stage translation), since the IOVA > for the ITS page is fixed and known by kernel. However, with virtual > machine enabling a nested IOMMU translation (2-stage), a guest kernel > directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA, > mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space > (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level > IOVA to program the MSI address. > > There have been two approaches to solve this problem: > 1. Create an identity mapping in the stage-1. VMM could insert a few > RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel > would fetch these RMR entries from the IORT and create an > IOMMU_RESV_DIRECT region per iommu group for a direct mapping. > Eventually, the mappings would look like: IOVA (0x8000000) === IPA > (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel > and VMM to agree on the IPA. Should this RMR be in a separate range than MSI_IOVA_BASE? The guest will have MSI_IOVA_BASE in a reserved region already, no? e.g. # cat /sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions 0x0000000008000000 0x00000000080fffff msi
On Wed, Feb 05, 2025 at 02:49:04PM -0800, Jacob Pan wrote: > > There have been two approaches to solve this problem: > > 1. Create an identity mapping in the stage-1. VMM could insert a few > > RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel > > would fetch these RMR entries from the IORT and create an > > IOMMU_RESV_DIRECT region per iommu group for a direct mapping. > > Eventually, the mappings would look like: IOVA (0x8000000) === IPA > > (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel > > and VMM to agree on the IPA. > > Should this RMR be in a separate range than MSI_IOVA_BASE? The guest > will have MSI_IOVA_BASE in a reserved region already, no? > e.g. # cat > /sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions > 0x0000000008000000 0x00000000080fffff msi No. In Patch-9, the driver-defined MSI_IOVA_BASE will be ignored if userspace has assigned IOMMU_OPTION_SW_MSI_START/SIZE, even if they might have the same values as the MSI_IOVA_BASE window. The idea of MSI_IOVA_BASE in this series is a kernel default that'd be only effective when user space doesn't care to set anything. Thanks Nicolin