Message ID | 20250220161320.518450-3-jean-philippe@linaro.org |
---|---|
State | New |
Headers | show |
Series | arm: Add DMA remapping for CCA | expand |
On 20.02.25 17:13, Jean-Philippe Brucker wrote: > For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU > notifiers and create e.g. VFIO mappings. The default VFIO discard > notifier isn't sufficient for CCA because the DMA addresses need a > translation (even without vIOMMU). > > At the moment: > * guest_memfd_state_change() calls the populate() notifier > * the populate notifier() calls IOMMU notifiers > * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA > * it calls ram_discard_manager_is_populated() which fails. > > guest_memfd_state_change() only changes the section's state after > calling the populate() notifier. We can't easily invert the order of > operation because it uses the old state bitmap to know which pages need > the populate() notifier. I assume we talk about this code: [1] [1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com +static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset, + uint64_t size, bool shared_to_private) +{ + int block_size = memory_attribute_manager_get_block_size(mgr); + int ret = 0; + + if (!memory_attribute_is_valid_range(mgr, offset, size)) { + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", + __func__, offset, size); + return -1; + } + + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) || + (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) { + return 0; + } + + if (shared_to_private) { + memory_attribute_notify_discard(mgr, offset, size); + } else { + ret = memory_attribute_notify_populate(mgr, offset, size); + } + + if (!ret) { + unsigned long first_bit = offset / block_size; + unsigned long nbits = size / block_size; + + g_assert((first_bit + nbits) <= mgr->bitmap_size); + + if (shared_to_private) { + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); + } else { + bitmap_set(mgr->shared_bitmap, first_bit, nbits); + } + + return 0; + } + + return ret; +} Then, in memory_attribute_notify_populate(), we walk the bitmap again. Why? We just checked that it's all in the expected state, no? virtio-mem doesn't handle it that way, so I'm curious why we would have to do it here? > > For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr() > that we're aware of the RAM discard manager state. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > --- > > Definitely not the prettiest hack, any idea how to do this cleanly? > --- > include/exec/memory.h | 5 +++++ > system/memory.c | 3 ++- > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 9f73b59867..6fcd98fe58 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -116,6 +116,11 @@ typedef enum { > IOMMU_RO = 1, > IOMMU_WO = 2, > IOMMU_RW = 3, > + /* > + * Allow mapping a discarded page, because we're in the process of > + * populating it. > + */ > + IOMMU_POPULATING = 4, > } IOMMUAccessFlags; > > #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0)) > diff --git a/system/memory.c b/system/memory.c > index 4c829793a0..8e884f9c15 100644 > --- a/system/memory.c > +++ b/system/memory.c > @@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, > * Disallow that. vmstate priorities make sure any RamDiscardManager > * were already restored before IOMMUs are restored. > */ > - if (!ram_discard_manager_is_populated(rdm, &tmp)) { > + if (!(iotlb->perm & IOMMU_POPULATING) && > + !ram_discard_manager_is_populated(rdm, &tmp)) { > error_setg(errp, "iommu map to discarded memory (e.g., unplugged" > " via virtio-mem): %" HWADDR_PRIx "", > iotlb->translated_addr);
On 2/21/2025 3:39 AM, David Hildenbrand wrote: > On 20.02.25 17:13, Jean-Philippe Brucker wrote: >> For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU >> notifiers and create e.g. VFIO mappings. The default VFIO discard >> notifier isn't sufficient for CCA because the DMA addresses need a >> translation (even without vIOMMU). >> >> At the moment: >> * guest_memfd_state_change() calls the populate() notifier >> * the populate notifier() calls IOMMU notifiers >> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA >> * it calls ram_discard_manager_is_populated() which fails. >> >> guest_memfd_state_change() only changes the section's state after >> calling the populate() notifier. We can't easily invert the order of >> operation because it uses the old state bitmap to know which pages need >> the populate() notifier. > > I assume we talk about this code: [1] > > [1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com > > > +static int memory_attribute_state_change(MemoryAttributeManager *mgr, > uint64_t offset, > + uint64_t size, bool > shared_to_private) > +{ > + int block_size = memory_attribute_manager_get_block_size(mgr); > + int ret = 0; > + > + if (!memory_attribute_is_valid_range(mgr, offset, size)) { > + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", > + __func__, offset, size); > + return -1; > + } > + > + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, > offset, size)) || > + (!shared_to_private && memory_attribute_is_range_populated(mgr, > offset, size))) { > + return 0; > + } > + > + if (shared_to_private) { > + memory_attribute_notify_discard(mgr, offset, size); > + } else { > + ret = memory_attribute_notify_populate(mgr, offset, size); > + } > + > + if (!ret) { > + unsigned long first_bit = offset / block_size; > + unsigned long nbits = size / block_size; > + > + g_assert((first_bit + nbits) <= mgr->bitmap_size); > + > + if (shared_to_private) { > + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); > + } else { > + bitmap_set(mgr->shared_bitmap, first_bit, nbits); > + } > + > + return 0; > + } > + > + return ret; > +} > > Then, in memory_attribute_notify_populate(), we walk the bitmap again. > > Why? > > We just checked that it's all in the expected state, no? > > > virtio-mem doesn't handle it that way, so I'm curious why we would have > to do it here? I was concerned about the case where the guest issues a request that only partial of the range is in the desired state. I think the main problem is the policy for the guest conversion request. My current handling is: 1. When a conversion request is made for a range already in the desired state, the helper simply returns success. 2. For requests involving a range partially in the desired state, only the necessary segments are converted, ensuring the entire range complies with the request efficiently. 3. In scenarios where a conversion request is declined by other systems, such as a failure from VFIO during notify_populate(), the helper will roll back the request, maintaining consistency. And the policy of virtio-mem is to refuse the state change if not all blocks are in the opposite state. Actually, this part is still a uncertain to me. BTW, per the status/bitmap track, the virtio-mem also changes the bitmap after the plug/unplug notifier. This is the same, correct? > > >> >> For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr() >> that we're aware of the RAM discard manager state. >> >> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> >> --- >> >> Definitely not the prettiest hack, any idea how to do this cleanly? >> --- >> include/exec/memory.h | 5 +++++ >> system/memory.c | 3 ++- >> 2 files changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/include/exec/memory.h b/include/exec/memory.h >> index 9f73b59867..6fcd98fe58 100644 >> --- a/include/exec/memory.h >> +++ b/include/exec/memory.h >> @@ -116,6 +116,11 @@ typedef enum { >> IOMMU_RO = 1, >> IOMMU_WO = 2, >> IOMMU_RW = 3, >> + /* >> + * Allow mapping a discarded page, because we're in the process of >> + * populating it. >> + */ >> + IOMMU_POPULATING = 4, >> } IOMMUAccessFlags; >> #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? >> IOMMU_WO : 0)) >> diff --git a/system/memory.c b/system/memory.c >> index 4c829793a0..8e884f9c15 100644 >> --- a/system/memory.c >> +++ b/system/memory.c >> @@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, >> void **vaddr, >> * Disallow that. vmstate priorities make sure any >> RamDiscardManager >> * were already restored before IOMMUs are restored. >> */ >> - if (!ram_discard_manager_is_populated(rdm, &tmp)) { >> + if (!(iotlb->perm & IOMMU_POPULATING) && >> + !ram_discard_manager_is_populated(rdm, &tmp)) { >> error_setg(errp, "iommu map to discarded memory (e.g., >> unplugged" >> " via virtio-mem): %" HWADDR_PRIx "", >> iotlb->translated_addr); > >
On 21.02.25 03:25, Chenyi Qiang wrote: > > > On 2/21/2025 3:39 AM, David Hildenbrand wrote: >> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>> For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU >>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>> notifier isn't sufficient for CCA because the DMA addresses need a >>> translation (even without vIOMMU). >>> >>> At the moment: >>> * guest_memfd_state_change() calls the populate() notifier >>> * the populate notifier() calls IOMMU notifiers >>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA >>> * it calls ram_discard_manager_is_populated() which fails. >>> >>> guest_memfd_state_change() only changes the section's state after >>> calling the populate() notifier. We can't easily invert the order of >>> operation because it uses the old state bitmap to know which pages need >>> the populate() notifier. >> >> I assume we talk about this code: [1] >> >> [1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com >> >> >> +static int memory_attribute_state_change(MemoryAttributeManager *mgr, >> uint64_t offset, >> + uint64_t size, bool >> shared_to_private) >> +{ >> + int block_size = memory_attribute_manager_get_block_size(mgr); >> + int ret = 0; >> + >> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >> + __func__, offset, size); >> + return -1; >> + } >> + >> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >> offset, size)) || >> + (!shared_to_private && memory_attribute_is_range_populated(mgr, >> offset, size))) { >> + return 0; >> + } >> + >> + if (shared_to_private) { >> + memory_attribute_notify_discard(mgr, offset, size); >> + } else { >> + ret = memory_attribute_notify_populate(mgr, offset, size); >> + } >> + >> + if (!ret) { >> + unsigned long first_bit = offset / block_size; >> + unsigned long nbits = size / block_size; >> + >> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >> + >> + if (shared_to_private) { >> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >> + } else { >> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >> + } >> + >> + return 0; >> + } >> + >> + return ret; >> +} >> >> Then, in memory_attribute_notify_populate(), we walk the bitmap again. >> >> Why? >> >> We just checked that it's all in the expected state, no? >> >> >> virtio-mem doesn't handle it that way, so I'm curious why we would have >> to do it here? > > I was concerned about the case where the guest issues a request that > only partial of the range is in the desired state. > I think the main problem is the policy for the guest conversion request. > My current handling is: > > 1. When a conversion request is made for a range already in the desired > state, the helper simply returns success. Yes. > 2. For requests involving a range partially in the desired state, only > the necessary segments are converted, ensuring the entire range > complies with the request efficiently. Ah, now I get: + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) || + (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) { + return 0; + } + We're not failing if it might already partially be in the other state. > 3. In scenarios where a conversion request is declined by other systems, > such as a failure from VFIO during notify_populate(), the helper will > roll back the request, maintaining consistency. > > And the policy of virtio-mem is to refuse the state change if not all > blocks are in the opposite state. Yes. > > Actually, this part is still a uncertain to me. > IIUC, the problem does not exist if we only convert a single page at a time. Is there a known use case where such partial conversions could happen? > BTW, per the status/bitmap track, the virtio-mem also changes the bitmap > after the plug/unplug notifier. This is the same, correct? Right. But because we reject these partial requests, we don't have to traverse the bitmap and could just adjust the bitmap operations.
On 2/21/2025 4:09 PM, David Hildenbrand wrote: > On 21.02.25 03:25, Chenyi Qiang wrote: >> >> >> On 2/21/2025 3:39 AM, David Hildenbrand wrote: >>> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>>> For Arm CCA we'd like the guest_memfd discard notifier to call the >>>> IOMMU >>>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>>> notifier isn't sufficient for CCA because the DMA addresses need a >>>> translation (even without vIOMMU). >>>> >>>> At the moment: >>>> * guest_memfd_state_change() calls the populate() notifier >>>> * the populate notifier() calls IOMMU notifiers >>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA >>>> * it calls ram_discard_manager_is_populated() which fails. >>>> >>>> guest_memfd_state_change() only changes the section's state after >>>> calling the populate() notifier. We can't easily invert the order of >>>> operation because it uses the old state bitmap to know which pages need >>>> the populate() notifier. >>> >>> I assume we talk about this code: [1] >>> >>> [1] https://lkml.kernel.org/r/20250217081833.21568-1- >>> chenyi.qiang@intel.com >>> >>> >>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr, >>> uint64_t offset, >>> + uint64_t size, bool >>> shared_to_private) >>> +{ >>> + int block_size = memory_attribute_manager_get_block_size(mgr); >>> + int ret = 0; >>> + >>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >>> + __func__, offset, size); >>> + return -1; >>> + } >>> + >>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >>> offset, size)) || >>> + (!shared_to_private && memory_attribute_is_range_populated(mgr, >>> offset, size))) { >>> + return 0; >>> + } >>> + >>> + if (shared_to_private) { >>> + memory_attribute_notify_discard(mgr, offset, size); >>> + } else { >>> + ret = memory_attribute_notify_populate(mgr, offset, size); >>> + } >>> + >>> + if (!ret) { >>> + unsigned long first_bit = offset / block_size; >>> + unsigned long nbits = size / block_size; >>> + >>> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >>> + >>> + if (shared_to_private) { >>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >>> + } else { >>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >>> + } >>> + >>> + return 0; >>> + } >>> + >>> + return ret; >>> +} >>> >>> Then, in memory_attribute_notify_populate(), we walk the bitmap again. >>> >>> Why? >>> >>> We just checked that it's all in the expected state, no? >>> >>> >>> virtio-mem doesn't handle it that way, so I'm curious why we would have >>> to do it here? >> >> I was concerned about the case where the guest issues a request that >> only partial of the range is in the desired state. >> I think the main problem is the policy for the guest conversion request. >> My current handling is: >> >> 1. When a conversion request is made for a range already in the desired >> state, the helper simply returns success. > > Yes. > >> 2. For requests involving a range partially in the desired state, only >> the necessary segments are converted, ensuring the entire range >> complies with the request efficiently. > > > Ah, now I get: > > + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, > offset, size)) || > + (!shared_to_private && memory_attribute_is_range_populated(mgr, > offset, size))) { > + return 0; > + } > + > > We're not failing if it might already partially be in the other state. > >> 3. In scenarios where a conversion request is declined by other systems, >> such as a failure from VFIO during notify_populate(), the helper will >> roll back the request, maintaining consistency. >> >> And the policy of virtio-mem is to refuse the state change if not all >> blocks are in the opposite state. > > Yes. > >> >> Actually, this part is still a uncertain to me. >> > > IIUC, the problem does not exist if we only convert a single page at a > time. > > Is there a known use case where such partial conversions could happen? I don't see such case yet. Actually, I'm trying to follow the behavior of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it doesn't reject the request if the whole range isn't in the opposite state. It just uses xa_store() to update it. Also, I don't see the spec says how to handle such case. To be robust, I just allow this special case. > >> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap >> after the plug/unplug notifier. This is the same, correct? > Right. But because we reject these partial requests, we don't have to > traverse the bitmap and could just adjust the bitmap operations. Yes, If we treat it as a guest error/bug, we can adjust it. >
diff --git a/include/exec/memory.h b/include/exec/memory.h index 9f73b59867..6fcd98fe58 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -116,6 +116,11 @@ typedef enum { IOMMU_RO = 1, IOMMU_WO = 2, IOMMU_RW = 3, + /* + * Allow mapping a discarded page, because we're in the process of + * populating it. + */ + IOMMU_POPULATING = 4, } IOMMUAccessFlags; #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0)) diff --git a/system/memory.c b/system/memory.c index 4c829793a0..8e884f9c15 100644 --- a/system/memory.c +++ b/system/memory.c @@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, * Disallow that. vmstate priorities make sure any RamDiscardManager * were already restored before IOMMUs are restored. */ - if (!ram_discard_manager_is_populated(rdm, &tmp)) { + if (!(iotlb->perm & IOMMU_POPULATING) && + !ram_discard_manager_is_populated(rdm, &tmp)) { error_setg(errp, "iommu map to discarded memory (e.g., unplugged" " via virtio-mem): %" HWADDR_PRIx "", iotlb->translated_addr);
For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU notifiers and create e.g. VFIO mappings. The default VFIO discard notifier isn't sufficient for CCA because the DMA addresses need a translation (even without vIOMMU). At the moment: * guest_memfd_state_change() calls the populate() notifier * the populate notifier() calls IOMMU notifiers * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA * it calls ram_discard_manager_is_populated() which fails. guest_memfd_state_change() only changes the section's state after calling the populate() notifier. We can't easily invert the order of operation because it uses the old state bitmap to know which pages need the populate() notifier. For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr() that we're aware of the RAM discard manager state. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- Definitely not the prettiest hack, any idea how to do this cleanly? --- include/exec/memory.h | 5 +++++ system/memory.c | 3 ++- 2 files changed, 7 insertions(+), 1 deletion(-)