mbox series

[RFC,0/5] KVM: guest_memfd: support for uffd missing

Message ID 20250303133011.44095-1-kalyazin@amazon.com
Headers show
Series KVM: guest_memfd: support for uffd missing | expand

Message

Nikita Kalyazin March 3, 2025, 1:30 p.m. UTC
This series is built on top of the v3 write syscall support [1].

With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace.  However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables.  In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.

This series proposes a limited support for userfaultfd in guest_memfd:
 - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
   (as is fault support in general)
 - Only `page missing` event is currently supported
 - Userspace is supposed to respond to the event with the `write`
   syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
   process.   Note that we can't use `UFFDIO_COPY` here because
   userfaulfd code does not know how to prepare guest_memfd pages, eg
   remove them from direct map [4].

Not included in this series:
 - Proper interface for userfaultfd to recognise guest_memfd mappings
 - Proper handling of truncation cases after locking the page

Request for comments:
 - Is it a sensible workflow for guest_memfd to resolve a userfault
   `page missing` event with `write` syscall + `UFFDIO_CONTINUE`?  One
   of the alternatives is teaching `UFFDIO_COPY` how to deal with
   guest_memfd pages.
 - What is a way forward to make userfaultfd code aware of guest_memfd?
   I saw that Patrick hit a somewhat similar problem in [5] when trying
   to use direct map manipulation functions in KVM and was pointed by
   David at Elliot's guestmem library [6] that might include a shim for that.
   Would the library be the right place to expose required interfaces like
   `vma_is_gmem`?

Nikita

[1] https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T/
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#ma130b29c130dbdc894aa08d8d56c16ec383f36dd
[5] https://lore.kernel.org/kvm/20241122-guestmem-library-v5-2-450e92951a15@quicinc.com/T/

Nikita Kalyazin (5):
  KVM: guest_memfd: add kvm_gmem_vma_is_gmem
  KVM: guest_memfd: add support for uffd missing
  mm: userfaultfd: allow to register userfaultfd for guest_memfd
  mm: userfaultfd: support continue for guest_memfd
  KVM: selftests: add uffd missing test for guest_memfd

 include/linux/userfaultfd_k.h                 |  9 ++
 mm/userfaultfd.c                              | 23 ++++-
 .../testing/selftests/kvm/guest_memfd_test.c  | 88 +++++++++++++++++++
 virt/kvm/guest_memfd.c                        | 17 +++-
 virt/kvm/kvm_mm.h                             |  1 +
 5 files changed, 136 insertions(+), 2 deletions(-)


base-commit: 592e7531753dc4b711f96cd1daf808fd493d3223

Comments

Peter Xu March 5, 2025, 8:29 p.m. UTC | #1
On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
> I think it might be useful to implement an fs-generic MINOR mode. The
> fault handler is already easy enough to do generically (though it
> would become more difficult to determine if the "MINOR" fault is
> actually a MISSING fault, but at least for my userspace, the
> distinction isn't important. :)) So the question becomes: what should
> UFFDIO_CONTINUE look like?
> 
> And I think it would be nice if UFFDIO_CONTINUE just called
> vm_ops->fault() to get the page we want to map and then mapped it,
> instead of having shmem-specific and hugetlb-specific versions (though
> maybe we need to keep the hugetlb specialization...). That would avoid
> putting kvm/gmem/etc. symbols in mm/userfaultfd code.
> 
> I've actually wanted to do this for a while but haven't had a good
> reason to pursue it. I wonder if it can be done in a
> backwards-compatible fashion...

Yes I also thought about that. :)

When Axel added minor fault, it's not a major concern as it's the only fs
that will consume the feature anyway in the do_fault() path - hugetlbfs has
its own path to take care of.. even until now.

And there's some valid points too if someone would argue to put it there
especially on folio lock - do that in shmem.c can avoid taking folio lock
when generating minor fault message.  It might make some difference when
the faults are heavy and when folio lock is frequently taken elsewhere too.

It might boil down to how many more FSes would support minor fault, and
whether we would care about such difference at last to shmem users. If gmem
is the only one after existing ones, IIUC there's still option we implement
it in gmem code.  After all, I expect the change should be very under
control (<20 LOCs?)..