diff mbox series

[v7,3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private

Message ID 20250328153133.3504118-4-tabba@google.com
State New
Headers show
Series KVM: Restricted mapping of guest_memfd at the host and arm64 support | expand

Commit Message

Fuad Tabba March 28, 2025, 3:31 p.m. UTC
From: Ackerley Tng <ackerleytng@google.com>

Track guest_memfd folio sharing state within the inode, since it is a
property of the guest_memfd's memory contents.

The guest_memfd PRIVATE memory attribute is not used for two reasons. It
reflects the userspace expectation for the memory state, and therefore
can be toggled by userspace. Also, although each guest_memfd file has a
1:1 binding with a KVM instance, the plan is to allow multiple files per
inode, e.g. to allow intra-host migration to a new KVM instance, without
destroying guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/guest_memfd.c | 58 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 5 deletions(-)

Comments

Sean Christopherson April 2, 2025, 11:56 p.m. UTC | #1
On Wed, Apr 02, 2025, Ackerley Tng wrote:
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index ac6b8853699d..cde16ed3b230 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -17,6 +17,18 @@ struct kvm_gmem {
> >  	struct list_head entry;
> >  };
> >  
> > +struct kvm_gmem_inode_private {
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +	struct xarray shared_offsets;
> > +	rwlock_t offsets_lock;
> 
> This lock doesn't work, either that or this lock can't be held while
> faulting, because holding this lock means we can't sleep, and we need to
> sleep to allocate.

rwlock_t is a variant of a spinlock, which can't be held when sleeping.

What exactly does offsets_lock protect, and what are the rules for holding it?
At a glance, it's flawed.  Something needs to prevent KVM from installing a mapping
for a private gfn that is being converted to shared.  KVM doesn't hold references
to PFNs while they're mapped into the guest, and kvm_gmem_get_pfn() doesn't check
shared_offsets let alone take offsets_lock.

guest_memfd currently handles races between kvm_gmem_fault() and PUNCH_HOLE via
kvm_gmem_invalidate_{begin,end}().  I don't see any equivalent functionality in
the shared/private conversion code.

In general, this series is unreviewable as many of the APIs have no callers,
which makes it impossible to review the safety/correctness of locking.  Ditto
for all the stubs that are guarded by CONFIG_KVM_GMEM_SHARED_MEM.  

Lack of uAPI also makes this series unreviewable.  The tracking is done on the
guest_memfd inode, but it looks like the uAPI to toggle private/shared is going
to be attached to the VM.  Why?  That's just adds extra locks and complexity to
ensure the memslots are stable.

Lastly, this series is at v7, but to be very blunt, it looks RFC quality to my
eyes.  On top of the fact that it's practically impossible to review, it doesn't
even compile on x86 when CONFIG_KVM=m.

  mm/swap.c:110:(.text+0x18ba): undefined reference to `kvm_gmem_handle_folio_put'
  ERROR: modpost: "security_inode_init_security_anon" [arch/x86/kvm/kvm.ko] undefined!

The latter can be solved with an EXPORT, but calling module code from core mm/
is a complete non-starter.

Maybe much of this has discussed been discussed elsewhere and there's a grand
plan for how all of this will shake out.  I haven't been closely following
guest_memfd development due to lack of cycles, and unfortunately I don't expect
that to change in the near future.  I am more than happy to let y'all do the heavy
lifting, but when you submit something for inclusion (i.e. not RFC), then it needs
to actually be ready for inclusion.

I would much, much prefer one large series that shows the full picture than a
mish mash of partial series that I can't actually review, even if the big series
is 100+ patches (hopefully not).
Sean Christopherson April 3, 2025, 2:51 p.m. UTC | #2
On Thu, Apr 03, 2025, Fuad Tabba wrote:
> On Thu, 3 Apr 2025 at 00:56, Sean Christopherson <seanjc@google.com> wrote:
> > On Wed, Apr 02, 2025, Ackerley Tng wrote:
> > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > > index ac6b8853699d..cde16ed3b230 100644
> > > > --- a/virt/kvm/guest_memfd.c
> > > > +++ b/virt/kvm/guest_memfd.c
> > > > @@ -17,6 +17,18 @@ struct kvm_gmem {
> > > >     struct list_head entry;
> > > >  };
> > > >
> > > > +struct kvm_gmem_inode_private {
> > > > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > > +   struct xarray shared_offsets;
> > > > +   rwlock_t offsets_lock;
> > >
> > > This lock doesn't work, either that or this lock can't be held while
> > > faulting, because holding this lock means we can't sleep, and we need to
> > > sleep to allocate.
> >
> > rwlock_t is a variant of a spinlock, which can't be held when sleeping.
> >
> > What exactly does offsets_lock protect, and what are the rules for holding it?
> > At a glance, it's flawed.  Something needs to prevent KVM from installing a mapping
> > for a private gfn that is being converted to shared.  KVM doesn't hold references
> > to PFNs while they're mapped into the guest, and kvm_gmem_get_pfn() doesn't check
> > shared_offsets let alone take offsets_lock.
> 
> You're right about the rwlock_t. The goal of the offsets_lock is to
> protect the shared offsets -- i.e., it's just meant to protect the
> SHARED/PRIVATE status of a folio, not more, hence why it's not checked
> in kvm_gmem_get_pfn(). It used to be protected by the
> filemap_invalidate_lock, but the problem is that it would be called
> from an interrupt context.
> 
> However, this is wrong, as you've pointed out. The purpose of locking
> is to ensure  that no two conversions of the same folio happen at the
> same time. An alternative I had written up is to rely on having
> exclusive access to the folio to ensure that, since this is tied to
> the folio. That could be either by acquiring the folio lock, or
> ensuring that the folio doesn't have any outstanding references,
> indicating that we have exclusive access to it. This would avoid the
> whole locking issue.
> 
> > ... Something needs to prevent KVM from installing a mapping
> > for a private gfn that is being converted to shared.  ...
> 
> > guest_memfd currently handles races between kvm_gmem_fault() and PUNCH_HOLE via
> > kvm_gmem_invalidate_{begin,end}().  I don't see any equivalent functionality in
> > the shared/private conversion code.
> 
> For in-place sharing, KVM can install a mapping for a SHARED gfn. What
> it cannot do is install a mapping for a transient (i.e., NONE) gfn. We
> don't rely on kvm_gmem_get_pfn() for that, but on the individual KVM
> mmu fault handlers, but that said...

Consumption of shared/private physical pages _must_ be enforced by guest_memfd.
The private vs. shared state in the MMU handlers is that VM's view of the world
and desired state.  The guest_memfd inode is the single source of true for the
state of the _physical_ page.

E.g. on TDX, if KVM installs a private SPTE for a PFN that is in actuality shared,
there will be machine checks and the host will likely crash.

> > I would much, much prefer one large series that shows the full picture than a
> > mish mash of partial series that I can't actually review, even if the big series
> > is 100+ patches (hopefully not).
> 
> Dropping the RFC from the second series was not intentional, the first
> series is the one where I intended to drop the RFC. I apologize for
> that.  Especially since I obviously don't know how to handle modules
> and wanted some input on how to do that :)

In this case, the rules for modules are pretty simple.  Code in mm/ can't call
into KVM.  Either avoid callbacks entirely, or implement via a layer of
indirection, e.g. function pointer or ops table, so that KVM can provide its
implementation at runtime.
Fuad Tabba April 4, 2025, 6:43 a.m. UTC | #3
Hi Sean,

On Thu, 3 Apr 2025 at 15:51, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 03, 2025, Fuad Tabba wrote:
> > On Thu, 3 Apr 2025 at 00:56, Sean Christopherson <seanjc@google.com> wrote:
> > > On Wed, Apr 02, 2025, Ackerley Tng wrote:
> > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > > > index ac6b8853699d..cde16ed3b230 100644
> > > > > --- a/virt/kvm/guest_memfd.c
> > > > > +++ b/virt/kvm/guest_memfd.c
> > > > > @@ -17,6 +17,18 @@ struct kvm_gmem {
> > > > >     struct list_head entry;
> > > > >  };
> > > > >
> > > > > +struct kvm_gmem_inode_private {
> > > > > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > > > +   struct xarray shared_offsets;
> > > > > +   rwlock_t offsets_lock;
> > > >
> > > > This lock doesn't work, either that or this lock can't be held while
> > > > faulting, because holding this lock means we can't sleep, and we need to
> > > > sleep to allocate.
> > >
> > > rwlock_t is a variant of a spinlock, which can't be held when sleeping.
> > >
> > > What exactly does offsets_lock protect, and what are the rules for holding it?
> > > At a glance, it's flawed.  Something needs to prevent KVM from installing a mapping
> > > for a private gfn that is being converted to shared.  KVM doesn't hold references
> > > to PFNs while they're mapped into the guest, and kvm_gmem_get_pfn() doesn't check
> > > shared_offsets let alone take offsets_lock.
> >
> > You're right about the rwlock_t. The goal of the offsets_lock is to
> > protect the shared offsets -- i.e., it's just meant to protect the
> > SHARED/PRIVATE status of a folio, not more, hence why it's not checked
> > in kvm_gmem_get_pfn(). It used to be protected by the
> > filemap_invalidate_lock, but the problem is that it would be called
> > from an interrupt context.
> >
> > However, this is wrong, as you've pointed out. The purpose of locking
> > is to ensure  that no two conversions of the same folio happen at the
> > same time. An alternative I had written up is to rely on having
> > exclusive access to the folio to ensure that, since this is tied to
> > the folio. That could be either by acquiring the folio lock, or
> > ensuring that the folio doesn't have any outstanding references,
> > indicating that we have exclusive access to it. This would avoid the
> > whole locking issue.
> >
> > > ... Something needs to prevent KVM from installing a mapping
> > > for a private gfn that is being converted to shared.  ...
> >
> > > guest_memfd currently handles races between kvm_gmem_fault() and PUNCH_HOLE via
> > > kvm_gmem_invalidate_{begin,end}().  I don't see any equivalent functionality in
> > > the shared/private conversion code.
> >
> > For in-place sharing, KVM can install a mapping for a SHARED gfn. What
> > it cannot do is install a mapping for a transient (i.e., NONE) gfn. We
> > don't rely on kvm_gmem_get_pfn() for that, but on the individual KVM
> > mmu fault handlers, but that said...
>
> Consumption of shared/private physical pages _must_ be enforced by guest_memfd.
> The private vs. shared state in the MMU handlers is that VM's view of the world
> and desired state.  The guest_memfd inode is the single source of true for the
> state of the _physical_ page.
>
> E.g. on TDX, if KVM installs a private SPTE for a PFN that is in actuality shared,
> there will be machine checks and the host will likely crash.

I agree. As a plus, I've made that change and it actually simplifies the logic .

> > > I would much, much prefer one large series that shows the full picture than a
> > > mish mash of partial series that I can't actually review, even if the big series
> > > is 100+ patches (hopefully not).
> >
> > Dropping the RFC from the second series was not intentional, the first
> > series is the one where I intended to drop the RFC. I apologize for
> > that.  Especially since I obviously don't know how to handle modules
> > and wanted some input on how to do that :)
>
> In this case, the rules for modules are pretty simple.  Code in mm/ can't call
> into KVM.  Either avoid callbacks entirely, or implement via a layer of
> indirection, e.g. function pointer or ops table, so that KVM can provide its
> implementation at runtime.

Ack.

Thanks again!
/fuad
diff mbox series

Patch

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index ac6b8853699d..cde16ed3b230 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -17,6 +17,18 @@  struct kvm_gmem {
 	struct list_head entry;
 };
 
+struct kvm_gmem_inode_private {
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	struct xarray shared_offsets;
+	rwlock_t offsets_lock;
+#endif
+};
+
+static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
+{
+	return inode->i_mapping->i_private_data;
+}
+
 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
 void kvm_gmem_handle_folio_put(struct folio *folio)
 {
@@ -324,8 +336,28 @@  static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+static void kvm_gmem_evict_inode(struct inode *inode)
+{
+	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	/*
+	 * .evict_inode can be called before private data is set up if there are
+	 * issues during inode creation.
+	 */
+	if (private)
+		xa_destroy(&private->shared_offsets);
+#endif
+
+	truncate_inode_pages_final(inode->i_mapping);
+
+	kfree(private);
+	clear_inode(inode);
+}
+
 static const struct super_operations kvm_gmem_super_operations = {
-	.statfs		= simple_statfs,
+	.statfs         = simple_statfs,
+	.evict_inode	= kvm_gmem_evict_inode,
 };
 
 static int kvm_gmem_init_fs_context(struct fs_context *fc)
@@ -553,6 +585,7 @@  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 						      loff_t size, u64 flags)
 {
 	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct kvm_gmem_inode_private *private;
 	struct inode *inode;
 	int err;
 
@@ -561,10 +594,20 @@  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 		return inode;
 
 	err = security_inode_init_security_anon(inode, &qname, NULL);
-	if (err) {
-		iput(inode);
-		return ERR_PTR(err);
-	}
+	if (err)
+		goto out;
+
+	err = -ENOMEM;
+	private = kzalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		goto out;
+
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	xa_init(&private->shared_offsets);
+	rwlock_init(&private->offsets_lock);
+#endif
+
+	inode->i_mapping->i_private_data = private;
 
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
@@ -577,6 +620,11 @@  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
 	return inode;
+
+out:
+	iput(inode);
+
+	return ERR_PTR(err);
 }
 
 static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,