[v7,0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support

Message ID	20250328153133.3504118-1-tabba@google.com
Headers	show Received: from mail-wm1-f73.google.com (mail-wm1-f73.google.com [209.85.128.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CFEB93594A for <linux-arm-msm@vger.kernel.org>; Fri, 28 Mar 2025 15:31:36 +0000 (UTC) Date: Fri, 28 Mar 2025 15:31:26 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20250328153133.3504118-1-tabba@google.com> Subject: [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support From: Fuad Tabba <tabba@google.com> To: kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org Cc: pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, ackerleytng@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com, peterx@redhat.com, pankaj.gupta@amd.com, tabba@google.com Content-Type: text/plain; charset="UTF-8"
Series	KVM: Restricted mapping of guest_memfd at the host and arm64 support \| expand [v7,0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support [v7,1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes [v7,2/7] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock [v7,3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private [v7,4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition [v7,5/7] KVM: guest_memfd: Restore folio state after final folio_put() [v7,6/7] KVM: guest_memfd: Handle invalidation of shared memory [v7,7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared

Message ID

20250328153133.3504118-1-tabba@google.com

Headers

Date: Fri, 28 Mar 2025 15:31:26 +0000
Precedence: bulk
Mime-Version: 1.0
Message-ID: <20250328153133.3504118-1-tabba@google.com>
Subject: [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and
 arm64 support
From: Fuad Tabba <tabba@google.com>
To: kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org
Cc: pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au,
	anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com,
	aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk,
	brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com,
	jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com,
	isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz,
	vannapurve@google.com, ackerleytng@google.com, mail@maciej.szmigiero.name,
	david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com,
	liam.merwick@oracle.com, isaku.yamahata@gmail.com,
	kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com,
 steven.price@arm.com,
	quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com,
	quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com,
	quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com,
	james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev,
	maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com,
	roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com,
	rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com,
	jthoughton@google.com, peterx@redhat.com, pankaj.gupta@amd.com,
	tabba@google.com
Content-Type: text/plain; charset="UTF-8"

Series

KVM: Restricted mapping of guest_memfd at the host and arm64 support | expand

Message

Fuad Tabba March 28, 2025, 3:31 p.m. UTC

This series adds restricted mmap() support to guest_memfd, as well as
support for guest_memfd on arm64. Please see v3 for the context [1].

Main change since v6 [2]:
Protected the shared_offsets array with a rwlock instead of hopping on
the invalidate_lock. The main reason for this is that the final put
callback (kvm_gmem_handle_folio_put()) could be called from an atomic
context, and the invalidate_lock is an rw_semaphore (Vishal).

The other reason is, in hindsight, it didn't make sense to use the
invalidate_lock since they're not quite protecting the same thing.

I did consider using the folio lock to implicitly protect the array, and
even have another series that does that. The result was more
complicated, and not obviously race free. One of the difficulties with
relying on the folio lock is that there are cases (e.g., on
initilization and teardown) when we want to toggle the state of an
offset that doesn't have a folio in the cache. We could special case
these, but the result was more complicated (and to me not obviously
better) than adding a separate lock for the shared_offsets array.

Based on the `KVM: Mapping guest_memfd backed memory at the host for
software protected VMs` series [3], which in turn is based on Linux
6.14-rc7.

The state diagram that uses the new states in this patch series,
and how they would interact with sharing/unsharing in pKVM [4].

Cheers,
/fuad

[1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
[2] https://lore.kernel.org/all/20250318162046.4016367-1-tabba@google.com/
[3] https://lore.kernel.org/all/20250318161823.4005529-1-tabba@google.com/
[4] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf

Ackerley Tng (2):
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private

Fuad Tabba (5):
  KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
    the folio lock
  KVM: guest_memfd: Folio sharing states and functions that manage their
    transition
  KVM: guest_memfd: Restore folio state after final folio_put()
  KVM: guest_memfd: Handle invalidation of shared memory
  KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared

 Documentation/virt/kvm/api.rst                |   4 +
 include/linux/kvm_host.h                      |  56 +-
 include/uapi/linux/kvm.h                      |   1 +
 include/uapi/linux/magic.h                    |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  |   7 +-
 virt/kvm/guest_memfd.c                        | 613 +++++++++++++++++-
 virt/kvm/kvm_main.c                           |  62 ++
 7 files changed, 706 insertions(+), 38 deletions(-)


base-commit: 62aff24816ac463bf3f754a15b2e7cdff59976ea

Comments

Liam R. Howlett April 4, 2025, 6:05 p.m. UTC | #1

* Fuad Tabba <tabba@google.com> [250328 11:32]:
> This series adds restricted mmap() support to guest_memfd, as well as
> support for guest_memfd on arm64. Please see v3 for the context [1].

As I'm sure you are aware, we have issues getting people to review
patches.  The lower the barrier to entry, the better for everyone.

Sorry if I go too much into detail about the process, I am not sure
where you are starting from.  The linux-mm maintainer (Andrew, aka
akpm), usually takes this cover letter and puts it on patch 1 so that
the git history captures all the details necessary for the entire patch
set to make sense.  What you have done here is made a lot of work for
the maintainer.  I'm not sure what information below is or is not
captured in the v3 context.

But then again, maybe are are not going through linux-mm for upstream?


> 
> Main change since v6 [2]:
> Protected the shared_offsets array with a rwlock instead of hopping on
> the invalidate_lock. The main reason for this is that the final put
> callback (kvm_gmem_handle_folio_put()) could be called from an atomic
> context, and the invalidate_lock is an rw_semaphore (Vishal).
> 
> The other reason is, in hindsight, 

Can you please try to be more brief and to the point?

>it didn't make sense to use the
> invalidate_lock since they're not quite protecting the same thing.
> 
> I did consider using the folio lock to implicitly protect the array, and
> even have another series that does that. The result was more
> complicated, and not obviously race free. One of the difficulties with
> relying on the folio lock is that there are cases (e.g., on
> initilization and teardown) when we want to toggle the state of an
> offset that doesn't have a folio in the cache. We could special case
> these, but the result was more complicated (and to me not obviously
> better) than adding a separate lock for the shared_offsets array.

This must be captured elsewhere and doesn't need to be here?

> 
> Based on the `KVM: Mapping guest_memfd backed memory at the host for
> software protected VMs` series [3], which in turn is based on Linux
> 6.14-rc7.

Wait, what?  You have another in-flight series that I need for this
series?

So this means I need 6.14-rc7 + patches from march 18th to review your
series?

Oh my, and I just responded to another patch set built off this patch
set?  So we have 3 in-flight patches that I need for the last patch set?
What is going on with guest_memfd that everything needs to be in-flight
at once?

I was actually thinking that maybe it would be better to split *this*
set into 2 logical parts: 1. mmap() support and 2. guest_memfd arm64
support.  But now I have so many lore emails opened in my browser trying
to figure this out that I don't want any more.

There's also "mm: Consolidate freeing of typed folios on final
folio_put()" and I don't know where it fits.

Is this because the upstream path differs?  It's very difficult to
follow what's going on.

> 
> The state diagram that uses the new states in this patch series,
> and how they would interact with sharing/unsharing in pKVM [4].

This cover letter is very difficult to follow.  Where do the main
changes since v6 end?  I was going to suggest bullets, but v3 has
bullets and is more clear already.  I'd much prefer if it remained line
v3, any reason you changed?

I am not sure what upstream repository you are working with, but
if you are sending this for the mm stream, please base your work on
mm-new and AT LEAST wait until the patches are in mm-new, but ideally
wait for mm-unstable and mm-stable for wider test coverage.  This might
vary based on your upstream path though.

I did respond to one of the other patch set that's based off this one
that the lockdep issue in patch 3 makes testing a concern.  Considering
there are patches on patches, I don't think you are going to get a whole
lot of people reviewing or testing it until it falls over once it hits
linux-next.

The number of patches in-flight, the ordering, and the base is so
confusing.  Are you sure none of these should be RFC?  The flood of
changes pretty much guarantees things will be missed, more work will be
created, and people (like me) will get frustrated.

It looks like a small team across companies are collaborating on this,
and that's awesome.  I think you need to change how you are doing things
and let the rest of us in on the code earlier.  Otherwise you will be
forced to rebase on changed patch series every time something is
accepted upstream.  That's all to say, you don't have a solid base of
development for the newer patches and I don't know what interactions
will happen when all the in-flight things meet.

I'm sorry, but you have made the barrier of entry to review this too
high.

Thanks,
Liam

> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
> [2] https://lore.kernel.org/all/20250318162046.4016367-1-tabba@google.com/
> [3] https://lore.kernel.org/all/20250318161823.4005529-1-tabba@google.com/
> [4] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf

David Hildenbrand April 4, 2025, 8:10 p.m. UTC | #2

On 04.04.25 20:05, Liam R. Howlett wrote:
> * Fuad Tabba <tabba@google.com> [250328 11:32]:
>> This series adds restricted mmap() support to guest_memfd, as well as
>> support for guest_memfd on arm64. Please see v3 for the context [1].
> 
> As I'm sure you are aware, we have issues getting people to review
> patches.  The lower the barrier to entry, the better for everyone.
> 
> Sorry if I go too much into detail about the process, I am not sure
> where you are starting from.  The linux-mm maintainer (Andrew, aka
> akpm), usually takes this cover letter and puts it on patch 1 so that
> the git history captures all the details necessary for the entire patch
> set to make sense.  What you have done here is made a lot of work for
> the maintainer.  I'm not sure what information below is or is not
> captured in the v3 context.
> 
> But then again, maybe are are not going through linux-mm for upstream?

[replying to some bits]

As all patches and this subject is "KVM:" I would assume this goes 
through the kvm tree once ready.

[...]

> 
> So this means I need 6.14-rc7 + patches from march 18th to review your
> series?
> 
> Oh my, and I just responded to another patch set built off this patch
> set?  So we have 3 in-flight patches that I need for the last patch set?
> What is going on with guest_memfd that everything needs to be in-flight
> at once?

Yeah, we're still all trying to connect the pieces to make it all fly 
together. Looks like we're getting there.

> 
> I was actually thinking that maybe it would be better to split *this*
> set into 2 logical parts: 1. mmap() support and 2. guest_memfd arm64
> support.  But now I have so many lore emails opened in my browser trying
> to figure this out that I don't want any more.
> 
> There's also "mm: Consolidate freeing of typed folios on final
> folio_put()" and I don't know where it fits.

That will likely be moved into a different patch set after some 
discussions we had yesterday.

> 
> Is this because the upstream path differs?  It's very difficult to
> follow what's going on.
> 
>>
>> The state diagram that uses the new states in this patch series,
>> and how they would interact with sharing/unsharing in pKVM [4].
> 
> This cover letter is very difficult to follow.  Where do the main
> changes since v6 end?  I was going to suggest bullets, but v3 has
> bullets and is more clear already.  I'd much prefer if it remained line
> v3, any reason you changed?
> 
> I am not sure what upstream repository you are working with, but
> if you are sending this for the mm stream, please base your work on
> mm-new and AT LEAST wait until the patches are in mm-new, but ideally
> wait for mm-unstable and mm-stable for wider test coverage.  This might
> vary based on your upstream path though.
> 
> I did respond to one of the other patch set that's based off this one
> that the lockdep issue in patch 3 makes testing a concern.  Considering
> there are patches on patches, I don't think you are going to get a whole
> lot of people reviewing or testing it until it falls over once it hits
> linux-next.

I think for now it's mostly KVM + guest_memfd people reviewing this. 
linux-mm is only CCed for awareness (clearly MM related stuff).

> 
> The number of patches in-flight, the ordering, and the base is so
> confusing.  Are you sure none of these should be RFC?  The flood of
> changes pretty much guarantees things will be missed, more work will be
> created, and people (like me) will get frustrated.

Yeah, I think everything basing the work on this series should be RFC or 
clearly tagged as based on this work differently.

> It looks like a small team across companies are collaborating on this,
> and that's awesome.  I think you need to change how you are doing things
> and let the rest of us in on the code earlier. 

I think the approach taken to share the different pieces early makes 
sense, it just has to be clearer what the dependencies are and what is 
actually the first thing that should go in so people can focus review on 
that.

> Otherwise you will be
> forced to rebase on changed patch series every time something is
> accepted upstream. 

I don't think that's a problem, the work built on top of this are mostly 
shared for early review/feedback -- whereby I agree that tagging them 
differently makes a lot of sense.

Sean Christopherson April 4, 2025, 9:48 p.m. UTC | #3

On Fri, Apr 04, 2025, David Hildenbrand wrote:
> On 04.04.25 20:05, Liam R. Howlett wrote:
> > But then again, maybe are are not going through linux-mm for upstream?
> 
> [replying to some bits]
> 
> As all patches and this subject is "KVM:" I would assume this goes through
> the kvm tree once ready.

Yeah, that's a safe assumption.

> > It looks like a small team across companies are collaborating on this,
> > and that's awesome.  I think you need to change how you are doing things
> > and let the rest of us in on the code earlier.
> 
> I think the approach taken to share the different pieces early makes sense,
> it just has to be clearer what the dependencies are and what is actually the
> first thing that should go in so people can focus review on that.

100% agreed that sharing early makes sense, but I also 100% agree with Liam that
having multiple series flying around with multiple dependencies makes them all
unreviewable.  I simply don't have the bandwidth to track down who's doing what
and where.

I don't see those two sides as conflicting.  Someone "just" needs to take point
on collecting, squashing, and posting the various inter-related works as a single
cohesive series.

As you said, things are getting there, but I recommend prioritizing that above
the actual code, otherwise reviewers are going to continue ignoring the individual
series.

FWIW, if necessary, I would much prefer a single massive series over a bunch of
smaller series all pointing at each other, at least for the initial review.

Liam R. Howlett April 5, 2025, 2:45 a.m. UTC | #4

* Sean Christopherson <seanjc@google.com> [250404 17:48]:
> On Fri, Apr 04, 2025, David Hildenbrand wrote:
> > On 04.04.25 20:05, Liam R. Howlett wrote:
> > > But then again, maybe are are not going through linux-mm for upstream?
> > 
> > [replying to some bits]
> > 
> > As all patches and this subject is "KVM:" I would assume this goes through
> > the kvm tree once ready.
> 
> Yeah, that's a safe assumption.

Okay, thanks.

It still seems strange to have such vast differences in the cover letter
between versions and require others to hunt down the information vs an
updated cover letter with revision history and links.

I did get lost on where the changes since v6 stopped vs new comments
started in the update.

But maybe it's that I'm set in my grumpy old ways.

I really like the cover letter information in the first patch (this
patch 1 of 7..) to have the information in the git log.

> 
> > > It looks like a small team across companies are collaborating on this,
> > > and that's awesome.  I think you need to change how you are doing things
> > > and let the rest of us in on the code earlier.
> > 
> > I think the approach taken to share the different pieces early makes sense,
> > it just has to be clearer what the dependencies are and what is actually the
> > first thing that should go in so people can focus review on that.
> 
> 100% agreed that sharing early makes sense, but I also 100% agree with Liam that
> having multiple series flying around with multiple dependencies makes them all
> unreviewable.  I simply don't have the bandwidth to track down who's doing what
> and where.

Yes, sharing early is crucial, but we lack the quilt file to stitch them
together in the proper sequence so we can do a proper review.

My main issue is barrier of entry, I have no idea how I can help this
effort as it is today.

> 
> I don't see those two sides as conflicting.  Someone "just" needs to take point
> on collecting, squashing, and posting the various inter-related works as a single
> cohesive series.

It *looks* like all these patches need to go in now (no RFC tags, for
instance), but when you start reading through the cover letters it has
many levels and the effort quickly grows; which branch do I need, what
order, and which of these landed?  This is like SMR, but with code.

> 
> As you said, things are getting there, but I recommend prioritizing that above
> the actual code, otherwise reviewers are going to continue ignoring the individual
> series.
> 
> FWIW, if necessary, I would much prefer a single massive series over a bunch of
> smaller series all pointing at each other, at least for the initial review.
> 

Yes, at least then the dependency order and versions would not be such
an effort to get correct.  If it's really big maybe a git repo would be
a good idea along with the large patch set?  You'd probably be using
that repo to combine/squash and generate the patches anyways.  I still
don't know what patch I need to start with and which ones have landed.

If each part is worth doing on its own, then send one at a time and wait
for them to land.  This might result in wasted time on a plan that needs
to be changed for upstreaming, so I think the big patch bomb/git branch
is the way to go.

Both of these methods will provide better end-to-end testing and code
review than the individual parts being sent out in short succession with
references to each other.

Thanks,
Liam