Message ID | 20231230161954.569267-12-michael.roth@amd.com |
---|---|
State | New |
Headers | show |
Series | Add AMD Secure Nested Paging (SEV-SNP) Initialization Support | expand |
On Sat, Dec 30, 2023 at 10:19:39AM -0600, Michael Roth wrote: > static int rmpupdate(u64 pfn, struct rmp_state *state) > { > unsigned long paddr = pfn << PAGE_SHIFT; > - int ret; > + int ret, level, npages; > > if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP)) > return -ENODEV; > > + level = RMP_TO_PG_LEVEL(state->pagesize); > + npages = page_level_size(level) / PAGE_SIZE; > + > + /* > + * If the kernel uses a 2MB directmap mapping to write to an address, > + * and that 2MB range happens to contain a 4KB page that set to private > + * in the RMP table, an RMP #PF will trigger and cause a host crash. > + * > + * Prevent this by removing pages from the directmap prior to setting > + * them as private in the RMP table. > + */ > + if (state->assigned && invalidate_direct_map(pfn, npages)) > + return -EFAULT; Well, the comment says one thing but the code does not do quite what the text says: * where is the check that pfn is part of the kernel direct map? * where is the check that this address is part of a 2M PMD in the direct map so that the situation is even given that you can have a 4K private PTE there? What this does is simply invalidate @npages unconditionally. Then, __set_pages_np() already takes a number of pages and a start address. Why is this thing iterating instead of sending *all* npages in one go? Yes, you'd need two new helpers, something like: set_pages_present(struct page *page, int numpages) set_pages_non_present(struct page *page, int numpages) which call the lowlevel counterparts. For some reason I remember asking this a while ago...
On 12/30/23 08:19, Michael Roth wrote: > If the kernel uses a 2MB directmap mapping to write to an address, and > that 2MB range happens to contain a 4KB page that set to private in the > RMP table, that will also lead to a page-fault exception. I thought we agreed long ago to just demote the whole direct map to 4k on kernels that might need to act as SEV-SNP hosts. That should be step one and this can be discussed as an optimization later.
On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote: > On 12/30/23 08:19, Michael Roth wrote: > > If the kernel uses a 2MB directmap mapping to write to an address, and > > that 2MB range happens to contain a 4KB page that set to private in the > > RMP table, that will also lead to a page-fault exception. > > I thought we agreed long ago to just demote the whole direct map to 4k > on kernels that might need to act as SEV-SNP hosts. That should be step > one and this can be discussed as an optimization later. What would be the disadvantage here? Higher TLB pressure when running kernel code I guess...
On 1/12/24 21:07, Borislav Petkov wrote: > On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote: >> On 12/30/23 08:19, Michael Roth wrote: >> > If the kernel uses a 2MB directmap mapping to write to an address, and >> > that 2MB range happens to contain a 4KB page that set to private in the >> > RMP table, that will also lead to a page-fault exception. >> >> I thought we agreed long ago to just demote the whole direct map to 4k >> on kernels that might need to act as SEV-SNP hosts. That should be step >> one and this can be discussed as an optimization later. > > What would be the disadvantage here? Higher TLB pressure when running > kernel code I guess... Yeah and last LSF/MM we concluded that it's not as a big disadvantage as we previously thought https://lwn.net/Articles/931406/
On 1/12/24 14:07, Borislav Petkov wrote: > On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote: >> On 12/30/23 08:19, Michael Roth wrote: >>> If the kernel uses a 2MB directmap mapping to write to an address, and >>> that 2MB range happens to contain a 4KB page that set to private in the >>> RMP table, that will also lead to a page-fault exception. I thought there was also a desire to remove the direct map for any pages assigned to a guest as private, not just the case that the comment says. So updating the comment would probably the best action. >> >> I thought we agreed long ago to just demote the whole direct map to 4k >> on kernels that might need to act as SEV-SNP hosts. That should be step >> one and this can be discussed as an optimization later. Won't this accomplish that without actually demoting a lot of long-live kernel related mappings that would never be demoted? I don't think we need to demote the whole mapping to 4K. Thanks, Tom > > What would be the disadvantage here? Higher TLB pressure when running > kernel code I guess... >
On 1/12/24 12:28, Tom Lendacky wrote: > I thought there was also a desire to remove the direct map for any pages > assigned to a guest as private, not just the case that the comment says. > So updating the comment would probably the best action. I'm not sure who desires that. It's sloooooooow to remove things from the direct map. There's almost certainly a frequency cutoff where running the whole direct mapping as 4k is better than the cost of mapping/unmapping. Actually, where _is_ the TLB flushing here?
On Sat, Dec 30, 2023 at 10:19:39AM -0600, Michael Roth wrote: > + /* > + * If the kernel uses a 2MB directmap mapping to write to an address, > + * and that 2MB range happens to contain a 4KB page that set to private > + * in the RMP table, an RMP #PF will trigger and cause a host crash. Also: diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c index 7d294d1a620b..2ad83e7fb2da 100644 --- a/arch/x86/virt/svm/sev.c +++ b/arch/x86/virt/svm/sev.c @@ -415,8 +415,9 @@ static int rmpupdate(u64 pfn, struct rmp_state *state) /* * If the kernel uses a 2MB directmap mapping to write to an address, - * and that 2MB range happens to contain a 4KB page that set to private - * in the RMP table, an RMP #PF will trigger and cause a host crash. + * and that 2MB range happens to contain a 4KB page that has been set + * to private in the RMP table, an RMP #PF will trigger and cause a + * host crash. * * Prevent this by removing pages from the directmap prior to setting * them as private in the RMP table.
On Fri, Jan 12, 2024 at 09:27:45PM +0100, Vlastimil Babka wrote: > Yeah and last LSF/MM we concluded that it's not as a big disadvantage as we > previously thought https://lwn.net/Articles/931406/ How nice, thanks for that! Do you have some refs to Mike's tests so that we could run them here too with SNP guests to see how big - if any - the fragmentation has. Thx.
On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote: > I thought we agreed long ago to just demote the whole direct map to 4k > on kernels that might need to act as SEV-SNP hosts. That should be step > one and this can be discussed as an optimization later. Do we have a link to that agreement somewhere? I'd like to read why we agreed. And looking at Mike's talk: https://lwn.net/Articles/931406/ - what do we wanna do in general with the direct map granularity? Thx.
On 1/15/24 10:06, Borislav Petkov wrote: > On Fri, Jan 12, 2024 at 09:27:45PM +0100, Vlastimil Babka wrote: >> Yeah and last LSF/MM we concluded that it's not as a big disadvantage as we >> previously thought https://lwn.net/Articles/931406/ > > How nice, thanks for that! > > Do you have some refs to Mike's tests so that we could run them here too > with SNP guests to see how big - if any - the fragmentation has. Let me Cc him... > Thx. >
On Mon, Jan 15, 2024 at 10:06:39AM +0100, Borislav Petkov wrote: > On Fri, Jan 12, 2024 at 09:27:45PM +0100, Vlastimil Babka wrote: > > Yeah and last LSF/MM we concluded that it's not as a big disadvantage as we > > previously thought https://lwn.net/Articles/931406/ > > How nice, thanks for that! > > Do you have some refs to Mike's tests so that we could run them here too > with SNP guests to see how big - if any - the fragmentation has. I ran several tests from mmtests suite. The tests results are here: https://rppt.io/gfp-unmapped-bench/ > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette >
On Mon, Jan 15, 2024 at 11:16:19AM +0200, Mike Rapoport wrote: > I ran several tests from mmtests suite. > > The tests results are here: > https://rppt.io/gfp-unmapped-bench/ Cool, thanks Mike! :-)
On 1/12/24 21:37, Dave Hansen wrote: > On 1/12/24 12:28, Tom Lendacky wrote: >> I thought there was also a desire to remove the direct map for any pages >> assigned to a guest as private, not just the case that the comment says. >> So updating the comment would probably the best action. > > I'm not sure who desires that. > > It's sloooooooow to remove things from the direct map. There's almost > certainly a frequency cutoff where running the whole direct mapping as > 4k is better than the cost of mapping/unmapping. > > Actually, where _is_ the TLB flushing here? Hm yeah it seems to be using the _noflush version? Maybe the RMP issues this avoids are only triggered with actual page tables and a stray outdated TLB hit doesn't trigger it? Needs documenting though if that's the case.
On Fri, Jan 12, 2024 at 12:37:31PM -0800, Dave Hansen wrote: > On 1/12/24 12:28, Tom Lendacky wrote: > > I thought there was also a desire to remove the direct map for any pages > > assigned to a guest as private, not just the case that the comment says. > > So updating the comment would probably the best action. > > I'm not sure who desires that. > > It's sloooooooow to remove things from the direct map. There's almost > certainly a frequency cutoff where running the whole direct mapping as > 4k is better than the cost of mapping/unmapping. One area we'd been looking to optimize[1] is lots of vCPUs/guests faulting in private memory on-demand via kernels with lazy acceptance support. The lazy acceptance / #NPF path for each 4K/2M guest page involves updating the corresponding RMP table entry to set it to Guest-owned state, and as part of that we remove the PFNs from the directmap. There is indeed potential for scalability issues due to how directmap updates are currently handled, since they involve holding a global cpa_lock for every update. At the time, there were investigations on how to remove the cpa_lock, and there's an RFC patch[2] that implements this change, but I don't know where this stands today. There was also some work[3] on being able to restore 2MB entries in the directmap (currently entries are only re-added as 4K) the Mike Rapoport pointed out to me a while back. With that, since the bulk of private guest pages 2MB, we'd be able to avoid splitting the directmap to 4K over time. There was also some indication[4] that UPM/guest_memfd would eventually manage directmap invalidations internally, so it made sense to have SNP continue to handle this up until the point that mangement was moved to gmem. Those 3 things, paired with a platform-independent way of catching unexpected kernel accesses to private guest memory, are what I think nudged us all toward the current implementation. But AFAIK none of these 3 things are being actively upstreamed today, so it makes sense to re-consider how we handle the directmap in the interim. I did some performance tests which do seem to indicate that pre-splitting the directmap to 4K can be substantially improve certain SNP guest workloads. This test involves running a single 1TB SNP guest with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to rapidly fault in all of its memory via lazy acceptance, and then measuring the rate that gmem pages are being allocated on the host by monitoring "FileHugePages" from /proc/meminfo to get some rough gauge of how quickly a guest can fault in it's initial working set prior to reaching steady state. The data is a bit noisy but seems to indicate significant improvement by taking the directmap updates out of the lazy acceptance path, and I would only expect that to become more significant as you scale up the number of guests / vCPUs. # Average fault-in rate across 3 runs, measured in GB/s unpinned | pinned to NUMA node 0 DirectMap4K 12.9 | 12.1 stddev 2.2 | 1.3 DirectMap2M+split 8.0 | 8.9 stddev 1.3 | 0.8 The downside of course is potential impact for non-SNP workloads resulting from splitting the directmap. Mike Rapoport's numbers make me feel a little better about it, but I don't think they apply directly to the notion of splitting the entire directmap. It's Even he LWN article summarizes: "The conclusion from all of this, Rapoport continued, was that direct-map fragmentation just does not matter — for data access, at least. Using huge-page mappings does still appear to make a difference for memory containing the kernel code, so allocator changes should focus on code allocations — improving the layout of allocations for loadable modules, for example, or allowing vmalloc() to allocate huge pages for code. But, for kernel-data allocations, direct-map fragmentation simply appears to not be worth worrying about." So at the very least, if we went down this path, we would be worth investigating the following areas in addition to general perf testing: 1) Only splitting directmap regions corresponding to kernel-allocatable *data* (hopefully that's even feasible...) 2) Potentially deferring the split until an SNP guest is actually run, so there isn't any impact just from having SNP enabled (though you still take a hit from RMP checks in that case so maybe it's not worthwhile, but that itself has been noted as a concern for users so it would be nice to not make things even worse). [1] https://lore.kernel.org/linux-mm/20231103000105.m3z4eijcxlxciyzd@amd.com/ [2] https://lore.kernel.org/lkml/Y7f9ZuPcIMk37KnN@gmail.com/T/#m15b74841f5319c0d1177f118470e9714d4ea96c8 [3] https://lore.kernel.org/linux-kernel/20200416213229.19174-1-kirill.shutemov@linux.intel.com/ [4] https://lore.kernel.org/all/YyGLXXkFCmxBfu5U@google.com/ > > Actually, where _is_ the TLB flushing here? Boris pointed that out in v6, and we implemented it in v7, but it completely cratered performance: https://lore.kernel.org/linux-mm/20221219150026.bltiyk72pmdc2ic3@amd.com/ After further discussion I think we'd concluded it wasn't necessary. Maybe that's worth revisiting though. If it is necessary, then that would be another reason to just pre-split the directmap because the above-mentioned lazy acceptance workload/bottleneck would likely get substantially worse. -Mike
On 1/15/24 01:09, Borislav Petkov wrote: > On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote: >> I thought we agreed long ago to just demote the whole direct map to 4k >> on kernels that might need to act as SEV-SNP hosts. That should be step >> one and this can be discussed as an optimization later. > > Do we have a link to that agreement somewhere? I seem to be remembering it wrong. I have some memory of disabling 2M mappings in the same spots that debug_pagealloc_enabled() does, but I must have hallucinated it. > https://lore.kernel.org/lkml/535400b4-0593-a7ca-1548-532ee1fefbd7@intel.com/ The original approach tried to fix up the page faults by doing the demotion there and didn't seem to work with hugetlbfs at all. > I'd like to read why we agreed. And looking at Mike's talk: > https://lwn.net/Articles/931406/ - what do we wanna do in general with > the direct map granularity? I think fracturing it is much less of a problem than we once thought it was. There are _definitely_ people that will pay the cost for added security. The problem will be the first time someone sees a regression when their direct-map-fracturing kernel feature (secret pages, SNP host, etc...) collides with some workload that's sensitive to kernel TLB pressure.
On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote: > I did some performance tests which do seem to indicate that > pre-splitting the directmap to 4K can be substantially improve certain > SNP guest workloads. This test involves running a single 1TB SNP guest > with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to > rapidly fault in all of its memory via lazy acceptance, and then > measuring the rate that gmem pages are being allocated on the host by > monitoring "FileHugePages" from /proc/meminfo to get some rough gauge > of how quickly a guest can fault in it's initial working set prior to > reaching steady state. The data is a bit noisy but seems to indicate > significant improvement by taking the directmap updates out of the > lazy acceptance path, and I would only expect that to become more > significant as you scale up the number of guests / vCPUs. > > # Average fault-in rate across 3 runs, measured in GB/s > unpinned | pinned to NUMA node 0 > DirectMap4K 12.9 | 12.1 > stddev 2.2 | 1.3 > DirectMap2M+split 8.0 | 8.9 > stddev 1.3 | 0.8 > > The downside of course is potential impact for non-SNP workloads > resulting from splitting the directmap. Mike Rapoport's numbers make > me feel a little better about it, but I don't think they apply directly > to the notion of splitting the entire directmap. It's Even he LWN article > summarizes: > > "The conclusion from all of this, Rapoport continued, was that > direct-map fragmentation just does not matter — for data access, at > least. Using huge-page mappings does still appear to make a difference > for memory containing the kernel code, so allocator changes should > focus on code allocations — improving the layout of allocations for > loadable modules, for example, or allowing vmalloc() to allocate huge > pages for code. But, for kernel-data allocations, direct-map > fragmentation simply appears to not be worth worrying about." > > So at the very least, if we went down this path, we would be worth > investigating the following areas in addition to general perf testing: > > 1) Only splitting directmap regions corresponding to kernel-allocatable > *data* (hopefully that's even feasible...) > 2) Potentially deferring the split until an SNP guest is actually > run, so there isn't any impact just from having SNP enabled (though > you still take a hit from RMP checks in that case so maybe it's not > worthwhile, but that itself has been noted as a concern for users > so it would be nice to not make things even worse). There's another potential area of investigation I forgot to mention that doesn't involve pre-splitting the directmap. It makes use of the fact that the kernel should never be accessing a 2MB mapping that overlaps with private guest memory if the backing PFN for the guest memory is a 2MB page. Since there's no chance for overlap (well, maybe via a 1GB directmap entry, but not as dramatic a change to force those to 2M), there's no need to actually split the directmap entry in these cases since they won't result in unexpected RMP faults. So if pre-splitting the directmap ends up having too many downsides, then there may still some potential for optimizing the current approach to a fair degree. -Mike
On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote: > So at the very least, if we went down this path, we would be worth > investigating the following areas in addition to general perf testing: > > 1) Only splitting directmap regions corresponding to kernel-allocatable > *data* (hopefully that's even feasible...) > 2) Potentially deferring the split until an SNP guest is actually > run, so there isn't any impact just from having SNP enabled (though > you still take a hit from RMP checks in that case so maybe it's not > worthwhile, but that itself has been noted as a concern for users > so it would be nice to not make things even worse). So the gist of this whole explanation why we end up doing what we end up doing eventually should be in the commit message so that it is clear *why* we did it. > After further discussion I think we'd concluded it wasn't necessary. Maybe > that's worth revisiting though. If it is necessary, then that would be > another reason to just pre-split the directmap because the above-mentioned > lazy acceptance workload/bottleneck would likely get substantially worse. The reason for that should also be in the commit message. And to answer: https://lore.kernel.org/linux-mm/20221219150026.bltiyk72pmdc2ic3@amd.com/ yes, you should add a @npages variant. See if you could use/extend this, for example: https://lore.kernel.org/r/20240116022008.1023398-3-mhklinux@outlook.com Thx.
On Tue, Jan 16, 2024 at 10:50:25AM -0600, Michael Roth wrote: > On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote: > > > > The downside of course is potential impact for non-SNP workloads > > resulting from splitting the directmap. Mike Rapoport's numbers make > > me feel a little better about it, but I don't think they apply directly > > to the notion of splitting the entire directmap. It's Even he LWN article > > summarizes: When I ran the tests, I had the entire direct map forced to 4k or 2M pages. There is indeed some impact and some tests suffer more than others but there were also runs that benefit from smaller page size in the direct map, at least if I remember correctly the results Intel folks posted a while ago. > > "The conclusion from all of this, Rapoport continued, was that > > direct-map fragmentation just does not matter — for data access, at > > least. Using huge-page mappings does still appear to make a difference > > for memory containing the kernel code, so allocator changes should > > focus on code allocations — improving the layout of allocations for > > loadable modules, for example, or allowing vmalloc() to allocate huge > > pages for code. But, for kernel-data allocations, direct-map > > fragmentation simply appears to not be worth worrying about." > > > > So at the very least, if we went down this path, we would be worth > > investigating the following areas in addition to general perf testing: > > > > 1) Only splitting directmap regions corresponding to kernel-allocatable > > *data* (hopefully that's even feasible...) Forcing the direct map to 4k pages does not affect the kernel text mappings, they are created separately and they are not the part of the kernel mapping of the physical memory. Well, except the modules, but they are mapped with 4k pages anyway. > -Mike
On 1/16/24 08:19, Michael Roth wrote: > > So at the very least, if we went down this path, we would be worth > investigating the following areas in addition to general perf testing: > > 1) Only splitting directmap regions corresponding to kernel-allocatable > *data* (hopefully that's even feasible...) Take a look at the 64-bit memory map in here: https://www.kernel.org/doc/Documentation/x86/x86_64/mm.rst We already have separate mappings for kernel data and (normal) kernel text. > 2) Potentially deferring the split until an SNP guest is actually > run, so there isn't any impact just from having SNP enabled (though > you still take a hit from RMP checks in that case so maybe it's not > worthwhile, but that itself has been noted as a concern for users > so it would be nice to not make things even worse). Yes, this would be nice too. >> Actually, where _is_ the TLB flushing here? > Boris pointed that out in v6, and we implemented it in v7, but it > completely cratered performance: That *desperately* needs to be documented. How can it be safe to skip the TLB flush? It this akin to a page permission promotion where you go from RO->RW but can skip the TLB flush? In that case, the CPU will see the RO TLB entry violation, drop it, and re-walk the page tables, discovering the RW entry. Does something similar happen here where the CPU sees the 2M/4k conflict in the TLB, drops the 2M entry, does a re-walk then picks up the newly-split 2M->4k entries? I can see how something like that would work, but it's _awfully_ subtle to go unmentioned.
On Tue, Jan 16, 2024 at 08:21:09AM -0800, Dave Hansen wrote: > The problem will be the first time someone sees a regression when their > direct-map-fracturing kernel feature (secret pages, SNP host, etc...) > collides with some workload that's sensitive to kernel TLB pressure. Yeah, and that "workload" will be some microbenchmark crap which people would pay too much attention to, without it having any relevance to real-world scenarios. Completely hypothetically speaking, ofc. :-)
On Tue, Jan 16, 2024 at 12:22:39PM -0800, Dave Hansen wrote: > On 1/16/24 08:19, Michael Roth wrote: > > > > So at the very least, if we went down this path, we would be worth > > investigating the following areas in addition to general perf testing: > > > > 1) Only splitting directmap regions corresponding to kernel-allocatable > > *data* (hopefully that's even feasible...) > > Take a look at the 64-bit memory map in here: > > https://www.kernel.org/doc/Documentation/x86/x86_64/mm.rst > > We already have separate mappings for kernel data and (normal) kernel text. > > > 2) Potentially deferring the split until an SNP guest is actually > > run, so there isn't any impact just from having SNP enabled (though > > you still take a hit from RMP checks in that case so maybe it's not > > worthwhile, but that itself has been noted as a concern for users > > so it would be nice to not make things even worse). > > Yes, this would be nice too. > > >> Actually, where _is_ the TLB flushing here? > > Boris pointed that out in v6, and we implemented it in v7, but it > > completely cratered performance: > > That *desperately* needs to be documented. > > How can it be safe to skip the TLB flush? It this akin to a page > permission promotion where you go from RO->RW but can skip the TLB > flush? In that case, the CPU will see the RO TLB entry violation, drop > it, and re-walk the page tables, discovering the RW entry. > > Does something similar happen here where the CPU sees the 2M/4k conflict > in the TLB, drops the 2M entry, does a re-walk then picks up the > newly-split 2M->4k entries? > > I can see how something like that would work, but it's _awfully_ subtle > to go unmentioned. I think the current logic relies on the fact that we don't actually have to remove entries from the RMP table to avoid unexpected RMP #PFs, because the owners of private PFNs are aware of what access restrictions would be active, and non-owners of private PFNs shouldn't be trying to write to them. It's only cases where a non-owner does a write via a 2MB+ mapping to that happens to overlap a private PFN that needs to be guarded against, and when a private PFN is removed from the directmap it will end up splitting the 2MB mapping, and in that cases there *is* a TLB flush that would wipe out the 2MB TLB entry. After that point, there may still be stale 4KB TLB entries that accrue, but by the time there is any sensible reason to use them to perform a write, the PFN would have been transitioned back to shared state and re-added to the directmap. But yes, it's ugly, and we've ended up re-working this to simply split the directmap via set_memory_4k(), which also does the expected TLB flushing of 2MB+ mappings, and we only call it when absolutely necessary, benefitting from the following optimizations: - if lookup_address() tells us the mapping is already split, no work splitting is needed - when the rmpupdate() is for a 2MB private page/RMP entry, there's no need to split the directmap, because there's no risk of a non-owner's writes overlapping with the private PFNs (RMP checks are only done on 4KB/2MB granularity, so even a write via a 1GB directmap mapping would not trigger an RMP fault since the non-owner's writes would be to a different 2MB PFN range. Since KVM/gmem generally allocate 2MB backing pages for private guest memory, the above optimizations result in the directmap only needing to be split, and only needing to acquire the cpa_lock, a handful of times for each guest. This also sort of gives us what we were looking for earlier: a way to defer the directmap split until it's actually needed. But rather than in bulk, it's amortized over time, and the more it's done, the more likely it is that the host is being used primarily to run SNP guests, and so the less likely it is that splitting the directmap is going to cause some sort of noticeable regression for non-SNP/non-guest workloads. I did some basic testing to compare this against pre-splitting the directmap, and I think it's pretty close performance-wise, so it seems like a good, safe default. It's also fairly trivial to add a kernel cmdline params or something later if someone wants to optimize specifically for SNP guests. System: EPYC, 512 cores, 1.5TB memory == Page-in/acceptance rate for 1 guests, each with 128 vCPUs/512M == directmap pre-split to 4K: 13.77GB/s 11.88GB/s 15.37GB/s directmap split on-demand: 14.77 GB/s 16.57 GB/s 15.53 GB/s == Page-in/acceptance rate for 4 guests, each with 128 vCPUs/256GB == directmap pre-split to 4K: 37.35 GB/s 33.37 GB/s 29.55 GB/s directmap split on-demand: 28.88 GB/s 25.13 GB/s 29.28 GB/s == Page-in/acceptance rate for 8 guests, each with 128 vCPUs/160GB == directmap pre-split to 4K: 24.39 GB/s 27.16 GB/s 24.70 GB/s direct split on-demand: 26.12 GB/s 24.44 GB/s 22.66 GB/s So for v2 we're using this on-demand approach, and I've tried to capture some of this rationale and in the commit log / comments for the relevant patch. Thanks, Mike
On Tue, Jan 16, 2024 at 10:12:26PM +0200, Mike Rapoport wrote: > On Tue, Jan 16, 2024 at 10:50:25AM -0600, Michael Roth wrote: > > On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote: > > > > > > The downside of course is potential impact for non-SNP workloads > > > resulting from splitting the directmap. Mike Rapoport's numbers make > > > me feel a little better about it, but I don't think they apply directly > > > to the notion of splitting the entire directmap. It's Even he LWN article > > > summarizes: > > When I ran the tests, I had the entire direct map forced to 4k or 2M pages. > There is indeed some impact and some tests suffer more than others but > there were also runs that benefit from smaller page size in the direct map, > at least if I remember correctly the results Intel folks posted a while > ago. I see, thanks for clarifying. Certainly helps to have this data if someone ends up wanting to investigate pre-splitting directmap to optimize SNP use-cases in the future. Still feel more comfortable introducing this as an optional tuneable though; can't help but worry about that *1* workload that somehow suffers perf regression and has us frantically re-working SNP implementation if we were to start off with making 4k directmap a requirement. > > > > "The conclusion from all of this, Rapoport continued, was that > > > direct-map fragmentation just does not matter — for data access, at > > > least. Using huge-page mappings does still appear to make a difference > > > for memory containing the kernel code, so allocator changes should > > > focus on code allocations — improving the layout of allocations for > > > loadable modules, for example, or allowing vmalloc() to allocate huge > > > pages for code. But, for kernel-data allocations, direct-map > > > fragmentation simply appears to not be worth worrying about." > > > > > > So at the very least, if we went down this path, we would be worth > > > investigating the following areas in addition to general perf testing: > > > > > > 1) Only splitting directmap regions corresponding to kernel-allocatable > > > *data* (hopefully that's even feasible...) > > Forcing the direct map to 4k pages does not affect the kernel text > mappings, they are created separately and they are not the part of the > kernel mapping of the physical memory. > Well, except the modules, but they are mapped with 4k pages anyway. Thanks! -Mike > > > -Mike > > -- > Sincerely yours, > Mike.
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c index ff9fa0a85a7f..ee182351d93a 100644 --- a/arch/x86/virt/svm/sev.c +++ b/arch/x86/virt/svm/sev.c @@ -369,14 +369,64 @@ int psmash(u64 pfn) } EXPORT_SYMBOL_GPL(psmash); +static int restore_direct_map(u64 pfn, int npages) +{ + int i, ret = 0; + + for (i = 0; i < npages; i++) { + ret = set_direct_map_default_noflush(pfn_to_page(pfn + i)); + if (ret) + break; + } + + if (ret) + pr_warn("Failed to restore direct map for pfn 0x%llx, ret: %d\n", + pfn + i, ret); + + return ret; +} + +static int invalidate_direct_map(u64 pfn, int npages) +{ + int i, ret = 0; + + for (i = 0; i < npages; i++) { + ret = set_direct_map_invalid_noflush(pfn_to_page(pfn + i)); + if (ret) + break; + } + + if (ret) { + pr_warn("Failed to invalidate direct map for pfn 0x%llx, ret: %d\n", + pfn + i, ret); + restore_direct_map(pfn, i); + } + + return ret; +} + static int rmpupdate(u64 pfn, struct rmp_state *state) { unsigned long paddr = pfn << PAGE_SHIFT; - int ret; + int ret, level, npages; if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP)) return -ENODEV; + level = RMP_TO_PG_LEVEL(state->pagesize); + npages = page_level_size(level) / PAGE_SIZE; + + /* + * If the kernel uses a 2MB directmap mapping to write to an address, + * and that 2MB range happens to contain a 4KB page that set to private + * in the RMP table, an RMP #PF will trigger and cause a host crash. + * + * Prevent this by removing pages from the directmap prior to setting + * them as private in the RMP table. + */ + if (state->assigned && invalidate_direct_map(pfn, npages)) + return -EFAULT; + do { /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */ asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE" @@ -386,12 +436,16 @@ static int rmpupdate(u64 pfn, struct rmp_state *state) } while (ret == RMPUPDATE_FAIL_OVERLAP); if (ret) { - pr_err("RMPUPDATE failed for PFN %llx, ret: %d\n", pfn, ret); + pr_err("RMPUPDATE failed for PFN %llx, pg_level: %d, ret: %d\n", + pfn, level, ret); dump_rmpentry(pfn); dump_stack(); return -EFAULT; } + if (!state->assigned && restore_direct_map(pfn, npages)) + return -EFAULT; + return 0; }