Message ID | 521d99c08b975fb06a1e7201e971cc24d68196d1.1740139449.git.lorenzo.stoakes@oracle.com |
---|---|
State | Accepted |
Commit | 8e2f2aeb8b48aceef6e6c07b2d9bede4eaa50c06 |
Headers | show |
Series | [1/2] fs/proc/task_mmu: add guard region bit to pagemap | expand |
On Mon, Feb 24, 2025 at 11:37:24AM +0100, David Hildenbrand wrote: > On 24.02.25 11:18, Lorenzo Stoakes wrote: > > On Mon, Feb 24, 2025 at 10:27:28AM +0100, David Hildenbrand wrote: > > > On 21.02.25 13:05, Lorenzo Stoakes wrote: > > > > Currently there is no means by which users can determine whether a given > > > > page in memory is in fact a guard region, that is having had the > > > > MADV_GUARD_INSTALL madvise() flag applied to it. > > > > > > > > This is intentional, as to provide this information in VMA metadata would > > > > contradict the intent of the feature (providing a means to change fault > > > > behaviour at a page table level rather than a VMA level), and would require > > > > VMA metadata operations to scan page tables, which is unacceptable. > > > > > > > > In many cases, users have no need to reflect and determine what regions > > > > have been designated guard regions, as it is the user who has established > > > > them in the first place. > > > > > > > > But in some instances, such as monitoring software, or software that relies > > > > upon being able to ascertain the nature of mappings within a remote process > > > > for instance, it becomes useful to be able to determine which pages have > > > > the guard region marker applied. > > > > > > > > This patch makes use of an unused pagemap bit (58) to provide this > > > > information. > > > > > > > > This patch updates the documentation at the same time as making the change > > > > such that the implementation of the feature and the documentation of it are > > > > tied together. > > > > > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > --- > > > > > > > > > Acked-by: David Hildenbrand <david@redhat.com> > > > > Thanks! :) > > > > > > Something that might be interesting is also extending the PAGEMAP_SCAN > > > ioctl. > > > > Yeah, funny you should mention that, I did see that, but on reading the man > > page it struck me that it requires the region to be uffd afaict? All the > > tests seem to establish uffd, and the man page implies it: > > > > To start tracking the written state (flag) of a page or range of > > memory, the UFFD_FEATURE_WP_ASYNC must be enabled by UFFDIO_API > > ioctl(2) on userfaultfd and memory range must be registered with > > UFFDIO_REGISTER ioctl(2) in UFFDIO_REGISTER_MODE_WP mode. > > > > It would be a bit of a weird edge case to add support there. I was excited > > when I first saw this ioctl, then disappointed afterwards... but maybe I > > got it wrong? > > > > I never managed to review that fully, but I thing that UFFD_FEATURE_WP_ASYNC > thingy is only required for PM_SCAN_CHECK_WPASYNC and PM_SCAN_WP_MATCHING. > > See pagemap_scan_test_walk(). > > I do recall that it works on any VMA. Oh ok well that's handy then! > > Ah yes, tools/testing/selftests/mm/vm_util.c ends up using it for > pagemap_is_swapped() and friends via page_entry_is() to sanity check that > what pagemap gives us is consistent with what pagemap_scan gives us. > > So it should work independent of the uffd magic. > I might be wrong, though ... No a quick glance makes me think you're right actually. > > > > > > > > > > See do_pagemap_scan(). > > > > > > The benefit here might be that one could effectively search/filter for guard > > > regions without copying 64bit per base-page to user space. > > > > > > But the idea would be to indicate something like PAGE_IS_GUARD_REGION as a > > > category when we hit a guard region entry in pagemap_page_category(). > > > > > > (the code is a bit complicated, and I am not sure why we indicate > > > PAGE_IS_SWAPPED for non-swap entries, likely wrong ...) > > > > Yeah, I could go on here about how much I hate how uffd does a 'parallel > > implementation' of a ton of stuff and then chucks in if (uffd) { go do > > something weird + wonderful } but I'll resist the urge :P :)) > > > > Do you think, if it were uffd-specific, this would be useful? > > If it really is completely uffd-specific for now, I agree that we should > rather leave it alone. Yeah agreed. > > > > > At any rate, I'm not sure it's _hugely_ beneficial in this form as pagemap > > is binary in any case so you're not having to deal with overhead of parsing > > a text file at least! > > My thinking was, that if you have a large VMA, with ordinary pagemap you > have to copy 8byte per entry (and have room for that somewhere in user > space). In theory, with the scanning feature, you can leave that ... > scanning to the kernel and don't have to do any copying/allocate space for > it in user space etc. That makes perfect sense! I think this one will go a little lower on priorities + I'll come back to it but I"ll put it on the one reliable todo list I have, the whiteboard in my home office :) everything on that list at least eventually gets looked at, majority get done. > > -- > Cheers, > > David / dhildenb > Great minds think alike though ;) as soon as I saw this I did think about extending it, but seems I mistakenly dismissed for uffd reasons. Cheers, Lorenzo
On Mon, Feb 24, 2025 at 2:39 AM David Hildenbrand <david@redhat.com> wrote: > > On 24.02.25 11:18, Lorenzo Stoakes wrote: > > On Mon, Feb 24, 2025 at 10:27:28AM +0100, David Hildenbrand wrote: > >> On 21.02.25 13:05, Lorenzo Stoakes wrote: > >>> Currently there is no means by which users can determine whether a given > >>> page in memory is in fact a guard region, that is having had the > >>> MADV_GUARD_INSTALL madvise() flag applied to it. > >>> > >>> This is intentional, as to provide this information in VMA metadata would > >>> contradict the intent of the feature (providing a means to change fault > >>> behaviour at a page table level rather than a VMA level), and would require > >>> VMA metadata operations to scan page tables, which is unacceptable. > >>> > >>> In many cases, users have no need to reflect and determine what regions > >>> have been designated guard regions, as it is the user who has established > >>> them in the first place. > >>> > >>> But in some instances, such as monitoring software, or software that relies > >>> upon being able to ascertain the nature of mappings within a remote process > >>> for instance, it becomes useful to be able to determine which pages have > >>> the guard region marker applied. > >>> > >>> This patch makes use of an unused pagemap bit (58) to provide this > >>> information. > >>> > >>> This patch updates the documentation at the same time as making the change > >>> such that the implementation of the feature and the documentation of it are > >>> tied together. > >>> > >>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > >>> --- > >> > >> > >> Acked-by: David Hildenbrand <david@redhat.com> > > > > Thanks! :) > >> > >> Something that might be interesting is also extending the PAGEMAP_SCAN > >> ioctl. > > > > Yeah, funny you should mention that, I did see that, but on reading the man > > page it struck me that it requires the region to be uffd afaict? All the > > tests seem to establish uffd, and the man page implies it: > > > > To start tracking the written state (flag) of a page or range of > > memory, the UFFD_FEATURE_WP_ASYNC must be enabled by UFFDIO_API > > ioctl(2) on userfaultfd and memory range must be registered with > > UFFDIO_REGISTER ioctl(2) in UFFDIO_REGISTER_MODE_WP mode. > > > > It would be a bit of a weird edge case to add support there. I was excited > > when I first saw this ioctl, then disappointed afterwards... but maybe I > > got it wrong? > > > > I never managed to review that fully, but I thing that > UFFD_FEATURE_WP_ASYNC thingy is only required for PM_SCAN_CHECK_WPASYNC > and PM_SCAN_WP_MATCHING. > > See pagemap_scan_test_walk(). > > I do recall that it works on any VMA. > > Ah yes, tools/testing/selftests/mm/vm_util.c ends up using it for > pagemap_is_swapped() and friends via page_entry_is() to sanity check > that what pagemap gives us is consistent with what pagemap_scan gives us. > > So it should work independent of the uffd magic. > I might be wrong, though ... PAGEMAP_SCAN can work without the UFFD magic. CRIU utilizes PAGEMAP_SCAN as a more efficient alternative to /proc/pid/pagemap: https://github.com/checkpoint-restore/criu/blob/d18912fc88f3dc7bde5fdfa3575691977eb21753/criu/pagemap-cache.c#L178 For CRIU, obtaining information about guard regions is critical. Without this functionality in the kernel, CRIU is broken. We probably should consider backporting these changes to the 6.13 and 6.14 stable branches. > > >> > >> > >> See do_pagemap_scan(). > >> > >> The benefit here might be that one could effectively search/filter for guard > >> regions without copying 64bit per base-page to user space. > >> > >> But the idea would be to indicate something like PAGE_IS_GUARD_REGION as a > >> category when we hit a guard region entry in pagemap_page_category(). > >> > >> (the code is a bit complicated, and I am not sure why we indicate > >> PAGE_IS_SWAPPED for non-swap entries, likely wrong ...) > > > > Yeah, I could go on here about how much I hate how uffd does a 'parallel > > implementation' of a ton of stuff and then chucks in if (uffd) { go do > > something weird + wonderful } but I'll resist the urge :P :)) > > > > Do you think, if it were uffd-specific, this would be useful? > > If it really is completely uffd-specific for now, I agree that we should > rather leave it alone. > > > > > At any rate, I'm not sure it's _hugely_ beneficial in this form as pagemap > > is binary in any case so you're not having to deal with overhead of parsing > > a text file at least! > > My thinking was, that if you have a large VMA, with ordinary pagemap you > have to copy 8byte per entry (and have room for that somewhere in user > space). In theory, with the scanning feature, you can leave that ... > scanning to the kernel and don't have to do any copying/allocate space > for it in user space etc. PAGEMAP_SCAN doesn't have this issue and it was one of the reasons to implement it. Thanks, Andrei
+cc Greg for stable question On Wed, Mar 19, 2025 at 11:22:40AM -0700, Andrei Vagin wrote: > On Mon, Feb 24, 2025 at 2:39 AM David Hildenbrand <david@redhat.com> wrote: > > > > On 24.02.25 11:18, Lorenzo Stoakes wrote: [snip] > > >> > > >> Acked-by: David Hildenbrand <david@redhat.com> > > > > > > Thanks! :) > > >> > > >> Something that might be interesting is also extending the PAGEMAP_SCAN > > >> ioctl. > > > > > > Yeah, funny you should mention that, I did see that, but on reading the man > > > page it struck me that it requires the region to be uffd afaict? All the > > > tests seem to establish uffd, and the man page implies it: > > > > > > To start tracking the written state (flag) of a page or range of > > > memory, the UFFD_FEATURE_WP_ASYNC must be enabled by UFFDIO_API > > > ioctl(2) on userfaultfd and memory range must be registered with > > > UFFDIO_REGISTER ioctl(2) in UFFDIO_REGISTER_MODE_WP mode. > > > > > > It would be a bit of a weird edge case to add support there. I was excited > > > when I first saw this ioctl, then disappointed afterwards... but maybe I > > > got it wrong? > > > > > > > > I never managed to review that fully, but I thing that > > UFFD_FEATURE_WP_ASYNC thingy is only required for PM_SCAN_CHECK_WPASYNC > > and PM_SCAN_WP_MATCHING. > > > > See pagemap_scan_test_walk(). > > > > I do recall that it works on any VMA. > > > > Ah yes, tools/testing/selftests/mm/vm_util.c ends up using it for > > pagemap_is_swapped() and friends via page_entry_is() to sanity check > > that what pagemap gives us is consistent with what pagemap_scan gives us. > > > > So it should work independent of the uffd magic. > > I might be wrong, though ... > > > PAGEMAP_SCAN can work without the UFFD magic. CRIU utilizes PAGEMAP_SCAN > as a more efficient alternative to /proc/pid/pagemap: > https://github.com/checkpoint-restore/criu/blob/d18912fc88f3dc7bde5fdfa3575691977eb21753/criu/pagemap-cache.c#L178 > Yeah we ascertained that - is on my list, LSF coming up next week means we aren't great on timing here, but I'll prioritise this. When I'm back. > For CRIU, obtaining information about guard regions is critical. > Without this functionality in the kernel, CRIU is broken. We probably should > consider backporting these changes to the 6.13 and 6.14 stable branches. > I'm not sure on precedent for backporting a feature like this - Greg? Am happy to do it though. As a stop gap we can backport the pagemap feature if Greg feels this is appropriate? [snip] > > My thinking was, that if you have a large VMA, with ordinary pagemap you > > have to copy 8byte per entry (and have room for that somewhere in user > > space). In theory, with the scanning feature, you can leave that ... > > scanning to the kernel and don't have to do any copying/allocate space > > for it in user space etc. > > PAGEMAP_SCAN doesn't have this issue and it was one of the reasons to > implement it. Ack. > > Thanks, > Andrei
On Wed, Mar 19, 2025 at 07:12:45PM +0000, Lorenzo Stoakes wrote: > +cc Greg for stable question > > On Wed, Mar 19, 2025 at 11:22:40AM -0700, Andrei Vagin wrote: > > On Mon, Feb 24, 2025 at 2:39 AM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 24.02.25 11:18, Lorenzo Stoakes wrote: > > [snip] > > > >> > > > >> Acked-by: David Hildenbrand <david@redhat.com> > > > > > > > > Thanks! :) > > > >> > > > >> Something that might be interesting is also extending the PAGEMAP_SCAN > > > >> ioctl. > > > > > > > > Yeah, funny you should mention that, I did see that, but on reading the man > > > > page it struck me that it requires the region to be uffd afaict? All the > > > > tests seem to establish uffd, and the man page implies it: > > > > > > > > To start tracking the written state (flag) of a page or range of > > > > memory, the UFFD_FEATURE_WP_ASYNC must be enabled by UFFDIO_API > > > > ioctl(2) on userfaultfd and memory range must be registered with > > > > UFFDIO_REGISTER ioctl(2) in UFFDIO_REGISTER_MODE_WP mode. > > > > > > > > It would be a bit of a weird edge case to add support there. I was excited > > > > when I first saw this ioctl, then disappointed afterwards... but maybe I > > > > got it wrong? > > > > > > > > > > > > I never managed to review that fully, but I thing that > > > UFFD_FEATURE_WP_ASYNC thingy is only required for PM_SCAN_CHECK_WPASYNC > > > and PM_SCAN_WP_MATCHING. > > > > > > See pagemap_scan_test_walk(). > > > > > > I do recall that it works on any VMA. > > > > > > Ah yes, tools/testing/selftests/mm/vm_util.c ends up using it for > > > pagemap_is_swapped() and friends via page_entry_is() to sanity check > > > that what pagemap gives us is consistent with what pagemap_scan gives us. > > > > > > So it should work independent of the uffd magic. > > > I might be wrong, though ... > > > > > > PAGEMAP_SCAN can work without the UFFD magic. CRIU utilizes PAGEMAP_SCAN > > as a more efficient alternative to /proc/pid/pagemap: > > https://github.com/checkpoint-restore/criu/blob/d18912fc88f3dc7bde5fdfa3575691977eb21753/criu/pagemap-cache.c#L178 > > > > Yeah we ascertained that - is on my list, LSF coming up next week means we > aren't great on timing here, but I'll prioritise this. When I'm back. > > > For CRIU, obtaining information about guard regions is critical. > > Without this functionality in the kernel, CRIU is broken. We probably should > > consider backporting these changes to the 6.13 and 6.14 stable branches. > > > > I'm not sure on precedent for backporting a feature like this - Greg? Am > happy to do it though. If it's a regression, sure, we can take it for stable. > As a stop gap we can backport the pagemap feature if Greg feels this is > appropriate? Which do the maintainers of the code feel is appropriate? I'll defer to them for making that call :) thanks, greg k-h
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index caba0f52dd36..a297e824f990 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -21,7 +21,8 @@ There are four components to pagemap: * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see Documentation/admin-guide/mm/userfaultfd.rst) - * Bits 58-60 zero + * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page) + * Bits 59-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped * Bit 63 page present diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f02cd362309a..c17615e21a5d 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1632,6 +1632,7 @@ struct pagemapread { #define PM_SOFT_DIRTY BIT_ULL(55) #define PM_MMAP_EXCLUSIVE BIT_ULL(56) #define PM_UFFD_WP BIT_ULL(57) +#define PM_GUARD_REGION BIT_ULL(58) #define PM_FILE BIT_ULL(61) #define PM_SWAP BIT_ULL(62) #define PM_PRESENT BIT_ULL(63) @@ -1732,6 +1733,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, page = pfn_swap_entry_to_page(entry); if (pte_marker_entry_uffd_wp(entry)) flags |= PM_UFFD_WP; + if (is_guard_swp_entry(entry)) + flags |= PM_GUARD_REGION; } if (page) { @@ -1931,7 +1934,8 @@ static const struct mm_walk_ops pagemap_ops = { * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst) * Bit 56 page exclusively mapped * Bit 57 pte is uffd-wp write-protected - * Bits 58-60 zero + * Bit 58 pte is a guard region + * Bits 59-60 zero * Bit 61 page is file-page or shared-anon * Bit 62 page swapped * Bit 63 page present
Currently there is no means by which users can determine whether a given page in memory is in fact a guard region, that is having had the MADV_GUARD_INSTALL madvise() flag applied to it. This is intentional, as to provide this information in VMA metadata would contradict the intent of the feature (providing a means to change fault behaviour at a page table level rather than a VMA level), and would require VMA metadata operations to scan page tables, which is unacceptable. In many cases, users have no need to reflect and determine what regions have been designated guard regions, as it is the user who has established them in the first place. But in some instances, such as monitoring software, or software that relies upon being able to ascertain the nature of mappings within a remote process for instance, it becomes useful to be able to determine which pages have the guard region marker applied. This patch makes use of an unused pagemap bit (58) to provide this information. This patch updates the documentation at the same time as making the change such that the implementation of the feature and the documentation of it are tied together. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- Documentation/admin-guide/mm/pagemap.rst | 3 ++- fs/proc/task_mmu.c | 6 +++++- 2 files changed, 7 insertions(+), 2 deletions(-)