mbox series

[v1,0/1] update mseal.rst

Message ID 20240927185211.729207-1-jeffxu@chromium.org
Headers show
Series update mseal.rst | expand

Message

Jeff Xu Sept. 27, 2024, 6:52 p.m. UTC
From: Jeff Xu <jeffxu@chromium.org>

Pedro Falcato's optimization [1] for checking sealed VMAs, which replaces
the can_modify_mm() function with an in-loop check, necessitates an update
to the mseal.rst documentation to reflect this change.

Furthermore, the document has received offline comments regarding the code
sample and suggestions for sentence clarification to enhance reader
comprehension.

[1] https://lore.kernel.org/linux-mm/20240817-mseal-depessimize-v3-0-d8d2e037df30@gmail.com/

Jeff Xu (1):
  mseal: update mseal.rst

 Documentation/userspace-api/mseal.rst | 290 ++++++++++++--------------
 1 file changed, 136 insertions(+), 154 deletions(-)

Comments

Andrew Morton Sept. 27, 2024, 7:58 p.m. UTC | #1
On Fri, 27 Sep 2024 18:52:09 +0000 jeffxu@chromium.org wrote:

> From: Jeff Xu <jeffxu@chromium.org>
> 
> Update doc after in-loop change: mprotect/madvise can have
> partially updated and munmap is atomic.

Fixes:what?

I think 4a2dd02b0916 ("mm/mprotect: replace can_modify_mm with
can_modify_vma")?
Pedro Falcato Sept. 27, 2024, 10:59 p.m. UTC | #2
On Fri, Sep 27, 2024 at 06:52:09PM GMT, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> Update doc after in-loop change: mprotect/madvise can have
> partially updated and munmap is atomic.
> 
> Fix indentation and clarify some sections to improve readability.
> 
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  Documentation/userspace-api/mseal.rst | 290 ++++++++++++--------------
>  1 file changed, 136 insertions(+), 154 deletions(-)
> 
> diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> index 4132eec995a3..68986084e191 100644
> --- a/Documentation/userspace-api/mseal.rst
> +++ b/Documentation/userspace-api/mseal.rst
> @@ -23,177 +23,159 @@ applications can additionally seal security critical data at runtime.
>  A similar feature already exists in the XNU kernel with the
>  VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
>  
> -User API
> -========
> -mseal()
> ------------
> -The mseal() syscall has the following signature:
> -
> -``int mseal(void addr, size_t len, unsigned long flags)``
> -
> -**addr/len**: virtual memory address range.
> -
> -The address range set by ``addr``/``len`` must meet:
> -   - The start address must be in an allocated VMA.
> -   - The start address must be page aligned.
> -   - The end address (``addr`` + ``len``) must be in an allocated VMA.
> -   - no gap (unallocated memory) between start and end address.
> -
> -The ``len`` will be paged aligned implicitly by the kernel.
> -
> -**flags**: reserved for future use.
> -
> -**return values**:
> -
> -- ``0``: Success.
> -
> -- ``-EINVAL``:
> -    - Invalid input ``flags``.
> -    - The start address (``addr``) is not page aligned.
> -    - Address range (``addr`` + ``len``) overflow.
> -
> -- ``-ENOMEM``:
> -    - The start address (``addr``) is not allocated.
> -    - The end address (``addr`` + ``len``) is not allocated.
> -    - A gap (unallocated memory) between start and end address.
> -
> -- ``-EPERM``:
> -    - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
> -
> -- For above error cases, users can expect the given memory range is
> -  unmodified, i.e. no partial update.
> -
> -- There might be other internal errors/cases not listed here, e.g.
> -  error during merging/splitting VMAs, or the process reaching the max
> -  number of supported VMAs. In those cases, partial updates to the given
> -  memory range could happen. However, those cases should be rare.
> -
> -**Blocked operations after sealing**:
> -    Unmapping, moving to another location, and shrinking the size,
> -    via munmap() and mremap(), can leave an empty space, therefore
> -    can be replaced with a VMA with a new set of attributes.
> -
> -    Moving or expanding a different VMA into the current location,
> -    via mremap().
> -
> -    Modifying a VMA via mmap(MAP_FIXED).
> -
> -    Size expansion, via mremap(), does not appear to pose any
> -    specific risks to sealed VMAs. It is included anyway because
> -    the use case is unclear. In any case, users can rely on
> -    merging to expand a sealed VMA.
> -
> -    mprotect() and pkey_mprotect().
> -
> -    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
> -    for anonymous memory, when users don't have write permission to the
> -    memory. Those behaviors can alter region contents by discarding pages,
> -    effectively a memset(0) for anonymous memory.
> -
> -    Kernel will return -EPERM for blocked operations.
> -
> -    For blocked operations, one can expect the given address is unmodified,
> -    i.e. no partial update. Note, this is different from existing mm
> -    system call behaviors, where partial updates are made till an error is
> -    found and returned to userspace. To give an example:
> -
> -    Assume following code sequence:
> -
> -    - ptr = mmap(null, 8192, PROT_NONE);
> -    - munmap(ptr + 4096, 4096);
> -    - ret1 = mprotect(ptr, 8192, PROT_READ);
> -    - mseal(ptr, 4096);
> -    - ret2 = mprotect(ptr, 8192, PROT_NONE);
> -
> -    ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
> -
> -    ret2 will be -EPERM, the page remains to be PROT_READ.
> -
> -**Note**:
> -
> -- mseal() only works on 64-bit CPUs, not 32-bit CPU.
> -
> -- users can call mseal() multiple times, mseal() on an already sealed memory
> -  is a no-action (not error).
> -
> -- munseal() is not supported.
> +SYSCALL
> +=======
> +mseal syscall signature
> +-----------------------
> +   **int** mseal(**void \*** addr, **size_t** len, **unsigned long** flags)
> +
> +   **addr**/**len**: virtual memory address range.
> +      The address range set by **addr**/**len** must meet:
> +         - The start address must be in an allocated VMA.
> +         - The start address must be page aligned.
> +         - The end address (**addr** + **len**) must be in an allocated VMA.
> +         - no gap (unallocated memory) between start and end address.
> +
> +      The ``len`` will be paged aligned implicitly by the kernel.
> +
> +   **flags**: reserved for future use.
> +
> +   **Return values**:
> +      - **0**: Success.
> +      - **-EINVAL**:
> +         * Invalid input ``flags``.
> +         * The start address (``addr``) is not page aligned.
> +         * Address range (``addr`` + ``len``) overflow.
> +      - **-ENOMEM**:
> +         * The start address (``addr``) is not allocated.
> +         * The end address (``addr`` + ``len``) is not allocated.
> +         * A gap (unallocated memory) between start and end address.
> +      - **-EPERM**:
> +         * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
> +
> +   **Note about error return**:
> +      - For above error cases, users can expect the given memory range is
> +        unmodified, i.e. no partial update.
> +      - There might be other internal errors/cases not listed here, e.g.
> +        error during merging/splitting VMAs, or the process reaching the max
> +        number of supported VMAs. In those cases, partial updates to the given
> +        memory range could happen. However, those cases should be rare.

How about turning the above into a man page?

> +   **Architecture support**:
> +      mseal only works on 64-bit CPUs, not 32-bit CPU.
> +
> +   **Idempotent**:
> +      users can call mseal multiple times, mseal on an already sealed memory
> +      is a no-action (not error).
> +
> +   **no munseal**
> +      Once mapping is sealed, it can't be unsealed. kernel should never
> +      have munseal, this is consistent with other sealing feature, e.g.
> +      F_SEAL_SEAL for file.
> +
> +Blocked mm syscall for sealed mapping
> +-------------------------------------
> +   It might be imporant to note: **once the mapping is sealed, it will
> +   stay in the process's memory till the process terminates**.
> +
> +   Example::
> +
> +         *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
> +         rc = mseal(ptr, 4096, 0);
> +         /* munmap will fail */
> +         rc = munmap(ptr, 4096);
> +         assert(rc < 0);
> +
> +   Blocked mm syscall:
> +      - munmap
> +      - mmap
> +      - mremap
> +      - mprotect and pkey_mprotect
> +      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
> +        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
> +
> +   The first set of syscall to block is munmap, mremap, mmap. They can
> +   either leave an empty space in the address space, therefore allow
> +   replacement with a new mapping with new set of attributes, or can
> +   overwrite the existing mapping with another mapping.
> +
> +   mprotect and pkey_mprotect are blocked because they changes the
                                                          change
> +   protection bits (rwx) of the mapping.
> +
> +   Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,
> +   MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
> +   for anonymous memory, when users don't have write permission to the
> +   memory. Those behaviors can alter region contents by discarding pages,
> +   effectively a memset(0) for anonymous memory.

What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE?

The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED
on a read-only file mapping. e.g .text. This is obviously wrong?

> +
> +   Kernel will return -EPERM for blocked syscalls.
> +
> +   When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
> +      - munmap: munmap is atomic. If one of VMAs in the given range is
> +        sealed, none of VMAs are updated.
> +      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
> +        when mprotect over multiple VMAs, mprotect might update the beginning
> +        VMAs before reaching the sealed VMA and return -EPERM.
> +      - mmap and mremap: undefined behavior.

mmap and mremap are actually not undefined as they use munmap semantics for their unmapping.
Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?)

>  
>  Use cases:
>  ==========
>  - glibc:
>    The dynamic linker, during loading ELF executables, can apply sealing to
> -  non-writable memory segments.
> +  mapping segments.
>  
>  - Chrome browser: protect some security sensitive data-structures.
>  
> -Notes on which memory to seal:
> -==============================
> -
> -It might be important to note that sealing changes the lifetime of a mapping,
> -i.e. the sealed mapping won’t be unmapped till the process terminates or the
> -exec system call is invoked. Applications can apply sealing to any virtual
> -memory region from userspace, but it is crucial to thoroughly analyze the
> -mapping's lifetime prior to apply the sealing.
> +Don't use mseal on:
> +===================
> +Applications can apply sealing to any virtual memory region from userspace,
> +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
> +apply the sealing. This is because the sealed mapping *won’t be unmapped*
> +till the process terminates or the exec system call is invoked.

There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this.
At least in its current form.

<snip>
> -
> -
> -Additional notes:
> -=================
>  As Jann Horn pointed out in [3], there are still a few ways to write
> -to RO memory, which is, in a way, by design. Those cases are not covered
> -by mseal(). If applications want to block such cases, sandbox tools (such as
> -seccomp, LSM, etc) might be considered.
> +to RO memory, which is, in a way, by design. And those could be blocked
> +by different security measures.
>  
>  Those cases are:
> -
> -- Write to read-only memory through /proc/self/mem interface.
> -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> -- userfaultfd.
> +   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
> +   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> +   - userfaultfd.

I don't understand how this is not a problem, but MADV_DONTNEED is.
To me it seems that what we have now is completely useless, because you can trivially
bypass it using /proc/self/mem, which is enabled on most Linux systems.

Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed
for Chrome. They need to work with every other distro and application as well.

It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and
/proc/<pid>/mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's
as the traditional UNIX permission model allows.
Jeff Xu Sept. 28, 2024, 1:29 a.m. UTC | #3
Hi Pedro,

On Fri, Sep 27, 2024 at 3:59 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Fri, Sep 27, 2024 at 06:52:09PM GMT, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Update doc after in-loop change: mprotect/madvise can have
> > partially updated and munmap is atomic.
> >
> > Fix indentation and clarify some sections to improve readability.
> >
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  Documentation/userspace-api/mseal.rst | 290 ++++++++++++--------------
> >  1 file changed, 136 insertions(+), 154 deletions(-)
> >
> > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > index 4132eec995a3..68986084e191 100644
> > --- a/Documentation/userspace-api/mseal.rst
> > +++ b/Documentation/userspace-api/mseal.rst
> > @@ -23,177 +23,159 @@ applications can additionally seal security critical data at runtime.
> >  A similar feature already exists in the XNU kernel with the
> >  VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
> >
> > -User API
> > -========
> > -mseal()
> > ------------
> > -The mseal() syscall has the following signature:
> > -
> > -``int mseal(void addr, size_t len, unsigned long flags)``
> > -
> > -**addr/len**: virtual memory address range.
> > -
> > -The address range set by ``addr``/``len`` must meet:
> > -   - The start address must be in an allocated VMA.
> > -   - The start address must be page aligned.
> > -   - The end address (``addr`` + ``len``) must be in an allocated VMA.
> > -   - no gap (unallocated memory) between start and end address.
> > -
> > -The ``len`` will be paged aligned implicitly by the kernel.
> > -
> > -**flags**: reserved for future use.
> > -
> > -**return values**:
> > -
> > -- ``0``: Success.
> > -
> > -- ``-EINVAL``:
> > -    - Invalid input ``flags``.
> > -    - The start address (``addr``) is not page aligned.
> > -    - Address range (``addr`` + ``len``) overflow.
> > -
> > -- ``-ENOMEM``:
> > -    - The start address (``addr``) is not allocated.
> > -    - The end address (``addr`` + ``len``) is not allocated.
> > -    - A gap (unallocated memory) between start and end address.
> > -
> > -- ``-EPERM``:
> > -    - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
> > -
> > -- For above error cases, users can expect the given memory range is
> > -  unmodified, i.e. no partial update.
> > -
> > -- There might be other internal errors/cases not listed here, e.g.
> > -  error during merging/splitting VMAs, or the process reaching the max
> > -  number of supported VMAs. In those cases, partial updates to the given
> > -  memory range could happen. However, those cases should be rare.
> > -
> > -**Blocked operations after sealing**:
> > -    Unmapping, moving to another location, and shrinking the size,
> > -    via munmap() and mremap(), can leave an empty space, therefore
> > -    can be replaced with a VMA with a new set of attributes.
> > -
> > -    Moving or expanding a different VMA into the current location,
> > -    via mremap().
> > -
> > -    Modifying a VMA via mmap(MAP_FIXED).
> > -
> > -    Size expansion, via mremap(), does not appear to pose any
> > -    specific risks to sealed VMAs. It is included anyway because
> > -    the use case is unclear. In any case, users can rely on
> > -    merging to expand a sealed VMA.
> > -
> > -    mprotect() and pkey_mprotect().
> > -
> > -    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
> > -    for anonymous memory, when users don't have write permission to the
> > -    memory. Those behaviors can alter region contents by discarding pages,
> > -    effectively a memset(0) for anonymous memory.
> > -
> > -    Kernel will return -EPERM for blocked operations.
> > -
> > -    For blocked operations, one can expect the given address is unmodified,
> > -    i.e. no partial update. Note, this is different from existing mm
> > -    system call behaviors, where partial updates are made till an error is
> > -    found and returned to userspace. To give an example:
> > -
> > -    Assume following code sequence:
> > -
> > -    - ptr = mmap(null, 8192, PROT_NONE);
> > -    - munmap(ptr + 4096, 4096);
> > -    - ret1 = mprotect(ptr, 8192, PROT_READ);
> > -    - mseal(ptr, 4096);
> > -    - ret2 = mprotect(ptr, 8192, PROT_NONE);
> > -
> > -    ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
> > -
> > -    ret2 will be -EPERM, the page remains to be PROT_READ.
> > -
> > -**Note**:
> > -
> > -- mseal() only works on 64-bit CPUs, not 32-bit CPU.
> > -
> > -- users can call mseal() multiple times, mseal() on an already sealed memory
> > -  is a no-action (not error).
> > -
> > -- munseal() is not supported.
> > +SYSCALL
> > +=======
> > +mseal syscall signature
> > +-----------------------
> > +   **int** mseal(**void \*** addr, **size_t** len, **unsigned long** flags)
> > +
> > +   **addr**/**len**: virtual memory address range.
> > +      The address range set by **addr**/**len** must meet:
> > +         - The start address must be in an allocated VMA.
> > +         - The start address must be page aligned.
> > +         - The end address (**addr** + **len**) must be in an allocated VMA.
> > +         - no gap (unallocated memory) between start and end address.
> > +
> > +      The ``len`` will be paged aligned implicitly by the kernel.
> > +
> > +   **flags**: reserved for future use.
> > +
> > +   **Return values**:
> > +      - **0**: Success.
> > +      - **-EINVAL**:
> > +         * Invalid input ``flags``.
> > +         * The start address (``addr``) is not page aligned.
> > +         * Address range (``addr`` + ``len``) overflow.
> > +      - **-ENOMEM**:
> > +         * The start address (``addr``) is not allocated.
> > +         * The end address (``addr`` + ``len``) is not allocated.
> > +         * A gap (unallocated memory) between start and end address.
> > +      - **-EPERM**:
> > +         * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
> > +
> > +   **Note about error return**:
> > +      - For above error cases, users can expect the given memory range is
> > +        unmodified, i.e. no partial update.
> > +      - There might be other internal errors/cases not listed here, e.g.
> > +        error during merging/splitting VMAs, or the process reaching the max
> > +        number of supported VMAs. In those cases, partial updates to the given
> > +        memory range could happen. However, those cases should be rare.
>
> How about turning the above into a man page?
>
yes. I have a TODO to add a man page :-)

> > +   **Architecture support**:
> > +      mseal only works on 64-bit CPUs, not 32-bit CPU.
> > +
> > +   **Idempotent**:
> > +      users can call mseal multiple times, mseal on an already sealed memory
> > +      is a no-action (not error).
> > +
> > +   **no munseal**
> > +      Once mapping is sealed, it can't be unsealed. kernel should never
> > +      have munseal, this is consistent with other sealing feature, e.g.
> > +      F_SEAL_SEAL for file.
> > +
> > +Blocked mm syscall for sealed mapping
> > +-------------------------------------
> > +   It might be imporant to note: **once the mapping is sealed, it will
> > +   stay in the process's memory till the process terminates**.
> > +
> > +   Example::
> > +
> > +         *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
> > +         rc = mseal(ptr, 4096, 0);
> > +         /* munmap will fail */
> > +         rc = munmap(ptr, 4096);
> > +         assert(rc < 0);
> > +
> > +   Blocked mm syscall:
> > +      - munmap
> > +      - mmap
> > +      - mremap
> > +      - mprotect and pkey_mprotect
> > +      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
> > +        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
> > +
> > +   The first set of syscall to block is munmap, mremap, mmap. They can
> > +   either leave an empty space in the address space, therefore allow
> > +   replacement with a new mapping with new set of attributes, or can
> > +   overwrite the existing mapping with another mapping.
> > +
> > +   mprotect and pkey_mprotect are blocked because they changes the
>                                                           change
> > +   protection bits (rwx) of the mapping.
> > +
> > +   Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,
> > +   MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
> > +   for anonymous memory, when users don't have write permission to the
> > +   memory. Those behaviors can alter region contents by discarding pages,
> > +   effectively a memset(0) for anonymous memory.
>
> What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE?
>
MAP_FILE seems not used ?
anonymous mapping is the mapping that is not backed by a file.

> The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED
> on a read-only file mapping. e.g .text. This is obviously wrong?
>
When a MADV_DONTNEED is called, pages will be freed, on file-backed
mapping,  if the process reads from the mapping again, the content
will be retrieved from the file.

For anonymous mapping, since  there is no file backup, if process
reads from the mapping, 0 is filled, hence equivalent to memset(0)

> > +
> > +   Kernel will return -EPERM for blocked syscalls.
> > +
> > +   When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
> > +      - munmap: munmap is atomic. If one of VMAs in the given range is
> > +        sealed, none of VMAs are updated.
> > +      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
> > +        when mprotect over multiple VMAs, mprotect might update the beginning
> > +        VMAs before reaching the sealed VMA and return -EPERM.
> > +      - mmap and mremap: undefined behavior.
>
> mmap and mremap are actually not undefined as they use munmap semantics for their unmapping.
> Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?)
>
I'm not sure if I can declare mmap/mremap as atomic.

Although, it might be possible to achieve this due to munmap being
atomic. I'm not sure  as I didn't test this. Would you like to find
out ?

> >
> >  Use cases:
> >  ==========
> >  - glibc:
> >    The dynamic linker, during loading ELF executables, can apply sealing to
> > -  non-writable memory segments.
> > +  mapping segments.
> >
> >  - Chrome browser: protect some security sensitive data-structures.
> >
> > -Notes on which memory to seal:
> > -==============================
> > -
> > -It might be important to note that sealing changes the lifetime of a mapping,
> > -i.e. the sealed mapping won’t be unmapped till the process terminates or the
> > -exec system call is invoked. Applications can apply sealing to any virtual
> > -memory region from userspace, but it is crucial to thoroughly analyze the
> > -mapping's lifetime prior to apply the sealing.
> > +Don't use mseal on:
> > +===================
> > +Applications can apply sealing to any virtual memory region from userspace,
> > +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
> > +apply the sealing. This is because the sealed mapping *won’t be unmapped*
> > +till the process terminates or the exec system call is invoked.
>
> There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this.
> At least in its current form.
>
Ya, the mseal is not for most apps. I mention the malloc example to stress that.

> <snip>
> > -
> > -
> > -Additional notes:
> > -=================
> >  As Jann Horn pointed out in [3], there are still a few ways to write
> > -to RO memory, which is, in a way, by design. Those cases are not covered
> > -by mseal(). If applications want to block such cases, sandbox tools (such as
> > -seccomp, LSM, etc) might be considered.
> > +to RO memory, which is, in a way, by design. And those could be blocked
> > +by different security measures.
> >
> >  Those cases are:
> > -
> > -- Write to read-only memory through /proc/self/mem interface.
> > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > -- userfaultfd.
> > +   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
> > +   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > +   - userfaultfd.
>
> I don't understand how this is not a problem, but MADV_DONTNEED is.
> To me it seems that what we have now is completely useless, because you can trivially
> bypass it using /proc/self/mem, which is enabled on most Linux systems.
>
> Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed
> for Chrome. They need to work with every other distro and application as well.
>
> It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and
> /proc/<pid>/mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's
> as the traditional UNIX permission model allows.
>
IMO, it is a matter of  Divide and Conquer.  In a nutshell, mseal only
prevents VMA's certain attributes (such as prot bits) from changing.
It doesn't mean to say that sealed RO memory is immutable. To achieve
that, the system needs to apply multiple security measures.

For writing to /proc/pid/mem, it can be disabled via [1].  SELINUX and
Landlock can achieve the same protection too.

[1] https://lore.kernel.org/lkml/20240802080225.89408-1-adrian.ratiu@collabora.com/

-Jeff

> --
> Pedro
Pedro Falcato Sept. 28, 2024, 1:43 p.m. UTC | #4
On Fri, Sep 27, 2024 at 06:29:30PM GMT, Jeff Xu wrote:
> Hi Pedro,
> 
> On Fri, Sep 27, 2024 at 3:59 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
<snip>
> > > +
> > > +   Blocked mm syscall:
> > > +      - munmap
> > > +      - mmap
> > > +      - mremap
> > > +      - mprotect and pkey_mprotect
> > > +      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
> > > +        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
> > > +
> > > +   The first set of syscall to block is munmap, mremap, mmap. They can
> > > +   either leave an empty space in the address space, therefore allow
> > > +   replacement with a new mapping with new set of attributes, or can
> > > +   overwrite the existing mapping with another mapping.
> > > +
> > > +   mprotect and pkey_mprotect are blocked because they changes the
> >                                                           change
> > > +   protection bits (rwx) of the mapping.
> > > +
> > > +   Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,
> > > +   MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
> > > +   for anonymous memory, when users don't have write permission to the
> > > +   memory. Those behaviors can alter region contents by discarding pages,
> > > +   effectively a memset(0) for anonymous memory.
> >
> > What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE?
> >
> MAP_FILE seems not used ?
> anonymous mapping is the mapping that is not backed by a file.

MAP_FILE is actually defined as 0 usually :) But I meant file-backed private mappings.

> > The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED
> > on a read-only file mapping. e.g .text. This is obviously wrong?
> >
> When a MADV_DONTNEED is called, pages will be freed, on file-backed
> mapping,  if the process reads from the mapping again, the content
> will be retrieved from the file.
> 

Sorry, it was late and I gave you a crap example. Consider this:
a file-backed MAP_PRIVATE vma is marked RW. I write to it, then RO-it + mseal.

The attacker later gets me to MADV_DONTNEED that VMA. You've just lost data.

The big problem here is with anon _pages_, not anon vmas.

> For anonymous mapping, since  there is no file backup, if process
> reads from the mapping, 0 is filled, hence equivalent to memset(0)
> 
> > > +
> > > +   Kernel will return -EPERM for blocked syscalls.
> > > +
> > > +   When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
> > > +      - munmap: munmap is atomic. If one of VMAs in the given range is
> > > +        sealed, none of VMAs are updated.
> > > +      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
> > > +        when mprotect over multiple VMAs, mprotect might update the beginning
> > > +        VMAs before reaching the sealed VMA and return -EPERM.
> > > +      - mmap and mremap: undefined behavior.
> >
> > mmap and mremap are actually not undefined as they use munmap semantics for their unmapping.
> > Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?)
> >
> I'm not sure if I can declare mmap/mremap as atomic.
> 
> Although, it might be possible to achieve this due to munmap being
> atomic. I'm not sure  as I didn't test this. Would you like to find
> out ?

I just told you they use munmap under the hood. It's just that the requirement isn't actually
written down anywhere.

> 
> > >
> > >  Use cases:
> > >  ==========
> > >  - glibc:
> > >    The dynamic linker, during loading ELF executables, can apply sealing to
> > > -  non-writable memory segments.
> > > +  mapping segments.
> > >
> > >  - Chrome browser: protect some security sensitive data-structures.
> > >
> > > -Notes on which memory to seal:
> > > -==============================
> > > -
> > > -It might be important to note that sealing changes the lifetime of a mapping,
> > > -i.e. the sealed mapping won’t be unmapped till the process terminates or the
> > > -exec system call is invoked. Applications can apply sealing to any virtual
> > > -memory region from userspace, but it is crucial to thoroughly analyze the
> > > -mapping's lifetime prior to apply the sealing.
> > > +Don't use mseal on:
> > > +===================
> > > +Applications can apply sealing to any virtual memory region from userspace,
> > > +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
> > > +apply the sealing. This is because the sealed mapping *won’t be unmapped*
> > > +till the process terminates or the exec system call is invoked.
> >
> > There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this.
> > At least in its current form.
> >
> Ya, the mseal is not for most apps. I mention the malloc example to stress that.
> 
> > <snip>
> > > -
> > > -
> > > -Additional notes:
> > > -=================
> > >  As Jann Horn pointed out in [3], there are still a few ways to write
> > > -to RO memory, which is, in a way, by design. Those cases are not covered
> > > -by mseal(). If applications want to block such cases, sandbox tools (such as
> > > -seccomp, LSM, etc) might be considered.
> > > +to RO memory, which is, in a way, by design. And those could be blocked
> > > +by different security measures.
> > >
> > >  Those cases are:
> > > -
> > > -- Write to read-only memory through /proc/self/mem interface.
> > > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > > -- userfaultfd.
> > > +   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
> > > +   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > > +   - userfaultfd.
> >
> > I don't understand how this is not a problem, but MADV_DONTNEED is.
> > To me it seems that what we have now is completely useless, because you can trivially
> > bypass it using /proc/self/mem, which is enabled on most Linux systems.
> >
> > Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed
> > for Chrome. They need to work with every other distro and application as well.
> >
> > It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and
> > /proc/<pid>/mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's
> > as the traditional UNIX permission model allows.
> >
> IMO, it is a matter of  Divide and Conquer.  In a nutshell, mseal only
> prevents VMA's certain attributes (such as prot bits) from changing.
> It doesn't mean to say that sealed RO memory is immutable. To achieve
> that, the system needs to apply multiple security measures.

No, it's a matter of providing a sane API without tons of edgecases. Making a VMA immutable should make a VMA
immutable, and not require you to provide a crap ton of other mechanisms in order to truly make it immutable.
If I call mseal, I expect it to be sealed, not "sealed except when it's not, lol".

You haven't been able to quite specify what semantics are desirable out of this whole thing. Making
prot flags "immutable" is completely worthless if you can simply write to a random pseudofile and
have it bypass the whole thing (where a write to /proc/self/mem is semantically equivalent to
mprotect RW + write + mprotect RO). Making the vma immutable is completely worthless
if I can simply wipe anon pages. There has to be some end goal here (make contents immutable?
make sure VMA protection can't be changed? both?) which seems to be unclear from the kernel mmap-side.

If you insist on providing half-baked APIs (and waving off any concerns), I'm sure this would've been better
implemented as a random bpf program for chrome. Maybe we could revert this whole thing and give eBPF one
or two bits of vma flags for their own uses :)

> 
> For writing to /proc/pid/mem, it can be disabled via [1].  SELINUX and
> Landlock can achieve the same protection too.

I'm not blocking /proc/pid/mem, and my distro doesn't run any of those security modules :/
Randy Dunlap Sept. 28, 2024, 6:28 p.m. UTC | #5
On 9/27/24 11:52 AM, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> Update doc after in-loop change: mprotect/madvise can have
> partially updated and munmap is atomic.
> 
> Fix indentation and clarify some sections to improve readability.
> 
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  Documentation/userspace-api/mseal.rst | 290 ++++++++++++--------------
>  1 file changed, 136 insertions(+), 154 deletions(-)
> 
> diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> index 4132eec995a3..68986084e191 100644
> --- a/Documentation/userspace-api/mseal.rst
> +++ b/Documentation/userspace-api/mseal.rst
> @@ -23,177 +23,159 @@ applications can additionally seal security critical data at runtime.
>  A similar feature already exists in the XNU kernel with the
>  VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
>  
> -User API
> -========
> -mseal()
> ------------
> -The mseal() syscall has the following signature:
> -
> -``int mseal(void addr, size_t len, unsigned long flags)``
> -
> -**addr/len**: virtual memory address range.
> -
> -The address range set by ``addr``/``len`` must meet:
> -   - The start address must be in an allocated VMA.
> -   - The start address must be page aligned.
> -   - The end address (``addr`` + ``len``) must be in an allocated VMA.
> -   - no gap (unallocated memory) between start and end address.
> -
> -The ``len`` will be paged aligned implicitly by the kernel.
> -
> -**flags**: reserved for future use.
> -
> -**return values**:
> -
> -- ``0``: Success.
> -
> -- ``-EINVAL``:
> -    - Invalid input ``flags``.
> -    - The start address (``addr``) is not page aligned.
> -    - Address range (``addr`` + ``len``) overflow.
> -
> -- ``-ENOMEM``:
> -    - The start address (``addr``) is not allocated.
> -    - The end address (``addr`` + ``len``) is not allocated.
> -    - A gap (unallocated memory) between start and end address.
> -
> -- ``-EPERM``:
> -    - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
> -
> -- For above error cases, users can expect the given memory range is
> -  unmodified, i.e. no partial update.
> -
> -- There might be other internal errors/cases not listed here, e.g.
> -  error during merging/splitting VMAs, or the process reaching the max
> -  number of supported VMAs. In those cases, partial updates to the given
> -  memory range could happen. However, those cases should be rare.
> -
> -**Blocked operations after sealing**:
> -    Unmapping, moving to another location, and shrinking the size,
> -    via munmap() and mremap(), can leave an empty space, therefore
> -    can be replaced with a VMA with a new set of attributes.
> -
> -    Moving or expanding a different VMA into the current location,
> -    via mremap().
> -
> -    Modifying a VMA via mmap(MAP_FIXED).
> -
> -    Size expansion, via mremap(), does not appear to pose any
> -    specific risks to sealed VMAs. It is included anyway because
> -    the use case is unclear. In any case, users can rely on
> -    merging to expand a sealed VMA.
> -
> -    mprotect() and pkey_mprotect().
> -
> -    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
> -    for anonymous memory, when users don't have write permission to the
> -    memory. Those behaviors can alter region contents by discarding pages,
> -    effectively a memset(0) for anonymous memory.
> -
> -    Kernel will return -EPERM for blocked operations.
> -
> -    For blocked operations, one can expect the given address is unmodified,
> -    i.e. no partial update. Note, this is different from existing mm
> -    system call behaviors, where partial updates are made till an error is
> -    found and returned to userspace. To give an example:
> -
> -    Assume following code sequence:
> -
> -    - ptr = mmap(null, 8192, PROT_NONE);
> -    - munmap(ptr + 4096, 4096);
> -    - ret1 = mprotect(ptr, 8192, PROT_READ);
> -    - mseal(ptr, 4096);
> -    - ret2 = mprotect(ptr, 8192, PROT_NONE);
> -
> -    ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
> -
> -    ret2 will be -EPERM, the page remains to be PROT_READ.
> -
> -**Note**:
> -
> -- mseal() only works on 64-bit CPUs, not 32-bit CPU.
> -
> -- users can call mseal() multiple times, mseal() on an already sealed memory
> -  is a no-action (not error).
> -
> -- munseal() is not supported.
> +SYSCALL
> +=======
> +mseal syscall signature
> +-----------------------
> +   **int** mseal(**void \*** addr, **size_t** len, **unsigned long** flags)

ugh. totally unreadable for people who just look at .rst files.

Does other documentation go to this extreme?

> +
> +   **addr**/**len**: virtual memory address range.
> +      The address range set by **addr**/**len** must meet:
> +         - The start address must be in an allocated VMA.
> +         - The start address must be page aligned.
> +         - The end address (**addr** + **len**) must be in an allocated VMA.
> +         - no gap (unallocated memory) between start and end address.
> +
> +      The ``len`` will be paged aligned implicitly by the kernel.
> +
> +   **flags**: reserved for future use.
> +
> +   **Return values**:
> +      - **0**: Success.
> +      - **-EINVAL**:
> +         * Invalid input ``flags``.
> +         * The start address (``addr``) is not page aligned.
> +         * Address range (``addr`` + ``len``) overflow.
> +      - **-ENOMEM**:
> +         * The start address (``addr``) is not allocated.
> +         * The end address (``addr`` + ``len``) is not allocated.
> +         * A gap (unallocated memory) between start and end address.
> +      - **-EPERM**:
> +         * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
> +
> +   **Note about error return**:
> +      - For above error cases, users can expect the given memory range is
> +        unmodified, i.e. no partial update.
> +      - There might be other internal errors/cases not listed here, e.g.
> +        error during merging/splitting VMAs, or the process reaching the max
> +        number of supported VMAs. In those cases, partial updates to the given
> +        memory range could happen. However, those cases should be rare.
> +
> +   **Architecture support**:
> +      mseal only works on 64-bit CPUs, not 32-bit CPU.

	                                             CPUs.

> +
> +   **Idempotent**:
> +      users can call mseal multiple times, mseal on an already sealed memory

	                               times. mseal

> +      is a no-action (not error).
> +
> +   **no munseal**
> +      Once mapping is sealed, it can't be unsealed. kernel should never
> +      have munseal, this is consistent with other sealing feature, e.g.
> +      F_SEAL_SEAL for file.
> +
> +Blocked mm syscall for sealed mapping
> +-------------------------------------
> +   It might be imporant to note: **once the mapping is sealed, it will

                  important

> +   stay in the process's memory till the process terminates**.
> +
> +   Example::
> +
> +         *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
> +         rc = mseal(ptr, 4096, 0);
> +         /* munmap will fail */
> +         rc = munmap(ptr, 4096);
> +         assert(rc < 0);
> +
> +   Blocked mm syscall:
> +      - munmap
> +      - mmap
> +      - mremap
> +      - mprotect and pkey_mprotect
> +      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
> +        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
> +
> +   The first set of syscall to block is munmap, mremap, mmap. They can
> +   either leave an empty space in the address space, therefore allow
> +   replacement with a new mapping with new set of attributes, or can
> +   overwrite the existing mapping with another mapping.
> +
> +   mprotect and pkey_mprotect are blocked because they changes the
> +   protection bits (rwx) of the mapping.

preferably            (RWX)

> +
> +   Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,

                       madvise

> +   MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
> +   for anonymous memory, when users don't have write permission to the
> +   memory. Those behaviors can alter region contents by discarding pages,
> +   effectively a memset(0) for anonymous memory.
> +
> +   Kernel will return -EPERM for blocked syscalls.
> +
> +   When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
> +      - munmap: munmap is atomic. If one of VMAs in the given range is
> +        sealed, none of VMAs are updated.
> +      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
> +        when mprotect over multiple VMAs, mprotect might update the beginning
> +        VMAs before reaching the sealed VMA and return -EPERM.
> +      - mmap and mremap: undefined behavior.
>  
>  Use cases:
>  ==========
>  - glibc:
>    The dynamic linker, during loading ELF executables, can apply sealing to
> -  non-writable memory segments.
> +  mapping segments.
>  
>  - Chrome browser: protect some security sensitive data-structures.

                                                     data structures.

>  
> -Notes on which memory to seal:
> -==============================
> -
> -It might be important to note that sealing changes the lifetime of a mapping,
> -i.e. the sealed mapping won’t be unmapped till the process terminates or the
> -exec system call is invoked. Applications can apply sealing to any virtual
> -memory region from userspace, but it is crucial to thoroughly analyze the
> -mapping's lifetime prior to apply the sealing.
> +Don't use mseal on:

Drop the ':'. Headings should not uses trailing colons.
(throughout this file)

Maybe change the heading to "When not to use mseal".

> +===================
> +Applications can apply sealing to any virtual memory region from userspace,
> +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
> +apply the sealing. This is because the sealed mapping *won’t be unmapped*
> +till the process terminates or the exec system call is invoked.

s/till/until/ preferably.

>  
>  For example:
> +   - aio/shm
> +     aio/shm can call mmap and  munmap on behalf of userspace, e.g.
> +     ksys_shmdt() in shm.c. The lifetime of those mapping are not tied to
> +     the lifetime of the process. If those memories are sealed from userspace,
> +     then munmap will fail, causing leaks in VMA address space during the
> +     lifetime of the process.
> +
> +   - ptr allocated by malloc (heap)
> +     Don't use mseal on the memory ptr return from malloc().
> +     malloc() is implemented by allocator, e.g. by glibc. Heap manager might
> +     allocate a ptr from brk or mapping created by mmap.
> +     If app calls mseal on ptr returned from malloc(), this can affect the heap

	If an app calls mseal on a ptr


> +     manager's ability to manage the mappings, the outcome is non-deterministic.

	                                mappings; the outcome

> +     Example::
> +
> +        ptr = malloc(size);
> +        /* don't call mseal on ptr return from malloc. */
> +        mseal(ptr, size);
> +        /* free will success, allocator can't shrink heap lower than ptr */
> +        free(ptr);
> +
> +mseal doesn't block:
> +====================
> +In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
> +attributes, such as protection bits (rwx). Sealed mappings doesn't mean the

preferably                             (RWX).

> +memory is immutable.
>  
> -- aio/shm
> -
> -  aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
> -  shm.c. The lifetime of those mapping are not tied to the lifetime of the

                lifetimes
?

> -  process. If those memories are sealed from userspace, then munmap() will fail,
> -  causing leaks in VMA address space during the lifetime of the process.
> -
> -- Brk (heap)
> -
> -  Currently, userspace applications can seal parts of the heap by calling
> -  malloc() and mseal().
> -  let's assume following calls from user space:
> -
> -  - ptr = malloc(size);
> -  - mprotect(ptr, size, RO);
> -  - mseal(ptr, size);
> -  - free(ptr);
> -
> -  Technically, before mseal() is added, the user can change the protection of
> -  the heap by calling mprotect(RO). As long as the user changes the protection
> -  back to RW before free(), the memory range can be reused.
> -
> -  Adding mseal() into the picture, however, the heap is then sealed partially,
> -  the user can still free it, but the memory remains to be RO. If the address
> -  is re-used by the heap manager for another malloc, the process might crash
> -  soon after. Therefore, it is important not to apply sealing to any memory
> -  that might get recycled.
> -
> -  Furthermore, even if the application never calls the free() for the ptr,
> -  the heap manager may invoke the brk system call to shrink the size of the
> -  heap. In the kernel, the brk-shrink will call munmap(). Consequently,
> -  depending on the location of the ptr, the outcome of brk-shrink is
> -  nondeterministic.
> -
> -
> -Additional notes:
> -=================
>  As Jann Horn pointed out in [3], there are still a few ways to write
> -to RO memory, which is, in a way, by design. Those cases are not covered
> -by mseal(). If applications want to block such cases, sandbox tools (such as
> -seccomp, LSM, etc) might be considered.
> +to RO memory, which is, in a way, by design. And those could be blocked
> +by different security measures.
>  
>  Those cases are:
> -
> -- Write to read-only memory through /proc/self/mem interface.
> -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> -- userfaultfd.
> +   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
> +   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> +   - userfaultfd.
>  
>  The idea that inspired this patch comes from Stephen Röttger’s work in V8
>  CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
>  
>  Reference:
>  ==========
> -[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> -
> -[2] https://man.openbsd.org/mimmutable.2
> -
> -[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
> -
> -[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
> +- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> +- [2] https://man.openbsd.org/mimmutable.2
> +- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
> +- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
Jeff Xu Oct. 1, 2024, 12:10 a.m. UTC | #6
Hi Randy,

On Sat, Sep 28, 2024 at 11:28 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
>
>
> On 9/27/24 11:52 AM, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Update doc after in-loop change: mprotect/madvise can have
> > partially updated and munmap is atomic.
> >
> > Fix indentation and clarify some sections to improve readability.
> >
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  Documentation/userspace-api/mseal.rst | 290 ++++++++++++--------------
> >  1 file changed, 136 insertions(+), 154 deletions(-)
> >
> > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > index 4132eec995a3..68986084e191 100644
> > --- a/Documentation/userspace-api/mseal.rst
> > +++ b/Documentation/userspace-api/mseal.rst
> > @@ -23,177 +23,159 @@ applications can additionally seal security critical data at runtime.
> >  A similar feature already exists in the XNU kernel with the
> >  VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
> >
> > -User API
> > -========
> > -mseal()
> > ------------
> > -The mseal() syscall has the following signature:
> > -
> > -``int mseal(void addr, size_t len, unsigned long flags)``
> > -
> > -**addr/len**: virtual memory address range.
> > -
> > -The address range set by ``addr``/``len`` must meet:
> > -   - The start address must be in an allocated VMA.
> > -   - The start address must be page aligned.
> > -   - The end address (``addr`` + ``len``) must be in an allocated VMA.
> > -   - no gap (unallocated memory) between start and end address.
> > -
> > -The ``len`` will be paged aligned implicitly by the kernel.
> > -
> > -**flags**: reserved for future use.
> > -
> > -**return values**:
> > -
> > -- ``0``: Success.
> > -
> > -- ``-EINVAL``:
> > -    - Invalid input ``flags``.
> > -    - The start address (``addr``) is not page aligned.
> > -    - Address range (``addr`` + ``len``) overflow.
> > -
> > -- ``-ENOMEM``:
> > -    - The start address (``addr``) is not allocated.
> > -    - The end address (``addr`` + ``len``) is not allocated.
> > -    - A gap (unallocated memory) between start and end address.
> > -
> > -- ``-EPERM``:
> > -    - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
> > -
> > -- For above error cases, users can expect the given memory range is
> > -  unmodified, i.e. no partial update.
> > -
> > -- There might be other internal errors/cases not listed here, e.g.
> > -  error during merging/splitting VMAs, or the process reaching the max
> > -  number of supported VMAs. In those cases, partial updates to the given
> > -  memory range could happen. However, those cases should be rare.
> > -
> > -**Blocked operations after sealing**:
> > -    Unmapping, moving to another location, and shrinking the size,
> > -    via munmap() and mremap(), can leave an empty space, therefore
> > -    can be replaced with a VMA with a new set of attributes.
> > -
> > -    Moving or expanding a different VMA into the current location,
> > -    via mremap().
> > -
> > -    Modifying a VMA via mmap(MAP_FIXED).
> > -
> > -    Size expansion, via mremap(), does not appear to pose any
> > -    specific risks to sealed VMAs. It is included anyway because
> > -    the use case is unclear. In any case, users can rely on
> > -    merging to expand a sealed VMA.
> > -
> > -    mprotect() and pkey_mprotect().
> > -
> > -    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
> > -    for anonymous memory, when users don't have write permission to the
> > -    memory. Those behaviors can alter region contents by discarding pages,
> > -    effectively a memset(0) for anonymous memory.
> > -
> > -    Kernel will return -EPERM for blocked operations.
> > -
> > -    For blocked operations, one can expect the given address is unmodified,
> > -    i.e. no partial update. Note, this is different from existing mm
> > -    system call behaviors, where partial updates are made till an error is
> > -    found and returned to userspace. To give an example:
> > -
> > -    Assume following code sequence:
> > -
> > -    - ptr = mmap(null, 8192, PROT_NONE);
> > -    - munmap(ptr + 4096, 4096);
> > -    - ret1 = mprotect(ptr, 8192, PROT_READ);
> > -    - mseal(ptr, 4096);
> > -    - ret2 = mprotect(ptr, 8192, PROT_NONE);
> > -
> > -    ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
> > -
> > -    ret2 will be -EPERM, the page remains to be PROT_READ.
> > -
> > -**Note**:
> > -
> > -- mseal() only works on 64-bit CPUs, not 32-bit CPU.
> > -
> > -- users can call mseal() multiple times, mseal() on an already sealed memory
> > -  is a no-action (not error).
> > -
> > -- munseal() is not supported.
> > +SYSCALL
> > +=======
> > +mseal syscall signature
> > +-----------------------
> > +   **int** mseal(**void \*** addr, **size_t** len, **unsigned long** flags)
>
> ugh. totally unreadable for people who just look at .rst files.
>
Right!
The ** (bold) in rst collides with a pointer in c. it does make it
unreadable. I will  remove them.

> Does other documentation go to this extreme?
>
> > +
> > +   **addr**/**len**: virtual memory address range.
> > +      The address range set by **addr**/**len** must meet:
> > +         - The start address must be in an allocated VMA.
> > +         - The start address must be page aligned.
> > +         - The end address (**addr** + **len**) must be in an allocated VMA.
> > +         - no gap (unallocated memory) between start and end address.
> > +
> > +      The ``len`` will be paged aligned implicitly by the kernel.
> > +
> > +   **flags**: reserved for future use.
> > +
> > +   **Return values**:
> > +      - **0**: Success.
> > +      - **-EINVAL**:
> > +         * Invalid input ``flags``.
> > +         * The start address (``addr``) is not page aligned.
> > +         * Address range (``addr`` + ``len``) overflow.
> > +      - **-ENOMEM**:
> > +         * The start address (``addr``) is not allocated.
> > +         * The end address (``addr`` + ``len``) is not allocated.
> > +         * A gap (unallocated memory) between start and end address.
> > +      - **-EPERM**:
> > +         * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
> > +
> > +   **Note about error return**:
> > +      - For above error cases, users can expect the given memory range is
> > +        unmodified, i.e. no partial update.
> > +      - There might be other internal errors/cases not listed here, e.g.
> > +        error during merging/splitting VMAs, or the process reaching the max
> > +        number of supported VMAs. In those cases, partial updates to the given
> > +        memory range could happen. However, those cases should be rare.
> > +
> > +   **Architecture support**:
> > +      mseal only works on 64-bit CPUs, not 32-bit CPU.
>
>                                                      CPUs.
>
Done, thanks.

> > +
> > +   **Idempotent**:
> > +      users can call mseal multiple times, mseal on an already sealed memory
>
>                                        times. mseal
>
Done, thanks.

> > +      is a no-action (not error).
> > +
> > +   **no munseal**
> > +      Once mapping is sealed, it can't be unsealed. kernel should never
> > +      have munseal, this is consistent with other sealing feature, e.g.
> > +      F_SEAL_SEAL for file.
> > +
> > +Blocked mm syscall for sealed mapping
> > +-------------------------------------
> > +   It might be imporant to note: **once the mapping is sealed, it will
>
>                   important
>
Done, thanks.

> > +   stay in the process's memory till the process terminates**.
> > +
> > +   Example::
> > +
> > +         *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
> > +         rc = mseal(ptr, 4096, 0);
> > +         /* munmap will fail */
> > +         rc = munmap(ptr, 4096);
> > +         assert(rc < 0);
> > +
> > +   Blocked mm syscall:
> > +      - munmap
> > +      - mmap
> > +      - mremap
> > +      - mprotect and pkey_mprotect
> > +      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
> > +        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
> > +
> > +   The first set of syscall to block is munmap, mremap, mmap. They can
> > +   either leave an empty space in the address space, therefore allow
> > +   replacement with a new mapping with new set of attributes, or can
> > +   overwrite the existing mapping with another mapping.
> > +
> > +   mprotect and pkey_mprotect are blocked because they changes the
> > +   protection bits (rwx) of the mapping.
>
> preferably            (RWX)
>
Done.

> > +
> > +   Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,
>
>                        madvise
>
Done.

> > +   MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
> > +   for anonymous memory, when users don't have write permission to the
> > +   memory. Those behaviors can alter region contents by discarding pages,
> > +   effectively a memset(0) for anonymous memory.
> > +
> > +   Kernel will return -EPERM for blocked syscalls.
> > +
> > +   When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
> > +      - munmap: munmap is atomic. If one of VMAs in the given range is
> > +        sealed, none of VMAs are updated.
> > +      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
> > +        when mprotect over multiple VMAs, mprotect might update the beginning
> > +        VMAs before reaching the sealed VMA and return -EPERM.
> > +      - mmap and mremap: undefined behavior.
> >
> >  Use cases:
> >  ==========
> >  - glibc:
> >    The dynamic linker, during loading ELF executables, can apply sealing to
> > -  non-writable memory segments.
> > +  mapping segments.
> >
> >  - Chrome browser: protect some security sensitive data-structures.
>
>                                                      data structures.
>
Done. Thanks.

> >
> > -Notes on which memory to seal:
> > -==============================
> > -
> > -It might be important to note that sealing changes the lifetime of a mapping,
> > -i.e. the sealed mapping won’t be unmapped till the process terminates or the
> > -exec system call is invoked. Applications can apply sealing to any virtual
> > -memory region from userspace, but it is crucial to thoroughly analyze the
> > -mapping's lifetime prior to apply the sealing.
> > +Don't use mseal on:
>
> Drop the ':'. Headings should not uses trailing colons.
> (throughout this file)
>
Done everywhere. Thanks.

> Maybe change the heading to "When not to use mseal".
>
> > +===================
> > +Applications can apply sealing to any virtual memory region from userspace,
> > +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
> > +apply the sealing. This is because the sealed mapping *won’t be unmapped*
> > +till the process terminates or the exec system call is invoked.
>
> s/till/until/ preferably.
>
ok.

> >
> >  For example:
> > +   - aio/shm
> > +     aio/shm can call mmap and  munmap on behalf of userspace, e.g.
> > +     ksys_shmdt() in shm.c. The lifetime of those mapping are not tied to
> > +     the lifetime of the process. If those memories are sealed from userspace,
> > +     then munmap will fail, causing leaks in VMA address space during the
> > +     lifetime of the process.
> > +
> > +   - ptr allocated by malloc (heap)
> > +     Don't use mseal on the memory ptr return from malloc().
> > +     malloc() is implemented by allocator, e.g. by glibc. Heap manager might
> > +     allocate a ptr from brk or mapping created by mmap.
> > +     If app calls mseal on ptr returned from malloc(), this can affect the heap
>
>         If an app calls mseal on a ptr
>
Done. thanks

>
> > +     manager's ability to manage the mappings, the outcome is non-deterministic.
>
>                                         mappings; the outcome
>
Done, thanks.

> > +     Example::
> > +
> > +        ptr = malloc(size);
> > +        /* don't call mseal on ptr return from malloc. */
> > +        mseal(ptr, size);
> > +        /* free will success, allocator can't shrink heap lower than ptr */
> > +        free(ptr);
> > +
> > +mseal doesn't block:
> > +====================
> > +In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
> > +attributes, such as protection bits (rwx). Sealed mappings doesn't mean the
>
> preferably                             (RWX).
>
Done, thanks.

> > +memory is immutable.
> >
> > -- aio/shm
> > -
> > -  aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
> > -  shm.c. The lifetime of those mapping are not tied to the lifetime of the
>
>                 lifetimes
OK.
> ?
>
> > -  process. If those memories are sealed from userspace, then munmap() will fail,
> > -  causing leaks in VMA address space during the lifetime of the process.
> > -
> > -- Brk (heap)
> > -
> > -  Currently, userspace applications can seal parts of the heap by calling
> > -  malloc() and mseal().
> > -  let's assume following calls from user space:
> > -
> > -  - ptr = malloc(size);
> > -  - mprotect(ptr, size, RO);
> > -  - mseal(ptr, size);
> > -  - free(ptr);
> > -
> > -  Technically, before mseal() is added, the user can change the protection of
> > -  the heap by calling mprotect(RO). As long as the user changes the protection
> > -  back to RW before free(), the memory range can be reused.
> > -
> > -  Adding mseal() into the picture, however, the heap is then sealed partially,
> > -  the user can still free it, but the memory remains to be RO. If the address
> > -  is re-used by the heap manager for another malloc, the process might crash
> > -  soon after. Therefore, it is important not to apply sealing to any memory
> > -  that might get recycled.
> > -
> > -  Furthermore, even if the application never calls the free() for the ptr,
> > -  the heap manager may invoke the brk system call to shrink the size of the
> > -  heap. In the kernel, the brk-shrink will call munmap(). Consequently,
> > -  depending on the location of the ptr, the outcome of brk-shrink is
> > -  nondeterministic.
> > -
> > -
> > -Additional notes:
> > -=================
> >  As Jann Horn pointed out in [3], there are still a few ways to write
> > -to RO memory, which is, in a way, by design. Those cases are not covered
> > -by mseal(). If applications want to block such cases, sandbox tools (such as
> > -seccomp, LSM, etc) might be considered.
> > +to RO memory, which is, in a way, by design. And those could be blocked
> > +by different security measures.
> >
> >  Those cases are:
> > -
> > -- Write to read-only memory through /proc/self/mem interface.
> > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > -- userfaultfd.
> > +   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
> > +   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > +   - userfaultfd.
> >
> >  The idea that inspired this patch comes from Stephen Röttger’s work in V8
> >  CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
> >
> >  Reference:
> >  ==========
> > -[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> > -
> > -[2] https://man.openbsd.org/mimmutable.2
> > -
> > -[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
> > -
> > -[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
> > +- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> > +- [2] https://man.openbsd.org/mimmutable.2
> > +- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
> > +- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc

I will update and send v2.

Appreciate the review comments.
-Jeff
Jeff Xu Oct. 1, 2024, 12:24 a.m. UTC | #7
Hi Pedro

On Sat, Sep 28, 2024 at 6:43 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Fri, Sep 27, 2024 at 06:29:30PM GMT, Jeff Xu wrote:
> > Hi Pedro,
> >
> > On Fri, Sep 27, 2024 at 3:59 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> <snip>
> > > > +
> > > > +   Blocked mm syscall:
> > > > +      - munmap
> > > > +      - mmap
> > > > +      - mremap
> > > > +      - mprotect and pkey_mprotect
> > > > +      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
> > > > +        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
> > > > +
> > > > +   The first set of syscall to block is munmap, mremap, mmap. They can
> > > > +   either leave an empty space in the address space, therefore allow
> > > > +   replacement with a new mapping with new set of attributes, or can
> > > > +   overwrite the existing mapping with another mapping.
> > > > +
> > > > +   mprotect and pkey_mprotect are blocked because they changes the
> > >                                                           change
> > > > +   protection bits (rwx) of the mapping.
> > > > +
> > > > +   Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,
> > > > +   MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
> > > > +   for anonymous memory, when users don't have write permission to the
> > > > +   memory. Those behaviors can alter region contents by discarding pages,
> > > > +   effectively a memset(0) for anonymous memory.
> > >
> > > What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE?
> > >
> > MAP_FILE seems not used ?
> > anonymous mapping is the mapping that is not backed by a file.
>
> MAP_FILE is actually defined as 0 usually :) But I meant file-backed private mappings.
>
OK, we are on the same page for this.

> > > The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED
> > > on a read-only file mapping. e.g .text. This is obviously wrong?
> > >
> > When a MADV_DONTNEED is called, pages will be freed, on file-backed
> > mapping,  if the process reads from the mapping again, the content
> > will be retrieved from the file.
> >
>
> Sorry, it was late and I gave you a crap example. Consider this:
> a file-backed MAP_PRIVATE vma is marked RW. I write to it, then RO-it + mseal.
>
> The attacker later gets me to MADV_DONTNEED that VMA. You've just lost data.
>
> The big problem here is with anon _pages_, not anon vmas.
>
That depends on the app's threat-model. What you described seems to be
a case below
1. The file is rw
2. The process opens the file as rw
3. the process mmap the fd as rw
4 The process writes the memory, and the change isn't flushed to the
file on disk.
5 The process  changes the mapping to RO
6. The process seals the mapping
7. The process is called MADV_DONTNEED , and because the change isn't
flush to file on disk, so it loses the change, (retrieve the old data
from disk when read from the mapped address later)

I'm not sure this is a valid use case, the problem here seems to be
that the app needs to flush the change from memory to disk if the
expectation is writing is permanent.

In any case, the mseal currently just blocks a subset of madvise, those
we know with a security implication.  If there is something mseal needs
to block additionally, one can always extend it by using the "flags" field.
I do think the bar is high though, e.g. a valid use case to support that.

> > For anonymous mapping, since  there is no file backup, if process
> > reads from the mapping, 0 is filled, hence equivalent to memset(0)
> >
> > > > +
> > > > +   Kernel will return -EPERM for blocked syscalls.
> > > > +
> > > > +   When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
> > > > +      - munmap: munmap is atomic. If one of VMAs in the given range is
> > > > +        sealed, none of VMAs are updated.
> > > > +      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
> > > > +        when mprotect over multiple VMAs, mprotect might update the beginning
> > > > +        VMAs before reaching the sealed VMA and return -EPERM.
> > > > +      - mmap and mremap: undefined behavior.
> > >
> > > mmap and mremap are actually not undefined as they use munmap semantics for their unmapping.
> > > Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?)
> > >
> > I'm not sure if I can declare mmap/mremap as atomic.
> >
> > Although, it might be possible to achieve this due to munmap being
> > atomic. I'm not sure  as I didn't test this. Would you like to find
> > out ?
>
> I just told you they use munmap under the hood. It's just that the requirement isn't actually
> written down anywhere.
>
I knew about mmap/mremap calling munmap. I don't know what exactly you
are asking though. In your patch and its discussion, you did not mention
the mmap/mremap (for sealing) is or should be atomic.

My point is: since there isn't a clear statement from your patch description
or POSIX, that mremap/mmap is atomic,  and I haven't tested it myself with
regards to sealing, let's  leave them as "undefined" for now. (I could get back
to this later after the merging window)

> >
> > > >
> > > >  Use cases:
> > > >  ==========
> > > >  - glibc:
> > > >    The dynamic linker, during loading ELF executables, can apply sealing to
> > > > -  non-writable memory segments.
> > > > +  mapping segments.
> > > >
> > > >  - Chrome browser: protect some security sensitive data-structures.
> > > >
> > > > -Notes on which memory to seal:
> > > > -==============================
> > > > -
> > > > -It might be important to note that sealing changes the lifetime of a mapping,
> > > > -i.e. the sealed mapping won’t be unmapped till the process terminates or the
> > > > -exec system call is invoked. Applications can apply sealing to any virtual
> > > > -memory region from userspace, but it is crucial to thoroughly analyze the
> > > > -mapping's lifetime prior to apply the sealing.
> > > > +Don't use mseal on:
> > > > +===================
> > > > +Applications can apply sealing to any virtual memory region from userspace,
> > > > +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
> > > > +apply the sealing. This is because the sealed mapping *won’t be unmapped*
> > > > +till the process terminates or the exec system call is invoked.
> > >
> > > There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this.
> > > At least in its current form.
> > >
> > Ya, the mseal is not for most apps. I mention the malloc example to stress that.
> >
> > > <snip>
> > > > -
> > > > -
> > > > -Additional notes:
> > > > -=================
> > > >  As Jann Horn pointed out in [3], there are still a few ways to write
> > > > -to RO memory, which is, in a way, by design. Those cases are not covered
> > > > -by mseal(). If applications want to block such cases, sandbox tools (such as
> > > > -seccomp, LSM, etc) might be considered.
> > > > +to RO memory, which is, in a way, by design. And those could be blocked
> > > > +by different security measures.
> > > >
> > > >  Those cases are:
> > > > -
> > > > -- Write to read-only memory through /proc/self/mem interface.
> > > > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > > > -- userfaultfd.
> > > > +   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
> > > > +   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > > > +   - userfaultfd.
> > >
> > > I don't understand how this is not a problem, but MADV_DONTNEED is.
> > > To me it seems that what we have now is completely useless, because you can trivially
> > > bypass it using /proc/self/mem, which is enabled on most Linux systems.
> > >
> > > Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed
> > > for Chrome. They need to work with every other distro and application as well.
> > >
> > > It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and
> > > /proc/<pid>/mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's
> > > as the traditional UNIX permission model allows.
> > >
> > IMO, it is a matter of  Divide and Conquer.  In a nutshell, mseal only
> > prevents VMA's certain attributes (such as prot bits) from changing.
> > It doesn't mean to say that sealed RO memory is immutable. To achieve
> > that, the system needs to apply multiple security measures.
>
> No, it's a matter of providing a sane API without tons of edgecases. Making a VMA immutable should make a VMA
> immutable, and not require you to provide a crap ton of other mechanisms in order to truly make it immutable.
> If I call mseal, I expect it to be sealed, not "sealed except when it's not, lol".
>
> You haven't been able to quite specify what semantics are desirable out of this whole thing. Making
> prot flags "immutable" is completely worthless if you can simply write to a random pseudofile and
> have it bypass the whole thing (where a write to /proc/self/mem is semantically equivalent to
> mprotect RW + write + mprotect RO). Making the vma immutable is completely worthless
> if I can simply wipe anon pages. There has to be some end goal here (make contents immutable?
> make sure VMA protection can't be changed? both?) which seems to be unclear from the kernel mmap-side.
>
> If you insist on providing half-baked APIs (and waving off any concerns), I'm sure this would've been better
> implemented as a random bpf program for chrome. Maybe we could revert this whole thing and give eBPF one
> or two bits of vma flags for their own uses :)
>
> >
> > For writing to /proc/pid/mem, it can be disabled via [1].  SELINUX and
> > Landlock can achieve the same protection too.
>
> I'm not blocking /proc/pid/mem, and my distro doesn't run any of those security modules :/
>
It is a choice you can make :-)

Linux is diverse, from desktop to mobile to cloud hosting to embedded systems.
For a safe-by-default system, some of them might like those security
enhancements.

Thanks
-Jeff


-Jeff

> --
> Pedro
Pedro Falcato Oct. 4, 2024, 5:02 p.m. UTC | #8
On Mon, Sep 30, 2024 at 05:24:39PM -0700, Jeff Xu wrote:
> Hi Pedro
> 
> On Sat, Sep 28, 2024 at 6:43 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > On Fri, Sep 27, 2024 at 06:29:30PM GMT, Jeff Xu wrote:
> > > Hi Pedro,
> > >
> > > On Fri, Sep 27, 2024 at 3:59 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > <snip>
> > > > > +
> > > > > +   Blocked mm syscall:
> > > > > +      - munmap
> > > > > +      - mmap
> > > > > +      - mremap
> > > > > +      - mprotect and pkey_mprotect
> > > > > +      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
> > > > > +        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
> > > > > +
> > > > > +   The first set of syscall to block is munmap, mremap, mmap. They can
> > > > > +   either leave an empty space in the address space, therefore allow
> > > > > +   replacement with a new mapping with new set of attributes, or can
> > > > > +   overwrite the existing mapping with another mapping.
> > > > > +
> > > > > +   mprotect and pkey_mprotect are blocked because they changes the
> > > >                                                           change
> > > > > +   protection bits (rwx) of the mapping.
> > > > > +
> > > > > +   Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,
> > > > > +   MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
> > > > > +   for anonymous memory, when users don't have write permission to the
> > > > > +   memory. Those behaviors can alter region contents by discarding pages,
> > > > > +   effectively a memset(0) for anonymous memory.
> > > >
> > > > What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE?
> > > >
> > > MAP_FILE seems not used ?
> > > anonymous mapping is the mapping that is not backed by a file.
> >
> > MAP_FILE is actually defined as 0 usually :) But I meant file-backed private mappings.
> >
> OK, we are on the same page for this.
> 
> > > > The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED
> > > > on a read-only file mapping. e.g .text. This is obviously wrong?
> > > >
> > > When a MADV_DONTNEED is called, pages will be freed, on file-backed
> > > mapping,  if the process reads from the mapping again, the content
> > > will be retrieved from the file.
> > >
> >
> > Sorry, it was late and I gave you a crap example. Consider this:
> > a file-backed MAP_PRIVATE vma is marked RW. I write to it, then RO-it + mseal.
> >
> > The attacker later gets me to MADV_DONTNEED that VMA. You've just lost data.
> >
> > The big problem here is with anon _pages_, not anon vmas.
> >
> That depends on the app's threat-model. What you described seems to be
> a case below
> 1. The file is rw
> 2. The process opens the file as rw
> 3. the process mmap the fd as rw
> 4 The process writes the memory, and the change isn't flushed to the
> file on disk.
> 5 The process  changes the mapping to RO
> 6. The process seals the mapping
> 7. The process is called MADV_DONTNEED , and because the change isn't
> flush to file on disk, so it loses the change, (retrieve the old data
> from disk when read from the mapped address later)
> 
> I'm not sure this is a valid use case, the problem here seems to be
> that the app needs to flush the change from memory to disk if the
> expectation is writing is permanent.
>

MAP_PRIVATE never does writeback. That's not what this is about.
I can trivially discard anonymous pages for private "file VMAs", which aren't
refilled with the exact same contents. This is a problem.

> In any case, the mseal currently just blocks a subset of madvise, those
> we know with a security implication.  If there is something mseal needs
> to block additionally, one can always extend it by using the "flags" field.
> I do think the bar is high though, e.g. a valid use case to support that.

No, this has nothing to do with a flag. It's about providing sane semantics.

> 
> > > For anonymous mapping, since  there is no file backup, if process
> > > reads from the mapping, 0 is filled, hence equivalent to memset(0)
> > >
> > > > > +
> > > > > +   Kernel will return -EPERM for blocked syscalls.
> > > > > +
> > > > > +   When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
> > > > > +      - munmap: munmap is atomic. If one of VMAs in the given range is
> > > > > +        sealed, none of VMAs are updated.
> > > > > +      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
> > > > > +        when mprotect over multiple VMAs, mprotect might update the beginning
> > > > > +        VMAs before reaching the sealed VMA and return -EPERM.
> > > > > +      - mmap and mremap: undefined behavior.
> > > >
> > > > mmap and mremap are actually not undefined as they use munmap semantics for their unmapping.
> > > > Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?)
> > > >
> > > I'm not sure if I can declare mmap/mremap as atomic.
> > >
> > > Although, it might be possible to achieve this due to munmap being
> > > atomic. I'm not sure  as I didn't test this. Would you like to find
> > > out ?
> >
> > I just told you they use munmap under the hood. It's just that the requirement isn't actually
> > written down anywhere.
> >
> I knew about mmap/mremap calling munmap. I don't know what exactly you
> are asking though. In your patch and its discussion, you did not mention
> the mmap/mremap (for sealing) is or should be atomic.
> 
> My point is: since there isn't a clear statement from your patch description
> or POSIX, that mremap/mmap is atomic,  and I haven't tested it myself with
> regards to sealing, let's  leave them as "undefined" for now. (I could get back
> to this later after the merging window)
> 
> > >
> > > > >
> > > > >  Use cases:
> > > > >  ==========
> > > > >  - glibc:
> > > > >    The dynamic linker, during loading ELF executables, can apply sealing to
> > > > > -  non-writable memory segments.
> > > > > +  mapping segments.
> > > > >
> > > > >  - Chrome browser: protect some security sensitive data-structures.
> > > > >
> > > > > -Notes on which memory to seal:
> > > > > -==============================
> > > > > -
> > > > > -It might be important to note that sealing changes the lifetime of a mapping,
> > > > > -i.e. the sealed mapping won’t be unmapped till the process terminates or the
> > > > > -exec system call is invoked. Applications can apply sealing to any virtual
> > > > > -memory region from userspace, but it is crucial to thoroughly analyze the
> > > > > -mapping's lifetime prior to apply the sealing.
> > > > > +Don't use mseal on:
> > > > > +===================
> > > > > +Applications can apply sealing to any virtual memory region from userspace,
> > > > > +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
> > > > > +apply the sealing. This is because the sealed mapping *won’t be unmapped*
> > > > > +till the process terminates or the exec system call is invoked.
> > > >
> > > > There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this.
> > > > At least in its current form.
> > > >
> > > Ya, the mseal is not for most apps. I mention the malloc example to stress that.
> > >
> > > > <snip>
> > > > > -
> > > > > -
> > > > > -Additional notes:
> > > > > -=================
> > > > >  As Jann Horn pointed out in [3], there are still a few ways to write
> > > > > -to RO memory, which is, in a way, by design. Those cases are not covered
> > > > > -by mseal(). If applications want to block such cases, sandbox tools (such as
> > > > > -seccomp, LSM, etc) might be considered.
> > > > > +to RO memory, which is, in a way, by design. And those could be blocked
> > > > > +by different security measures.
> > > > >
> > > > >  Those cases are:
> > > > > -
> > > > > -- Write to read-only memory through /proc/self/mem interface.
> > > > > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > > > > -- userfaultfd.
> > > > > +   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
> > > > > +   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > > > > +   - userfaultfd.
> > > >
> > > > I don't understand how this is not a problem, but MADV_DONTNEED is.
> > > > To me it seems that what we have now is completely useless, because you can trivially
> > > > bypass it using /proc/self/mem, which is enabled on most Linux systems.
> > > >
> > > > Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed
> > > > for Chrome. They need to work with every other distro and application as well.
> > > >
> > > > It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and
> > > > /proc/<pid>/mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's
> > > > as the traditional UNIX permission model allows.
> > > >
> > > IMO, it is a matter of  Divide and Conquer.  In a nutshell, mseal only
> > > prevents VMA's certain attributes (such as prot bits) from changing.
> > > It doesn't mean to say that sealed RO memory is immutable. To achieve
> > > that, the system needs to apply multiple security measures.
> >
> > No, it's a matter of providing a sane API without tons of edgecases. Making a VMA immutable should make a VMA
> > immutable, and not require you to provide a crap ton of other mechanisms in order to truly make it immutable.
> > If I call mseal, I expect it to be sealed, not "sealed except when it's not, lol".
> >
> > You haven't been able to quite specify what semantics are desirable out of this whole thing. Making
> > prot flags "immutable" is completely worthless if you can simply write to a random pseudofile and
> > have it bypass the whole thing (where a write to /proc/self/mem is semantically equivalent to
> > mprotect RW + write + mprotect RO). Making the vma immutable is completely worthless
> > if I can simply wipe anon pages. There has to be some end goal here (make contents immutable?
> > make sure VMA protection can't be changed? both?) which seems to be unclear from the kernel mmap-side.
> >
> > If you insist on providing half-baked APIs (and waving off any concerns), I'm sure this would've been better
> > implemented as a random bpf program for chrome. Maybe we could revert this whole thing and give eBPF one
> > or two bits of vma flags for their own uses :)
> >

Please reply to the above. We're struggling to understand exactly what semantics you want from this.
*That* is what we want to document and get set in stone, and we'll move from there.

> > >
> > > For writing to /proc/pid/mem, it can be disabled via [1].  SELINUX and
> > > Landlock can achieve the same protection too.
> >
> > I'm not blocking /proc/pid/mem, and my distro doesn't run any of those security modules :/
> >
> It is a choice you can make :-)

Your feature needs to work without "extra choices".
Jeff Xu Oct. 14, 2024, 4:49 a.m. UTC | #9
Hi Pedro

On Fri, Oct 4, 2024 at 10:02 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Mon, Sep 30, 2024 at 05:24:39PM -0700, Jeff Xu wrote:
> > Hi Pedro
> >
> > On Sat, Sep 28, 2024 at 6:43 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > On Fri, Sep 27, 2024 at 06:29:30PM GMT, Jeff Xu wrote:
> > > > Hi Pedro,
> > > >
> > > > On Fri, Sep 27, 2024 at 3:59 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > > <snip>
> > > > > > +
> > > > > > +   Blocked mm syscall:
> > > > > > +      - munmap
> > > > > > +      - mmap
> > > > > > +      - mremap
> > > > > > +      - mprotect and pkey_mprotect
> > > > > > +      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
> > > > > > +        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
> > > > > > +
> > > > > > +   The first set of syscall to block is munmap, mremap, mmap. They can
> > > > > > +   either leave an empty space in the address space, therefore allow
> > > > > > +   replacement with a new mapping with new set of attributes, or can
> > > > > > +   overwrite the existing mapping with another mapping.
> > > > > > +
> > > > > > +   mprotect and pkey_mprotect are blocked because they changes the
> > > > >                                                           change
> > > > > > +   protection bits (rwx) of the mapping.
> > > > > > +
> > > > > > +   Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,
> > > > > > +   MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
> > > > > > +   for anonymous memory, when users don't have write permission to the
> > > > > > +   memory. Those behaviors can alter region contents by discarding pages,
> > > > > > +   effectively a memset(0) for anonymous memory.
> > > > >
> > > > > What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE?
> > > > >
> > > > MAP_FILE seems not used ?
> > > > anonymous mapping is the mapping that is not backed by a file.
> > >
> > > MAP_FILE is actually defined as 0 usually :) But I meant file-backed private mappings.
> > >
> > OK, we are on the same page for this.
> >
> > > > > The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED
> > > > > on a read-only file mapping. e.g .text. This is obviously wrong?
> > > > >
> > > > When a MADV_DONTNEED is called, pages will be freed, on file-backed
> > > > mapping,  if the process reads from the mapping again, the content
> > > > will be retrieved from the file.
> > > >
> > >
> > > Sorry, it was late and I gave you a crap example. Consider this:
> > > a file-backed MAP_PRIVATE vma is marked RW. I write to it, then RO-it + mseal.
> > >
> > > The attacker later gets me to MADV_DONTNEED that VMA. You've just lost data.
> > >
> > > The big problem here is with anon _pages_, not anon vmas.
> > >
> > That depends on the app's threat-model. What you described seems to be
> > a case below
> > 1. The file is rw
> > 2. The process opens the file as rw
> > 3. the process mmap the fd as rw
> > 4 The process writes the memory, and the change isn't flushed to the
> > file on disk.
> > 5 The process  changes the mapping to RO
> > 6. The process seals the mapping
> > 7. The process is called MADV_DONTNEED , and because the change isn't
> > flush to file on disk, so it loses the change, (retrieve the old data
> > from disk when read from the mapped address later)
> >
> > I'm not sure this is a valid use case, the problem here seems to be
> > that the app needs to flush the change from memory to disk if the
> > expectation is writing is permanent.
> >
>
> MAP_PRIVATE never does writeback. That's not what this is about.
> I can trivially discard anonymous pages for private "file VMAs", which aren't
> refilled with the exact same contents. This is a problem.
>
That is fair, I appreciate you providing the use case.

I think mseal should support this . In addition, after reviewing
MADV_DONTNEED, I believe we should also allow madvise(DONTNEED) when
PROT is PROT_NONE.

I will work on fixes for those, but before that,  I like to make sure
some of the existing fixes are backported to 6.10, which makes it easy
to backport future fixes.

> > In any case, the mseal currently just blocks a subset of madvise, those
> > we know with a security implication.  If there is something mseal needs
> > to block additionally, one can always extend it by using the "flags" field.
> > I do think the bar is high though, e.g. a valid use case to support that.
>
> No, this has nothing to do with a flag. It's about providing sane semantics.
>
> >
> > > > For anonymous mapping, since  there is no file backup, if process
> > > > reads from the mapping, 0 is filled, hence equivalent to memset(0)
> > > >
> > > > > > +
> > > > > > +   Kernel will return -EPERM for blocked syscalls.
> > > > > > +
> > > > > > +   When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
> > > > > > +      - munmap: munmap is atomic. If one of VMAs in the given range is
> > > > > > +        sealed, none of VMAs are updated.
> > > > > > +      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
> > > > > > +        when mprotect over multiple VMAs, mprotect might update the beginning
> > > > > > +        VMAs before reaching the sealed VMA and return -EPERM.
> > > > > > +      - mmap and mremap: undefined behavior.
> > > > >
> > > > > mmap and mremap are actually not undefined as they use munmap semantics for their unmapping.
> > > > > Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?)
> > > > >
> > > > I'm not sure if I can declare mmap/mremap as atomic.
> > > >
> > > > Although, it might be possible to achieve this due to munmap being
> > > > atomic. I'm not sure  as I didn't test this. Would you like to find
> > > > out ?
> > >
> > > I just told you they use munmap under the hood. It's just that the requirement isn't actually
> > > written down anywhere.
> > >
> > I knew about mmap/mremap calling munmap. I don't know what exactly you
> > are asking though. In your patch and its discussion, you did not mention
> > the mmap/mremap (for sealing) is or should be atomic.
> >
> > My point is: since there isn't a clear statement from your patch description
> > or POSIX, that mremap/mmap is atomic,  and I haven't tested it myself with
> > regards to sealing, let's  leave them as "undefined" for now. (I could get back
> > to this later after the merging window)
> >
> > > >
> > > > > >
> > > > > >  Use cases:
> > > > > >  ==========
> > > > > >  - glibc:
> > > > > >    The dynamic linker, during loading ELF executables, can apply sealing to
> > > > > > -  non-writable memory segments.
> > > > > > +  mapping segments.
> > > > > >
> > > > > >  - Chrome browser: protect some security sensitive data-structures.
> > > > > >
> > > > > > -Notes on which memory to seal:
> > > > > > -==============================
> > > > > > -
> > > > > > -It might be important to note that sealing changes the lifetime of a mapping,
> > > > > > -i.e. the sealed mapping won’t be unmapped till the process terminates or the
> > > > > > -exec system call is invoked. Applications can apply sealing to any virtual
> > > > > > -memory region from userspace, but it is crucial to thoroughly analyze the
> > > > > > -mapping's lifetime prior to apply the sealing.
> > > > > > +Don't use mseal on:
> > > > > > +===================
> > > > > > +Applications can apply sealing to any virtual memory region from userspace,
> > > > > > +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
> > > > > > +apply the sealing. This is because the sealed mapping *won’t be unmapped*
> > > > > > +till the process terminates or the exec system call is invoked.
> > > > >
> > > > > There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this.
> > > > > At least in its current form.
> > > > >
> > > > Ya, the mseal is not for most apps. I mention the malloc example to stress that.
> > > >
> > > > > <snip>
> > > > > > -
> > > > > > -
> > > > > > -Additional notes:
> > > > > > -=================
> > > > > >  As Jann Horn pointed out in [3], there are still a few ways to write
> > > > > > -to RO memory, which is, in a way, by design. Those cases are not covered
> > > > > > -by mseal(). If applications want to block such cases, sandbox tools (such as
> > > > > > -seccomp, LSM, etc) might be considered.
> > > > > > +to RO memory, which is, in a way, by design. And those could be blocked
> > > > > > +by different security measures.
> > > > > >
> > > > > >  Those cases are:
> > > > > > -
> > > > > > -- Write to read-only memory through /proc/self/mem interface.
> > > > > > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > > > > > -- userfaultfd.
> > > > > > +   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
> > > > > > +   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> > > > > > +   - userfaultfd.
> > > > >
> > > > > I don't understand how this is not a problem, but MADV_DONTNEED is.
> > > > > To me it seems that what we have now is completely useless, because you can trivially
> > > > > bypass it using /proc/self/mem, which is enabled on most Linux systems.
> > > > >
> > > > > Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed
> > > > > for Chrome. They need to work with every other distro and application as well.
> > > > >
> > > > > It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and
> > > > > /proc/<pid>/mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's
> > > > > as the traditional UNIX permission model allows.
> > > > >
> > > > IMO, it is a matter of  Divide and Conquer.  In a nutshell, mseal only
> > > > prevents VMA's certain attributes (such as prot bits) from changing.
> > > > It doesn't mean to say that sealed RO memory is immutable. To achieve
> > > > that, the system needs to apply multiple security measures.
> > >
> > > No, it's a matter of providing a sane API without tons of edgecases. Making a VMA immutable should make a VMA
> > > immutable, and not require you to provide a crap ton of other mechanisms in order to truly make it immutable.
> > > If I call mseal, I expect it to be sealed, not "sealed except when it's not, lol".
> > >
> > > You haven't been able to quite specify what semantics are desirable out of this whole thing. Making
> > > prot flags "immutable" is completely worthless if you can simply write to a random pseudofile and
> > > have it bypass the whole thing (where a write to /proc/self/mem is semantically equivalent to
> > > mprotect RW + write + mprotect RO). Making the vma immutable is completely worthless
> > > if I can simply wipe anon pages. There has to be some end goal here (make contents immutable?
> > > make sure VMA protection can't be changed? both?) which seems to be unclear from the kernel mmap-side.
> > >
> > > If you insist on providing half-baked APIs (and waving off any concerns), I'm sure this would've been better
> > > implemented as a random bpf program for chrome. Maybe we could revert this whole thing and give eBPF one
> > > or two bits of vma flags for their own uses :)
> > >
>
> Please reply to the above. We're struggling to understand exactly what semantics you want from this.
> *That* is what we want to document and get set in stone, and we'll move from there.
>
If you meant to make mseal to support blocking /proc/self/mem or
ptrace or other cases, I welcome that.  Please go ahead to implement
it with a flag, the mseal already has a field for such future
extension.

The current  mseal's semantic doesn't come out from vacuum, for system
such as ChromeOS and Android, they are already relying heavily on
security mechanisms to block /proc/self/mem or ptrace, as I pointed
out previously, SELINUX/YAMA/Landlock/seccomp, and most recently
CONFIG_PROC_MEM_RESTRICT_WRITE_ALL all contributes to that, therefore
I don't have a use case or needs to make mseal to additionally block
those. But I understand others might have different choices on which
security mechanism to use, and I don't want to argue about which
requirements are correct. Security isn't a true/false binary state,
each system is different and no single security feature can achieve
the absolute "secure" state.

Thanks
-Jeff



> > > >
> > > > For writing to /proc/pid/mem, it can be disabled via [1].  SELINUX and
> > > > Landlock can achieve the same protection too.
> > >
> > > I'm not blocking /proc/pid/mem, and my distro doesn't run any of those security modules :/
> > >
> > It is a choice you can make :-)
>
> Your feature needs to work without "extra choices".
>
> --
> Pedro