[0/3] selftests/mm: virtual_address_range: Two bugfixes and a cleanup

Message ID	20250107-virtual_address_range-tests-v1-0-3834a2fb47fe@linutronix.de
Headers	show Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C72741F3D28; Tue, 7 Jan 2025 15:14:56 +0000 (UTC) From: =?utf-8?q?Thomas_Wei=C3=9Fschuh?= <thomas.weissschuh@linutronix.de> Subject: [PATCH 0/3] selftests/mm: virtual_address_range: Two bugfixes and a cleanup Date: Tue, 07 Jan 2025 16:14:44 +0100 Message-Id: <20250107-virtual_address_range-tests-v1-0-3834a2fb47fe@linutronix.de> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit To: Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Dev Jain <dev.jain@arm.com>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org, =?utf-8?q?Thomas_Wei=C3=9Fschuh?= <thomas.weissschuh@linutronix.de>, stable@vger.kernel.org, kernel test robot <oliver.sang@intel.com>
Series	selftests/mm: virtual_address_range: Two bugfixes and a cleanup \| expand [0/3] selftests/mm: virtual_address_range: Two bugfixes and a cleanup [1/3] selftests/mm: virtual_address_range: Fix error when CommitLimit < 1GiB [2/3] selftests/mm: virtual_address_range: Avoid reading VVAR mappings [3/3] selftests/mm: virtual_address_range: Dump to /dev/null

Thomas Weißschuh Jan. 7, 2025, 3:14 p.m. UTC

The selftest started failing since commit e93d2521b27f
("x86/vdso: Split virtual clock pages into dedicated mapping")
was merged. While debugging I stumbled upon another bug and potential
cleanup.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
Thomas Weißschuh (3):
      selftests/mm: virtual_address_range: Fix error when CommitLimit < 1GiB
      selftests/mm: virtual_address_range: Avoid reading VVAR mappings
      selftests/mm: virtual_address_range: Dump to /dev/null

 tools/testing/selftests/mm/virtual_address_range.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)
---
base-commit: fbfd64d25c7af3b8695201ebc85efe90be28c5a3
change-id: 20250107-virtual_address_range-tests-95843766fa97

Best regards,

David Hildenbrand Jan. 8, 2025, 1:30 p.m. UTC | #1

On 08.01.25 07:09, Dev Jain wrote:
> 
> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>> During the execution of validate_complete_va_space() a lot of memory is
>> on the VM subsystem. When running on a low memory subsystem an OOM may
>> be triggered, when writing to the dump file as the filesystem may also
>> require memory.
>>
>> On my test system with 1100MiB physical memory:
>>
>> 	Tasks state (memory values in pages):
>> 	[  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
>> 	[     57]     0    57 34359215953      695      256        0       439 1064390656        0             0 virtual_address
>>
>> 	Out of memory: Killed process 57 (virtual_address) total-vm:137436863812kB, anon-rss:1024kB, file-rss:0kB, shmem-rss:1756kB, UID:0 pgtables:1039444kB oom_score_adj:0
>> 	<snip>
>> 	fault_in_iov_iter_readable+0x4a/0xd0
>> 	generic_perform_write+0x9c/0x280
>> 	shmem_file_write_iter+0x86/0x90
>> 	vfs_write+0x29c/0x480
>> 	ksys_write+0x6c/0xe0
>> 	do_syscall_64+0x9e/0x1a0
>> 	entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>> Write the dumped data into /dev/null instead which does not require
>> additional memory during write(), making the code simpler as a
>> side-effect.
>>
>> Signed-off-by: Thomas Weißschuh<thomas.weissschuh@linutronix.de>
>> ---
>>   tools/testing/selftests/mm/virtual_address_range.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
>> index 484f82c7b7c871f82a7d9ec6d6c649f2ab1eb0cd..4042fd878acd702d23da2c3293292de33bd48143 100644
>> --- a/tools/testing/selftests/mm/virtual_address_range.c
>> +++ b/tools/testing/selftests/mm/virtual_address_range.c
>> @@ -103,10 +103,9 @@ static int validate_complete_va_space(void)
>>   	FILE *file;
>>   	int fd;
>>   
>> -	fd = open("va_dump", O_CREAT | O_WRONLY, 0600);
>> -	unlink("va_dump");
>> +	fd = open("/dev/null", O_WRONLY);
>>   	if (fd < 0) {
>> -		ksft_test_result_skip("cannot create or open dump file\n");
>> +		ksft_test_result_skip("cannot create or open /dev/null\n");
>>   		ksft_finished();
>>   	}
 >>   >> @@ -152,7 +151,6 @@ static int validate_complete_va_space(void)
>>   		while (start_addr + hop < end_addr) {
>>   			if (write(fd, (void *)(start_addr + hop), 1) != 1)
>>   				return 1;
>> -			lseek(fd, 0, SEEK_SET);
>>   
>>   			hop += MAP_CHUNK_SIZE;
>>   		}
>>
> 
> The reason I had not used /dev/null was that write() was succeeding to /dev/null
> even from an address not in my VA space. I was puzzled about this behaviour of
> /dev/null and I chose to ignore it and just use a real file.
> 
> To test this behaviour, run the following program:
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/mman.h>
> intmain()
> {
> intfd;
> fd = open("va_dump", O_CREAT| O_WRONLY, 0600);
> unlink("va_dump");
> // fd = open("/dev/null", O_WRONLY);
> intret = munmap((void*)(1UL<< 30), 100);
> if(!ret)
> printf("munmap succeeded\n");
> intres = write(fd, (void*)(1UL<< 30), 1);
> if(res == 1)
> printf("write succeeded\n");
> return0;
> }
> The write will fail as expected, but if you comment out the va_dump
> lines and use /dev/null, the write will succeed.

What exactly do we want to achieve with the write? Verify that the 
output of /proc/self/map is reasonable and we can actually resolve a 
fault / map a page?

Why not access the memory directly+signal handler or using 
/proc/self/mem, so you can avoid the temp file completely?

David Hildenbrand Jan. 8, 2025, 4:46 p.m. UTC | #2

On 08.01.25 17:13, Thomas Weißschuh wrote:
> On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote:
>> On 08.01.25 09:05, Thomas Weißschuh wrote:
>>> On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
>>>>
>>>> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>>>>> If not enough physical memory is available the kernel may fail mmap();
>>>>> see __vm_enough_memory() and vm_commit_limit().
>>>>> In that case the logic in validate_complete_va_space() does not make
>>>>> sense and will even incorrectly fail.
>>>>> Instead skip the test if no mmap() succeeded.
>>>>>
>>>>> Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
>>>>> Cc: stable@vger.kernel.org
>>
>> CC stable on tests is ... odd.
> 
> I thought it was fairly common, but it isn't.
> Will drop it.

As it's not really a "kernel BUG", it's rather uncommon.

>>
>> Note that with MAP_NORESRVE, most setups we care about will allow mapping as
>> much as you want, but on access OOM will fire.
> 
> Thanks for the hint.
> 
>> So one could require that /proc/sys/vm/overcommit_memory is setup properly
>> and use MAP_NORESRVE.
> 
> Isn't the check for lchunks == 0 essentially exactly this?

I assume paired with MAP_NORESERVE?

Maybe, but it could be better to have something that says "if 
overcommit_memory is not setup properly I will SKIP this test", but 
otherwise I expect this to work and will FAIL if it doesn't".

Or would you expect to run into lchunks == 0 even if overcommit_memory 
is setup properly and MAP_NORESERVE is used? (very very low memory that 
we cannot even create all the VMAs?)

> 
>> Reading from anonymous memory will populate the shared zeropage. To mitigate
>> OOM from "too many page tables", one could simply unmap the pieces as they
>> are verified (or MAP_FIXED over them, to free page tables).
> 
> The code has to figure out if a verified region was created by mmap(),
> otherwise an munmap() could crash the process.
> As the entries from /proc/self/maps may have been merged and (I assume)

Yes, and partial unmap (in chunk granularity?) would split them again.

> the ordering of mappings is not guaranteed, some bespoke logic to establish
> the link will be needed.


My thinking was that you simply process one /proc/self/maps entry in 
some chunks. After processing a chunk, you munmap() it.

So you would process + munmap in chunks.

> 
> Is it fine to rely on CONFIG_ANON_VMA_NAME?
> That would make it much easier to implement.

Can you elaborate how you would do it?

> 
> Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even
> in very low physical memory conditions.

Cool.

David Hildenbrand Jan. 9, 2025, 1:05 p.m. UTC | #3

>
> That is clear. The issue would be to figure which chunks are valid to
> unmap. If something critical like the executable file is unmapped,
> the process crashes. But see below.

Ah, now I see what you mean. Yes, also the stack etc. will be 
problematic. So IIUC, you want to limit the munmap optimization only to 
the manually mmap()ed parts.

> 
>>> Is it fine to rely on CONFIG_ANON_VMA_NAME?
>>> That would make it much easier to implement.
>>
>> Can you elaborate how you would do it?
> 
> First set the VMA name after mmap():
> 
> for (i = 0; i < NR_CHUNKS_LOW; i++) {
> 	ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE,
> 		     MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> 	if (ptr[i] == MAP_FAILED) {
> 		if (validate_lower_address_hint())
> 			ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n");
> 		break;
> 	}
> 
> 	validate_addr(ptr[i], 0);
> 	if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range"))
> 		ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno));

Likely this would prevent merging of VMAs.

With a 1 GiB chunk size, and NR_CHUNKS_LOW == 128TiB, you'd already 
require 128k VMAs. The default limit is frequently 64k.

We could just scan the ptr / hptr array to see if this is a manual mmap 
area or not. If this takes too long, one could sort the arrays by 
address and perform a binary search.

Not the most efficient way of doing it, but maybe good enough for this test?

Alternatively, store the pointer in a xarray-like tree instead of two 
arrays. Requires a bit more memory ... and we'd have to find a simple 
implementation we could just reuse in this test. So maybe there is a 
simpler way to get it done.

David Hildenbrand Jan. 9, 2025, 1:19 p.m. UTC | #4

On 09.01.25 14:05, David Hildenbrand wrote:
>   >
>> That is clear. The issue would be to figure which chunks are valid to
>> unmap. If something critical like the executable file is unmapped,
>> the process crashes. But see below.
> 
> Ah, now I see what you mean. Yes, also the stack etc. will be
> problematic. So IIUC, you want to limit the munmap optimization only to
> the manually mmap()ed parts.
> 
>>
>>>> Is it fine to rely on CONFIG_ANON_VMA_NAME?
>>>> That would make it much easier to implement.
>>>
>>> Can you elaborate how you would do it?
>>
>> First set the VMA name after mmap():

I took a look at the implementation, and VMA merging seems to be able to 
merge such VMAs that share the same name (even when set separately).

So assuming you use the same name for all, that should indeed also work.

Thomas Weißschuh Jan. 9, 2025, 1:38 p.m. UTC | #5

On Thu, Jan 09, 2025 at 02:05:43PM +0100, David Hildenbrand wrote:
> >
> > That is clear. The issue would be to figure which chunks are valid to
> > unmap. If something critical like the executable file is unmapped,
> > the process crashes. But see below.
> 
> Ah, now I see what you mean. Yes, also the stack etc. will be problematic.
> So IIUC, you want to limit the munmap optimization only to the manually
> mmap()ed parts.

Correct.

> > > > Is it fine to rely on CONFIG_ANON_VMA_NAME?
> > > > That would make it much easier to implement.
> > > 
> > > Can you elaborate how you would do it?
> > 
> > First set the VMA name after mmap():
> > 
> > for (i = 0; i < NR_CHUNKS_LOW; i++) {
> > 	ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE,
> > 		     MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > 
> > 	if (ptr[i] == MAP_FAILED) {
> > 		if (validate_lower_address_hint())
> > 			ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n");
> > 		break;
> > 	}
> > 
> > 	validate_addr(ptr[i], 0);
> > 	if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range"))
> > 		ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno));
> 
> Likely this would prevent merging of VMAs.
>
> With a 1 GiB chunk size, and NR_CHUNKS_LOW == 128TiB, you'd already require
> 128k VMAs. The default limit is frequently 64k.

They are merged for me, as they all share the same name.

PR_SET_VMA(2const) even mentions merging:

	Note that assigning an attribute to a virtual memory area might
	prevent it from being merged with adjacent virtual memory areas
	due to the difference in that attribute's value.

is_mergeable_vma() has an explicit check using anon_vma_name_eq().

> We could just scan the ptr / hptr array to see if this is a manual mmap area
> or not. If this takes too long, one could sort the arrays by address and
> perform a binary search.
>
> Not the most efficient way of doing it, but maybe good enough for this test?

A naive loop is what I tried first, but it took forever.

> Alternatively, store the pointer in a xarray-like tree instead of two
> arrays. Requires a bit more memory ... and we'd have to find a simple
> implementation we could just reuse in this test. So maybe there is a simpler
> way to get it done.

IMO the prctl() is that simpler way.
The only real drawback is the dependency on CONFIG_ANON_VMA_NAME.
We can add an entry to tools/testing/selftests/mm/config for it.


Thomas

[0/3] selftests/mm: virtual_address_range: Two bugfixes and a cleanup

Message

Comments