Message ID | 20250407234223.1059191-5-nphamcs@gmail.com |
---|---|
State | New |
Headers | show |
Series | Virtual Swap Space | expand |
On Mon, Apr 07, 2025 at 04:42:05PM -0700, Nhat Pham wrote: > Currently, the swap cache code assumes that the swap space is of a fixed > size. The virtual swap space is dynamically sized, so the existing > partitioning code cannot be easily reused. A dynamic partitioning is > planned, but for now keep the design simple and just use a flat > swapcache for vswap. > > Since the vswap's implementation has begun to diverge from the old > implementation, we also introduce a new build config > (CONFIG_VIRTUAL_SWAP). Users who do not select this config will get the > old implementation, with no behavioral change. > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > --- > mm/Kconfig | 13 ++++++++++ > mm/swap.h | 22 ++++++++++------ > mm/swap_state.c | 68 +++++++++++++++++++++++++++++++++++++++++-------- > 3 files changed, 85 insertions(+), 18 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 1b501db06417..1a6acdb64333 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -22,6 +22,19 @@ menuconfig SWAP > used to provide more virtual memory than the actual RAM present > in your computer. If unsure say Y. > > +config VIRTUAL_SWAP > + bool "Swap space virtualization" > + depends on SWAP > + default n > + help > + When this is selected, the kernel is built with the new swap > + design. This will allow us to decouple the swap backends > + (zswap, on-disk swapfile, etc.), and save disk space when we > + use zswap (or the zero-filled swap page optimization). > + > + There might be more lock contentions with heavy swap use, since > + the swap cache is no longer range partitioned. > + > config ZSWAP > bool "Compressed cache for swap pages" > depends on SWAP > diff --git a/mm/swap.h b/mm/swap.h > index d5f8effa8015..06e20b1d79c4 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -22,22 +22,27 @@ void swap_write_unplug(struct swap_iocb *sio); > int swap_writepage(struct page *page, struct writeback_control *wbc); > void __swap_writepage(struct folio *folio, struct writeback_control *wbc); > > -/* linux/mm/swap_state.c */ > -/* One swap address space for each 64M swap space */ > +/* Return the swap device position of the swap slot. */ > +static inline loff_t swap_slot_pos(swp_slot_t slot) > +{ > + return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT; > +} In the same vein as the previous email, please avoid mixing moves, renames and new code as much as possible. This makes it quite hard to follow what's going on. I think it would be better if you structure the series as follows: 1. Prep patches. Separate patches for moves, renames, new code. 3. mm: vswap - config VIRTUAL_SWAP - mm/vswap.c with skeleton data structures, init/exit, Makefile hookup 4. (temporarily) flatten existing address spaces IMO you can do the swapcache and zswap in one patch 5+. conversion patches Grow mm/vswap.c as you add discrete components like the descriptor allocator, swapoff locking, the swap_cgroup tracker etc. You're mostly doing this part already. But try to order them by complexity and on a "core to periphery" gradient. I.e. swapoff locking should probably come before cgroup stuff. Insert move and rename patches at points where they make the most sense. I.e. if they can be understood in the current upstream code already, put them with step 1 prep patches. If you find a move or a rename can only be understood in the context of one of the components, put them in a prep patch right before that one. > @@ -260,6 +269,28 @@ void delete_from_swap_cache(struct folio *folio) > folio_ref_sub(folio, folio_nr_pages(folio)); > } > > +#ifdef CONFIG_VIRTUAL_SWAP > +void clear_shadow_from_swap_cache(int type, unsigned long begin, > + unsigned long end) > +{ > + swp_slot_t slot = swp_slot(type, begin); > + swp_entry_t entry = swp_slot_to_swp_entry(slot); > + unsigned long index = swap_cache_index(entry); > + struct address_space *address_space = swap_address_space(entry); > + void *old; > + XA_STATE(xas, &address_space->i_pages, index); > + > + xas_set_update(&xas, workingset_update_node); > + > + xa_lock_irq(&address_space->i_pages); > + xas_for_each(&xas, old, entry.val + end - begin) { > + if (!xa_is_value(old)) > + continue; > + xas_store(&xas, NULL); > + } > + xa_unlock_irq(&address_space->i_pages); I don't think you need separate functions for this, init, exit etc. if you tweak the macros to resolve to one tree. The current code already works if swapfiles are smaller than SWAP_ADDRESS_SPACE_PAGES and there is only one tree, after all. This would save a lot of duplication and keep ifdefs more confined.
On Tue, Apr 8, 2025 at 8:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Apr 07, 2025 at 04:42:05PM -0700, Nhat Pham wrote: > > Currently, the swap cache code assumes that the swap space is of a fixed > > size. The virtual swap space is dynamically sized, so the existing > > partitioning code cannot be easily reused. A dynamic partitioning is > > planned, but for now keep the design simple and just use a flat > > swapcache for vswap. > > > > Since the vswap's implementation has begun to diverge from the old > > implementation, we also introduce a new build config > > (CONFIG_VIRTUAL_SWAP). Users who do not select this config will get the > > old implementation, with no behavioral change. > > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > --- > > mm/Kconfig | 13 ++++++++++ > > mm/swap.h | 22 ++++++++++------ > > mm/swap_state.c | 68 +++++++++++++++++++++++++++++++++++++++++-------- > > 3 files changed, 85 insertions(+), 18 deletions(-) > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 1b501db06417..1a6acdb64333 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -22,6 +22,19 @@ menuconfig SWAP > > used to provide more virtual memory than the actual RAM present > > in your computer. If unsure say Y. > > > > +config VIRTUAL_SWAP > > + bool "Swap space virtualization" > > + depends on SWAP > > + default n > > + help > > + When this is selected, the kernel is built with the new swap > > + design. This will allow us to decouple the swap backends > > + (zswap, on-disk swapfile, etc.), and save disk space when we > > + use zswap (or the zero-filled swap page optimization). > > + > > + There might be more lock contentions with heavy swap use, since > > + the swap cache is no longer range partitioned. > > + > > config ZSWAP > > bool "Compressed cache for swap pages" > > depends on SWAP > > diff --git a/mm/swap.h b/mm/swap.h > > index d5f8effa8015..06e20b1d79c4 100644 > > --- a/mm/swap.h > > +++ b/mm/swap.h > > @@ -22,22 +22,27 @@ void swap_write_unplug(struct swap_iocb *sio); > > int swap_writepage(struct page *page, struct writeback_control *wbc); > > void __swap_writepage(struct folio *folio, struct writeback_control *wbc); > > > > -/* linux/mm/swap_state.c */ > > -/* One swap address space for each 64M swap space */ > > +/* Return the swap device position of the swap slot. */ > > +static inline loff_t swap_slot_pos(swp_slot_t slot) > > +{ > > + return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT; > > +} > > In the same vein as the previous email, please avoid mixing moves, > renames and new code as much as possible. This makes it quite hard to > follow what's going on. > > I think it would be better if you structure the series as follows: > > 1. Prep patches. Separate patches for moves, renames, new code. > > 3. mm: vswap > - config VIRTUAL_SWAP > - mm/vswap.c with skeleton data structures, init/exit, Makefile hookup > > 4. (temporarily) flatten existing address spaces > > IMO you can do the swapcache and zswap in one patch > > 5+. conversion patches > > Grow mm/vswap.c as you add discrete components like the descriptor > allocator, swapoff locking, the swap_cgroup tracker etc. > > You're mostly doing this part already. But try to order them by > complexity and on a "core to periphery" gradient. I.e. swapoff > locking should probably come before cgroup stuff. > > Insert move and rename patches at points where they make the most > sense. I.e. if they can be understood in the current upstream code > already, put them with step 1 prep patches. If you find a move or a > rename can only be understood in the context of one of the components, > put them in a prep patch right before that one. Makes sense, yeah! I'll try to avoid mixing moves/renames/new code as much as I can. > > > @@ -260,6 +269,28 @@ void delete_from_swap_cache(struct folio *folio) > > folio_ref_sub(folio, folio_nr_pages(folio)); > > } > > > > +#ifdef CONFIG_VIRTUAL_SWAP > > +void clear_shadow_from_swap_cache(int type, unsigned long begin, > > + unsigned long end) > > +{ > > + swp_slot_t slot = swp_slot(type, begin); > > + swp_entry_t entry = swp_slot_to_swp_entry(slot); > > + unsigned long index = swap_cache_index(entry); > > + struct address_space *address_space = swap_address_space(entry); > > + void *old; > > + XA_STATE(xas, &address_space->i_pages, index); > > + > > + xas_set_update(&xas, workingset_update_node); > > + > > + xa_lock_irq(&address_space->i_pages); > > + xas_for_each(&xas, old, entry.val + end - begin) { > > + if (!xa_is_value(old)) > > + continue; > > + xas_store(&xas, NULL); > > + } > > + xa_unlock_irq(&address_space->i_pages); > > I don't think you need separate functions for this, init, exit etc. if > you tweak the macros to resolve to one tree. The current code already > works if swapfiles are smaller than SWAP_ADDRESS_SPACE_PAGES and there > is only one tree, after all. For clear_shadow_from_swap_cache(), I think I understand what you want - keep clear_shadow_from_swap_cache() the same for two implementations, but at caller sites, have the callers themselves determine the range in swap cache (i.e (begin, end)). I'm a bit confused with init and exit, but I assume there is a way to do it for them as well. I will note though, that it might increase the number of ifdefs sections (or alternatively, IS_ENABLED() checks), because these functions are called in different contexts for the two implementations: 1. init and exit are called in swapon/swapoff in the old implementation. They are called in swap initialization in the virtual swap implementation. 2. Similarly, we clear swap cache shadows when we free physical swap slots in the old implementation, and when we free virtual swap slots in the new implementation, I think it is good actually, because it makes the difference explicit rather than implicit. Also, it helps us know exactly which code block to target when we unify the two implementations :) Just putting it out there. > > This would save a lot of duplication and keep ifdefs more confined.
diff --git a/mm/Kconfig b/mm/Kconfig index 1b501db06417..1a6acdb64333 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -22,6 +22,19 @@ menuconfig SWAP used to provide more virtual memory than the actual RAM present in your computer. If unsure say Y. +config VIRTUAL_SWAP + bool "Swap space virtualization" + depends on SWAP + default n + help + When this is selected, the kernel is built with the new swap + design. This will allow us to decouple the swap backends + (zswap, on-disk swapfile, etc.), and save disk space when we + use zswap (or the zero-filled swap page optimization). + + There might be more lock contentions with heavy swap use, since + the swap cache is no longer range partitioned. + config ZSWAP bool "Compressed cache for swap pages" depends on SWAP diff --git a/mm/swap.h b/mm/swap.h index d5f8effa8015..06e20b1d79c4 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -22,22 +22,27 @@ void swap_write_unplug(struct swap_iocb *sio); int swap_writepage(struct page *page, struct writeback_control *wbc); void __swap_writepage(struct folio *folio, struct writeback_control *wbc); -/* linux/mm/swap_state.c */ -/* One swap address space for each 64M swap space */ +/* Return the swap device position of the swap slot. */ +static inline loff_t swap_slot_pos(swp_slot_t slot) +{ + return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT; +} + #define SWAP_ADDRESS_SPACE_SHIFT 14 #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) #define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1) + +/* linux/mm/swap_state.c */ +#ifdef CONFIG_VIRTUAL_SWAP +extern struct address_space *swap_address_space(swp_entry_t entry); +#define swap_cache_index(entry) entry.val +#else +/* One swap address space for each 64M swap space */ extern struct address_space *swapper_spaces[]; #define swap_address_space(entry) \ (&swapper_spaces[swp_type(entry)][swp_offset(entry) \ >> SWAP_ADDRESS_SPACE_SHIFT]) -/* Return the swap device position of the swap slot. */ -static inline loff_t swap_slot_pos(swp_slot_t slot) -{ - return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT; -} - /* * Return the swap cache index of the swap entry. */ @@ -46,6 +51,7 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry) BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK); return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK; } +#endif void show_swap_cache_info(void); bool add_to_swap(struct folio *folio); diff --git a/mm/swap_state.c b/mm/swap_state.c index 055e555d3382..268338a0ea57 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -38,10 +38,19 @@ static const struct address_space_operations swap_aops = { #endif }; +#ifdef CONFIG_VIRTUAL_SWAP +static struct address_space swapper_space __read_mostly; + +struct address_space *swap_address_space(swp_entry_t entry) +{ + return &swapper_space; +} +#else struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly; static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly; -static bool enable_vma_readahead __read_mostly = true; +#endif +static bool enable_vma_readahead __read_mostly = true; #define SWAP_RA_ORDER_CEILING 5 #define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2) @@ -260,6 +269,28 @@ void delete_from_swap_cache(struct folio *folio) folio_ref_sub(folio, folio_nr_pages(folio)); } +#ifdef CONFIG_VIRTUAL_SWAP +void clear_shadow_from_swap_cache(int type, unsigned long begin, + unsigned long end) +{ + swp_slot_t slot = swp_slot(type, begin); + swp_entry_t entry = swp_slot_to_swp_entry(slot); + unsigned long index = swap_cache_index(entry); + struct address_space *address_space = swap_address_space(entry); + void *old; + XA_STATE(xas, &address_space->i_pages, index); + + xas_set_update(&xas, workingset_update_node); + + xa_lock_irq(&address_space->i_pages); + xas_for_each(&xas, old, entry.val + end - begin) { + if (!xa_is_value(old)) + continue; + xas_store(&xas, NULL); + } + xa_unlock_irq(&address_space->i_pages); +} +#else void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end) { @@ -290,6 +321,7 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin, break; } } +#endif /* * If we are the only user, then try to free up the swap cache. @@ -718,23 +750,34 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, return folio; } +static void init_swapper_space(struct address_space *space) +{ + xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ); + atomic_set(&space->i_mmap_writable, 0); + space->a_ops = &swap_aops; + /* swap cache doesn't use writeback related tags */ + mapping_set_no_writeback_tags(space); +} + +#ifdef CONFIG_VIRTUAL_SWAP int init_swap_address_space(unsigned int type, unsigned long nr_pages) { - struct address_space *spaces, *space; + return 0; +} + +void exit_swap_address_space(unsigned int type) {} +#else +int init_swap_address_space(unsigned int type, unsigned long nr_pages) +{ + struct address_space *spaces; unsigned int i, nr; nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES); spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL); if (!spaces) return -ENOMEM; - for (i = 0; i < nr; i++) { - space = spaces + i; - xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ); - atomic_set(&space->i_mmap_writable, 0); - space->a_ops = &swap_aops; - /* swap cache doesn't use writeback related tags */ - mapping_set_no_writeback_tags(space); - } + for (i = 0; i < nr; i++) + init_swapper_space(spaces + i); nr_swapper_spaces[type] = nr; swapper_spaces[type] = spaces; @@ -752,6 +795,7 @@ void exit_swap_address_space(unsigned int type) nr_swapper_spaces[type] = 0; swapper_spaces[type] = NULL; } +#endif static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start, unsigned long *end) @@ -930,6 +974,10 @@ static int __init swap_init_sysfs(void) int err; struct kobject *swap_kobj; +#ifdef CONFIG_VIRTUAL_SWAP + init_swapper_space(&swapper_space); +#endif + swap_kobj = kobject_create_and_add("swap", mm_kobj); if (!swap_kobj) { pr_err("failed to create swap kobject\n");
Currently, the swap cache code assumes that the swap space is of a fixed size. The virtual swap space is dynamically sized, so the existing partitioning code cannot be easily reused. A dynamic partitioning is planned, but for now keep the design simple and just use a flat swapcache for vswap. Since the vswap's implementation has begun to diverge from the old implementation, we also introduce a new build config (CONFIG_VIRTUAL_SWAP). Users who do not select this config will get the old implementation, with no behavioral change. Signed-off-by: Nhat Pham <nphamcs@gmail.com> --- mm/Kconfig | 13 ++++++++++ mm/swap.h | 22 ++++++++++------ mm/swap_state.c | 68 +++++++++++++++++++++++++++++++++++++++++-------- 3 files changed, 85 insertions(+), 18 deletions(-)