Message ID | 20230403201839.4097845-6-zi.yan@sent.com |
---|---|
State | New |
Headers | show |
Series | None | expand |
On Mon, 3 Apr 2023, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > To split a THP to any lower order pages, we need to reform THPs on > subpages at given order and add page refcount based on the new page > order. Also we need to reinitialize page_deferred_list after removing > the page from the split_queue, otherwise a subsequent split will see > list corruption when checking the page_deferred_list again. > > It has many uses, like minimizing the number of pages after > truncating a huge pagecache page. For anonymous THPs, we can only split > them to order-0 like before until we add support for any size anonymous > THPs. > > Signed-off-by: Zi Yan <ziy@nvidia.com> > --- ... > @@ -2754,14 +2798,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) > if (folio_test_swapbacked(folio)) { > __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, > -nr); > - } else { > + } else if (!new_order) { > + /* > + * Decrease THP stats only if split to normal > + * pages > + */ > __lruvec_stat_mod_folio(folio, NR_FILE_THPS, > -nr); > filemap_nr_thps_dec(mapping); > } > } This part is wrong. The problem I've had is /proc/sys/vm/stat_refresh warning of negative nr_shmem_hugepages (which then gets shown as 0 in vmstat or meminfo, even though there actually are shmem hugepages). At first I thought that the fix needed (which I'm running with) is: --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2797,17 +2797,16 @@ int split_huge_page_to_list_to_order(str int nr = folio_nr_pages(folio); xas_split(&xas, folio, folio_order(folio)); - if (folio_test_swapbacked(folio)) { - __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, - -nr); - } else if (!new_order) { - /* - * Decrease THP stats only if split to normal - * pages - */ - __lruvec_stat_mod_folio(folio, NR_FILE_THPS, - -nr); - filemap_nr_thps_dec(mapping); + if (folio_test_pmd_mappable(folio) && + new_order < HPAGE_PMD_ORDER) { + if (folio_test_swapbacked(folio)) { + __lruvec_stat_mod_folio(folio, + NR_SHMEM_THPS, -nr); + } else { + __lruvec_stat_mod_folio(folio, + NR_FILE_THPS, -nr); + filemap_nr_thps_dec(mapping); + } } } because elsewhere the maintenance of NR_SHMEM_THPS or NR_FILE_THPS is rightly careful to be dependent on folio_test_pmd_mappable() (and, so far as I know, we shall not be seeing folios of order higher than HPAGE_PMD_ORDER yet in mm/huge_memory.c - those would need more thought). But it may be more complicated than that, given that patch 7/7 appears (I haven't tried) to allow splitting to other orders on a file opened for reading - that might be a bug. The complication here is that we now have four kinds of large folio in mm/huge_memory.c, and the rules are a bit different for each. Anonymous THPs: okay, I think I've seen you exclude those with -EINVAL at a higher level (and they wouldn't be getting into this "if (mapping) {" block anyway). Shmem (swapbacked) THPs: we are only allocating shmem in 0-order or HPAGE_PMD_ORDER at present. I can imagine that in a few months or a year-or-so's time, we shall want to follow Matthew's folio readahead, and generalize to other orders in shmem; but right now I'd really prefer not to have truncation or debugfs introducing the surprise of other orders there. Maybe there's little that needs to be fixed, only the THP_SWPOUT and THP_SWPOUT_FALLBACK statistics have come to mind so far (would need to be limited to folio_test_pmd_mappable()); though I've no idea how well intermediate orders will work with or against THP swapout. CONFIG_READ_ONLY_THP_FOR_FS=y file THPs: those need special care, and their filemap_nr_thps_dec(mapping) above may not be good enough. So long as it's working as intended, it does exclude the possibility of truncation splitting here; but if you allow splitting via debugfs to reach them, then the accounting needs to be changed - for them, any order higher than 0 has to be counted in nr_thps - so splitting one HPAGE_PMD_ORDER THP into multiple large folios will need to add to that count, not decrement it. Otherwise, a filesystem unprepared for large folios or compound pages is in danger of meeting them by surprise. Better just disable that possibility, along with shmem. mapping_large_folio_support() file THPs: this category is the one you're really trying to address with this series, they can already come in various orders, and it's fair for truncation to make a different choice of orders - but is what it's doing worth doing? I'll say more on 6/7. Hugh
On 16 Apr 2023, at 15:25, Hugh Dickins wrote: > On Mon, 3 Apr 2023, Zi Yan wrote: > >> From: Zi Yan <ziy@nvidia.com> >> >> To split a THP to any lower order pages, we need to reform THPs on >> subpages at given order and add page refcount based on the new page >> order. Also we need to reinitialize page_deferred_list after removing >> the page from the split_queue, otherwise a subsequent split will see >> list corruption when checking the page_deferred_list again. >> >> It has many uses, like minimizing the number of pages after >> truncating a huge pagecache page. For anonymous THPs, we can only split >> them to order-0 like before until we add support for any size anonymous >> THPs. >> >> Signed-off-by: Zi Yan <ziy@nvidia.com> >> --- > ... >> @@ -2754,14 +2798,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >> if (folio_test_swapbacked(folio)) { >> __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, >> -nr); >> - } else { >> + } else if (!new_order) { >> + /* >> + * Decrease THP stats only if split to normal >> + * pages >> + */ >> __lruvec_stat_mod_folio(folio, NR_FILE_THPS, >> -nr); >> filemap_nr_thps_dec(mapping); >> } >> } > > This part is wrong. The problem I've had is /proc/sys/vm/stat_refresh > warning of negative nr_shmem_hugepages (which then gets shown as 0 in > vmstat or meminfo, even though there actually are shmem hugepages). > > At first I thought that the fix needed (which I'm running with) is: > > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2797,17 +2797,16 @@ int split_huge_page_to_list_to_order(str > int nr = folio_nr_pages(folio); > > xas_split(&xas, folio, folio_order(folio)); > - if (folio_test_swapbacked(folio)) { > - __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, > - -nr); > - } else if (!new_order) { > - /* > - * Decrease THP stats only if split to normal > - * pages > - */ > - __lruvec_stat_mod_folio(folio, NR_FILE_THPS, > - -nr); > - filemap_nr_thps_dec(mapping); > + if (folio_test_pmd_mappable(folio) && > + new_order < HPAGE_PMD_ORDER) { > + if (folio_test_swapbacked(folio)) { > + __lruvec_stat_mod_folio(folio, > + NR_SHMEM_THPS, -nr); > + } else { > + __lruvec_stat_mod_folio(folio, > + NR_FILE_THPS, -nr); > + filemap_nr_thps_dec(mapping); > + } > } > } > > because elsewhere the maintenance of NR_SHMEM_THPS or NR_FILE_THPS > is rightly careful to be dependent on folio_test_pmd_mappable() (and, > so far as I know, we shall not be seeing folios of order higher than > HPAGE_PMD_ORDER yet in mm/huge_memory.c - those would need more thought). > > But it may be more complicated than that, given that patch 7/7 appears > (I haven't tried) to allow splitting to other orders on a file opened > for reading - that might be a bug. > > The complication here is that we now have four kinds of large folio > in mm/huge_memory.c, and the rules are a bit different for each. > > Anonymous THPs: okay, I think I've seen you exclude those with -EINVAL > at a higher level (and they wouldn't be getting into this "if (mapping) {" > block anyway). > > Shmem (swapbacked) THPs: we are only allocating shmem in 0-order or > HPAGE_PMD_ORDER at present. I can imagine that in a few months or a > year-or-so's time, we shall want to follow Matthew's folio readahead, > and generalize to other orders in shmem; but right now I'd really > prefer not to have truncation or debugfs introducing the surprise > of other orders there. Maybe there's little that needs to be fixed, > only the THP_SWPOUT and THP_SWPOUT_FALLBACK statistics have come to > mind so far (would need to be limited to folio_test_pmd_mappable()); > though I've no idea how well intermediate orders will work with or > against THP swapout. > > CONFIG_READ_ONLY_THP_FOR_FS=y file THPs: those need special care, > and their filemap_nr_thps_dec(mapping) above may not be good enough. > So long as it's working as intended, it does exclude the possibility > of truncation splitting here; but if you allow splitting via debugfs > to reach them, then the accounting needs to be changed - for them, > any order higher than 0 has to be counted in nr_thps - so splitting > one HPAGE_PMD_ORDER THP into multiple large folios will need to add > to that count, not decrement it. Otherwise, a filesystem unprepared > for large folios or compound pages is in danger of meeting them by > surprise. Better just disable that possibility, along with shmem. OK. I will disable for these two cases. I will come back to them once I figure the details out. > > mapping_large_folio_support() file THPs: this category is the one > you're really trying to address with this series, they can already > come in various orders, and it's fair for truncation to make a > different choice of orders - but is what it's doing worth doing? > I'll say more on 6/7. > > Hugh -- Best Regards, Yan, Zi
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 20284387b841..32c91e1b59cd 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -147,10 +147,11 @@ void prep_transhuge_page(struct page *page); void free_transhuge_page(struct page *page); bool can_split_folio(struct folio *folio, int *pextra_pins); -int split_huge_page_to_list(struct page *page, struct list_head *list); +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, + unsigned int new_order); static inline int split_huge_page(struct page *page) { - return split_huge_page_to_list(page, NULL); + return split_huge_page_to_list_to_order(page, NULL, 0); } void deferred_split_folio(struct folio *folio); @@ -297,7 +298,8 @@ can_split_folio(struct folio *folio, int *pextra_pins) return false; } static inline int -split_huge_page_to_list(struct page *page, struct list_head *list) +split_huge_page_to_list_to_order(struct page *page, struct list_head *list, + unsigned int new_order) { return 0; } @@ -397,7 +399,7 @@ static inline bool thp_migration_supported(void) static inline int split_folio_to_list(struct folio *folio, struct list_head *list) { - return split_huge_page_to_list(&folio->page, list); + return split_huge_page_to_list_to_order(&folio->page, list, 0); } static inline int split_folio(struct folio *folio) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f8a8a72b207d..619d25278340 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2359,11 +2359,13 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_folio(struct folio *folio) { - enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | - TTU_SYNC; + enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SYNC; VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); + if (folio_test_pmd_mappable(folio)) + ttu_flags |= TTU_SPLIT_HUGE_PMD; + /* * Anon pages need migration entries to preserve them, but file * pages can simply be left unmapped, then faulted back on demand. @@ -2395,7 +2397,6 @@ static void lru_add_page_tail(struct page *head, struct page *tail, struct lruvec *lruvec, struct list_head *list) { VM_BUG_ON_PAGE(!PageHead(head), head); - VM_BUG_ON_PAGE(PageCompound(tail), head); VM_BUG_ON_PAGE(PageLRU(tail), head); lockdep_assert_held(&lruvec->lru_lock); @@ -2416,7 +2417,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail, } static void __split_huge_page_tail(struct page *head, int tail, - struct lruvec *lruvec, struct list_head *list) + struct lruvec *lruvec, struct list_head *list, unsigned int new_order) { struct page *page_tail = head + tail; @@ -2483,10 +2484,15 @@ static void __split_huge_page_tail(struct page *head, int tail, * which needs correct compound_head(). */ clear_compound_head(page_tail); + if (new_order) { + prep_compound_page(page_tail, new_order); + prep_transhuge_page(page_tail); + } /* Finally unfreeze refcount. Additional reference from page cache. */ - page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) || - PageSwapCache(head))); + page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) || + PageSwapCache(head)) ? + thp_nr_pages(page_tail) : 0)); if (page_is_young(head)) set_page_young(page_tail); @@ -2504,7 +2510,7 @@ static void __split_huge_page_tail(struct page *head, int tail, } static void __split_huge_page(struct page *page, struct list_head *list, - pgoff_t end) + pgoff_t end, unsigned int new_order) { struct folio *folio = page_folio(page); struct page *head = &folio->page; @@ -2512,11 +2518,12 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct address_space *swap_cache = NULL; unsigned long offset = 0; unsigned int nr = thp_nr_pages(head); + unsigned int new_nr = 1 << new_order; int order = folio_order(folio); int i; /* complete memcg works before add pages to LRU */ - split_page_memcg(head, order, 0); + split_page_memcg(head, order, new_order); if (PageAnon(head) && PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2531,14 +2538,14 @@ static void __split_huge_page(struct page *page, struct list_head *list, ClearPageHasHWPoisoned(head); - for (i = nr - 1; i >= 1; i--) { - __split_huge_page_tail(head, i, lruvec, list); + for (i = nr - new_nr; i >= new_nr; i -= new_nr) { + __split_huge_page_tail(head, i, lruvec, list, new_order); /* Some pages can be beyond EOF: drop them from page cache */ if (head[i].index >= end) { struct folio *tail = page_folio(head + i); if (shmem_mapping(head->mapping)) - shmem_uncharge(head->mapping->host, 1); + shmem_uncharge(head->mapping->host, new_nr); else if (folio_test_clear_dirty(tail)) folio_account_cleaned(tail, inode_to_wb(folio->mapping->host)); @@ -2548,29 +2555,38 @@ static void __split_huge_page(struct page *page, struct list_head *list, __xa_store(&head->mapping->i_pages, head[i].index, head + i, 0); } else if (swap_cache) { + /* + * split anonymous THPs (including swapped out ones) to + * non-zero order not supported + */ + VM_WARN_ONCE(new_order, + "Split swap-cached anon folio to non-0 order not supported"); __xa_store(&swap_cache->i_pages, offset + i, head + i, 0); } } - ClearPageCompound(head); + if (!new_order) + ClearPageCompound(head); + else + set_compound_order(head, new_order); unlock_page_lruvec(lruvec); /* Caller disabled irqs, so they are still disabled here */ - split_page_owner(head, order, 0); + split_page_owner(head, order, new_order); /* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { /* Additional pin to swap cache */ if (PageSwapCache(head)) { - page_ref_add(head, 2); + page_ref_add(head, 1 + new_nr); xa_unlock(&swap_cache->i_pages); } else { page_ref_inc(head); } } else { /* Additional pin to page cache */ - page_ref_add(head, 2); + page_ref_add(head, 1 + new_nr); xa_unlock(&head->mapping->i_pages); } local_irq_enable(); @@ -2583,7 +2599,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, split_swap_cluster(entry); } - for (i = 0; i < nr; i++) { + /* + * set page to its compound_head when split to non order-0 pages, so + * we can skip unlocking it below, since PG_locked is transferred to + * the compound_head of the page and the caller will unlock it. + */ + if (new_order) + page = compound_head(page); + + for (i = 0; i < nr; i += new_nr) { struct page *subpage = head + i; if (subpage == page) continue; @@ -2617,29 +2641,31 @@ bool can_split_folio(struct folio *folio, int *pextra_pins) } /* - * This function splits huge page into normal pages. @page can point to any - * subpage of huge page to split. Split doesn't change the position of @page. + * This function splits huge page into pages in @new_order. @page can point to + * any subpage of huge page to split. Split doesn't change the position of + * @page. * * Only caller must hold pin on the @page, otherwise split fails with -EBUSY. * The huge page must be locked. * * If @list is null, tail pages will be added to LRU list, otherwise, to @list. * - * Both head page and tail pages will inherit mapping, flags, and so on from - * the hugepage. + * Pages in new_order will inherit mapping, flags, and so on from the hugepage. * - * GUP pin and PG_locked transferred to @page. Rest subpages can be freed if - * they are not mapped. + * GUP pin and PG_locked transferred to @page or the compound page @page belongs + * to. Rest subpages can be freed if they are not mapped. * * Returns 0 if the hugepage is split successfully. * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under * us. */ -int split_huge_page_to_list(struct page *page, struct list_head *list) +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, + unsigned int new_order) { struct folio *folio = page_folio(page); struct deferred_split *ds_queue = get_deferred_split_queue(folio); - XA_STATE(xas, &folio->mapping->i_pages, folio->index); + /* reset xarray order to new order after split */ + XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int extra_pins, ret; @@ -2649,6 +2675,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); + /* Cannot split THP to order-1 (no order-1 THPs) */ + if (new_order == 1) { + VM_WARN_ONCE(1, "Cannot split to order-1 folio"); + return -EINVAL; + } + + /* Split anonymous folio to non-zero order not support */ + if (folio_test_anon(folio) && new_order) { + VM_WARN_ONCE(1, "Split anon folio to non-0 order not support"); + return -EINVAL; + } + is_hzp = is_huge_zero_page(&folio->page); VM_WARN_ON_ONCE_FOLIO(is_hzp, folio); if (is_hzp) @@ -2744,7 +2782,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (folio_ref_freeze(folio, 1 + extra_pins)) { if (!list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; - list_del(&folio->_deferred_list); + /* + * Reinitialize page_deferred_list after removing the + * page from the split_queue, otherwise a subsequent + * split will see list corruption when checking the + * page_deferred_list. + */ + list_del_init(&folio->_deferred_list); } spin_unlock(&ds_queue->split_queue_lock); if (mapping) { @@ -2754,14 +2798,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (folio_test_swapbacked(folio)) { __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr); - } else { + } else if (!new_order) { + /* + * Decrease THP stats only if split to normal + * pages + */ __lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr); filemap_nr_thps_dec(mapping); } } - __split_huge_page(page, list, end); + __split_huge_page(page, list, end, new_order); ret = 0; } else { spin_unlock(&ds_queue->split_queue_lock);