Message ID | 20210414133931.4555-1-mgorman@techsingularity.net |
---|---|
Headers | show |
Series | Use local_lock for pcp protection and reduce stat overhead | expand |
On 4/14/21 3:39 PM, Mel Gorman wrote: > Now that the zone_statistics are simple counters that do not require > special protection, the bulk allocator accounting updates can be batch > updated without adding too much complexity with protected RMW updates or > using xchg. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > include/linux/vmstat.h | 8 ++++++++ > mm/page_alloc.c | 30 +++++++++++++----------------- > 2 files changed, 21 insertions(+), 17 deletions(-) > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index dde4dec4e7dd..8473b8fa9756 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -246,6 +246,14 @@ __count_numa_event(struct zone *zone, enum numa_stat_item item) > raw_cpu_inc(pzstats->vm_numa_event[item]); > } > > +static inline void > +__count_numa_events(struct zone *zone, enum numa_stat_item item, long delta) > +{ > + struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; > + > + raw_cpu_add(pzstats->vm_numa_event[item], delta); > +} > + > extern void __count_numa_event(struct zone *zone, enum numa_stat_item item); > extern unsigned long sum_zone_node_page_state(int node, > enum zone_stat_item item); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9d0f047647e3..cff0f1c98b28 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3411,7 +3411,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt) > * > * Must be called with interrupts disabled. > */ > -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > +static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, > + long nr_account) > { > #ifdef CONFIG_NUMA > enum numa_stat_item local_stat = NUMA_LOCAL; > @@ -3424,12 +3425,12 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > local_stat = NUMA_OTHER; > > if (zone_to_nid(z) == zone_to_nid(preferred_zone)) > - __count_numa_event(z, NUMA_HIT); > + __count_numa_events(z, NUMA_HIT, nr_account); > else { > - __count_numa_event(z, NUMA_MISS); > - __count_numa_event(preferred_zone, NUMA_FOREIGN); > + __count_numa_events(z, NUMA_MISS, nr_account); > + __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account); > } > - __count_numa_event(z, local_stat); > + __count_numa_events(z, local_stat, nr_account); > #endif > } > > @@ -3475,7 +3476,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, > page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list); > if (page) { > __count_zid_vm_events(PGALLOC, page_zonenum(page), 1); > - zone_statistics(preferred_zone, zone); > + zone_statistics(preferred_zone, zone, 1); > } > local_unlock_irqrestore(&pagesets.lock, flags); > return page; > @@ -3536,7 +3537,7 @@ struct page *rmqueue(struct zone *preferred_zone, > get_pcppage_migratetype(page)); > > __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); > - zone_statistics(preferred_zone, zone); > + zone_statistics(preferred_zone, zone, 1); > local_irq_restore(flags); > > out: > @@ -5019,7 +5020,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > struct alloc_context ac; > gfp_t alloc_gfp; > unsigned int alloc_flags = ALLOC_WMARK_LOW; > - int nr_populated = 0; > + int nr_populated = 0, nr_account = 0; > > if (unlikely(nr_pages <= 0)) > return 0; > @@ -5092,15 +5093,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > goto failed_irq; > break; > } > - > - /* > - * Ideally this would be batched but the best way to do > - * that cheaply is to first convert zone_statistics to > - * be inaccurate per-cpu counter like vm_events to avoid > - * a RMW cycle then do the accounting with IRQs enabled. > - */ > - __count_zid_vm_events(PGALLOC, zone_idx(zone), 1); > - zone_statistics(ac.preferred_zoneref->zone, zone); > + nr_account++; > > prep_new_page(page, 0, gfp, 0); > if (page_list) > @@ -5110,6 +5103,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > nr_populated++; > } > > + __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account); > + zone_statistics(ac.preferred_zoneref->zone, zone, nr_account); > + > local_unlock_irqrestore(&pagesets.lock, flags); > > return nr_populated; >
On 4/14/21 3:39 PM, Mel Gorman wrote: > Both free_pcppages_bulk() and free_one_page() have very similar > checks about whether a page's migratetype has changed under the > zone lock. Use a common helper. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Seems like for free_pcppages_bulk() this patch makes it check for each page on the pcplist - zone->nr_isolate_pageblock != 0 instead of local bool (the performance might be the same I guess on modern cpu though) - is_migrate_isolate(migratetype) for a migratetype obtained by get_pcppage_migratetype() which cannot be migrate_isolate so the check is useless. As such it doesn't seem a worthwhile cleanup to me considering all the other microoptimisations? > --- > mm/page_alloc.c | 32 ++++++++++++++++++++++---------- > 1 file changed, 22 insertions(+), 10 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 295624fe293b..1ed370668e7f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1354,6 +1354,23 @@ static inline void prefetch_buddy(struct page *page) > prefetch(buddy); > } > > +/* > + * The migratetype of a page may have changed due to isolation so check. > + * Assumes the caller holds the zone->lock to serialise against page > + * isolation. > + */ > +static inline int > +check_migratetype_isolated(struct zone *zone, struct page *page, unsigned long pfn, int migratetype) > +{ > + /* If isolating, check if the migratetype has changed */ > + if (unlikely(has_isolate_pageblock(zone) || > + is_migrate_isolate(migratetype))) { > + migratetype = get_pfnblock_migratetype(page, pfn); > + } > + > + return migratetype; > +} > + > /* > * Frees a number of pages from the PCP lists > * Assumes all pages on list are in same zone, and of same order. > @@ -1371,7 +1388,6 @@ static void free_pcppages_bulk(struct zone *zone, int count, > int migratetype = 0; > int batch_free = 0; > int prefetch_nr = READ_ONCE(pcp->batch); > - bool isolated_pageblocks; > struct page *page, *tmp; > LIST_HEAD(head); > > @@ -1433,21 +1449,20 @@ static void free_pcppages_bulk(struct zone *zone, int count, > * both PREEMPT_RT and non-PREEMPT_RT configurations. > */ > spin_lock(&zone->lock); > - isolated_pageblocks = has_isolate_pageblock(zone); > > /* > * Use safe version since after __free_one_page(), > * page->lru.next will not point to original list. > */ > list_for_each_entry_safe(page, tmp, &head, lru) { > + unsigned long pfn = page_to_pfn(page); > int mt = get_pcppage_migratetype(page); > + > /* MIGRATE_ISOLATE page should not go to pcplists */ > VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); > - /* Pageblock could have been isolated meanwhile */ > - if (unlikely(isolated_pageblocks)) > - mt = get_pageblock_migratetype(page); > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt, FPI_NONE); > + mt = check_migratetype_isolated(zone, page, pfn, mt); > + __free_one_page(page, pfn, zone, 0, mt, FPI_NONE); > trace_mm_page_pcpu_drain(page, 0, mt); > } > spin_unlock(&zone->lock); > @@ -1459,10 +1474,7 @@ static void free_one_page(struct zone *zone, > int migratetype, fpi_t fpi_flags) > { > spin_lock(&zone->lock); > - if (unlikely(has_isolate_pageblock(zone) || > - is_migrate_isolate(migratetype))) { > - migratetype = get_pfnblock_migratetype(page, pfn); > - } > + migratetype = check_migratetype_isolated(zone, page, pfn, migratetype); > __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); > spin_unlock(&zone->lock); > } >
On Wed, Apr 14, 2021 at 07:21:42PM +0200, Vlastimil Babka wrote: > On 4/14/21 3:39 PM, Mel Gorman wrote: > > Both free_pcppages_bulk() and free_one_page() have very similar > > checks about whether a page's migratetype has changed under the > > zone lock. Use a common helper. > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > Seems like for free_pcppages_bulk() this patch makes it check for each page on > the pcplist > - zone->nr_isolate_pageblock != 0 instead of local bool (the performance might > be the same I guess on modern cpu though) > - is_migrate_isolate(migratetype) for a migratetype obtained by > get_pcppage_migratetype() which cannot be migrate_isolate so the check is useless. > > As such it doesn't seem a worthwhile cleanup to me considering all the other > microoptimisations? > The patch was a preparation patch for the rest of the series to avoid code duplication and to consolidate checks together in one place to determine if they are even correct. Until zone_pcp_disable() came along, it was possible to have isolated PCP pages in the lists even though zone->nr_isolate_pageblock could be 0 during memory hot-remove so the split in free_pcppages_bulk was not necessarily correct at all times. The remaining problem is alloc_contig_pages, it does not disable PCPs so both checks are necessary. If that also disabled PCPs then check_migratetype_isolated could be deleted but the cost to alloc_contig_pages might be too high. I'll delete this patch for now because it's relatively minor and there should be other ways of keeping the code duplication down. -- Mel Gorman SUSE Labs
On 4/15/21 11:33 AM, Mel Gorman wrote: > On Wed, Apr 14, 2021 at 07:21:42PM +0200, Vlastimil Babka wrote: >> On 4/14/21 3:39 PM, Mel Gorman wrote: >> > Both free_pcppages_bulk() and free_one_page() have very similar >> > checks about whether a page's migratetype has changed under the >> > zone lock. Use a common helper. >> > >> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> >> >> Seems like for free_pcppages_bulk() this patch makes it check for each page on >> the pcplist >> - zone->nr_isolate_pageblock != 0 instead of local bool (the performance might >> be the same I guess on modern cpu though) >> - is_migrate_isolate(migratetype) for a migratetype obtained by >> get_pcppage_migratetype() which cannot be migrate_isolate so the check is useless. >> >> As such it doesn't seem a worthwhile cleanup to me considering all the other >> microoptimisations? >> > > The patch was a preparation patch for the rest of the series to avoid code > duplication and to consolidate checks together in one place to determine > if they are even correct. > > Until zone_pcp_disable() came along, it was possible to have isolated PCP > pages in the lists even though zone->nr_isolate_pageblock could be 0 during > memory hot-remove so the split in free_pcppages_bulk was not necessarily > correct at all times. > > The remaining problem is alloc_contig_pages, it does not disable > PCPs so both checks are necessary. If that also disabled PCPs > then check_migratetype_isolated could be deleted but the cost to > alloc_contig_pages might be too high. I see. Well that's unfortunate if checking zone->nr_isolate_pageblock is not sufficient, as it was supposed to be :( But I don't think the check_migratetype_isolated() check was helping in such scenario as it was, anyway. It's testing this: + if (unlikely(has_isolate_pageblock(zone) || + is_migrate_isolate(migratetype))) { In the context of free_one_page(), the 'migratetype' variable holds a value that's read from pageblock in one of the callers of free_one_page(): migratetype = get_pcppage_migratetype(page); and because it's read outside of zone lock, it might be a MIGRATE_ISOLATE even though after we obtain the zone lock, we might find out it's not anymore. This is explained in commit ad53f92eb416 ("mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype") as scenario 1. However, in the context of free_pcppages_bulk(), the migratetype we are checking in check_migratetype_isolated() is this one: int mt = get_pcppage_migratetype(page); That was the one determined while adding the page to pcplist, and is stored in the struct page and we know it's not MIGRATE_ISOLATE otherwise the page would not go to pcplist. But by rechecking this stored value, we would not be finding the case where the underlying pageblock's migratetype would change to MIGRATE_ISOLATE, anyway... > I'll delete this patch for now because it's relatively minor and there > should be other ways of keeping the code duplication down. >
On 4/14/21 3:39 PM, Mel Gorman wrote: > Historically when freeing pages, free_one_page() assumed that callers > had IRQs disabled and the zone->lock could be acquired with spin_lock(). > This confuses the scope of what local_lock_irq is protecting and what > zone->lock is protecting in free_unref_page_list in particular. > > This patch uses spin_lock_irqsave() for the zone->lock in > free_one_page() instead of relying on callers to have disabled > IRQs. free_unref_page_commit() is changed to only deal with PCP pages > protected by the local lock. free_unref_page_list() then first frees > isolated pages to the buddy lists with free_one_page() and frees the rest > of the pages to the PCP via free_unref_page_commit(). The end result > is that free_one_page() is no longer depending on side-effects of > local_lock to be correct. > > Note that this may incur a performance penalty while memory hot-remove > is running but that is not a common operation. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> A nit below: > @@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list) > struct page *page, *next; > unsigned long flags, pfn; > int batch_count = 0; > + int migratetype; > > /* Prepare pages for freeing */ > list_for_each_entry_safe(page, next, list, lru) { > @@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list) > if (!free_unref_page_prepare(page, pfn)) > list_del(&page->lru); > set_page_private(page, pfn); Should probably move this below so we don't set private for pages that then go through free_one_page()? Doesn't seem to be a bug, just unneccessary. > + > + /* > + * Free isolated pages directly to the allocator, see > + * comment in free_unref_page. > + */ > + migratetype = get_pcppage_migratetype(page); > + if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { > + if (unlikely(is_migrate_isolate(migratetype))) { > + free_one_page(page_zone(page), page, pfn, 0, > + migratetype, FPI_NONE); > + list_del(&page->lru); > + } > + } > } > > local_lock_irqsave(&pagesets.lock, flags); > list_for_each_entry_safe(page, next, list, lru) { > - unsigned long pfn = page_private(page); > - > + pfn = page_private(page); > set_page_private(page, 0); > + migratetype = get_pcppage_migratetype(page); > trace_mm_page_free_batched(page); > - free_unref_page_commit(page, pfn); > + free_unref_page_commit(page, pfn, migratetype); > > /* > * Guard against excessive IRQ disabled times when we get >
On Thu, Apr 15, 2021 at 02:25:36PM +0200, Vlastimil Babka wrote: > > @@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list) > > struct page *page, *next; > > unsigned long flags, pfn; > > int batch_count = 0; > > + int migratetype; > > > > /* Prepare pages for freeing */ > > list_for_each_entry_safe(page, next, list, lru) { > > @@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list) > > if (!free_unref_page_prepare(page, pfn)) > > list_del(&page->lru); > > set_page_private(page, pfn); > > Should probably move this below so we don't set private for pages that then go > through free_one_page()? Doesn't seem to be a bug, just unneccessary. > Sure. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1d87ca364680..a9c1282d9c7b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3293,7 +3293,6 @@ void free_unref_page_list(struct list_head *list) pfn = page_to_pfn(page); if (!free_unref_page_prepare(page, pfn)) list_del(&page->lru); - set_page_private(page, pfn); /* * Free isolated pages directly to the allocator, see @@ -3307,6 +3306,8 @@ void free_unref_page_list(struct list_head *list) list_del(&page->lru); } } + + set_page_private(page, pfn); } local_lock_irqsave(&pagesets.lock, flags); -- Mel Gorman SUSE Labs
On 4/14/21 3:39 PM, Mel Gorman wrote: > struct per_cpu_pages is protected by the pagesets lock but it can be > embedded within struct per_cpu_pages at a minor cost. This is possible > because per-cpu lookups are based on offsets. Paraphrasing an explanation > from Peter Ziljstra > > The whole thing relies on: > > &per_cpu_ptr(msblk->stream, cpu)->lock == per_cpu_ptr(&msblk->stream->lock, cpu) > > Which is true because the lhs: > > (local_lock_t *)((zone->per_cpu_pages + per_cpu_offset(cpu)) + offsetof(struct per_cpu_pages, lock)) > > and the rhs: > > (local_lock_t *)((zone->per_cpu_pages + offsetof(struct per_cpu_pages, lock)) + per_cpu_offset(cpu)) > > are identical, because addition is associative. > > More details are included in mmzone.h. This embedding is not completely > free for three reasons. > > 1. As local_lock does not return a per-cpu structure, the PCP has to > be looked up twice -- first to acquire the lock and again to get the > PCP pointer. > > 2. For PREEMPT_RT and CONFIG_DEBUG_LOCK_ALLOC, local_lock is potentially > a spinlock or has lock-specific tracking. In both cases, it becomes > necessary to release/acquire different locks when freeing a list of > pages in free_unref_page_list. Looks like this pattern could benefit from a local_lock API helper that would do the right thing? It probably couldn't optimize much the CONFIG_PREEMPT_RT case which would need to be unlock/lock in any case, but CONFIG_DEBUG_LOCK_ALLOC could perhaps just keep the IRQ's disabled and just note the change of what's acquired? > 3. For most kernel configurations, local_lock_t is empty and no storage is > required. By embedding the lock, the memory consumption on PREEMPT_RT > and CONFIG_DEBUG_LOCK_ALLOC is higher. But I wonder, is there really a benefit to this increased complexity? Before the patch we had "pagesets" - a local_lock that protects all zones' pcplists. Now each zone's pcplists have own local_lock. On !PREEMPT_RT we will never take the locks of multiple zones from the same CPU in parallel, because we use local_lock_irqsave(). Can that parallelism happen on PREEMPT_RT, because that could perhaps justify the change? > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > ---
On Thu, Apr 15, 2021 at 04:53:46PM +0200, Vlastimil Babka wrote: > On 4/14/21 3:39 PM, Mel Gorman wrote: > > struct per_cpu_pages is protected by the pagesets lock but it can be > > embedded within struct per_cpu_pages at a minor cost. This is possible > > because per-cpu lookups are based on offsets. Paraphrasing an explanation > > from Peter Ziljstra > > > > The whole thing relies on: > > > > &per_cpu_ptr(msblk->stream, cpu)->lock == per_cpu_ptr(&msblk->stream->lock, cpu) > > > > Which is true because the lhs: > > > > (local_lock_t *)((zone->per_cpu_pages + per_cpu_offset(cpu)) + offsetof(struct per_cpu_pages, lock)) > > > > and the rhs: > > > > (local_lock_t *)((zone->per_cpu_pages + offsetof(struct per_cpu_pages, lock)) + per_cpu_offset(cpu)) > > > > are identical, because addition is associative. > > > > More details are included in mmzone.h. This embedding is not completely > > free for three reasons. > > > > 1. As local_lock does not return a per-cpu structure, the PCP has to > > be looked up twice -- first to acquire the lock and again to get the > > PCP pointer. > > > > 2. For PREEMPT_RT and CONFIG_DEBUG_LOCK_ALLOC, local_lock is potentially > > a spinlock or has lock-specific tracking. In both cases, it becomes > > necessary to release/acquire different locks when freeing a list of > > pages in free_unref_page_list. > > Looks like this pattern could benefit from a local_lock API helper that would do > the right thing? It probably couldn't optimize much the CONFIG_PREEMPT_RT case > which would need to be unlock/lock in any case, but CONFIG_DEBUG_LOCK_ALLOC > could perhaps just keep the IRQ's disabled and just note the change of what's > acquired? > A helper could potentially be used but right now, there is only one call-site that needs this type of care so it may be overkill. A helper was proposed that can lookup and lock a per-cpu structure which is generally useful but does not suit the case where different locks need to be acquired. > > 3. For most kernel configurations, local_lock_t is empty and no storage is > > required. By embedding the lock, the memory consumption on PREEMPT_RT > > and CONFIG_DEBUG_LOCK_ALLOC is higher. > > But I wonder, is there really a benefit to this increased complexity? Before the > patch we had "pagesets" - a local_lock that protects all zones' pcplists. Now > each zone's pcplists have own local_lock. On !PREEMPT_RT we will never take the > locks of multiple zones from the same CPU in parallel, because we use > local_lock_irqsave(). Can that parallelism happen on PREEMPT_RT, because that > could perhaps justify the change? > I don't think PREEMPT_RT gets additional parallelism because it's still a per-cpu structure that is being protected. The difference is whether we are protecting the CPU-N index for all per_cpu_pages or just one. The patch exists because it was asked why the lock was not embedded within the structure it's protecting. I initially thought that was unsafe and I was wrong as explained in the changelog. But now that I find it *can* be done but it's a bit ugly so I put it at the end of the series so it can be dropped if necessary. -- Mel Gorman SUSE Labs