[v4] Weighted Interleave Auto-tuning

Message ID	20250128222332.3835931-1-joshua.hahnjy@gmail.com
State	Superseded
Headers	show Received: from mail-yb1-f172.google.com (mail-yb1-f172.google.com [209.85.219.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1818319755B; Tue, 28 Jan 2025 22:23:34 +0000 (UTC) From: Joshua Hahn <joshua.hahnjy@gmail.com> To: gourry@gourry.net, hyeonggon.yoo@sk.com, ying.huang@linux.alibaba.com Cc: rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org, akpm@linux-foundation.org, honggyu.kim@sk.com, rakie.kim@sk.com, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: [PATCH v4] Weighted Interleave Auto-tuning Date: Tue, 28 Jan 2025 14:23:31 -0800 Message-ID: <20250128222332.3835931-1-joshua.hahnjy@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v4] Weighted Interleave Auto-tuning \| expand [v4] Weighted Interleave Auto-tuning

Joshua Hahn Jan. 28, 2025, 10:23 p.m. UTC

On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

At the same time, we want these weights to be as small as possible.
Having ratios that involve large co-prime numbers like 7639:1345:7 leads
to awkward and inefficient allocations, since the node with weight 7
will remain mostly unused (and despite being proportional to bandwidth,
will not aid in relieving the bandwidth pressure in the other two nodes).

This patch introduces an auto-configuration mode for the interleave
weights that aims to balance the two goals of setting node weights to be
proportional to their bandwidths and keeping the weight values low.
In order to perform the weight re-scaling, we use an internal
"weightiness" value (fixed to 32) that defines interleave aggression.

In this auto configuration mode, node weights are dynamically updated
every time there is a hotplug event that introduces new bandwidth.

Users can also enter manual mode by writing "N" or "0" to the new "auto"
sysfs interface. When a user enters manual mode, the system stops
dynamically updating any of the node weights, even during hotplug events
that can shift the optimal weight distribution. The system also enters
manual mode any time a user sets a node's weight directly by using the
nodeN interface introduced in [1]. On the other hand, auto mode is
only entered by explicitly writing "Y" or "1" to the auto interface.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent.

This patch introduces "real" defaults, but moves away from letting users
use 0 as a "set to default" interface. Rather, users who want to use
system defaults should use auto mode. This patch seems to be the
appropriate place to make this change, since we would like to remove
this usage before users begin to rely on the feature in userspace.
Moreover, users will not be losing any functionality; they can still
write 1 into a node if they want a weight of 1. Thus, we deprecate the
"write zero to reset" feature in favor of returning an error, the same
way we would return an error when the user writes any other invalid
weight to the interface.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Changelog
v4:
- Renamed the mode inerface to the "auto" interface, which now only
  emits either 'Y' or 'N'. Users can now interact with it by
  writing 'Y', '1', 'N', or '0' to it.
- Added additional documentation to the nodeN sysfs interface.
- Makes sure iw_table locks are properly held.
- Removed unlikely() call in reduce_interleave_weights.
- Wordsmithing

v3:
- Weightiness (max_node_weight) is now fixed to 32.
- Instead, the sysfs interface now exposes a "mode" parameter, which
  can either be "auto" or "manual".
  - Thank you Hyeonggon and Honggyu for the feedback.
- Documentation updated to reflect new sysfs interface, explicitly
  specifies that 0 is invalid.
  - Thank you Gregory and Ying for the discussion on how best to
    handle the 0 case.
- Re-worked nodeN sysfs store to handle auto --> manual shifts
- mempolicy_set_node_perf internally handles the auto / manual
  caes differently now. bw is always updated, iw updates depend on
  what mode the user is in.
- Wordsmithing comments for clarity.
- Removed RFC tag.

v2:
- Name of the interface is changed: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
  interleave weight table is initialized with the defaults, if bandwidth
  information is available.
  - In addition, all sections that handle iw_table have been changed
    to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
  support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
  behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!

...fs-kernel-mm-mempolicy-weighted-interleave |  34 ++-
 drivers/acpi/numa/hmat.c                      |   1 +
 drivers/base/node.c                           |   7 +
 include/linux/mempolicy.h                     |   4 +
 mm/mempolicy.c                                | 200 ++++++++++++++++--
 5 files changed, 229 insertions(+), 17 deletions(-)

Honggyu Kim Jan. 31, 2025, 2:23 p.m. UTC | #1

Hi Joshua,

Thank you for the v4 update!

It took a long time to give you a feedback due to Luna new year's
holiday.  Happy new year anyway!

On 1/29/25 07:23, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across nodes according to user-set ratios.
> 
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
> 
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the bandwidth pressure in the other two nodes).
> 
> This patch introduces an auto-configuration mode for the interleave
> weights that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> In order to perform the weight re-scaling, we use an internal
> "weightiness" value (fixed to 32) that defines interleave aggression.
> 
> In this auto configuration mode, node weights are dynamically updated
> every time there is a hotplug event that introduces new bandwidth.
> 
> Users can also enter manual mode by writing "N" or "0" to the new "auto"
> sysfs interface. When a user enters manual mode, the system stops
> dynamically updating any of the node weights, even during hotplug events
> that can shift the optimal weight distribution. The system also enters
> manual mode any time a user sets a node's weight directly by using the
> nodeN interface introduced in [1]. On the other hand, auto mode is
> only entered by explicitly writing "Y" or "1" to the auto interface.
> 
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent.
> 
> This patch introduces "real" defaults, but moves away from letting users
> use 0 as a "set to default" interface. Rather, users who want to use
> system defaults should use auto mode. This patch seems to be the
> appropriate place to make this change, since we would like to remove
> this usage before users begin to rely on the feature in userspace.
> Moreover, users will not be losing any functionality; they can still
> write 1 into a node if they want a weight of 1. Thus, we deprecate the
> "write zero to reset" feature in favor of returning an error, the same
> way we would return an error when the user writes any other invalid
> weight to the interface.
> 
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> Changelog
> v4:
> - Renamed the mode inerface to the "auto" interface, which now only

[nit] inerface -> interface

So we're going back to similar to "use_default" now.  I prefer "mode"
with string interface with "auto" and "manual", but I also agree with
Ying that "mode" won't have more options in the future so this kind of
binary interface with "auto" is also fine.

>    emits either 'Y' or 'N'. Users can now interact with it by
>    writing 'Y', '1', 'N', or '0' to it.
> - Added additional documentation to the nodeN sysfs interface.
> - Makes sure iw_table locks are properly held.
> - Removed unlikely() call in reduce_interleave_weights.
> - Wordsmithing
> 
> v3:
> - Weightiness (max_node_weight) is now fixed to 32.
> - Instead, the sysfs interface now exposes a "mode" parameter, which
>    can either be "auto" or "manual".
>    - Thank you Hyeonggon and Honggyu for the feedback.
> - Documentation updated to reflect new sysfs interface, explicitly
>    specifies that 0 is invalid.
>    - Thank you Gregory and Ying for the discussion on how best to
>      handle the 0 case.
> - Re-worked nodeN sysfs store to handle auto --> manual shifts
> - mempolicy_set_node_perf internally handles the auto / manual
>    caes differently now. bw is always updated, iw updates depend on

[nit] caes -> case

>    what mode the user is in.
> - Wordsmithing comments for clarity.
> - Removed RFC tag.
> 
> v2:
> - Name of the interface is changed: "max_node_weight" --> "weightiness"
> - Default interleave weight table no longer exists. Rather, the
>    interleave weight table is initialized with the defaults, if bandwidth
>    information is available.
>    - In addition, all sections that handle iw_table have been changed
>      to reference iw_table if it exists, otherwise defaulting to 1.
> - All instances of unsigned long are converted to uint64_t to guarantee
>    support for both 32-bit and 64-bit machines
> - sysfs initialization cleanup
> - Documentation has been rewritten to explicitly outline expected
>    behavior and expand on the interpretation of "weightiness".
> - kzalloc replaced with kcalloc for readability
> - Thank you Gregory and Hyeonggon for your review & feedback!
> 
> ...fs-kernel-mm-mempolicy-weighted-interleave |  34 ++-
>   drivers/acpi/numa/hmat.c                      |   1 +
>   drivers/base/node.c                           |   7 +
>   include/linux/mempolicy.h                     |   4 +
>   mm/mempolicy.c                                | 200 ++++++++++++++++--
>   5 files changed, 229 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..c26879f59d5d 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -20,6 +20,34 @@ Description:	Weight configuration interface for nodeN
>   		Minimum weight: 1
>   		Maximum weight: 255
>   
> -		Writing an empty string or `0` will reset the weight to the
> -		system default. The system default may be set by the kernel
> -		or drivers at boot or during hotplug events.
> +		Writing invalid values (i.e. any values not in [1,255],
> +		empty string, ...) will return -EINVAL.
> +
> +		Changing the weight to a valid value will automatically
> +		update the system to manual mode as well.
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
> +Date:		January 2025
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Auto-weighting configuration interface
> +
> +		Configuration mode for weighted interleave. A 'Y' indicates
> +		that the system is in auto mode, and a 'N' indicates that
> +		the system is in manual mode. All other values are invalid.
> +
> +		In auto mode, all node weights are re-calculated and overwritten
> +		(visible via the nodeN interfaces) whenever new bandwidth data
> +		is made available during either boot or hotplug events.
> +
> +		In manual mode, node weights can only be updated by the user.
> +		If a node is hotplugged while the user is in manual mode,
> +		the node will have a default weight of 1.

Sorry for jumping into this discussion a bit late, but IMHO, setting a
newly onlined node's weight to 1 can possibly make the entire weight
ratio unbalanced.

We can avoid this default weight by not choosing the new node in
nodemask as Ying suggested, but some APIs of numactl uses all available
nodes in the system and the following API is one of the example.

   void *numa_alloc_weighted_interleaved(size_t size);
   (Add by Hyeonggon at https://github.com/numactl/numactl/pull/238)

The description of this API is as follows.

   numa_alloc_weighted_interleaved() allocates size bytes of memory page
   interleaved on _all nodes_ according to the weights in
   /sys/kernel/mm/mempolicy/weighted_interleave/node*.

Let's say there are 3 nodes and there bandwidth ratio is as follows.

   node0 - 1000
   node1 - 500
   node2 - 100

In this case, the weight ratio will be 10:5:1 in auto mode.  I worry
about the following problematic scenario.

   1. the auto mode set the weights as 10:5:1 for node{0-2}.
   2. node2 is offlined, then recalculation makes it as 2:1 for node{0,1}.
   3. the auto sysfs interface is set to 0 to make it manual mode.
   4. node2 is onlined again, then the weights is now 2:1:1.
   5. calling numa_alloc_weighted_interleaved() API works in bad weights.

IMHO, the weight of onlined node is better to be 0 in manual mode and
give responsibility of setting the proper weight to users.

Could anyone please correct me if I miss something for this?

> +
> +		Modes can be changed by writing Y, N, 1, or 0 to the interface.
> +		All other strings will be ignored, and -EINVAL will be returned.
> +		If Y or 1 is written to the interface but the recalculation or
> +		updates fail at any point (-ENOMEM or -ENODEV), then the mode
> +		will remain in manual mode.
> +
> +		Writing a new weight to a node directly via the nodeN interface
> +		will also automatically update the system to manual mode.
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 80a3481c0470..cc94cba112dd 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -20,6 +20,7 @@
>   #include <linux/list_sort.h>
>   #include <linux/memregion.h>
>   #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>   #include <linux/mutex.h>
>   #include <linux/node.h>
>   #include <linux/sysfs.h>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..16e7a5a8ebe7 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
>   #include <linux/init.h>
>   #include <linux/mm.h>
>   #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>   #include <linux/vmstat.h>
>   #include <linux/notifier.h>
>   #include <linux/node.h>
> @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>   			break;
>   		}
>   	}
> +
> +	/* When setting CPU access coordinates, update mempolicy */
> +	if (access == ACCESS_COORDINATE_CPU) {
> +		if (mempolicy_set_node_perf(nid, coord))
> +			pr_info("failed to set node%d mempolicy attrs\n", nid);
> +	}
>   }
>   EXPORT_SYMBOL_GPL(node_set_perf_attrs);
>   
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index ce9885e0178a..0fe96f3ab3ef 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -11,6 +11,7 @@
>   #include <linux/slab.h>
>   #include <linux/rbtree.h>
>   #include <linux/spinlock.h>
> +#include <linux/node.h>
>   #include <linux/nodemask.h>
>   #include <linux/pagemap.h>
>   #include <uapi/linux/mempolicy.h>
> @@ -178,6 +179,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>   
>   extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>   
> +extern int mempolicy_set_node_perf(unsigned int node,
> +				   struct access_coordinate *coords);
> +
>   #else
>   
>   struct mempolicy {};
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 04f35659717a..e41dbb3e9185 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -109,6 +109,7 @@
>   #include <linux/mmu_notifier.h>
>   #include <linux/printk.h>
>   #include <linux/swapops.h>
> +#include <linux/gcd.h>
>   
>   #include <asm/tlbflush.h>
>   #include <asm/tlb.h>
> @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
>   
>   static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>   
> +static uint64_t *node_bw_table;
> +
>   /*
> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> - * system-default value should be used. A NULL iw_table also denotes that
> - * system-default values should be used. Until the system-default table
> - * is implemented, the system-default is always 1.
> - *
> + * iw_table is the interleave weight table.
> + * If bandwiddth data is available and the user is in auto mode, the table

[nit] bandwiddth -> bandwidth

> + * is populated with default values in [1,255].
>    * iw_table is RCU protected
>    */
>   static u8 __rcu *iw_table;
>   static DEFINE_MUTEX(iw_table_lock);
> +static const int weightiness = 32;
> +static bool weighted_interleave_auto = true;
>   
>   static u8 get_il_weight(int node)
>   {
> @@ -156,14 +159,113 @@ static u8 get_il_weight(int node)
>   
>   	rcu_read_lock();
>   	table = rcu_dereference(iw_table);
> -	/* if no iw_table, use system default */
>   	weight = table ? table[node] : 1;
> -	/* if value in iw_table is 0, use system default */
> -	weight = weight ? weight : 1;
>   	rcu_read_unlock();
>   	return weight;
>   }
>   
> +/*
> + * Convert bandwidth values into weighted interleave weights.
> + * Call with iw_table_lock.
> + */
> +static void reduce_interleave_weights(uint64_t *bw, u8 *new_iw)
> +{
> +	uint64_t ttl_bw = 0, ttl_iw = 0, scaling_factor = 1, iw_gcd = 1;

[nit] Just curiosity but what does ttl mean? If this means "total", then
what about using sum_bw or sum_iw?

It's also a nitpick, but this single line defines too many variables so
how about this?

	uint64_t sum_bw = 0, sum_iw = 0;
	uint64_t scaling_factor = 1, iw_gcd = 1;

> +	unsigned int i = 0;
> +
> +	/* Recalculate the bandwidth distribution given the new info */
> +	for (i = 0; i < nr_node_ids; i++)
> +		ttl_bw += bw[i];
> +
> +	/* If node is not set or has < 1% of total bw, use minimum value of 1 */
> +	for (i = 0; i < nr_node_ids; i++) {
> +		if (bw[i]) {
> +			scaling_factor = 100 * bw[i];
> +			new_iw[i] = max(scaling_factor / ttl_bw, 1);
> +		} else {
> +			new_iw[i] = 1;
> +		}
> +		ttl_iw += new_iw[i];
> +	}
> +
> +	/*
> +	 * Scale each node's share of the total bandwidth from percentages
> +	 * to whole numbers in the range [1, weightiness]
> +	 */
> +	for (i = 0; i < nr_node_ids; i++) {
> +		scaling_factor = weightiness * new_iw[i];
> +		new_iw[i] = max(scaling_factor / ttl_iw, 1);
> +		if (i == 0)
> +			iw_gcd = new_iw[0];
> +		iw_gcd = gcd(iw_gcd, new_iw[i]);
> +	}
> +
> +	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
> +	for (i = 0; i < nr_node_ids; i++)
> +		new_iw[i] /= iw_gcd;
> +}
> +
> +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
> +{
> +	uint64_t *old_bw, *new_bw;
> +	uint64_t bw_val;
> +	u8 *old_iw, *new_iw;
> +
> +	/*
> +	 * Bandwidths above this limit cause rounding errors when reducing
> +	 * weights. This value is ~16 exabytes, which is unreasonable anyways.
> +	 */
> +	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
> +	if (bw_val > (U64_MAX / 10))
> +		return -EINVAL;
> +
> +	new_bw = kcalloc(nr_node_ids, sizeof(uint64_t), GFP_KERNEL);
> +	if (!new_bw)
> +		return -ENOMEM;
> +
> +	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
> +	if (!new_iw) {
> +		kfree(new_bw);
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * Update bandwidth info, even in manual mode. That way, when switching
> +	 * to auto mode in the future, iw_table can be overwritten using
> +	 * accurate bw data.
> +	 */
> +	mutex_lock(&iw_table_lock);
> +	old_bw = node_bw_table;
> +	old_iw = rcu_dereference_protected(iw_table,
> +					   lockdep_is_held(&iw_table_lock));
> +
> +	if (old_bw)
> +		memcpy(new_bw, old_bw, nr_node_ids*sizeof(uint64_t));

[nit] nr_node_ids*sizeof(uint64_t) -> nr_node_ids * sizeof(uint64_t)

Thanks,
Honggyu

Gregory Price Feb. 1, 2025, 4:49 p.m. UTC | #2

Hi Honggyu,

On Fri, Jan 31, 2025 at 11:23:34PM +0900, Honggyu Kim wrote:
> Sorry for jumping into this discussion a bit late, but IMHO, setting a
> newly onlined node's weight to 1 can possibly make the entire weight
> ratio unbalanced.
> 

This is only the default for either:
a) Manual Mode
b) Auto Mode - when HMAT/CDAT data is missing

In manual mode, the node weight in /sys/kernel/.../nodeN is present
regardless of the online/offline state of the node - so it can be
set prior to that node being hotplugged.  Since it's in manual mode,
it's expected the administrator knows when things are coming online
and offline, and they have a clear chance to change weights such that
a 1 is never perceived.

In auto mode, the node goes "online" after HMAT/CDAT data is reported
so users will generally not perceive this default weight.

>   1. the auto mode set the weights as 10:5:1 for node{0-2}.
>   2. node2 is offlined, then recalculation makes it as 2:1 for node{0,1}.
>   3. the auto sysfs interface is set to 0 to make it manual mode.

If the user sets manual mode, they get manual mode.  If they don't
want to manually adjust weights, don't set manual mode.

We can't stop users from doing silly things

~Gregory

Gregory Price Feb. 1, 2025, 4:53 p.m. UTC | #3

On Sat, Feb 01, 2025 at 11:49:31AM -0500, Gregory Price wrote:
> >   1. the auto mode set the weights as 10:5:1 for node{0-2}.
> >   2. node2 is offlined, then recalculation makes it as 2:1 for node{0,1}.

Point of clarification here:  a hot-unplug event won't cause
recalculation.

What actually causes re-weight is hot-plug reporting new capacity.

So in this scenario, the weight will remain the same for node2.

> >   3. the auto sysfs interface is set to 0 to make it manual mode.

Honggyu Kim Feb. 2, 2025, 1:44 p.m. UTC | #4

Hi Gregory,

On 2/2/25 01:49, Gregory Price wrote:
> Hi Honggyu,
> 
> On Fri, Jan 31, 2025 at 11:23:34PM +0900, Honggyu Kim wrote:
>> Sorry for jumping into this discussion a bit late, but IMHO, setting a
>> newly onlined node's weight to 1 can possibly make the entire weight
>> ratio unbalanced.
>>
> 
> This is only the default for either:
> a) Manual Mode
> b) Auto Mode - when HMAT/CDAT data is missing
> 
> In manual mode, the node weight in /sys/kernel/.../nodeN is present
> regardless of the online/offline state of the node - so it can be
> set prior to that node being hotplugged.  Since it's in manual mode,
> it's expected the administrator knows when things are coming online
> and offline, and they have a clear chance to change weights such that
> a 1 is never perceived.

This may be true, but system admins can hot-plug some nodes without
considering about weighted interleaving.  If weighted interleave is not
their major use cases, then they may not consider weight values
seriously, but some workloads might use some weighted interleave APIs
that use all nodes available.

This might be my imaginary scenario so if you think it's not realistic,
then you can just ignore.

Thanks,
Honggyu

> 
> In auto mode, the node goes "online" after HMAT/CDAT data is reported
> so users will generally not perceive this default weight.
> 
>>    1. the auto mode set the weights as 10:5:1 for node{0-2}.
>>    2. node2 is offlined, then recalculation makes it as 2:1 for node{0,1}.
>>    3. the auto sysfs interface is set to 0 to make it manual mode.
> 
> If the user sets manual mode, they get manual mode.  If they don't
> want to manually adjust weights, don't set manual mode.
> 
> We can't stop users from doing silly things
> 
> ~Gregory

Honggyu Kim Feb. 2, 2025, 1:51 p.m. UTC | #5

On 2/2/25 01:53, Gregory Price wrote:
> On Sat, Feb 01, 2025 at 11:49:31AM -0500, Gregory Price wrote:
>>>    1. the auto mode set the weights as 10:5:1 for node{0-2}.
>>>    2. node2 is offlined, then recalculation makes it as 2:1 for node{0,1}.
> 
> Point of clarification here:  a hot-unplug event won't cause
> recalculation.
> 
> What actually causes re-weight is hot-plug reporting new capacity.

So do you mean re-weight is done only when a new node is onlined while
offline doesn't trigger re-weight?

I see node_set_perf_attrs() does recalculation by calling
mempolicy_set_node_perf(), then reduce_interleave_weights().

But I'm not sure if the re-weight is done via node_set_perf_attrs() only
when a new node is onlined.

Could you please explain where I can find it?

> 
> So in this scenario, the weight will remain the same for node2.

If it's true, my scenario is wrong.

Thanks,
Honggyu

> 
>>>    3. the auto sysfs interface is set to 0 to make it manual mode.

Harry (Hyeonggon) Yoo Feb. 4, 2025, 7:50 a.m. UTC | #6

On Tue, Jan 28, 2025 at 02:23:31PM -0800, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across nodes according to user-set ratios.
>
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
>
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the bandwidth pressure in the other two nodes).
>
> This patch introduces an auto-configuration mode for the interleave
> weights that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> In order to perform the weight re-scaling, we use an internal
> "weightiness" value (fixed to 32) that defines interleave aggression.
>
> In this auto configuration mode, node weights are dynamically updated
> every time there is a hotplug event that introduces new bandwidth.
>
> Users can also enter manual mode by writing "N" or "0" to the new "auto"
> sysfs interface. When a user enters manual mode, the system stops
> dynamically updating any of the node weights, even during hotplug events
> that can shift the optimal weight distribution. The system also enters
> manual mode any time a user sets a node's weight directly by using the
> nodeN interface introduced in [1]. On the other hand, auto mode is
> only entered by explicitly writing "Y" or "1" to the auto interface.
>
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent.
>
> This patch introduces "real" defaults, but moves away from letting users
> use 0 as a "set to default" interface. Rather, users who want to use
> system defaults should use auto mode. This patch seems to be the
> appropriate place to make this change, since we would like to remove
> this usage before users begin to rely on the feature in userspace.
> Moreover, users will not be losing any functionality; they can still
> write 1 into a node if they want a weight of 1. Thus, we deprecate the
> "write zero to reset" feature in favor of returning an error, the same
> way we would return an error when the user writes any other invalid
> weight to the interface.
>
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---

Hi Joshua,
I'm glad we're close to finalizing the interface.

I believe the author has successfully addressed major concerns
through the revisions. The interface and the code now look good to me.

Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

With a few nits:

> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..c26879f59d5d 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -20,6 +20,34 @@ Description:	Weight configuration interface for nodeN

[...snip...]

> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
> +Date:		January 2025
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Auto-weighting configuration interface
> +
> +		Configuration mode for weighted interleave. A 'Y' indicates
> +		that the system is in auto mode, and a 'N' indicates that
> +		the system is in manual mode. All other values are invalid.
> +
> +		In auto mode, all node weights are re-calculated and overwritten
> +		(visible via the nodeN interfaces) whenever new bandwidth data
> +		is made available during either boot or hotplug events.
> +
> +		In manual mode, node weights can only be updated by the user.
> +		If a node is hotplugged while the user is in manual mode,
> +		the node will have a default weight of 1.
> +
> +		Modes can be changed by writing Y, N, 1, or 0 to the interface.
> +		All other strings will be ignored, and -EINVAL will be returned.
> +		If Y or 1 is written to the interface but the recalculation or
> +		updates fail at any point (-ENOMEM or -ENODEV), then the mode
> +		will remain in manual mode.

nit: the commit log describes that writing 'N' or '0' means
switching to manual mode and writing 1 means switching to auto mode,
but the Documentation does not explicitly states what '0' and '1' does?

> +		Writing a new weight to a node directly via the nodeN interface
> +		will also automatically update the system to manual mode.


> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 80a3481c0470..cc94cba112dd 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -20,6 +20,7 @@
>  #include <linux/list_sort.h>
>  #include <linux/memregion.h>
>  #include <linux/memory.h>
> +#include <linux/mempolicy.h>

nit: is this #include directive necessary?

--
Harry

Gregory Price Feb. 4, 2025, 4:30 p.m. UTC | #7

On Tue, Feb 04, 2025 at 04:50:26PM +0900, Harry (Hyeonggon) Yoo wrote:
> On Tue, Jan 28, 2025 at 02:23:31PM -0800, Joshua Hahn wrote:
> 
> I believe the author has successfully addressed major concerns
> through the revisions. The interface and the code now look good to me.
> 
> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> 

Thank you!

> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > index 80a3481c0470..cc94cba112dd 100644
> > --- a/drivers/acpi/numa/hmat.c
> > +++ b/drivers/acpi/numa/hmat.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/list_sort.h>
> >  #include <linux/memregion.h>
> >  #include <linux/memory.h>
> > +#include <linux/mempolicy.h>
> 
> nit: is this #include directive necessary?

yes because hmat.c now calls
	mempolicy_set_node_perf(nid, coord))

and best practice is to "include what you use" to avoid hidden
dependency issues should another include be removed later.

~Gregory

Gregory Price Feb. 12, 2025, 5:27 p.m. UTC | #8

On Tue, Feb 04, 2025 at 11:30:01AM -0500, Gregory Price wrote:
> > > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > > index 80a3481c0470..cc94cba112dd 100644
> > > --- a/drivers/acpi/numa/hmat.c
> > > +++ b/drivers/acpi/numa/hmat.c
> > > @@ -20,6 +20,7 @@
> > >  #include <linux/list_sort.h>
> > >  #include <linux/memregion.h>
> > >  #include <linux/memory.h>
> > > +#include <linux/mempolicy.h>
> > 
> > nit: is this #include directive necessary?
> 
> yes because hmat.c now calls
> 	mempolicy_set_node_perf(nid, coord))
> 

apologies, i missed that there were two of these, we'll drop this.

~Gregory

[v4] Weighted Interleave Auto-tuning

Commit Message

Comments

Patch