diff mbox series

[2/2,v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

Message ID 20250226213518.767670-2-joshua.hahnjy@gmail.com
State New
Headers show
Series [1/2,v6] mm/mempolicy: Weighted Interleave Auto-tuning | expand

Commit Message

Joshua Hahn Feb. 26, 2025, 9:35 p.m. UTC
We should never try to allocate memory from a memoryless node. Creating a
sysfs knob to control its weighted interleave weight does not make sense,
and can be unsafe.

Only create weighted interleave weight knobs for nodes with memory.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/mempolicy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Honggyu Kim Feb. 27, 2025, 2:32 a.m. UTC | #1
Hi Joshua,

On 2/27/2025 6:35 AM, Joshua Hahn wrote:
> We should never try to allocate memory from a memoryless node. Creating a
> sysfs knob to control its weighted interleave weight does not make sense,
> and can be unsafe.
> 
> Only create weighted interleave weight knobs for nodes with memory.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>   mm/mempolicy.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4cc04ff8f12c..50cbb7c047fa 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
>   		return err;
>   	}
>   
> -	for_each_node_state(nid, N_POSSIBLE) {

Actually, we're aware of this issue and currently trying to fix this.
In our system, we've attached 4ch of CXL memory for each socket as
follows.

         node0             node1
       +-------+   UPI   +-------+
       | CPU 0 |-+-----+-| CPU 1 |
       +-------+         +-------+
       | DRAM0 |         | DRAM1 |
       +---+---+         +---+---+
           |                 |
       +---+---+         +---+---+
       | CXL 0 |         | CXL 4 |
       +---+---+         +---+---+
       | CXL 1 |         | CXL 5 |
       +---+---+         +---+---+
       | CXL 2 |         | CXL 6 |
       +---+---+         +---+---+
       | CXL 3 |         | CXL 7 |
       +---+---+         +---+---+
         node2             node3

The 4ch of CXL memory are detected as a single NUMA node in each socket,
but it shows as follows with the current N_POSSIBLE loop.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
node0 node1 node2 node3 node4 node5
node6 node7 node8 node9 node10 node11

> +	for_each_node_state(nid, N_MEMORY) {

But using N_MEMORY doesn't fix this problem and it hides the entire CXL
memory nodes in our system because the CXL memory isn't detected at this
point of creating node*.  Maybe there is some difference when multiple
CXL memory is detected as a single node.

We have to create more nodes when CXL memory is detected later.  In 
addition, this part can be changed to "for_each_online_node(nid)"
although N_MEMORY is also fine here.

We've internally fixed it using a memory hotpluging callback so we can
upload another working version later.

Do you mind if we continue fixing this work?

Thanks,
Honggyu

>   		err = add_weight_node(nid, wi_kobj);
>   		if (err) {
>   			pr_err("failed to add sysfs [node%d]\n", nid);
Gregory Price March 3, 2025, 4:19 p.m. UTC | #2
On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
> 
> But using N_MEMORY doesn't fix this problem and it hides the entire CXL
> memory nodes in our system because the CXL memory isn't detected at this
> point of creating node*.  Maybe there is some difference when multiple
> CXL memory is detected as a single node.
> 

Hm, well, the node is "created" during early boot when ACPI tables are
read and the CFMW are discovered - but they aren't necessarily "online"
at the time they're created.

There is no true concept of a "Hotplug NUMA Node" - as the node must be
created at boot time. (tl;dr: N_POSSIBLE will never change).

This patch may have been a bit overzealous of us, I forgot to ask
whether N_MEMORY is set for nodes created but not onlined at boot. So
this is a good observation.

It also doesn't help that this may introduce a subtle race condition.

If a node exists (N_POSSIBLE) but hasn't been onlined (!N_MEMORY) and
bandwidth information is reported - then we store the bandwidth info
but don't include the node in the reduction.  Then if the node comes
online later, we don't re-trigger reduction.

Joshua we should just drop this patch for now and work with Honggyu and
friends separately on this issue.  In the meantime we can stick with
N_POSSIBLE.

There are more problems in this space - namely how to handle a system
whereby 8 CXL nodes are "possible" but the user only configures 2 (as
described by Hyonggye here).  We will probably need to introduce
hotplug/node on/offline callbacks to re-configure weights.

~Gregory
Gregory Price March 4, 2025, 4:16 p.m. UTC | #3
On Tue, Mar 04, 2025 at 10:03:22PM +0900, Honggyu Kim wrote:
> Hi Gregory,
> 
> > This patch may have been a bit overzealous of us, I forgot to ask
> > whether N_MEMORY is set for nodes created but not onlined at boot. So
> > this is a good observation.
> 
> I didn't want to make more noise but we found many issues again after
> getting a new machine and started using it with multiple CXL memory.
> 

I spent yesterday looking into how nodes are created and marked N_MEMORY
and I think now that this patch is just not correct.

N_MEMORY for a given nid is toggled:
  1) during mm_init if any page is associated with that node (DRAM)
  2) memory_hotplug when a memory block is onlined/offlined  (CXL)

This means a CXL node which is deferred to the driver will come up as
memoryless at boot (mm_init) but has N_MEMORY toggled on when the first
hotplug memory block is onlined.  However, its access_coordinate data is
reported during cxl driver probe - well prior to memory hotplug.

This means we must expose a node entry for every possible node, always,
because we can't predict what nodes will have hotplug memory.

We COULD try to react to hotplug memory blocks, but this increase in
complexity just doesn't seem worth the hassle - the hotplug callback has
timing restrictions (callback must occur AFTER N_MEMORY is toggled).

It seems better to include all nodes with reported data in the reduction.

This has two downsides:
  1) stale data may be used if hotplug occurs and the new device does
     not have CDAT/HMAT/access_coordinate data.
  2) any device without CDAT/HMAT/access_coordinate data will not be
     included in the reduction by default.

I think we can work around #2 by detecting this (on reduction, if data
is missing but N_MEMORY is set, fire a warning).  We can't do much about
#1 unless we field physical device hot-un-plug callbacks - and that
seems like a bit much.

~Gregory
Gregory Price March 4, 2025, 4:29 p.m. UTC | #4
On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
> Actually, we're aware of this issue and currently trying to fix this.
> In our system, we've attached 4ch of CXL memory for each socket as
> follows.
> 
>         node0             node1
>       +-------+   UPI   +-------+
>       | CPU 0 |-+-----+-| CPU 1 |
>       +-------+         +-------+
>       | DRAM0 |         | DRAM1 |
>       +---+---+         +---+---+
>           |                 |
>       +---+---+         +---+---+
>       | CXL 0 |         | CXL 4 |
>       +---+---+         +---+---+
>       | CXL 1 |         | CXL 5 |
>       +---+---+         +---+---+
>       | CXL 2 |         | CXL 6 |
>       +---+---+         +---+---+
>       | CXL 3 |         | CXL 7 |
>       +---+---+         +---+---+
>         node2             node3
> 
> The 4ch of CXL memory are detected as a single NUMA node in each socket,
> but it shows as follows with the current N_POSSIBLE loop.
> 
> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
> node0 node1 node2 node3 node4 node5
> node6 node7 node8 node9 node10 node11

This is insufficient information for me to assess the correctness of the
configuration. Can you please show the contents of your CEDT/CFMWS and
SRAT/Memory Affinity structures?

mkdir acpi_data && cd acpi_data
acpidump -b
iasl -d *
cat cedt.dsl  <- find all CFMWS entries
cat srat.dsl  <- find all Memory Affinity entries

Basically I need to know:
1) Is each CXL device on a dedicated Host Bridge?
2) Is inter-host-bridge interleaving configured?
3) Is intra-host-bridge interleaving configured?
4) Do SRAT entries exist for all nodes?
5) Why are there 12 nodes but only 10 sources? Are there additional
   devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
   Affinity records - resulting in 10 nodes? This is strange.

By default, Linux creates a node for each proximity domain ("PXM")
detected in the SRAT Memory Affinity tables. If SRAT entries for a
memory region described in a CFMWS is absent, it will also create an
node for that CFMWS.

Your reported configuration and results lead me to believe you have
a combination of CFMWS/SRAT configurations that are unexpected.

~Gregory
Honggyu Kim March 6, 2025, 12:39 p.m. UTC | #5
Hi Gregory,

On 3/5/2025 1:29 AM, Gregory Price wrote:
> On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
>> Actually, we're aware of this issue and currently trying to fix this.
>> In our system, we've attached 4ch of CXL memory for each socket as
>> follows.
>>
>>          node0             node1
>>        +-------+   UPI   +-------+
>>        | CPU 0 |-+-----+-| CPU 1 |
>>        +-------+         +-------+
>>        | DRAM0 |         | DRAM1 |
>>        +---+---+         +---+---+
>>            |                 |
>>        +---+---+         +---+---+
>>        | CXL 0 |         | CXL 4 |
>>        +---+---+         +---+---+
>>        | CXL 1 |         | CXL 5 |
>>        +---+---+         +---+---+
>>        | CXL 2 |         | CXL 6 |
>>        +---+---+         +---+---+
>>        | CXL 3 |         | CXL 7 |
>>        +---+---+         +---+---+
>>          node2             node3
>>
>> The 4ch of CXL memory are detected as a single NUMA node in each socket,
>> but it shows as follows with the current N_POSSIBLE loop.
>>
>> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
>> node0 node1 node2 node3 node4 node5
>> node6 node7 node8 node9 node10 node11
> 
> This is insufficient information for me to assess the correctness of the
> configuration. Can you please show the contents of your CEDT/CFMWS and
> SRAT/Memory Affinity structures?
> 
> mkdir acpi_data && cd acpi_data
> acpidump -b
> iasl -d *
> cat cedt.dsl  <- find all CFMWS entries
> cat srat.dsl  <- find all Memory Affinity entries

I'm not able to provide all the details as srat.dsl has too much info.

   $ wc -l srat.dsl
   25229 srat.dsl

Instead, I can show you that there are 4 diffferent proximity domains
with "Enabled : 1" with the following filtered output from srat.dsl.

   $ grep -E "Proximity Domain :|Enabled : " srat.dsl | cut -c 31- | sed 
'N;s/\n//' | sort | uniq
          Enabled : 0       Enabled : 0
   Proximity Domain : 00000000       Enabled : 0
   Proximity Domain : 00000000       Enabled : 1
   Proximity Domain : 00000001       Enabled : 1
   Proximity Domain : 00000006       Enabled : 1
   Proximity Domain : 00000007       Enabled : 1

We don't actually have to use those complicated commands to check this
as dmesg clearly prints the SRAT and node numbers as follows.

   [    0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
   [    0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
   [    0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 
0x60f80000000-0x64f7fffffff]
   [    0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 
0x2080000000-0x807fffffff] hotplug
   [    0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 
0x64f80000000-0x6cf7fffffff] hotplug

The memoryless nodes are printed as follows after those ACPI, SRAT,
Node N PXM M messages.

   [    0.010927] Initmem setup node 0 [mem 
0x0000000000001000-0x000000207effffff]
   [    0.010930] Initmem setup node 1 [mem 
0x0000060f80000000-0x0000064f7fffffff]
   [    0.010992] Initmem setup node 2 as memoryless
   [    0.011055] Initmem setup node 3 as memoryless
   [    0.011115] Initmem setup node 4 as memoryless
   [    0.011177] Initmem setup node 5 as memoryless
   [    0.011238] Initmem setup node 6 as memoryless
   [    0.011299] Initmem setup node 7 as memoryless
   [    0.011361] Initmem setup node 8 as memoryless
   [    0.011422] Initmem setup node 9 as memoryless
   [    0.011484] Initmem setup node 10 as memoryless
   [    0.011544] Initmem setup node 11 as memoryless

This is related why the 12 nodes at sysfs knobs are provided with the
current N_POSSIBLE loop.

> 
> Basically I need to know:
> 1) Is each CXL device on a dedicated Host Bridge?
> 2) Is inter-host-bridge interleaving configured?
> 3) Is intra-host-bridge interleaving configured?
> 4) Do SRAT entries exist for all nodes?

Are there some simple commands that I can get those info?

> 5) Why are there 12 nodes but only 10 sources? Are there additional
>     devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
>     Affinity records - resulting in 10 nodes? This is strange.

My blind guess is that there could be a logic node that combines 4ch of
CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
local CPU/DRAM makes 12 nodes in total.

> 
> By default, Linux creates a node for each proximity domain ("PXM")
> detected in the SRAT Memory Affinity tables. If SRAT entries for a
> memory region described in a CFMWS is absent, it will also create an
> node for that CFMWS.
> 
> Your reported configuration and results lead me to believe you have
> a combination of CFMWS/SRAT configurations that are unexpected.
> 
> ~Gregory

Not sure about this part but our approach with hotplug_memory_notifier()
resolves this problem.  Rakie will submit an initial working patchset 
soonish.

Thanks,
Honggyu
Gregory Price March 6, 2025, 5:32 p.m. UTC | #6
On Thu, Mar 06, 2025 at 09:39:26PM +0900, Honggyu Kim wrote:
> 
> The memoryless nodes are printed as follows after those ACPI, SRAT,
> Node N PXM M messages.
> 
>   [    0.010927] Initmem setup node 0 [mem
> 0x0000000000001000-0x000000207effffff]
>   [    0.010930] Initmem setup node 1 [mem
> 0x0000060f80000000-0x0000064f7fffffff]
>   [    0.010992] Initmem setup node 2 as memoryless
>   [    0.011055] Initmem setup node 3 as memoryless
>   [    0.011115] Initmem setup node 4 as memoryless
>   [    0.011177] Initmem setup node 5 as memoryless
>   [    0.011238] Initmem setup node 6 as memoryless
>   [    0.011299] Initmem setup node 7 as memoryless
>   [    0.011361] Initmem setup node 8 as memoryless
>   [    0.011422] Initmem setup node 9 as memoryless
>   [    0.011484] Initmem setup node 10 as memoryless
>   [    0.011544] Initmem setup node 11 as memoryless
> 
> This is related why the 12 nodes at sysfs knobs are provided with the
> current N_POSSIBLE loop.
> 

This isn't actually why, this is another symptom.  This gets printed
because someone is marking nodes 4-11 as possible and setup_nr_node_ids
reports 12 total nodes

void __init setup_nr_node_ids(void)
{
        unsigned int highest;

        highest = find_last_bit(node_possible_map.bits, MAX_NUMNODES);
        nr_node_ids = highest + 1;
}

Given your configuration data so far, we may have a bug somewhere (or
i'm missing a configuration piece).

> > Basically I need to know:
> > 1) Is each CXL device on a dedicated Host Bridge?
> > 2) Is inter-host-bridge interleaving configured?
> > 3) Is intra-host-bridge interleaving configured?
> > 4) Do SRAT entries exist for all nodes?
> 
> Are there some simple commands that I can get those info?
> 

The content of the CEDT would be sufficient - that will show us the
number of CXL host bridges.

> > 5) Why are there 12 nodes but only 10 sources? Are there additional
> >     devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
> >     Affinity records - resulting in 10 nodes? This is strange.
> 
> My blind guess is that there could be a logic node that combines 4ch of
> CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
> local CPU/DRAM makes 12 nodes in total.
>

The issue is that nodes have associated memory regions.  If there are
multiple nodes with overlapping memory regions, that seems problematic.

If there are "possible nodes" without memory and no real use case
(because the memory is associated with the aggregate node) then those
nodes probably shouldn't be reported as possible.

the tl;dr here is we should figure out what is marking those nodes as
possible.

> Not sure about this part but our approach with hotplug_memory_notifier()
> resolves this problem.  Rakie will submit an initial working patchset
> soonish.

This may just be a bandaid on the issue.  We should get our node
configuration correct from the get-go.

~Gregory
Honggyu Kim March 7, 2025, 11:46 a.m. UTC | #7
On 3/7/2025 2:32 AM, Gregory Price wrote:
> On Thu, Mar 06, 2025 at 09:39:26PM +0900, Honggyu Kim wrote:
>>
>> The memoryless nodes are printed as follows after those ACPI, SRAT,
>> Node N PXM M messages.
>>
>>    [    0.010927] Initmem setup node 0 [mem
>> 0x0000000000001000-0x000000207effffff]
>>    [    0.010930] Initmem setup node 1 [mem
>> 0x0000060f80000000-0x0000064f7fffffff]
>>    [    0.010992] Initmem setup node 2 as memoryless
>>    [    0.011055] Initmem setup node 3 as memoryless
>>    [    0.011115] Initmem setup node 4 as memoryless
>>    [    0.011177] Initmem setup node 5 as memoryless
>>    [    0.011238] Initmem setup node 6 as memoryless
>>    [    0.011299] Initmem setup node 7 as memoryless
>>    [    0.011361] Initmem setup node 8 as memoryless
>>    [    0.011422] Initmem setup node 9 as memoryless
>>    [    0.011484] Initmem setup node 10 as memoryless
>>    [    0.011544] Initmem setup node 11 as memoryless
>>
>> This is related why the 12 nodes at sysfs knobs are provided with the
>> current N_POSSIBLE loop.
>>
> 
> This isn't actually why, this is another symptom.  This gets printed
> because someone is marking nodes 4-11 as possible and setup_nr_node_ids
> reports 12 total nodes
> 
> void __init setup_nr_node_ids(void)
> {
>          unsigned int highest;
> 
>          highest = find_last_bit(node_possible_map.bits, MAX_NUMNODES);
>          nr_node_ids = highest + 1;
> }
> 
> Given your configuration data so far, we may have a bug somewhere (or
> i'm missing a configuration piece).

Maybe there could be some misunderstanding on this issue.
This isn't a problem of NUMA detection for CXL memory but just a problem
of number of "node" knobs only for weighted interleave.

The number of nodes in 'numactl -H' shows the correct nodes even without
our fix.

   $ numactl -H
   available: 4 nodes (0-3)
   node 0 cpus: 0 1 2 3 ...
   node 0 size: 128504 MB
   node 0 free: 118563 MB
   node 1 cpus: 144 145 146 147 ...
   node 1 size: 257961 MB
   node 1 free: 242628 MB
   node 2 cpus:
   node 2 size: 393216 MB
   node 2 free: 393216 MB
   node 3 cpus:
   node 3 size: 524288 MB
   node 3 free: 524288 MB
   node distances:
   node     0    1    2    3
      0:   10   21   14   24
      1:   21   10   24   14
      2:   14   24   10   26
      3:   24   14   26   10

You can see more info below.

   $ cd /sys/devices/system/node

   $ ls -d node*
   node0  node1  node2  node3

   $ cat possible
   0-11

   $ cat online
   0-3

   $ cat has_memory
   0-3

   $ cat has_normal_memory
   0-1

   $ cat has_cpu
   0-1

>>> Basically I need to know:
>>> 1) Is each CXL device on a dedicated Host Bridge?
>>> 2) Is inter-host-bridge interleaving configured?
>>> 3) Is intra-host-bridge interleaving configured?
>>> 4) Do SRAT entries exist for all nodes?
>>
>> Are there some simple commands that I can get those info?
>>
> 
> The content of the CEDT would be sufficient - that will show us the
> number of CXL host bridges.

Which command do we need for this info specifically?  My output doesn't
provide some useful info for that.

   $ acpidump -b
   $ iasl -d *
   $ cat cedt.dsl
       ...
   **** Unknown ACPI table signature [CEDT]

> 
>>> 5) Why are there 12 nodes but only 10 sources? Are there additional
>>>      devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
>>>      Affinity records - resulting in 10 nodes? This is strange.
>>
>> My blind guess is that there could be a logic node that combines 4ch of
>> CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
>> local CPU/DRAM makes 12 nodes in total.
>>
> 
> The issue is that nodes have associated memory regions.  If there are
> multiple nodes with overlapping memory regions, that seems problematic.
> 
> If there are "possible nodes" without memory and no real use case
> (because the memory is associated with the aggregate node) then those
> nodes probably shouldn't be reported as possible.
> 
> the tl;dr here is we should figure out what is marking those nodes as
> possible.
> 
>> Not sure about this part but our approach with hotplug_memory_notifier()
>> resolves this problem.  Rakie will submit an initial working patchset
>> soonish.
> 
> This may just be a bandaid on the issue.  We should get our node
> configuration correct from the get-go.

Not sure about it.  This must be fixed ASAP because current kernel is
broken on this issue and the fix should go into hotfix tree first.

If you can think this is just a bandaid, but leaving it bleeding as is
not the right approach.

Our fix was posted a few hours ago.  Please have a look, then think
about the apprach again.
https://lore.kernel.org/linux-mm/20250307063534.540-1-rakie.kim@sk.com

Thanks,
Honggyu

> 
> ~Gregory
Gregory Price March 7, 2025, 5:51 p.m. UTC | #8
On Fri, Mar 07, 2025 at 08:46:46PM +0900, Honggyu Kim wrote:
> You can see more info below.
> 
>   $ cd /sys/devices/system/node
> 
>   $ ls -d node*
>   node0  node1  node2  node3
> 
>   $ cat possible
>   0-11

We're split across two threads now, but i'll add this context

I'm basically asking whether there should be 12 nodes possible. It seems
like there should only be 4 nodes possible - 2 for sockets, 2 for host
bridges.

Unless I'm misunderstanding, it should be the case that a given physical
address can only be hosted by 1 numa node (proximity domain).

So it *should* be the case that you either have 4 nodes possible or 10
nodes possible, not 12.  But I could be missing a piece of context.

> Which command do we need for this info specifically?  My output doesn't
> provide some useful info for that.
> 
>   $ acpidump -b
>   $ iasl -d *
>   $ cat cedt.dsl
>       ...
>   **** Unknown ACPI table signature [CEDT]
>

You probably have an old version of acpidump here, you might need to get
a newer version that knows about the CEDT.

You'll also want to get all the Memory Affinity entries from srat.dsl

> Not sure about it.  This must be fixed ASAP because current kernel is
> broken on this issue and the fix should go into hotfix tree first.
>

I agree something is broken, I'm not convinced what is broken.

> If you can think this is just a bandaid, but leaving it bleeding as is
> not the right approach.
>

This affects userland, we shouldn't thrash on this. Lets get it right.

~Gregory
Honggyu Kim March 10, 2025, 12:26 p.m. UTC | #9
Hi Gregory,

On 3/8/2025 2:51 AM, Gregory Price wrote:
> On Fri, Mar 07, 2025 at 08:46:46PM +0900, Honggyu Kim wrote:
>> You can see more info below.
>>
>>    $ cd /sys/devices/system/node
>>
>>    $ ls -d node*
>>    node0  node1  node2  node3
>>
>>    $ cat possible
>>    0-11
> 
> We're split across two threads now, but i'll add this context
> 
> I'm basically asking whether there should be 12 nodes possible. It seems
> like there should only be 4 nodes possible - 2 for sockets, 2 for host
> bridges.

Ack. If the N_POSSIBLE itself becomes 4, then I agree this problem can simply be 
resolved.

> 
> Unless I'm misunderstanding, it should be the case that a given physical
> address can only be hosted by 1 numa node (proximity domain).

Yeah, the proximity domain detects the node correctly as follows in dmesg.

  [  0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
  [  0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
  [  0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 0x60f80000000-0x64f7fffffff]
  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug

It is printed even before CXL detection.

> 
> So it *should* be the case that you either have 4 nodes possible or 10
> nodes possible, not 12.  But I could be missing a piece of context.
> 
>> Which command do we need for this info specifically?  My output doesn't
>> provide some useful info for that.
>>
>>    $ acpidump -b
>>    $ iasl -d *
>>    $ cat cedt.dsl
>>        ...
>>    **** Unknown ACPI table signature [CEDT]
>>
> 
> You probably have an old version of acpidump here, you might need to get
> a newer version that knows about the CEDT.

I just used the newest acpica and was able to dump CEDT properly.  But its
output is also very long so it'd be helpful if you could tell us which specific
info you need.

> 
> You'll also want to get all the Memory Affinity entries from srat.dsl
> 
>> Not sure about it.  This must be fixed ASAP because current kernel is
>> broken on this issue and the fix should go into hotfix tree first.
>>
> 
> I agree something is broken, I'm not convinced what is broken.

Yeah, we should fix the broken status hopefully before v6.14 release.

Thanks,
Honggyu

> 
>> If you can think this is just a bandaid, but leaving it bleeding as is
>> not the right approach.
>>
> 
> This affects userland, we shouldn't thrash on this. Lets get it right.
> 
> ~Gregory
Gregory Price March 10, 2025, 2:22 p.m. UTC | #10
On Mon, Mar 10, 2025 at 09:26:48PM +0900, Honggyu Kim wrote:
> Yeah, the proximity domain detects the node correctly as follows in dmesg.
> 
>  [  0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
>  [  0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
>  [  0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 0x60f80000000-0x64f7fffffff]
>  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
>  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug
> 
> It is printed even before CXL detection.
> 

I wrote a some documentation on some example configurations last friday

https://lore.kernel.org/linux-mm/Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F/T/#m2780e47df7f0962a79182502afc99843bb046205

This isn't exhaustive, but as I said with Rakie, I think your
configuration is probably ok - if slightly confusing.

What I'm going to guess is happening is you have 1 CFMWS per device that
do not have matching SRAT entries, and then the CFMWS covering these:

>  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
>  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug

have interleave set up
Yunjeong Mun March 11, 2025, 2:07 a.m. UTC | #11
Hi Gregory,

On Mon, 10 Mar 2025 10:22:11 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Mon, Mar 10, 2025 at 09:26:48PM +0900, Honggyu Kim wrote:
> > Yeah, the proximity domain detects the node correctly as follows in dmesg.
> > 
> >  [  0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> >  [  0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
> >  [  0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 0x60f80000000-0x64f7fffffff]
> >  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
> >  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug
> > 
> > It is printed even before CXL detection.
> > 
> 
> I wrote a some documentation on some example configurations last friday
> 
> https://lore.kernel.org/linux-mm/Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F/T/#m2780e47df7f0962a79182502afc99843bb046205
> 

I've checked our CFMWS configurations using the information from this link and 
wrote them below.

> This isn't exhaustive, but as I said with Rakie, I think your
> configuration is probably ok - if slightly confusing.
> 

I wonder if the below data is sufficient for your review in Rakies's
email [1]. Please let me know.

> What I'm going to guess is happening is you have 1 CFMWS per device that
> do not have matching SRAT entries, and then the CFMWS covering these:
> 
> >  [  0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug
> >  [  0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug
> 
> have interleave set up
> 

In my understanding, both SRAT and CFMWS have the above device and interleave setup.

Below are the SRAT entries (with some unnecessary lines removed) :

[A128h 41256 001h]               Subtable Type : 01 [Memory Affinity]  
[A12Ah 41258 004h]            Proximity Domain : 00000006  
[A130h 41264 008h]                Base Address : 0000002080000000
[A138h 41272 008h]              Address Length : 0000006000000000  
                                     Enabled : 1  
                               Hot Pluggable : 1  
...
  
[A150h 41296 001h]               Subtable Type : 01 [Memory Affinity]  
[A152h 41298 004h]            Proximity Domain : 00000007  
[A158h 41304 008h]                Base Address : 0000064F80000000
[A160h 41312 008h]              Address Length : 0000008000000000  
                                     Enabled : 1  
                               Hot Pluggable : 1  

and below are the CFMWS configurations (with some unnecessary lines removed):

[0A4h 0164 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0ACh 0172 008h]         Window base address : 0000002080000000 <- Memory region
[0B4h 0180 008h]                 Window size : 0000032780000000
[0BCh 0188 001h]          Interleave Members : 01           <-- 2-way interleave
[0BDh 0189 001h]       Interleave Arithmetic : 01
[0C8h 0200 004h]                First Target : 00000043     <-- host bridge id  
[0CCh 0204 004h]                 Next Target : 00000053     <-- host bridge id 

...

[170h 0368 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
[178h 0376 008h]         Window base address : 0000064F80000000
[180h 0384 008h]                 Window size : 0000033000000000
[188h 0392 001h]          Interleave Members : 01          <-- 2-way interleave
[189h 0393 001h]       Interleave Arithmetic : 01
[194h 0404 004h]                First Target : 00000143    <-- host bridge id
[198h 0408 004h]                 Next Target : 00000153    <-- host bridge id

[1] https://lore.kernel.org/linux-mm/Z87zpg3TLRReikgu@gourry-fedora-PF4VCD3F/

Best regards,
Yunjeong
Gregory Price March 11, 2025, 2:42 a.m. UTC | #12
On Tue, Mar 11, 2025 at 11:07:27AM +0900, Yunjeong Mun wrote:
> Hi Gregory,
> 
> In my understanding, both SRAT and CFMWS have the above device and interleave setup.
> 
> and below are the CFMWS configurations (with some unnecessary lines removed):
> 
> [0A4h 0164 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
> [0ACh 0172 008h]         Window base address : 0000002080000000 <- Memory region
> [0B4h 0180 008h]                 Window size : 0000032780000000
> [0BCh 0188 001h]          Interleave Members : 01           <-- 2-way interleave
> [0BDh 0189 001h]       Interleave Arithmetic : 01
> [0C8h 0200 004h]                First Target : 00000043     <-- host bridge id  
> [0CCh 0204 004h]                 Next Target : 00000053     <-- host bridge id 
> 
> ...
> 
> [170h 0368 001h]               Subtable Type : 01 [CXL Fixed Memory Window Structure]
> [178h 0376 008h]         Window base address : 0000064F80000000
> [180h 0384 008h]                 Window size : 0000033000000000
> [188h 0392 001h]          Interleave Members : 01          <-- 2-way interleave
> [189h 0393 001h]       Interleave Arithmetic : 01
> [194h 0404 004h]                First Target : 00000143    <-- host bridge id
> [198h 0408 004h]                 Next Target : 00000153    <-- host bridge id
>

Are you able to share all CXL Fixed Memory Window Structures in the
CEDT?  Just want to confirm some suspicions here about why we're seeing
12 NUMA nodes.  This explains 2 and suggests there's at least 4 host
bridges - but not the other 8.

~Gregory
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4cc04ff8f12c..50cbb7c047fa 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3721,7 +3721,7 @@  static int add_weighted_interleave_group(struct kobject *root_kobj)
 		return err;
 	}
 
-	for_each_node_state(nid, N_POSSIBLE) {
+	for_each_node_state(nid, N_MEMORY) {
 		err = add_weight_node(nid, wi_kobj);
 		if (err) {
 			pr_err("failed to add sysfs [node%d]\n", nid);