diff mbox series

[RFC] topology: Represent clusters of CPUs within a die.

Message ID 20201016152702.1513592-1-Jonathan.Cameron@huawei.com
State New
Headers show
Series [RFC] topology: Represent clusters of CPUs within a die. | expand

Commit Message

Jonathan Cameron Oct. 16, 2020, 3:27 p.m. UTC
Both ACPI and DT provide the ability to describe additional layers of
topology between that of individual cores and higher level constructs
such as the level at which the last level cache is shared.
In ACPI this can be represented in PPTT as a Processor Hierarchy
Node Structure [1] that is the parent of the CPU cores and in turn
has a parent Processor Hierarchy Nodes Structure representing
a higher level of topology.

For example Kunpeng 920 has clusters of 4 CPUs.  These do not share
any cache resources, but the interconnect topology is such that
the cost to transfer ownership of a cacheline between CPUs within
a cluster is lower than between CPUs in different clusters on the same
die.   Hence, it can make sense to deliberately schedule threads
sharing data to a single cluster.

This patch simply exposes this information to userspace libraries
like hwloc by providing cluster_cpus and related sysfs attributes.
PoC of HWLOC support at [2].

Note this patch only handle the ACPI case.

Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).

Currently the ID provided is the offset of the Processor
Hierarchy Nodes Structure within PPTT.  Whilst this is unique
it is not terribly elegant so alternative suggestions welcome.

Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.

RFC questions:
1) Naming
2) Related to naming, do we want to represent all potential levels,
   or this enough?  On Kunpeng920, the next level up from cluster happens
   to be covered by llc cache sharing, but in theory more than one
   level of cluster description might be needed by some future system.
3) Do we need DT code in place? I'm not sure any DT based ARM64
   systems would have enough complexity for this to be useful.
4) Other architectures?  Is this useful on x86 for example?

[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
    structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

---

 Documentation/admin-guide/cputopology.rst | 26 ++++++++--
 arch/arm64/kernel/topology.c              |  2 +
 drivers/acpi/pptt.c                       | 60 +++++++++++++++++++++++
 drivers/base/arch_topology.c              | 14 ++++++
 drivers/base/topology.c                   | 10 ++++
 include/linux/acpi.h                      |  5 ++
 include/linux/arch_topology.h             |  5 ++
 include/linux/topology.h                  |  6 +++
 8 files changed, 124 insertions(+), 4 deletions(-)

-- 
2.19.1

Comments

Greg KH Oct. 17, 2020, 6:44 a.m. UTC | #1
On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:
> Both ACPI and DT provide the ability to describe additional layers of

> topology between that of individual cores and higher level constructs

> such as the level at which the last level cache is shared.

> In ACPI this can be represented in PPTT as a Processor Hierarchy

> Node Structure [1] that is the parent of the CPU cores and in turn

> has a parent Processor Hierarchy Nodes Structure representing

> a higher level of topology.

> 

> For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> any cache resources, but the interconnect topology is such that

> the cost to transfer ownership of a cacheline between CPUs within

> a cluster is lower than between CPUs in different clusters on the same

> die.   Hence, it can make sense to deliberately schedule threads

> sharing data to a single cluster.

> 

> This patch simply exposes this information to userspace libraries

> like hwloc by providing cluster_cpus and related sysfs attributes.

> PoC of HWLOC support at [2].

> 

> Note this patch only handle the ACPI case.

> 

> Special consideration is needed for SMT processors, where it is

> necessary to move 2 levels up the hierarchy from the leaf nodes

> (thus skipping the processor core level).

> 

> Currently the ID provided is the offset of the Processor

> Hierarchy Nodes Structure within PPTT.  Whilst this is unique

> it is not terribly elegant so alternative suggestions welcome.

> 

> Note that arm64 / ACPI does not provide any means of identifying

> a die level in the topology but that may be unrelate to the cluster

> level.

> 

> RFC questions:

> 1) Naming

> 2) Related to naming, do we want to represent all potential levels,

>    or this enough?  On Kunpeng920, the next level up from cluster happens

>    to be covered by llc cache sharing, but in theory more than one

>    level of cluster description might be needed by some future system.

> 3) Do we need DT code in place? I'm not sure any DT based ARM64

>    systems would have enough complexity for this to be useful.

> 4) Other architectures?  Is this useful on x86 for example?

> 

> [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node

>     structure (Type 0)

> [2] https://github.com/hisilicon/hwloc/tree/linux-cluster

> 

> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---

> 

>  Documentation/admin-guide/cputopology.rst | 26 ++++++++--


You are adding new sysfs files here, but not adding Documentation/ABI/
entries as well?  This cputopology document is nice, but no one knows to
look there for sysfs stuff :)

thanks,

greg k-h
Jonathan Cameron Oct. 19, 2020, 8:08 a.m. UTC | #2
On Sat, 17 Oct 2020 08:44:25 +0200
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:

> > Both ACPI and DT provide the ability to describe additional layers of

> > topology between that of individual cores and higher level constructs

> > such as the level at which the last level cache is shared.

> > In ACPI this can be represented in PPTT as a Processor Hierarchy

> > Node Structure [1] that is the parent of the CPU cores and in turn

> > has a parent Processor Hierarchy Nodes Structure representing

> > a higher level of topology.

> > 

> > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> > any cache resources, but the interconnect topology is such that

> > the cost to transfer ownership of a cacheline between CPUs within

> > a cluster is lower than between CPUs in different clusters on the same

> > die.   Hence, it can make sense to deliberately schedule threads

> > sharing data to a single cluster.

> > 

> > This patch simply exposes this information to userspace libraries

> > like hwloc by providing cluster_cpus and related sysfs attributes.

> > PoC of HWLOC support at [2].

> > 

> > Note this patch only handle the ACPI case.

> > 

> > Special consideration is needed for SMT processors, where it is

> > necessary to move 2 levels up the hierarchy from the leaf nodes

> > (thus skipping the processor core level).

> > 

> > Currently the ID provided is the offset of the Processor

> > Hierarchy Nodes Structure within PPTT.  Whilst this is unique

> > it is not terribly elegant so alternative suggestions welcome.

> > 

> > Note that arm64 / ACPI does not provide any means of identifying

> > a die level in the topology but that may be unrelate to the cluster

> > level.

> > 

> > RFC questions:

> > 1) Naming

> > 2) Related to naming, do we want to represent all potential levels,

> >    or this enough?  On Kunpeng920, the next level up from cluster happens

> >    to be covered by llc cache sharing, but in theory more than one

> >    level of cluster description might be needed by some future system.

> > 3) Do we need DT code in place? I'm not sure any DT based ARM64

> >    systems would have enough complexity for this to be useful.

> > 4) Other architectures?  Is this useful on x86 for example?

> > 

> > [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node

> >     structure (Type 0)

> > [2] https://github.com/hisilicon/hwloc/tree/linux-cluster

> > 

> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> > ---

> > 

> >  Documentation/admin-guide/cputopology.rst | 26 ++++++++--  

> 

> You are adding new sysfs files here, but not adding Documentation/ABI/

> entries as well?  This cputopology document is nice, but no one knows to

> look there for sysfs stuff :)

Hi Greg,

Ah.  I'd assumed there wasn't a current doc as the patch adding
die description didn't touch it.   Turns out it was just missing from
that patch. (Documentation/ABI/testing/sysfs-devices-system-cpu)
Seems those docs are missing quite a bit of more recent stuff such as
die and more package related parts.  I'll bring it up to date as a
precursor to v2 of this series.

Thanks,

Jonathan

> 

> thanks,

> 

> greg k-h
Brice Goglin Oct. 19, 2020, 10 a.m. UTC | #3
Le 16/10/2020 à 17:27, Jonathan Cameron a écrit :
> Both ACPI and DT provide the ability to describe additional layers of

> topology between that of individual cores and higher level constructs

> such as the level at which the last level cache is shared.

> In ACPI this can be represented in PPTT as a Processor Hierarchy

> Node Structure [1] that is the parent of the CPU cores and in turn

> has a parent Processor Hierarchy Nodes Structure representing

> a higher level of topology.

>

> For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> any cache resources, but the interconnect topology is such that

> the cost to transfer ownership of a cacheline between CPUs within

> a cluster is lower than between CPUs in different clusters on the same

> die.   Hence, it can make sense to deliberately schedule threads

> sharing data to a single cluster.

>

> This patch simply exposes this information to userspace libraries

> like hwloc by providing cluster_cpus and related sysfs attributes.

> PoC of HWLOC support at [2].

>

> Note this patch only handle the ACPI case.

>

> Special consideration is needed for SMT processors, where it is

> necessary to move 2 levels up the hierarchy from the leaf nodes

> (thus skipping the processor core level).

>

> Currently the ID provided is the offset of the Processor

> Hierarchy Nodes Structure within PPTT.  Whilst this is unique

> it is not terribly elegant so alternative suggestions welcome.

>

> Note that arm64 / ACPI does not provide any means of identifying

> a die level in the topology but that may be unrelate to the cluster

> level.

>

> RFC questions:

> 1) Naming

> 2) Related to naming, do we want to represent all potential levels,

>    or this enough?  On Kunpeng920, the next level up from cluster happens

>    to be covered by llc cache sharing, but in theory more than one

>    level of cluster description might be needed by some future system.

> 3) Do we need DT code in place? I'm not sure any DT based ARM64

>    systems would have enough complexity for this to be useful.

> 4) Other architectures?  Is this useful on x86 for example?



Hello Jonathan

Intel has CPUID registers to describe "tiles" and "modules" too (not
used yet as far as I know). The list of levels could become quite long
if any processor ever exposes those. If having multiple cluster levels
is possible, maybe it's time to think about introducing some sort of
generic levels:

cluster0_id = your cluster_id
cluster0_cpus/cpulist = your cluster_cpus/cpulis
cluster0_type = would optionally contain hardware-specific info such as
"module" or "tile" on x86
cluster_levels = 1

hwloc already does something like this for some "rare" levels such as
s390 book/drawers (by the way, thanks a lot for the hwloc PoC, very good
job), we call them "Groups" instead of "cluster" above.

However I don't know if the Linux scheduler would like that. Is it
better to have 10+ levels with static names, or a dynamic number of levels?

Brice
Sudeep Holla Oct. 19, 2020, 10:01 a.m. UTC | #4
+Morten

On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:
> Both ACPI and DT provide the ability to describe additional layers of

> topology between that of individual cores and higher level constructs

> such as the level at which the last level cache is shared.

> In ACPI this can be represented in PPTT as a Processor Hierarchy

> Node Structure [1] that is the parent of the CPU cores and in turn

> has a parent Processor Hierarchy Nodes Structure representing

> a higher level of topology.

>

> For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> any cache resources, but the interconnect topology is such that

> the cost to transfer ownership of a cacheline between CPUs within

> a cluster is lower than between CPUs in different clusters on the same

> die.   Hence, it can make sense to deliberately schedule threads

> sharing data to a single cluster.

>


This is very SoC specific and hard to generalise, hence LLC is chosen
as one of the main factor to decide.

Are there any scheduler topology related changes to achieve the same ?
If so, we need their opinion on that.

> This patch simply exposes this information to userspace libraries

> like hwloc by providing cluster_cpus and related sysfs attributes.

> PoC of HWLOC support at [2].

>


OK, cluster is too Arm specific with no definition for it whatsoever.
How do you plan to support clusters of clusters or higher level of
hierarchy present in PPTT.

> Note this patch only handle the ACPI case.

>


If we decide to add this to sysfs, I prefer to keep DT implementation
also in sync. The bindings are in sync, just matter of implementation.

> Special consideration is needed for SMT processors, where it is

> necessary to move 2 levels up the hierarchy from the leaf nodes

> (thus skipping the processor core level).

>

> Currently the ID provided is the offset of the Processor

> Hierarchy Nodes Structure within PPTT.  Whilst this is unique

> it is not terribly elegant so alternative suggestions welcome.

>


That is firmware choice. May be your firmware just fills in mandatory
fields and doesn't care about anything and everything optional. We do
check for valid UID fields and if that is not set, offset is used as
unique ID in kernel implementation. So if you enhance the firmware, the
kernel sysfs will become elegant as you expect 😉 

> Note that arm64 / ACPI does not provide any means of identifying

> a die level in the topology but that may be unrelate to the cluster

> level.

>


Indeed. It can be cluster of clusters on some platform. If we need that
info, we should add it. My assumption was that generally each die forms
a NUMA node and hence the information is available there. I may be wrong.

> RFC questions:

> 1) Naming

> 2) Related to naming, do we want to represent all potential levels,

>    or this enough?  On Kunpeng920, the next level up from cluster happens

>    to be covered by llc cache sharing, but in theory more than one

>    level of cluster description might be needed by some future system.


That is my question above. Can't recall the terminology used in ACPI
PPTT, but IIRC "cluster" is not used to keep it generic. May be we need
to do the same here as the term "cluster" is ill-defined on Arm and I
would avoid using it if possible.

> 3) Do we need DT code in place? I'm not sure any DT based ARM64

>    systems would have enough complexity for this to be useful.


I prefer to keep the implementation in sync

> 4) Other architectures?  Is this useful on x86 for example?

>


AMD had multi-die within socket IIRC. IIUC, cpuid will provide the info
and nothing more is needed from ACPI ? I may be wrong, just guessing/asking.

--
Regards,
Sudeep
Peter Zijlstra Oct. 19, 2020, 10:35 a.m. UTC | #5
On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:
> Both ACPI and DT provide the ability to describe additional layers of

> topology between that of individual cores and higher level constructs

> such as the level at which the last level cache is shared.

> In ACPI this can be represented in PPTT as a Processor Hierarchy

> Node Structure [1] that is the parent of the CPU cores and in turn

> has a parent Processor Hierarchy Nodes Structure representing

> a higher level of topology.

> 

> For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> any cache resources, but the interconnect topology is such that

> the cost to transfer ownership of a cacheline between CPUs within

> a cluster is lower than between CPUs in different clusters on the same

> die.   Hence, it can make sense to deliberately schedule threads

> sharing data to a single cluster.

> 

> This patch simply exposes this information to userspace libraries

> like hwloc by providing cluster_cpus and related sysfs attributes.

> PoC of HWLOC support at [2].

> 

> Note this patch only handle the ACPI case.

> 

> Special consideration is needed for SMT processors, where it is

> necessary to move 2 levels up the hierarchy from the leaf nodes

> (thus skipping the processor core level).


I'm confused by all of this. The core level is exactly what you seem to
want.
Jonathan Cameron Oct. 19, 2020, 12:32 p.m. UTC | #6
On Mon, 19 Oct 2020 12:35:22 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:

> > Both ACPI and DT provide the ability to describe additional layers of

> > topology between that of individual cores and higher level constructs

> > such as the level at which the last level cache is shared.

> > In ACPI this can be represented in PPTT as a Processor Hierarchy

> > Node Structure [1] that is the parent of the CPU cores and in turn

> > has a parent Processor Hierarchy Nodes Structure representing

> > a higher level of topology.

> > 

> > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> > any cache resources, but the interconnect topology is such that

> > the cost to transfer ownership of a cacheline between CPUs within

> > a cluster is lower than between CPUs in different clusters on the same

> > die.   Hence, it can make sense to deliberately schedule threads

> > sharing data to a single cluster.

> > 

> > This patch simply exposes this information to userspace libraries

> > like hwloc by providing cluster_cpus and related sysfs attributes.

> > PoC of HWLOC support at [2].

> > 

> > Note this patch only handle the ACPI case.

> > 

> > Special consideration is needed for SMT processors, where it is

> > necessary to move 2 levels up the hierarchy from the leaf nodes

> > (thus skipping the processor core level).  


Hi Peter,

> 

> I'm confused by all of this. The core level is exactly what you seem to

> want.


It's the level above the core, whether in an multi-threaded core
or a single threaded core.   This may correspond to the level
at which caches are shared (typically L3).  Cores are already well
represented via thread_siblings and similar.  Extra confusion is that
the current core_siblings (deprecated) sysfs interface, actually reflects
the package level and ignores anything in between core and
package (such as die on x86)

So in a typical system with a hierarchical interconnect you would have

thread
core
cluster (possibly multiple layers as mentioned in Brice's reply).
die
package

Unfortunately as pointed out in other branches of this thread, there is
no consistent generic name.  I'm open to suggestions!

Both ACPI PPTT and DT provide generic structures to represent layers of
topology.   They don't name as such, but in ACPI there are flags to indicate
package, core, thread. 

For example, in zen2 this would correspond to a 'core complex' consisting
4 CPU cores (each one 2 threads) sharing some local L3 cache.
https://en.wikichip.org/wiki/amd/microarchitectures/zen_2
In zen3 it looks like this level will be the same as that for the die.

Given they used the name in knights landing (and as is pointed out in
another branch of this thread, it's the CPUID description) I think Intel
calls these 'tiles' (anyone confirm that?) 

A similar concept exists for some ARM processors. 
https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110
CCLs in the diagram on that page.

Centriq 2400 had 2 core 'duplexes' which shared l2.
https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3

From the info release at hotchips, it looks like the thunderx3 deploys
a similar ring interconnect with groups of cores, each with 4 threads.
Not sure what they plan to call them yet though or whether they will chose
to represent that layer of the topology in their firmware tables.

Arms CMN600 interconnect also support such 'clusters' though I have no idea
if anyone has used it in this form yet.  In that case, they are called
"processor compute clusters"
https://developer.arm.com/documentation/100180/0103/

Xuantie-910 is cluster based as well (shares l2).

So in many cases the cluster level corresponds to something we already have
visibility of due to cache sharing etc, but that isn't true in kunpeng 920.

Thanks,

Jonathan
Jonathan Cameron Oct. 19, 2020, 12:38 p.m. UTC | #7
On Mon, 19 Oct 2020 12:00:15 +0200
Brice Goglin <Brice.Goglin@inria.fr> wrote:

> Le 16/10/2020 à 17:27, Jonathan Cameron a écrit :

> > Both ACPI and DT provide the ability to describe additional layers of

> > topology between that of individual cores and higher level constructs

> > such as the level at which the last level cache is shared.

> > In ACPI this can be represented in PPTT as a Processor Hierarchy

> > Node Structure [1] that is the parent of the CPU cores and in turn

> > has a parent Processor Hierarchy Nodes Structure representing

> > a higher level of topology.

> >

> > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> > any cache resources, but the interconnect topology is such that

> > the cost to transfer ownership of a cacheline between CPUs within

> > a cluster is lower than between CPUs in different clusters on the same

> > die.   Hence, it can make sense to deliberately schedule threads

> > sharing data to a single cluster.

> >

> > This patch simply exposes this information to userspace libraries

> > like hwloc by providing cluster_cpus and related sysfs attributes.

> > PoC of HWLOC support at [2].

> >

> > Note this patch only handle the ACPI case.

> >

> > Special consideration is needed for SMT processors, where it is

> > necessary to move 2 levels up the hierarchy from the leaf nodes

> > (thus skipping the processor core level).

> >

> > Currently the ID provided is the offset of the Processor

> > Hierarchy Nodes Structure within PPTT.  Whilst this is unique

> > it is not terribly elegant so alternative suggestions welcome.

> >

> > Note that arm64 / ACPI does not provide any means of identifying

> > a die level in the topology but that may be unrelate to the cluster

> > level.

> >

> > RFC questions:

> > 1) Naming

> > 2) Related to naming, do we want to represent all potential levels,

> >    or this enough?  On Kunpeng920, the next level up from cluster happens

> >    to be covered by llc cache sharing, but in theory more than one

> >    level of cluster description might be needed by some future system.

> > 3) Do we need DT code in place? I'm not sure any DT based ARM64

> >    systems would have enough complexity for this to be useful.

> > 4) Other architectures?  Is this useful on x86 for example?  

> 

> 

> Hello Jonathan


Hi Brice,

> 

> Intel has CPUID registers to describe "tiles" and "modules" too (not

> used yet as far as I know). The list of levels could become quite long

> if any processor ever exposes those. If having multiple cluster levels

> is possible, maybe it's time to think about introducing some sort of

> generic levels:


I've been wondering what tiles and modules are... Looking back and
naming over time, I'm going to guess tiles are the closest to the
particular case I was focusing on here.

> 

> cluster0_id = your cluster_id

> cluster0_cpus/cpulist = your cluster_cpus/cpulis

> cluster0_type = would optionally contain hardware-specific info such as

> "module" or "tile" on x86

> cluster_levels = 1


I wondered exactly the same question.  At this point, perhaps we just
statically introduce an 0 index, but with the assumption we would extend
that as / when necessary in future.

> 

> hwloc already does something like this for some "rare" levels such as

> s390 book/drawers (by the way, thanks a lot for the hwloc PoC, very good

> job), we call them "Groups" instead of "cluster" above.


Given we definitely have a 'naming' issue here, perhaps group0 etc is a good
generic choice? 

> 

> However I don't know if the Linux scheduler would like that. Is it

> better to have 10+ levels with static names, or a dynamic number of levels?


So far our 'general' experiments with adding clusters into the kernel
scheduler have been a typical mixed bunch.  Hence the proposal
to just expose the structure to userspace where we should at least know
what the workload is.   Hopefully we'll gain more experience with using
it and use that to drive possible kernel scheduler changes.

> 

> Brice


Thanks,

Jonathan
Peter Zijlstra Oct. 19, 2020, 12:50 p.m. UTC | #8
On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:
> On Mon, 19 Oct 2020 12:35:22 +0200

> Peter Zijlstra <peterz@infradead.org> wrote:


> > I'm confused by all of this. The core level is exactly what you seem to

> > want.

> 

> It's the level above the core, whether in an multi-threaded core

> or a single threaded core.   This may correspond to the level

> at which caches are shared (typically L3).  Cores are already well

> represented via thread_siblings and similar.  Extra confusion is that

> the current core_siblings (deprecated) sysfs interface, actually reflects

> the package level and ignores anything in between core and

> package (such as die on x86)


That seems wrong. core-mask should be whatever cores share L3. So on a
Intel Core2-Quad (just to pick an example) you should have 4 CPU in a
package, but only 2 CPUs for the core-mask.

It just so happens that L3 and package were the same for a long while in
x86 land, although recent chips started breaking that trend.

And I know nothing about the core-mask being depricated; it's what the
scheduler uses. It's not going anywhere.

So if your 'cluster' is a group of single cores (possibly with SMT) that
do not share cache but have a faster cache connection and you want them
to behave as-if they were a multi-core group that did share cache, then
core-mask it is.
Morten Rasmussen Oct. 19, 2020, 1:10 p.m. UTC | #9
Hi Jonathan,

On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:
> On Mon, 19 Oct 2020 12:35:22 +0200

> Peter Zijlstra <peterz@infradead.org> wrote:

> 

> > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:

> > > Both ACPI and DT provide the ability to describe additional layers of

> > > topology between that of individual cores and higher level constructs

> > > such as the level at which the last level cache is shared.

> > > In ACPI this can be represented in PPTT as a Processor Hierarchy

> > > Node Structure [1] that is the parent of the CPU cores and in turn

> > > has a parent Processor Hierarchy Nodes Structure representing

> > > a higher level of topology.

> > > 

> > > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> > > any cache resources, but the interconnect topology is such that

> > > the cost to transfer ownership of a cacheline between CPUs within

> > > a cluster is lower than between CPUs in different clusters on the same

> > > die.   Hence, it can make sense to deliberately schedule threads

> > > sharing data to a single cluster.

> > > 

> > > This patch simply exposes this information to userspace libraries

> > > like hwloc by providing cluster_cpus and related sysfs attributes.

> > > PoC of HWLOC support at [2].

> > > 

> > > Note this patch only handle the ACPI case.

> > > 

> > > Special consideration is needed for SMT processors, where it is

> > > necessary to move 2 levels up the hierarchy from the leaf nodes

> > > (thus skipping the processor core level).  

> 

> Hi Peter,

> 

> > 

> > I'm confused by all of this. The core level is exactly what you seem to

> > want.

> 

> It's the level above the core, whether in an multi-threaded core

> or a single threaded core.   This may correspond to the level

> at which caches are shared (typically L3).  Cores are already well

> represented via thread_siblings and similar.  Extra confusion is that

> the current core_siblings (deprecated) sysfs interface, actually reflects

> the package level and ignores anything in between core and

> package (such as die on x86)

> 

> So in a typical system with a hierarchical interconnect you would have

> 

> thread

> core

> cluster (possibly multiple layers as mentioned in Brice's reply).

> die

> package

> 

> Unfortunately as pointed out in other branches of this thread, there is

> no consistent generic name.  I'm open to suggestions!


IIUC, you are actually proposing another "die" level? I'm not sure if we
can actually come up with a generic name since interconnects are highly
implementation dependent.

How is you memory distributed? Do you already have NUMA nodes? If you
want to keep tasks together, it might make sense to define the clusters
(in your case) as NUMA nodes.

> Both ACPI PPTT and DT provide generic structures to represent layers of

> topology.   They don't name as such, but in ACPI there are flags to indicate

> package, core, thread.


I think that is because those are the only ones that a fairly generic
:-) It is also the only ones that scheduler cares about (plus NUMA).

> 

> For example, in zen2 this would correspond to a 'core complex' consisting

> 4 CPU cores (each one 2 threads) sharing some local L3 cache.

> https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

> In zen3 it looks like this level will be the same as that for the die.

> 

> Given they used the name in knights landing (and as is pointed out in

> another branch of this thread, it's the CPUID description) I think Intel

> calls these 'tiles' (anyone confirm that?) 

> 

> A similar concept exists for some ARM processors. 

> https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110

> CCLs in the diagram on that page.

> 

> Centriq 2400 had 2 core 'duplexes' which shared l2.

> https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3

> 

> From the info release at hotchips, it looks like the thunderx3 deploys

> a similar ring interconnect with groups of cores, each with 4 threads.

> Not sure what they plan to call them yet though or whether they will chose

> to represent that layer of the topology in their firmware tables.

> 

> Arms CMN600 interconnect also support such 'clusters' though I have no idea

> if anyone has used it in this form yet.  In that case, they are called

> "processor compute clusters"

> https://developer.arm.com/documentation/100180/0103/

> 

> Xuantie-910 is cluster based as well (shares l2).

> 

> So in many cases the cluster level corresponds to something we already have

> visibility of due to cache sharing etc, but that isn't true in kunpeng 920.


The problem I see is that the benefit of keeping tasks together due to
the interconnect layout might vary significantly between systems. So if
we introduce a new cpumask for cluster it has to have represent roughly
the same system properties otherwise generic software consuming this
information could be tricked.

If there is a provable benefit of having interconnect grouping
information, I think it would be better represented by a distance matrix
like we have for NUMA.

Morten
Brice Goglin Oct. 19, 2020, 1:12 p.m. UTC | #10
Le 19/10/2020 à 14:50, Peter Zijlstra a écrit :
> On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:

>> On Mon, 19 Oct 2020 12:35:22 +0200

>> Peter Zijlstra <peterz@infradead.org> wrote:

>>> I'm confused by all of this. The core level is exactly what you seem to

>>> want.

>> It's the level above the core, whether in an multi-threaded core

>> or a single threaded core.   This may correspond to the level

>> at which caches are shared (typically L3).  Cores are already well

>> represented via thread_siblings and similar.  Extra confusion is that

>> the current core_siblings (deprecated) sysfs interface, actually reflects

>> the package level and ignores anything in between core and

>> package (such as die on x86)

> That seems wrong. core-mask should be whatever cores share L3. So on a

> Intel Core2-Quad (just to pick an example) you should have 4 CPU in a

> package, but only 2 CPUs for the core-mask.

>

> It just so happens that L3 and package were the same for a long while in

> x86 land, although recent chips started breaking that trend.

>

> And I know nothing about the core-mask being depricated; it's what the

> scheduler uses. It's not going anywhere.



Only the sysfs filenames are deprecated:

thread_siblings -> core_cpus

core_siblings -> package_cpus

New names reflect better what has been implemented/documented in the
past (core_siblings=package_cpus are processors with same physical
package id). And that's indeed different from the core-mask you are
talking about above with Core2-Quad (that one has never been exposed
anywhere in sysfs, except in the L3 cpumap).

Brice
Morten Rasmussen Oct. 19, 2020, 1:13 p.m. UTC | #11
On Mon, Oct 19, 2020 at 02:50:53PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:

> > On Mon, 19 Oct 2020 12:35:22 +0200

> > Peter Zijlstra <peterz@infradead.org> wrote:

> 

> > > I'm confused by all of this. The core level is exactly what you seem to

> > > want.

> > 

> > It's the level above the core, whether in an multi-threaded core

> > or a single threaded core.   This may correspond to the level

> > at which caches are shared (typically L3).  Cores are already well

> > represented via thread_siblings and similar.  Extra confusion is that

> > the current core_siblings (deprecated) sysfs interface, actually reflects

> > the package level and ignores anything in between core and

> > package (such as die on x86)

> 

> That seems wrong. core-mask should be whatever cores share L3. So on a

> Intel Core2-Quad (just to pick an example) you should have 4 CPU in a

> package, but only 2 CPUs for the core-mask.

> 

> It just so happens that L3 and package were the same for a long while in

> x86 land, although recent chips started breaking that trend.

> 

> And I know nothing about the core-mask being depricated; it's what the

> scheduler uses. It's not going anywhere.


Don't get confused over the user-space topology and the scheduler
topology, they are _not_ the same despite having similar names for some
things :-)

> So if your 'cluster' is a group of single cores (possibly with SMT) that

> do not share cache but have a faster cache connection and you want them

> to behave as-if they were a multi-core group that did share cache, then

> core-mask it is.


In the scheduler, yes. There is no core-mask exposed to user-space.

We have to be clear about whether we discuss scheduler or user-space
topology :-)
Jonathan Cameron Oct. 19, 2020, 1:14 p.m. UTC | #12
On Mon, 19 Oct 2020 11:01:00 +0100
Sudeep Holla <sudeep.holla@arm.com> wrote:

> +Morten


Hi Sudeep,
> 

> On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:

> > Both ACPI and DT provide the ability to describe additional layers of

> > topology between that of individual cores and higher level constructs

> > such as the level at which the last level cache is shared.

> > In ACPI this can be represented in PPTT as a Processor Hierarchy

> > Node Structure [1] that is the parent of the CPU cores and in turn

> > has a parent Processor Hierarchy Nodes Structure representing

> > a higher level of topology.

> >

> > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> > any cache resources, but the interconnect topology is such that

> > the cost to transfer ownership of a cacheline between CPUs within

> > a cluster is lower than between CPUs in different clusters on the same

> > die.   Hence, it can make sense to deliberately schedule threads

> > sharing data to a single cluster.

> >  

> 

> This is very SoC specific and hard to generalise, hence LLC is chosen

> as one of the main factor to decide.


I absolutely agree. It is very hard to modify the kernel scheduler to
take into account this information.  Chances are we'll help some workloads
and hurt others.  However, it is good to at least expose that it exists.

> 

> Are there any scheduler topology related changes to achieve the same ?

> If so, we need their opinion on that.


So far no.  We have done a few experiments with adding this knowledge
into the scheduler but it is still very much a work in progress.

However, we do have micro benchmarks showing clearly that can make a
substantial difference.  Our intent was to first expose this information to
user space and get experience of using it in applications via HWLOC
before we tried to propose any scheduler changes.

> 

> > This patch simply exposes this information to userspace libraries

> > like hwloc by providing cluster_cpus and related sysfs attributes.

> > PoC of HWLOC support at [2].

> >  

> 

> OK, cluster is too Arm specific with no definition for it whatsoever.


It does get used by others (ACPI for example), but I agree there may be
a naming issue here.  Brice observed that hwloc uses the term group to cover
the somewhat obscure drawer and book levels that powerpc uses.
Perhaps that is a better name?

> How do you plan to support clusters of clusters or higher level of

> hierarchy present in PPTT.


Right now I don't.   I was hoping that question would come up,
but didn't want to complicate the first version of this proposal.
Guess I got that wrong ;)

As Brice suggested, probably adding an index makes sense.
I'm not sure if we bother implementing more than one layer for now, but
we definitely want a naming scheme that allows for it. If anyone has
a CPU that already uses another layer let us know!

On another note, we are thinking of an ACPI PPTT addition to add a flag
for die, if anyone has another suggestions on how to match the x86
CPUID part for that, it would be great to align.  At least die is
fairly well defined...  Tile and Module not so much...

> 

> > Note this patch only handle the ACPI case.

> >  

> 

> If we decide to add this to sysfs, I prefer to keep DT implementation

> also in sync. The bindings are in sync, just matter of implementation.


Ok, happy to look at DT that if we seem to be moving forwards with this
in general.

> 

> > Special consideration is needed for SMT processors, where it is

> > necessary to move 2 levels up the hierarchy from the leaf nodes

> > (thus skipping the processor core level).

> >

> > Currently the ID provided is the offset of the Processor

> > Hierarchy Nodes Structure within PPTT.  Whilst this is unique

> > it is not terribly elegant so alternative suggestions welcome.

> >  

> 

> That is firmware choice. May be your firmware just fills in mandatory

> fields and doesn't care about anything and everything optional. We do

> check for valid UID fields and if that is not set, offset is used as

> unique ID in kernel implementation. So if you enhance the firmware, the

> kernel sysfs will become elegant as you expect 😉 


Fair enough that there is a way to improve it, but it would be nice if
the default solution was also elegant!  I'll add the processor container
DSDT stuff to one or two of my test cases and make sure we can get
that working.  There are a lot of combinations that need testing and
I get bored quickly writing firmware tables by hand.

> 

> > Note that arm64 / ACPI does not provide any means of identifying

> > a die level in the topology but that may be unrelate to the cluster

> > level.

> >  

> 

> Indeed. It can be cluster of clusters on some platform. If we need that

> info, we should add it. My assumption was that generally each die forms

> a NUMA node and hence the information is available there. I may be wrong.


There are architectures that have exposed multiple NUMA nodes on one die.
Given steady increase in number of cores, I'd be surprised if it didn't happen
again.

An example would be the Haswell cluster on die. This pages suggests they
were exposed as NUMA nodes via SRAT proximity domains.

https://www.dell.com/support/article/en-uk/sln315049/intel-cluster-on-die-cod-technology-on-vmware-esxi?lang=en

So that isn't sufficient information.  If we want to expose it it needs
to be explicit.

> 

> > RFC questions:

> > 1) Naming

> > 2) Related to naming, do we want to represent all potential levels,

> >    or this enough?  On Kunpeng920, the next level up from cluster happens

> >    to be covered by llc cache sharing, but in theory more than one

> >    level of cluster description might be needed by some future system.  

> 

> That is my question above. Can't recall the terminology used in ACPI

> PPTT, but IIRC "cluster" is not used to keep it generic. May be we need

> to do the same here as the term "cluster" is ill-defined on Arm and I

> would avoid using it if possible.


In example text, ACPI uses the term cluster. Of course that may
reflect that the author was an ARM person (I have no idea)
E.g. The Cache Type Cluster Example Fig 5.13  Various other places in
description of power management use cluster to mean similar things.

> 

> > 3) Do we need DT code in place? I'm not sure any DT based ARM64

> >    systems would have enough complexity for this to be useful.  

> 

> I prefer to keep the implementation in sync


Fair enough.  Once we are sure this is going somewhere I'll get that
implemented (mostly writing test cases *sigh*)

> 

> > 4) Other architectures?  Is this useful on x86 for example?

> >  

> 

> AMD had multi-die within socket IIRC. IIUC, cpuid will provide the info

> and nothing more is needed from ACPI ? I may be wrong, just guessing/asking.


I'm not sure if any x86 platforms use PPTT.   However, the CPUID value,
if provided would still need to be hooked up to the sysfs files
etc.  As mentioned in another branch in this thread my guess is the
equivalent is 'tile'.

Thanks,

Jonathan

> 

> --

> Regards,

> Sudeep
Jonathan Cameron Oct. 19, 2020, 1:41 p.m. UTC | #13
On Mon, 19 Oct 2020 15:10:52 +0200
Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> Hi Jonathan,

> 

> On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:

> > On Mon, 19 Oct 2020 12:35:22 +0200

> > Peter Zijlstra <peterz@infradead.org> wrote:

> >   

> > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:  

> > > > Both ACPI and DT provide the ability to describe additional layers of

> > > > topology between that of individual cores and higher level constructs

> > > > such as the level at which the last level cache is shared.

> > > > In ACPI this can be represented in PPTT as a Processor Hierarchy

> > > > Node Structure [1] that is the parent of the CPU cores and in turn

> > > > has a parent Processor Hierarchy Nodes Structure representing

> > > > a higher level of topology.

> > > > 

> > > > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> > > > any cache resources, but the interconnect topology is such that

> > > > the cost to transfer ownership of a cacheline between CPUs within

> > > > a cluster is lower than between CPUs in different clusters on the same

> > > > die.   Hence, it can make sense to deliberately schedule threads

> > > > sharing data to a single cluster.

> > > > 

> > > > This patch simply exposes this information to userspace libraries

> > > > like hwloc by providing cluster_cpus and related sysfs attributes.

> > > > PoC of HWLOC support at [2].

> > > > 

> > > > Note this patch only handle the ACPI case.

> > > > 

> > > > Special consideration is needed for SMT processors, where it is

> > > > necessary to move 2 levels up the hierarchy from the leaf nodes

> > > > (thus skipping the processor core level).    

> > 

> > Hi Peter,

> >   

> > > 

> > > I'm confused by all of this. The core level is exactly what you seem to

> > > want.  

> > 

> > It's the level above the core, whether in an multi-threaded core

> > or a single threaded core.   This may correspond to the level

> > at which caches are shared (typically L3).  Cores are already well

> > represented via thread_siblings and similar.  Extra confusion is that

> > the current core_siblings (deprecated) sysfs interface, actually reflects

> > the package level and ignores anything in between core and

> > package (such as die on x86)

> > 

> > So in a typical system with a hierarchical interconnect you would have

> > 

> > thread

> > core

> > cluster (possibly multiple layers as mentioned in Brice's reply).

> > die

> > package

> > 

> > Unfortunately as pointed out in other branches of this thread, there is

> > no consistent generic name.  I'm open to suggestions!  

> 

> IIUC, you are actually proposing another "die" level? I'm not sure if we

> can actually come up with a generic name since interconnects are highly

> implementation dependent.


Brice mentioned hwloc is using 'group'.  That seems generic enough perhaps.

> 

> How is you memory distributed? Do you already have NUMA nodes? If you

> want to keep tasks together, it might make sense to define the clusters

> (in your case) as NUMA nodes.


We already have all of the standard levels.  We need at least one more.
On a near future platform we'll have full set (kunpeng920 is single thread)

So on kunpeng 920 we have
cores
(clusters)
die / llc shared at this level
package (multiple NUMA nodes in each package) 
System, multiple packages.


> 

> > Both ACPI PPTT and DT provide generic structures to represent layers of

> > topology.   They don't name as such, but in ACPI there are flags to indicate

> > package, core, thread.  

> 

> I think that is because those are the only ones that a fairly generic

> :-) It is also the only ones that scheduler cares about (plus NUMA).


Agreed, I'm not proposing we add these to the kernel scheduler (at least
for now).  Another layer just means another layer of complexity.

> 

> > 

> > For example, in zen2 this would correspond to a 'core complex' consisting

> > 4 CPU cores (each one 2 threads) sharing some local L3 cache.

> > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

> > In zen3 it looks like this level will be the same as that for the die.

> > 

> > Given they used the name in knights landing (and as is pointed out in

> > another branch of this thread, it's the CPUID description) I think Intel

> > calls these 'tiles' (anyone confirm that?) 

> > 

> > A similar concept exists for some ARM processors. 

> > https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110

> > CCLs in the diagram on that page.

> > 

> > Centriq 2400 had 2 core 'duplexes' which shared l2.

> > https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3

> > 

> > From the info release at hotchips, it looks like the thunderx3 deploys

> > a similar ring interconnect with groups of cores, each with 4 threads.

> > Not sure what they plan to call them yet though or whether they will chose

> > to represent that layer of the topology in their firmware tables.

> > 

> > Arms CMN600 interconnect also support such 'clusters' though I have no idea

> > if anyone has used it in this form yet.  In that case, they are called

> > "processor compute clusters"

> > https://developer.arm.com/documentation/100180/0103/

> > 

> > Xuantie-910 is cluster based as well (shares l2).

> > 

> > So in many cases the cluster level corresponds to something we already have

> > visibility of due to cache sharing etc, but that isn't true in kunpeng 920.  

> 

> The problem I see is that the benefit of keeping tasks together due to

> the interconnect layout might vary significantly between systems. So if

> we introduce a new cpumask for cluster it has to have represent roughly

> the same system properties otherwise generic software consuming this

> information could be tricked.


Agreed.  Any software would currently have to do it's own benchmarking
to work out how to use the presented information.  It would imply that
you 'want to look at this group of CPUs' rather than providing any hard
rules.  The same is true of die, which we already have.  What
that means will vary enormously between different designs in a fashion
that may or may not be independent of NUMA topology. Note,
there are people who do extensive benchmarking of NUMA topology as
the information provided is either inaccurate / missing, or not of
sufficient detail to do their scheduling.  It's not a big load to
do that sort of stuff on startup of software on an HPC system.

> 

> If there is a provable benefit of having interconnect grouping

> information, I think it would be better represented by a distance matrix

> like we have for NUMA.


There have been some discussions in various forums about how to
describe the complexity of interconnects well enough to actually be
useful.  Those have mostly floundered on the immense complexity of
designing such a description in a fashion any normal software would actually
use.  +cc Jerome who raised some of this in the kernel a while back.

Add this cluster / group layer seems moderately safe as it just says
'something here you should consider', rather than making particular
statements on expected performance etc.

So I fully agree it would be good to have that info, if we can figure
out how to do it!  However, that is never going to be a short term thing.

Thanks,

Jonathan

> 

> Morten
Valentin Schneider Oct. 19, 2020, 1:48 p.m. UTC | #14
+Cc Jeremy

On 19/10/20 14:10, Morten Rasmussen wrote:
> Hi Jonathan,

> The problem I see is that the benefit of keeping tasks together due to

> the interconnect layout might vary significantly between systems. So if

> we introduce a new cpumask for cluster it has to have represent roughly

> the same system properties otherwise generic software consuming this

> information could be tricked.

>

> If there is a provable benefit of having interconnect grouping

> information, I think it would be better represented by a distance matrix

> like we have for NUMA.

>

> Morten


That's my queue to paste some of that stuff I've been rambling on and off
about!

With regards to cache / interconnect layout, I do believe that if we
want to support in the scheduler itself then we should leverage some
distance table rather than to create X extra scheduler topology levels.

I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given
that SLIT gives us a distance value between any two PXM, we could directly
express core-to-core distance in that table. With that (and if that still
lets us properly discover NUMA node spans), we could let the scheduler
build dynamic NUMA-like topology levels representing the inner quirks of
the cache / interconnect layout.

It's mostly pipe dreams for now, but there seems to be more and more
hardware where that would make sense; somewhat recently the PowerPC guys
added something to their arch-specific code in that regards.
Morten Rasmussen Oct. 19, 2020, 2:16 p.m. UTC | #15
On Mon, Oct 19, 2020 at 02:41:57PM +0100, Jonathan Cameron wrote:
> On Mon, 19 Oct 2020 15:10:52 +0200

> Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> 

> > Hi Jonathan,

> > 

> > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:

> > > On Mon, 19 Oct 2020 12:35:22 +0200

> > > Peter Zijlstra <peterz@infradead.org> wrote:

> > >   

> > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:  

> > > > > Both ACPI and DT provide the ability to describe additional layers of

> > > > > topology between that of individual cores and higher level constructs

> > > > > such as the level at which the last level cache is shared.

> > > > > In ACPI this can be represented in PPTT as a Processor Hierarchy

> > > > > Node Structure [1] that is the parent of the CPU cores and in turn

> > > > > has a parent Processor Hierarchy Nodes Structure representing

> > > > > a higher level of topology.

> > > > > 

> > > > > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> > > > > any cache resources, but the interconnect topology is such that

> > > > > the cost to transfer ownership of a cacheline between CPUs within

> > > > > a cluster is lower than between CPUs in different clusters on the same

> > > > > die.   Hence, it can make sense to deliberately schedule threads

> > > > > sharing data to a single cluster.

> > > > > 

> > > > > This patch simply exposes this information to userspace libraries

> > > > > like hwloc by providing cluster_cpus and related sysfs attributes.

> > > > > PoC of HWLOC support at [2].

> > > > > 

> > > > > Note this patch only handle the ACPI case.

> > > > > 

> > > > > Special consideration is needed for SMT processors, where it is

> > > > > necessary to move 2 levels up the hierarchy from the leaf nodes

> > > > > (thus skipping the processor core level).    

> > > 

> > > Hi Peter,

> > >   

> > > > 

> > > > I'm confused by all of this. The core level is exactly what you seem to

> > > > want.  

> > > 

> > > It's the level above the core, whether in an multi-threaded core

> > > or a single threaded core.   This may correspond to the level

> > > at which caches are shared (typically L3).  Cores are already well

> > > represented via thread_siblings and similar.  Extra confusion is that

> > > the current core_siblings (deprecated) sysfs interface, actually reflects

> > > the package level and ignores anything in between core and

> > > package (such as die on x86)

> > > 

> > > So in a typical system with a hierarchical interconnect you would have

> > > 

> > > thread

> > > core

> > > cluster (possibly multiple layers as mentioned in Brice's reply).

> > > die

> > > package

> > > 

> > > Unfortunately as pointed out in other branches of this thread, there is

> > > no consistent generic name.  I'm open to suggestions!  

> > 

> > IIUC, you are actually proposing another "die" level? I'm not sure if we

> > can actually come up with a generic name since interconnects are highly

> > implementation dependent.

> 

> Brice mentioned hwloc is using 'group'.  That seems generic enough perhaps.

> 

> > 

> > How is you memory distributed? Do you already have NUMA nodes? If you

> > want to keep tasks together, it might make sense to define the clusters

> > (in your case) as NUMA nodes.

> 

> We already have all of the standard levels.  We need at least one more.

> On a near future platform we'll have full set (kunpeng920 is single thread)

> 

> So on kunpeng 920 we have

> cores

> (clusters)

> die / llc shared at this level


IIRC, LLC sharing isn't tied to a specific level in the user-space
topology description. On some Arm systems LLC is per cluster while the
package has a single die with two clusters.

I'm slightly confused about the cache sharing. You said above that your
clusters don't share cache resources? This list says LLC is at die
level, which is above cluster level?

> package (multiple NUMA nodes in each package) 


What is your NUMA node span? Couldn't you just make it equivalent to
your clusters?

> > > For example, in zen2 this would correspond to a 'core complex' consisting

> > > 4 CPU cores (each one 2 threads) sharing some local L3 cache.

> > > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

> > > In zen3 it looks like this level will be the same as that for the die.

> > > 

> > > Given they used the name in knights landing (and as is pointed out in

> > > another branch of this thread, it's the CPUID description) I think Intel

> > > calls these 'tiles' (anyone confirm that?) 

> > > 

> > > A similar concept exists for some ARM processors. 

> > > https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110

> > > CCLs in the diagram on that page.

> > > 

> > > Centriq 2400 had 2 core 'duplexes' which shared l2.

> > > https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3

> > > 

> > > From the info release at hotchips, it looks like the thunderx3 deploys

> > > a similar ring interconnect with groups of cores, each with 4 threads.

> > > Not sure what they plan to call them yet though or whether they will chose

> > > to represent that layer of the topology in their firmware tables.

> > > 

> > > Arms CMN600 interconnect also support such 'clusters' though I have no idea

> > > if anyone has used it in this form yet.  In that case, they are called

> > > "processor compute clusters"

> > > https://developer.arm.com/documentation/100180/0103/

> > > 

> > > Xuantie-910 is cluster based as well (shares l2).

> > > 

> > > So in many cases the cluster level corresponds to something we already have

> > > visibility of due to cache sharing etc, but that isn't true in kunpeng 920.  

> > 

> > The problem I see is that the benefit of keeping tasks together due to

> > the interconnect layout might vary significantly between systems. So if

> > we introduce a new cpumask for cluster it has to have represent roughly

> > the same system properties otherwise generic software consuming this

> > information could be tricked.

> 

> Agreed.  Any software would currently have to do it's own benchmarking

> to work out how to use the presented information.  It would imply that

> you 'want to look at this group of CPUs' rather than providing any hard

> rules.  The same is true of die, which we already have.  What

> that means will vary enormously between different designs in a fashion

> that may or may not be independent of NUMA topology. Note,

> there are people who do extensive benchmarking of NUMA topology as

> the information provided is either inaccurate / missing, or not of

> sufficient detail to do their scheduling.  It's not a big load to

> do that sort of stuff on startup of software on an HPC system.


With an undefined CPU grouping mask, you would still have to do some
benchmarking to figure out if it is useful for managing your particular
workload. It would certainly make the profiling simpler though.

> 

> > 

> > If there is a provable benefit of having interconnect grouping

> > information, I think it would be better represented by a distance matrix

> > like we have for NUMA.

> 

> There have been some discussions in various forums about how to

> describe the complexity of interconnects well enough to actually be

> useful.  Those have mostly floundered on the immense complexity of

> designing such a description in a fashion any normal software would actually

> use.  +cc Jerome who raised some of this in the kernel a while back.


I agree that representing interconnect details is hard. I had hoped that
a distance matrix would be better than nothing and more generic than
inserting extra group masks.

> 

> Add this cluster / group layer seems moderately safe as it just says

> 'something here you should consider', rather than making particular

> statements on expected performance etc.

> 

> So I fully agree it would be good to have that info, if we can figure

> out how to do it!  However, that is never going to be a short term thing.


A new sysfs entry could live for a long time :-)

Morten
Jonathan Cameron Oct. 19, 2020, 2:27 p.m. UTC | #16
On Mon, 19 Oct 2020 14:48:02 +0100
Valentin Schneider <valentin.schneider@arm.com> wrote:

> +Cc Jeremy

> 

> On 19/10/20 14:10, Morten Rasmussen wrote:

> > Hi Jonathan,

> > The problem I see is that the benefit of keeping tasks together due to

> > the interconnect layout might vary significantly between systems. So if

> > we introduce a new cpumask for cluster it has to have represent roughly

> > the same system properties otherwise generic software consuming this

> > information could be tricked.

> >

> > If there is a provable benefit of having interconnect grouping

> > information, I think it would be better represented by a distance matrix

> > like we have for NUMA.

> >

> > Morten  

> 

> That's my queue to paste some of that stuff I've been rambling on and off

> about!

> 

> With regards to cache / interconnect layout, I do believe that if we

> want to support in the scheduler itself then we should leverage some

> distance table rather than to create X extra scheduler topology levels.

> 

> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given

> that SLIT gives us a distance value between any two PXM, we could directly

> express core-to-core distance in that table. With that (and if that still

> lets us properly discover NUMA node spans), we could let the scheduler

> build dynamic NUMA-like topology levels representing the inner quirks of

> the cache / interconnect layout.


You would rapidly run into the problem SLIT had for numa node description.
There is no consistent description of distance and except in the vaguest
sense or 'nearer' it wasn't any use for anything.   That is why HMAT
came along. It's far from perfect but it is a step up.

I can't see how you'd generalize those particular tables to do anything
for intercore comms without breaking their use for NUMA, but something
a bit similar might work.

A lot of thought has gone in (and meeting time) to try an improve the
situation for complex topology around NUMA.  Whilst there are differences
in representing the internal interconnects and caches it seems like a somewhat
similar problem.  The issue there is it is really really hard to describe
this stuff with enough detail to be useful, but simple enough to be usable.

https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@redhat.com/

> 

> It's mostly pipe dreams for now, but there seems to be more and more

> hardware where that would make sense; somewhat recently the PowerPC guys

> added something to their arch-specific code in that regards.


Pipe dream == something to work on ;)

ACPI has a nice code first model of updating the spec now, so we can discuss
this one in public, and propose spec changes only once we have an implementation
proven.

Note I'm not proposing we put the cluster stuff in the scheduler, just
provide it as a hint to userspace.

Jonathan
Brice Goglin Oct. 19, 2020, 2:42 p.m. UTC | #17
Le 19/10/2020 à 16:16, Morten Rasmussen a écrit :
>

>>> If there is a provable benefit of having interconnect grouping

>>> information, I think it would be better represented by a distance matrix

>>> like we have for NUMA.

>> There have been some discussions in various forums about how to

>> describe the complexity of interconnects well enough to actually be

>> useful.  Those have mostly floundered on the immense complexity of

>> designing such a description in a fashion any normal software would actually

>> use.  +cc Jerome who raised some of this in the kernel a while back.

> I agree that representing interconnect details is hard. I had hoped that

> a distance matrix would be better than nothing and more generic than

> inserting extra group masks.

>


The distance matrix is indeed more precise, but would it scale to
tens/hundreds of core? When ACPI HMAT latency/bandwidth was added, there
were concerns that exposing the full matrix would be an issue for the
kernel (that's why only local latency/bandwidth is exposed n sysfs).
This was only for NUMA nodes/targets/initiators, you would have
significantly more cores than that.

Brice
Jonathan Cameron Oct. 19, 2020, 3:30 p.m. UTC | #18
On Mon, 19 Oct 2020 16:16:53 +0200
Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> On Mon, Oct 19, 2020 at 02:41:57PM +0100, Jonathan Cameron wrote:

> > On Mon, 19 Oct 2020 15:10:52 +0200

> > Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> >   

> > > Hi Jonathan,

> > > 

> > > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:  

> > > > On Mon, 19 Oct 2020 12:35:22 +0200

> > > > Peter Zijlstra <peterz@infradead.org> wrote:

> > > >     

> > > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:    

> > > > > > Both ACPI and DT provide the ability to describe additional layers of

> > > > > > topology between that of individual cores and higher level constructs

> > > > > > such as the level at which the last level cache is shared.

> > > > > > In ACPI this can be represented in PPTT as a Processor Hierarchy

> > > > > > Node Structure [1] that is the parent of the CPU cores and in turn

> > > > > > has a parent Processor Hierarchy Nodes Structure representing

> > > > > > a higher level of topology.

> > > > > > 

> > > > > > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share

> > > > > > any cache resources, but the interconnect topology is such that

> > > > > > the cost to transfer ownership of a cacheline between CPUs within

> > > > > > a cluster is lower than between CPUs in different clusters on the same

> > > > > > die.   Hence, it can make sense to deliberately schedule threads

> > > > > > sharing data to a single cluster.

> > > > > > 

> > > > > > This patch simply exposes this information to userspace libraries

> > > > > > like hwloc by providing cluster_cpus and related sysfs attributes.

> > > > > > PoC of HWLOC support at [2].

> > > > > > 

> > > > > > Note this patch only handle the ACPI case.

> > > > > > 

> > > > > > Special consideration is needed for SMT processors, where it is

> > > > > > necessary to move 2 levels up the hierarchy from the leaf nodes

> > > > > > (thus skipping the processor core level).      

> > > > 

> > > > Hi Peter,

> > > >     

> > > > > 

> > > > > I'm confused by all of this. The core level is exactly what you seem to

> > > > > want.    

> > > > 

> > > > It's the level above the core, whether in an multi-threaded core

> > > > or a single threaded core.   This may correspond to the level

> > > > at which caches are shared (typically L3).  Cores are already well

> > > > represented via thread_siblings and similar.  Extra confusion is that

> > > > the current core_siblings (deprecated) sysfs interface, actually reflects

> > > > the package level and ignores anything in between core and

> > > > package (such as die on x86)

> > > > 

> > > > So in a typical system with a hierarchical interconnect you would have

> > > > 

> > > > thread

> > > > core

> > > > cluster (possibly multiple layers as mentioned in Brice's reply).

> > > > die

> > > > package

> > > > 

> > > > Unfortunately as pointed out in other branches of this thread, there is

> > > > no consistent generic name.  I'm open to suggestions!    

> > > 

> > > IIUC, you are actually proposing another "die" level? I'm not sure if we

> > > can actually come up with a generic name since interconnects are highly

> > > implementation dependent.  

> > 

> > Brice mentioned hwloc is using 'group'.  That seems generic enough perhaps.

> >   

> > > 

> > > How is you memory distributed? Do you already have NUMA nodes? If you

> > > want to keep tasks together, it might make sense to define the clusters

> > > (in your case) as NUMA nodes.  

> > 

> > We already have all of the standard levels.  We need at least one more.

> > On a near future platform we'll have full set (kunpeng920 is single thread)

> > 

> > So on kunpeng 920 we have

> > cores

> > (clusters)

> > die / llc shared at this level  

> 

> IIRC, LLC sharing isn't tied to a specific level in the user-space

> topology description. On some Arm systems LLC is per cluster while the

> package has a single die with two clusters.


That was just an example of this system where it happens to sit at that level.
Nothing stops per core private LLC with the right coherence protocol.
Basically LLC can sit at any level.

It is indirectly exposed and used by userspace - though which level it attaches
to or whether it defines it own varies.
So in cpuX/topology there is no mention of caches, but if you are using hwloc
it also uses
cpuX/cache/indexX
and from that it will happily add more layers of topology to 
reflect cache sharing.  It will merge the cache levels with existing levels
from cpuX/topology if they happen to match.

So it may not have been the intent of the sysfs design, but a commonly
used library is figuring it out anyway.

> 

> I'm slightly confused about the cache sharing. You said above that your

> clusters don't share cache resources? This list says LLC is at die

> level, which is above cluster level?


Yes. Cluster is below the level at which the l3 is shared.
It's more complex than that in reality but at the coarse scale
we can describe at all, the l3 is at the die level.

From lstopo (heavily cropped obviously, it's a 128 core 2 socket system)

Machine (1006GB total)
  Package L#0
    L3 L#0 (32MB)
      NUMANode L#0 (P#0 252GB)
      Cluster L#0
        L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
        L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
        L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
        L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)
      Cluster L#1
        L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4)
        L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5)
        L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6)
        L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7)
...
    L3 L#1 (32MB)
      NUMANode L#1 (P#1 252GB)
      Cluster L#8
        L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32)
        L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33)
        L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34)
        L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35)
...

  Package L#1                                                                                                                                                    
    L3 L#2 (32MB)                                                                                                                                                  
      NUMANode L#2 (P#2 252GB)                                                                                                                                     
      Cluster L#16
        L2 L#64 (512KB) + L1d L#64 (64KB) + L1i L#64 (64KB) + Core L#64 + PU L#64 (P#64)
        L2 L#65 (512KB) + L1d L#65 (64KB) + L1i L#65 (64KB) + Core L#65 + PU L#65 (P#65)
        L2 L#66 (512KB) + L1d L#66 (64KB) + L1i L#66 (64KB) + Core L#66 + PU L#66 (P#66)
        L2 L#67 (512KB) + L1d L#67 (64KB) + L1i L#67 (64KB) + Core L#67 + PU L#67 (P#67)
etc


> > package (multiple NUMA nodes in each package)   

> 

> What is your NUMA node span? Couldn't you just make it equivalent to

> your clusters?


Nope can't do that. System has 2 nodes per package, multiple packages per system and
interconnect hangs off one end of the each package. It's one of the platforms that has been
causing Valentin issues with span.  (span 3 I think?) 2 socket system basically looks like

[a]-[b]-----[c]-[d] 

where dash length reflects distance and there isn't a direct [a] to [d] path.

We have tried messing around with dropping the die level of NUMA description
in favour of representing cluster in the past, but it worked worse than going
for NUMA at the die level (which is where the memory controllers are anyway).
The cluster affects are generally more minor than NUMA.  Hence I'm not proposing
putting them anywhere they'd current affect the kernel scheduler.

> 

> > > > For example, in zen2 this would correspond to a 'core complex' consisting

> > > > 4 CPU cores (each one 2 threads) sharing some local L3 cache.

> > > > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

> > > > In zen3 it looks like this level will be the same as that for the die.

> > > > 

> > > > Given they used the name in knights landing (and as is pointed out in

> > > > another branch of this thread, it's the CPUID description) I think Intel

> > > > calls these 'tiles' (anyone confirm that?) 

> > > > 

> > > > A similar concept exists for some ARM processors. 

> > > > https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110

> > > > CCLs in the diagram on that page.

> > > > 

> > > > Centriq 2400 had 2 core 'duplexes' which shared l2.

> > > > https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3

> > > > 

> > > > From the info release at hotchips, it looks like the thunderx3 deploys

> > > > a similar ring interconnect with groups of cores, each with 4 threads.

> > > > Not sure what they plan to call them yet though or whether they will chose

> > > > to represent that layer of the topology in their firmware tables.

> > > > 

> > > > Arms CMN600 interconnect also support such 'clusters' though I have no idea

> > > > if anyone has used it in this form yet.  In that case, they are called

> > > > "processor compute clusters"

> > > > https://developer.arm.com/documentation/100180/0103/

> > > > 

> > > > Xuantie-910 is cluster based as well (shares l2).

> > > > 

> > > > So in many cases the cluster level corresponds to something we already have

> > > > visibility of due to cache sharing etc, but that isn't true in kunpeng 920.    

> > > 

> > > The problem I see is that the benefit of keeping tasks together due to

> > > the interconnect layout might vary significantly between systems. So if

> > > we introduce a new cpumask for cluster it has to have represent roughly

> > > the same system properties otherwise generic software consuming this

> > > information could be tricked.  

> > 

> > Agreed.  Any software would currently have to do it's own benchmarking

> > to work out how to use the presented information.  It would imply that

> > you 'want to look at this group of CPUs' rather than providing any hard

> > rules.  The same is true of die, which we already have.  What

> > that means will vary enormously between different designs in a fashion

> > that may or may not be independent of NUMA topology. Note,

> > there are people who do extensive benchmarking of NUMA topology as

> > the information provided is either inaccurate / missing, or not of

> > sufficient detail to do their scheduling.  It's not a big load to

> > do that sort of stuff on startup of software on an HPC system.  

> 

> With an undefined CPU grouping mask, you would still have to do some

> benchmarking to figure out if it is useful for managing your particular

> workload. It would certainly make the profiling simpler though.


Absolutely.  Playing guess for the scale of clusters is not going to be
pretty.  I suppose you'd probably catch most systems if you brute forced
2,4 and 8 and hoped your CPUs were in sane order (optimistic).

> 

> >   

> > > 

> > > If there is a provable benefit of having interconnect grouping

> > > information, I think it would be better represented by a distance matrix

> > > like we have for NUMA.  

> > 

> > There have been some discussions in various forums about how to

> > describe the complexity of interconnects well enough to actually be

> > useful.  Those have mostly floundered on the immense complexity of

> > designing such a description in a fashion any normal software would actually

> > use.  +cc Jerome who raised some of this in the kernel a while back.  

> 

> I agree that representing interconnect details is hard. I had hoped that

> a distance matrix would be better than nothing and more generic than

> inserting extra group masks.


Short of affectively mapping it to groups (which is basically what SLIT ends
up being used for) I'm not sure how on earth we'd define it.

Distance for what operation?
Shared access to a cacheline?
Get exclusive? (with all the fun of chatting to directories and possible DCT)

There are whole set of possibilities and we might potentially want bandwidth.
All come with a bunch of questions around access patterns etc (probably
- I've not really thought about this that much for this case!)

Basically we'd be trying to wrap up all the complex stuff those interconnect
and cache specialists work on in a single number.   It's going to just become
and ordered list of who is nearer..  So another way of representing the
groupings we can already describe in firmware.

> 

> > 

> > Add this cluster / group layer seems moderately safe as it just says

> > 'something here you should consider', rather than making particular

> > statements on expected performance etc.

> > 

> > So I fully agree it would be good to have that info, if we can figure

> > out how to do it!  However, that is never going to be a short term thing.  

> 

> A new sysfs entry could live for a long time :-)


Agreed.  But it if it is generic and easy to support from existing description
(which this is) then that shouldn't be a problem.  Definitely good to get
it as good as we can at the start though.

If something better comes along that is more useful people can transition to
that and it doesn't matter if the burden of maintaining the old route is trivial.

I'm absolutely in favour of better description and happy to work with people on
that but it's a multi year job at the very least.  We all love standards bodies :)

Biggest job here would be proving the info was useful and generic across
many 'interesting' system architectures.  Many of which also have more than one
mode in which you can run the caches.

> 

> Morten


Jonathan
Valentin Schneider Oct. 19, 2020, 3:51 p.m. UTC | #19
On 19/10/20 15:27, Jonathan Cameron wrote:
> On Mon, 19 Oct 2020 14:48:02 +0100

> Valentin Schneider <valentin.schneider@arm.com> wrote:

>>

>> That's my queue to paste some of that stuff I've been rambling on and off

>> about!

>>

>> With regards to cache / interconnect layout, I do believe that if we

>> want to support in the scheduler itself then we should leverage some

>> distance table rather than to create X extra scheduler topology levels.

>>

>> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given

>> that SLIT gives us a distance value between any two PXM, we could directly

>> express core-to-core distance in that table. With that (and if that still

>> lets us properly discover NUMA node spans), we could let the scheduler

>> build dynamic NUMA-like topology levels representing the inner quirks of

>> the cache / interconnect layout.

>

> You would rapidly run into the problem SLIT had for numa node description.

> There is no consistent description of distance and except in the vaguest

> sense or 'nearer' it wasn't any use for anything.   That is why HMAT

> came along. It's far from perfect but it is a step up.

>


I wasn't aware of HMAT; my feeble ACPI knowledge is limited to SRAT / SLIT
/ PPTT, so thanks for pointing this out.

> I can't see how you'd generalize those particular tables to do anything

> for intercore comms without breaking their use for NUMA, but something

> a bit similar might work.

>


Right, there's the issue of still being able to determine NUMA node
boundaries.

> A lot of thought has gone in (and meeting time) to try an improve the

> situation for complex topology around NUMA.  Whilst there are differences

> in representing the internal interconnects and caches it seems like a somewhat

> similar problem.  The issue there is it is really really hard to describe

> this stuff with enough detail to be useful, but simple enough to be usable.

>

> https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@redhat.com/

>


Thanks for the link!

>>

>> It's mostly pipe dreams for now, but there seems to be more and more

>> hardware where that would make sense; somewhat recently the PowerPC guys

>> added something to their arch-specific code in that regards.

>

> Pipe dream == something to work on ;)

>

> ACPI has a nice code first model of updating the spec now, so we can discuss

> this one in public, and propose spec changes only once we have an implementation

> proven.

>


FWIW I blabbered about a "generalization" of NUMA domains & distances
within the scheduler at LPC19 (and have been pasting that occasionally,
apologies for the broken record):

https://linuxplumbersconf.org/event/4/contributions/484/

I've only pondered about the implementation, but if (big if; also I really
despise advertising "the one solution that will solve all your issues"
which this is starting to sound like) it would help I could cobble together
an RFC leveraging a separate distance table.

It doesn't solve the "funneling cache properties into a single number"
issue, which as you just pointed out in a parallel email is a separate
discussion altogether.

> Note I'm not proposing we put the cluster stuff in the scheduler, just

> provide it as a hint to userspace.

>


The goal being to tweak tasks' affinities, right? Other than CPU pinning
and rare cases, IMO if the userspace has to mess around with affinities it
is due to the failings of the underlying scheduler. Restricted CPU
affinities is also something the load-balancer struggles with; I have and
have been fighting over such issues where just a single per-CPU kworker
waking up at the wrong time can mess up load-balance for quite some time. I
tend to phrase it as: "if you're rude to the scheduler, it can and will
respond in kind".

Now yes, it's not the same timescale nor amount of work, but this is
something the scheduler itself should leverage, not userspace.

> Jonathan
Jonathan Cameron Oct. 19, 2020, 4 p.m. UTC | #20
On Mon, 19 Oct 2020 16:51:06 +0100
Valentin Schneider <valentin.schneider@arm.com> wrote:

> On 19/10/20 15:27, Jonathan Cameron wrote:

> > On Mon, 19 Oct 2020 14:48:02 +0100

> > Valentin Schneider <valentin.schneider@arm.com> wrote:  

> >>

> >> That's my queue to paste some of that stuff I've been rambling on and off

> >> about!

> >>

> >> With regards to cache / interconnect layout, I do believe that if we

> >> want to support in the scheduler itself then we should leverage some

> >> distance table rather than to create X extra scheduler topology levels.

> >>

> >> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given

> >> that SLIT gives us a distance value between any two PXM, we could directly

> >> express core-to-core distance in that table. With that (and if that still

> >> lets us properly discover NUMA node spans), we could let the scheduler

> >> build dynamic NUMA-like topology levels representing the inner quirks of

> >> the cache / interconnect layout.  

> >

> > You would rapidly run into the problem SLIT had for numa node description.

> > There is no consistent description of distance and except in the vaguest

> > sense or 'nearer' it wasn't any use for anything.   That is why HMAT

> > came along. It's far from perfect but it is a step up.

> >  

> 

> I wasn't aware of HMAT; my feeble ACPI knowledge is limited to SRAT / SLIT

> / PPTT, so thanks for pointing this out.

> 

> > I can't see how you'd generalize those particular tables to do anything

> > for intercore comms without breaking their use for NUMA, but something

> > a bit similar might work.

> >  

> 

> Right, there's the issue of still being able to determine NUMA node

> boundaries.


Backwards compatibility will break you there. I'd definitely look at a separate
table.  Problem with SLIT etc is that, as static tables, we can't play games
with OSC bits to negotiate what the OS and the firmware both understand.

> 

> > A lot of thought has gone in (and meeting time) to try an improve the

> > situation for complex topology around NUMA.  Whilst there are differences

> > in representing the internal interconnects and caches it seems like a somewhat

> > similar problem.  The issue there is it is really really hard to describe

> > this stuff with enough detail to be useful, but simple enough to be usable.

> >

> > https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@redhat.com/

> >  

> 

> Thanks for the link!

> 

> >>

> >> It's mostly pipe dreams for now, but there seems to be more and more

> >> hardware where that would make sense; somewhat recently the PowerPC guys

> >> added something to their arch-specific code in that regards.  

> >

> > Pipe dream == something to work on ;)

> >

> > ACPI has a nice code first model of updating the spec now, so we can discuss

> > this one in public, and propose spec changes only once we have an implementation

> > proven.

> >  

> 

> FWIW I blabbered about a "generalization" of NUMA domains & distances

> within the scheduler at LPC19 (and have been pasting that occasionally,

> apologies for the broken record):

> 

> https://linuxplumbersconf.org/event/4/contributions/484/

> 

> I've only pondered about the implementation, but if (big if; also I really

> despise advertising "the one solution that will solve all your issues"

> which this is starting to sound like) it would help I could cobble together

> an RFC leveraging a separate distance table.


It would certainly be interesting.

> 

> It doesn't solve the "funneling cache properties into a single number"

> issue, which as you just pointed out in a parallel email is a separate

> discussion altogether.

> 

> > Note I'm not proposing we put the cluster stuff in the scheduler, just

> > provide it as a hint to userspace.

> >  

> 

> The goal being to tweak tasks' affinities, right? Other than CPU pinning

> and rare cases, IMO if the userspace has to mess around with affinities it

> is due to the failings of the underlying scheduler. Restricted CPU

> affinities is also something the load-balancer struggles with; I have and

> have been fighting over such issues where just a single per-CPU kworker

> waking up at the wrong time can mess up load-balance for quite some time. I

> tend to phrase it as: "if you're rude to the scheduler, it can and will

> respond in kind".

> 

> Now yes, it's not the same timescale nor amount of work, but this is

> something the scheduler itself should leverage, not userspace.


Ideally I absolutely agree, but then we get into the games of trying to
classify the types of workload which would benefit.  Much like with
NUMA spreading, it is going to be hard to come up with a one true
solution (nice though that would be!)

Not getting regressions with anything in this area is going to be
really tricky.   

J


> 

> > Jonathan
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index b90dafcc8237..f9d374560047 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -24,6 +24,12 @@  core_id:
 	identifier (rather than the kernel's).  The actual value is
 	architecture and platform dependent.
 
+cluster_id:
+
+	the Cluster ID of cpuX.  Typically it is the hardware platform's
+	identifier (rather than the kernel's).  The actual value is
+	architecture and platform dependent.
+
 book_id:
 
 	the book ID of cpuX. Typically it is the hardware platform's
@@ -56,6 +62,14 @@  package_cpus_list:
 	human-readable list of CPUs sharing the same physical_package_id.
 	(deprecated name: "core_siblings_list")
 
+cluster_cpus:
+
+	internal kernel map of CPUs within the same cluster.
+
+cluster_cpus_list:
+
+	human-readable list of CPUs within the same cluster.
+
 die_cpus:
 
 	internal kernel map of CPUs within the same die.
@@ -96,11 +110,13 @@  these macros in include/asm-XXX/topology.h::
 
 	#define topology_physical_package_id(cpu)
 	#define topology_die_id(cpu)
+	#define topology_cluster_id(cpu)
 	#define topology_core_id(cpu)
 	#define topology_book_id(cpu)
 	#define topology_drawer_id(cpu)
 	#define topology_sibling_cpumask(cpu)
 	#define topology_core_cpumask(cpu)
+	#define topology_cluster_cpumask(cpu)
 	#define topology_die_cpumask(cpu)
 	#define topology_book_cpumask(cpu)
 	#define topology_drawer_cpumask(cpu)
@@ -116,10 +132,12 @@  not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
 2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_sibling_cpumask: just the given CPU
+6) topology_core_cpumask: just the given CPU
+7) topology_cluster_cpumask: just the given CPU
+8) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 0801a0f3c156..4c40240fdf6a 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -101,6 +101,8 @@  int __init parse_acpi_topology(void)
 			cpu_topology[cpu].thread_id  = -1;
 			cpu_topology[cpu].core_id    = topology_id;
 		}
+		topology_id = find_acpi_cpu_topology_cluster(cpu);
+		cpu_topology[cpu].cluster_id = topology_id;
 		topology_id = find_acpi_cpu_topology_package(cpu);
 		cpu_topology[cpu].package_id = topology_id;
 
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index 4ae93350b70d..8646a93e4348 100644
--- a/drivers/acpi/pptt.c
+++ b/drivers/acpi/pptt.c
@@ -736,6 +736,66 @@  int find_acpi_cpu_topology_package(unsigned int cpu)
 					  ACPI_PPTT_PHYSICAL_PACKAGE);
 }
 
+/**
+ * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
+ * @cpu: Kernel logical CPU number
+ *
+ * Determine a topology unique cluster ID for the given CPU/thread.
+ * This ID can then be used to group peers, which will have matching ids.
+ *
+ * The cluster, if present is the level of topology above CPUs. In a
+ * multi-thread CPU, it will be the level above the CPU, not the thread.
+ * It may not exist in single CPU systems. In simple multi-CPU systems,
+ * it may be equal to the package topology level.
+ *
+ * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
+ * or there is no toplogy level above the CPU..
+ * Otherwise returns a value which represents the package for this CPU.
+ */
+
+int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	struct acpi_table_header *table;
+	acpi_status status;
+	struct acpi_pptt_processor *cpu_node, *cluster_node;
+	int retval;
+	int is_thread;
+
+	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
+	if (ACPI_FAILURE(status)) {
+		acpi_pptt_warn_missing();
+		return -ENOENT;
+	}
+	cpu_node = acpi_find_processor_node(table, cpu);
+	if (cpu_node == NULL || !cpu_node->parent) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+
+	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
+	cluster_node = fetch_pptt_node(table, cpu_node->parent);
+	if (cluster_node == NULL) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+	if (is_thread) {
+		if (!cluster_node->parent) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+		cluster_node = fetch_pptt_node(table, cluster_node->parent);
+		if (cluster_node == NULL) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+	}
+	retval = ACPI_PTR_DIFF(cluster_node, table);
+put_table:
+	acpi_put_table(table);
+
+	return retval;
+}
+
 /**
  * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
  * @cpu: Kernel logical CPU number
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 75f72d684294..e2ca8f39131e 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -497,6 +497,11 @@  const struct cpumask *cpu_coregroup_mask(int cpu)
 	return core_mask;
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return &cpu_topology[cpu].cluster_sibling;
+}
+
 void update_siblings_masks(unsigned int cpuid)
 {
 	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@@ -514,6 +519,11 @@  void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;
 
+		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
+			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
+			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
+		}
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
 
@@ -532,6 +542,9 @@  static void clear_cpu_topology(int cpu)
 	cpumask_clear(&cpu_topo->llc_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
 
+	cpumask_clear(&cpu_topo->cluster_sibling);
+	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
+
 	cpumask_clear(&cpu_topo->core_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
 	cpumask_clear(&cpu_topo->thread_sibling);
@@ -562,6 +575,7 @@  void remove_cpu_topology(unsigned int cpu)
 		cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
 	for_each_cpu(sibling, topology_sibling_cpumask(cpu))
 		cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
+
 	for_each_cpu(sibling, topology_llc_cpumask(cpu))
 		cpumask_clear_cpu(cpu, topology_llc_cpumask(sibling));
 
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index ad8d33c6077b..f72ac9acc112 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -46,6 +46,9 @@  static DEVICE_ATTR_RO(physical_package_id);
 define_id_show_func(die_id);
 static DEVICE_ATTR_RO(die_id);
 
+define_id_show_func(cluster_id);
+static DEVICE_ATTR_RO(cluster_id);
+
 define_id_show_func(core_id);
 static DEVICE_ATTR_RO(core_id);
 
@@ -61,6 +64,10 @@  define_siblings_show_func(core_siblings, core_cpumask);
 static DEVICE_ATTR_RO(core_siblings);
 static DEVICE_ATTR_RO(core_siblings_list);
 
+define_siblings_show_func(cluster_cpus, cluster_cpumask);
+static DEVICE_ATTR_RO(cluster_cpus);
+static DEVICE_ATTR_RO(cluster_cpus_list);
+
 define_siblings_show_func(die_cpus, die_cpumask);
 static DEVICE_ATTR_RO(die_cpus);
 static DEVICE_ATTR_RO(die_cpus_list);
@@ -88,6 +95,7 @@  static DEVICE_ATTR_RO(drawer_siblings_list);
 static struct attribute *default_attrs[] = {
 	&dev_attr_physical_package_id.attr,
 	&dev_attr_die_id.attr,
+	&dev_attr_cluster_id.attr,
 	&dev_attr_core_id.attr,
 	&dev_attr_thread_siblings.attr,
 	&dev_attr_thread_siblings_list.attr,
@@ -95,6 +103,8 @@  static struct attribute *default_attrs[] = {
 	&dev_attr_core_cpus_list.attr,
 	&dev_attr_core_siblings.attr,
 	&dev_attr_core_siblings_list.attr,
+	&dev_attr_cluster_cpus.attr,
+	&dev_attr_cluster_cpus_list.attr,
 	&dev_attr_die_cpus.attr,
 	&dev_attr_die_cpus_list.attr,
 	&dev_attr_package_cpus.attr,
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 159a461d8524..05b1e30f43fc 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1333,6 +1333,7 @@  static inline int lpit_read_residency_count_address(u64 *address)
 #ifdef CONFIG_ACPI_PPTT
 int acpi_pptt_cpu_is_thread(unsigned int cpu);
 int find_acpi_cpu_topology(unsigned int cpu, int level);
+int find_acpi_cpu_topology_cluster(unsigned int cpu);
 int find_acpi_cpu_topology_package(unsigned int cpu);
 int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
 int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
@@ -1345,6 +1346,10 @@  static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
 {
 	return -EINVAL;
 }
+static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	return -EINVAL;
+}
 static inline int find_acpi_cpu_topology_package(unsigned int cpu)
 {
 	return -EINVAL;
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 69b1dabe39dc..dba06864eab5 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -45,10 +45,12 @@  void topology_set_thermal_pressure(const struct cpumask *cpus,
 struct cpu_topology {
 	int thread_id;
 	int core_id;
+	int cluster_id;
 	int package_id;
 	int llc_id;
 	cpumask_t thread_sibling;
 	cpumask_t core_sibling;
+	cpumask_t cluster_sibling;
 	cpumask_t llc_sibling;
 };
 
@@ -56,13 +58,16 @@  struct cpu_topology {
 extern struct cpu_topology cpu_topology[NR_CPUS];
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
+#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
 #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
 #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
 #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
+#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
 #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
 void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
+const struct cpumask *cpu_clustergroup_mask(int cpu);
 void update_siblings_masks(unsigned int cpu);
 void remove_cpu_topology(unsigned int cpuid);
 void reset_cpu_topology(void);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 608fa4aadf0e..5f666488f8f1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -185,6 +185,9 @@  static inline int cpu_to_mem(int cpu)
 #ifndef topology_die_id
 #define topology_die_id(cpu)			((void)(cpu), -1)
 #endif
+#ifndef topology_cluster_id
+#define topology_cluster_id(cpu)		((void)(cpu), -1)
+#endif
 #ifndef topology_core_id
 #define topology_core_id(cpu)			((void)(cpu), 0)
 #endif
@@ -194,6 +197,9 @@  static inline int cpu_to_mem(int cpu)
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
+#ifndef topology_cluster_cpumask
+#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
+#endif
 #ifndef topology_die_cpumask
 #define topology_die_cpumask(cpu)		cpumask_of(cpu)
 #endif