Message ID | 20200924195058.362984-7-danielhb413@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | pseries NUMA distance calculation | expand |
On Thu, Sep 24, 2020 at 04:50:58PM -0300, Daniel Henrique Barboza wrote: > This update provides more in depth information about the > choices and drawbacks of the new NUMA support for the > spapr machine. > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> > --- > docs/specs/ppc-spapr-numa.rst | 213 ++++++++++++++++++++++++++++++++++ > 1 file changed, 213 insertions(+) > > diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst > index e762038022..994bfb996f 100644 > --- a/docs/specs/ppc-spapr-numa.rst > +++ b/docs/specs/ppc-spapr-numa.rst > @@ -189,3 +189,216 @@ QEMU up to 5.1, as follows: > > This also means that user input in QEMU command line does not change the > NUMA distancing inside the guest for the pseries machine. > + > +New NUMA mechanics for pseries in QEMU 5.2 > +========================================== I'd suggest putting the new mechanics - which we intend to become the normal mechanis - at the top of the document, with the old version later. > + > +Starting in QEMU 5.2, the pseries machine now considers user input when > +setting NUMA topology of the guest. The following changes were made: > + > +* ibm,associativity-reference-points was changed to {0x4, 0x3, 0x2, 0x1}, allowing > + for 4 distinct NUMA distance values based on the NUMA levels > + > +* ibm,max-associativity-domains was changed to support multiple associativity > + domains in all NUMA levels. This is needed to ensure user flexibility > + > +* ibm,associativity for all resources now varies with user input > + > +These changes are only effective for pseries-5.2 and newer machines that are > +created with more than one NUMA node (disconsidering NUMA nodes created by > +the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been > +around for such a long time, with users seeing NUMA distances 10 and 40 > +(and 80 if using NVLink2 GPUs), and there is no need to disrupt the > +existing experience of those guests. > + > +To bring the user experience x86 users have when tuning up NUMA, we had > +to operate under the current pseries Linux kernel logic described in > +`How the pseries Linux guest calculates NUMA distances`_. The result > +is that we needed to translate NUMA distance user input to pseries > +Linux kernel input. > + > +Translating user distance to kernel distance > +-------------------------------------------- > + > +User input for NUMA distance can vary from 10 to 254. We need to translate > +that to the values that the Linux kernel operates on (10, 20, 40, 80, 160). > +This is how it is being done: > + > +* user distance 11 to 30 will be interpreted as 20 > +* user distance 31 to 60 will be interpreted as 40 > +* user distance 61 to 120 will be interpreted as 80 > +* user distance 121 and beyond will be interpreted as 160 > +* user distance 10 stays 10 > + > +The reasoning behind this aproximation is to avoid any round up to the local > +distance (10), keeping it exclusive to the 4th NUMA level (which is still > +exclusive to the node_id). All other ranges were chosen under the developer > +discretion of what would be (somewhat) sensible considering the user input. > +Any other strategy can be used here, but in the end the reality is that we'll > +have to accept that a large array of values will be translated to the same > +NUMA topology in the guest, e.g. this user input: > + > +:: > + > + 0 1 2 > + 0 10 31 120 > + 1 31 10 30 > + 2 120 30 10 > + > +And this other user input: > + > +:: > + > + 0 1 2 > + 0 10 60 61 > + 1 60 10 11 > + 2 61 11 10 > + > +Will both be translated to the same values internally: > + > +:: > + > + 0 1 2 > + 0 10 40 80 > + 1 40 10 20 > + 2 80 20 10 > + > +Users are encouraged to use only the kernel values in the NUMA definition to > +avoid being taken by surprise with that the guest is actually seeing in the > +topology. There are enough potential surprises that are inherent to the > +associativity domain assignment process, discussed below. > + > + > +How associativity domains are assigned > +-------------------------------------- > + > +LOPAPR allows more than one associativity array (or 'string') per allocated > +resource. This would be used to represent that the resource has multiple > +connections with the board, and then the operational system, when deciding > +NUMA distancing, should consider the associativity information that provides > +the shortest distance. > + > +The spapr implementation does not support multiple associativity arrays per > +resource, neither does the pseries Linux kernel. We'll have to represent the > +NUMA topology using one associativity per resource, which means that choices > +and compromises are going to be made. > + > +Consider the following NUMA topology entered by user input: > + > +:: > + > + 0 1 2 3 > + 0 10 20 20 40 > + 1 20 10 80 40 > + 2 20 80 10 20 > + 3 40 40 20 10 > + > +Honoring just the relative distances of node 0 to every other node, one possible > +value for all associativity arrays would be: > + > +* node 0: 0 B A 0 > +* node 1: 0 0 A 1 > +* node 2: 0 0 A 2 > +* node 3: 0 B 0 3 > + > +With the reference points {0x4, 0x3, 0x2, 0x1}, for node 0: > + > +* distance from 0 to 1 is 20 (no match at 0x4, will match at 0x3) > +* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3) > +* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match > + at 0x2) > + > +The distances related to node 0 are well represented. Doing for node 1, and keeping > +in mind that we don't need to revisit node 0 again, the distance from node 1 to > +2 is 80, matching at 0x4: > + > +* node 1: C 0 A 1 > +* node 2: C 0 A 2 > + > +Over here we already have the first conflict. Even if we assign a new associativity > +domain at 0x4 for 1 and 2, and we do that in the code, the kernel will define > +the distance between 1 and 2 as 20, not 80, because both 1 and 2 have the "A" > +associativity domain from the previous step. If we decide to discard the > +associativity with "A" then the node 0 distances are compromised. > + > +Following up with the distance from 1 to 3 being 40 (a match in 0x2) we have another > +decision to make. These are the current associativity domains of each: > + > +* node 1: C 0 A 1 > +* node 3: 0 B 0 3 > + > +There is already an associativity domain at 0x2 in node 3, "B", which was assigned > +by the node 0 distances. If we define a new associativity domain at this level > +for 1 and 3 we will overwrite the existing associativity between 0 and 3. What > +the code is doing in this case is to assign the existing domain to the > +current associativity, in this case, "B" is now assigned to the 0x2 of node 1, > +resulting in the following associativity arrays: > + > +* node 0: 0 B A 0 > +* node 1: C 0 A 1 > +* node 2: C B A 2 > +* node 3: 0 B 0 3 > + > +In the last step we will analyze just nodes 2 and 3. The desired distance between > +2 and 3 is 20, i.e. a match in 0x3. Node 2 already has a domain assigned in 0x3, > +A, so we do the same as we did in the previous case and assign it to node 3 > +at 0x3. This is the end result for the associativity arrays: > + > +* node 0: 0 B A 0 > +* node 1: C 0 A 1 > +* node 2: C B A 2 > +* node 3: 0 B A 3 > + > +The kernel will read these arrays and will calculate the following NUMA topology for > +the guest: > + > +:: > + > + 0 1 2 3 > + 0 10 20 20 20 > + 1 20 10 20 20 > + 2 20 20 10 20 > + 3 20 20 20 10 > + > +Which is not what the user wanted, but it is what the current logic and implementation > +constraints of the kernel and QEMU will provide inside the LOPAPR specification. > + > +Changing a single value, specially a low distance value, makes for drastic changes > +in the result. For example, with the same user input from above, but changing the > +node distance from 0 to 1 to 40: > + > +:: > + > + 0 1 2 3 > + 0 10 40 20 40 > + 1 40 10 80 40 > + 2 20 80 10 20 > + 3 40 40 20 10 > + > +This is the result inside the guest, applying the same heuristics: > + > +:: > + > + $ numactl -H > + available: 4 nodes (0-3) > + (...) > + node distances: > + node 0 1 2 3 > + 0: 10 40 20 20 > + 1: 40 10 80 40 > + 2: 20 80 10 20 > + 3: 20 40 20 10 > + > +This result is much closer to the user input and only a single distance was changed > +from the original. > + > +The kernel will always match with the shortest associativity domain possible, and we're > +attempting to retain the previous established relations between the nodes. This means > +that a distance equal to 20 between nodes A and B and the same distance 20 between nodes > +A and F will cause the distance between B and F to also be 20. The same will happen to > +other distances, but shorter distances has precedent over it to the distance calculation. > + > +Users are welcome to use this knowledge and experiment with the input to get the > +NUMA topology they want, or as closer as they want. The important thing is to keep > +expectations up to par with what we are capable of provide at this moment: an > +approximation. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst index e762038022..994bfb996f 100644 --- a/docs/specs/ppc-spapr-numa.rst +++ b/docs/specs/ppc-spapr-numa.rst @@ -189,3 +189,216 @@ QEMU up to 5.1, as follows: This also means that user input in QEMU command line does not change the NUMA distancing inside the guest for the pseries machine. + +New NUMA mechanics for pseries in QEMU 5.2 +========================================== + +Starting in QEMU 5.2, the pseries machine now considers user input when +setting NUMA topology of the guest. The following changes were made: + +* ibm,associativity-reference-points was changed to {0x4, 0x3, 0x2, 0x1}, allowing + for 4 distinct NUMA distance values based on the NUMA levels + +* ibm,max-associativity-domains was changed to support multiple associativity + domains in all NUMA levels. This is needed to ensure user flexibility + +* ibm,associativity for all resources now varies with user input + +These changes are only effective for pseries-5.2 and newer machines that are +created with more than one NUMA node (disconsidering NUMA nodes created by +the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been +around for such a long time, with users seeing NUMA distances 10 and 40 +(and 80 if using NVLink2 GPUs), and there is no need to disrupt the +existing experience of those guests. + +To bring the user experience x86 users have when tuning up NUMA, we had +to operate under the current pseries Linux kernel logic described in +`How the pseries Linux guest calculates NUMA distances`_. The result +is that we needed to translate NUMA distance user input to pseries +Linux kernel input. + +Translating user distance to kernel distance +-------------------------------------------- + +User input for NUMA distance can vary from 10 to 254. We need to translate +that to the values that the Linux kernel operates on (10, 20, 40, 80, 160). +This is how it is being done: + +* user distance 11 to 30 will be interpreted as 20 +* user distance 31 to 60 will be interpreted as 40 +* user distance 61 to 120 will be interpreted as 80 +* user distance 121 and beyond will be interpreted as 160 +* user distance 10 stays 10 + +The reasoning behind this aproximation is to avoid any round up to the local +distance (10), keeping it exclusive to the 4th NUMA level (which is still +exclusive to the node_id). All other ranges were chosen under the developer +discretion of what would be (somewhat) sensible considering the user input. +Any other strategy can be used here, but in the end the reality is that we'll +have to accept that a large array of values will be translated to the same +NUMA topology in the guest, e.g. this user input: + +:: + + 0 1 2 + 0 10 31 120 + 1 31 10 30 + 2 120 30 10 + +And this other user input: + +:: + + 0 1 2 + 0 10 60 61 + 1 60 10 11 + 2 61 11 10 + +Will both be translated to the same values internally: + +:: + + 0 1 2 + 0 10 40 80 + 1 40 10 20 + 2 80 20 10 + +Users are encouraged to use only the kernel values in the NUMA definition to +avoid being taken by surprise with that the guest is actually seeing in the +topology. There are enough potential surprises that are inherent to the +associativity domain assignment process, discussed below. + + +How associativity domains are assigned +-------------------------------------- + +LOPAPR allows more than one associativity array (or 'string') per allocated +resource. This would be used to represent that the resource has multiple +connections with the board, and then the operational system, when deciding +NUMA distancing, should consider the associativity information that provides +the shortest distance. + +The spapr implementation does not support multiple associativity arrays per +resource, neither does the pseries Linux kernel. We'll have to represent the +NUMA topology using one associativity per resource, which means that choices +and compromises are going to be made. + +Consider the following NUMA topology entered by user input: + +:: + + 0 1 2 3 + 0 10 20 20 40 + 1 20 10 80 40 + 2 20 80 10 20 + 3 40 40 20 10 + +Honoring just the relative distances of node 0 to every other node, one possible +value for all associativity arrays would be: + +* node 0: 0 B A 0 +* node 1: 0 0 A 1 +* node 2: 0 0 A 2 +* node 3: 0 B 0 3 + +With the reference points {0x4, 0x3, 0x2, 0x1}, for node 0: + +* distance from 0 to 1 is 20 (no match at 0x4, will match at 0x3) +* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3) +* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match + at 0x2) + +The distances related to node 0 are well represented. Doing for node 1, and keeping +in mind that we don't need to revisit node 0 again, the distance from node 1 to +2 is 80, matching at 0x4: + +* node 1: C 0 A 1 +* node 2: C 0 A 2 + +Over here we already have the first conflict. Even if we assign a new associativity +domain at 0x4 for 1 and 2, and we do that in the code, the kernel will define +the distance between 1 and 2 as 20, not 80, because both 1 and 2 have the "A" +associativity domain from the previous step. If we decide to discard the +associativity with "A" then the node 0 distances are compromised. + +Following up with the distance from 1 to 3 being 40 (a match in 0x2) we have another +decision to make. These are the current associativity domains of each: + +* node 1: C 0 A 1 +* node 3: 0 B 0 3 + +There is already an associativity domain at 0x2 in node 3, "B", which was assigned +by the node 0 distances. If we define a new associativity domain at this level +for 1 and 3 we will overwrite the existing associativity between 0 and 3. What +the code is doing in this case is to assign the existing domain to the +current associativity, in this case, "B" is now assigned to the 0x2 of node 1, +resulting in the following associativity arrays: + +* node 0: 0 B A 0 +* node 1: C 0 A 1 +* node 2: C B A 2 +* node 3: 0 B 0 3 + +In the last step we will analyze just nodes 2 and 3. The desired distance between +2 and 3 is 20, i.e. a match in 0x3. Node 2 already has a domain assigned in 0x3, +A, so we do the same as we did in the previous case and assign it to node 3 +at 0x3. This is the end result for the associativity arrays: + +* node 0: 0 B A 0 +* node 1: C 0 A 1 +* node 2: C B A 2 +* node 3: 0 B A 3 + +The kernel will read these arrays and will calculate the following NUMA topology for +the guest: + +:: + + 0 1 2 3 + 0 10 20 20 20 + 1 20 10 20 20 + 2 20 20 10 20 + 3 20 20 20 10 + +Which is not what the user wanted, but it is what the current logic and implementation +constraints of the kernel and QEMU will provide inside the LOPAPR specification. + +Changing a single value, specially a low distance value, makes for drastic changes +in the result. For example, with the same user input from above, but changing the +node distance from 0 to 1 to 40: + +:: + + 0 1 2 3 + 0 10 40 20 40 + 1 40 10 80 40 + 2 20 80 10 20 + 3 40 40 20 10 + +This is the result inside the guest, applying the same heuristics: + +:: + + $ numactl -H + available: 4 nodes (0-3) + (...) + node distances: + node 0 1 2 3 + 0: 10 40 20 20 + 1: 40 10 80 40 + 2: 20 80 10 20 + 3: 20 40 20 10 + +This result is much closer to the user input and only a single distance was changed +from the original. + +The kernel will always match with the shortest associativity domain possible, and we're +attempting to retain the previous established relations between the nodes. This means +that a distance equal to 20 between nodes A and B and the same distance 20 between nodes +A and F will cause the distance between B and F to also be 20. The same will happen to +other distances, but shorter distances has precedent over it to the distance calculation. + +Users are welcome to use this knowledge and experiment with the input to get the +NUMA topology they want, or as closer as they want. The important thing is to keep +expectations up to par with what we are capable of provide at this moment: an +approximation.
This update provides more in depth information about the choices and drawbacks of the new NUMA support for the spapr machine. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/specs/ppc-spapr-numa.rst | 213 ++++++++++++++++++++++++++++++++++ 1 file changed, 213 insertions(+)