Message ID | cover.1611836479.git.petrm@nvidia.com |
---|---|
Headers | show |
Series | nexthop: Preparations for resilient next-hop groups | expand |
On 1/28/21 5:49 AM, Petr Machata wrote: > From: David Ahern <dsahern@kernel.org> > > nexthop_free_mpath really should be nexthop_free_group. Rename it. > > Signed-off-by: David Ahern <dsahern@kernel.org> > Reviewed-by: Ido Schimmel <idosch@nvidia.com> > Signed-off-by: Petr Machata <petrm@nvidia.com> > --- > net/ipv4/nexthop.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > Reviewed-by: David Ahern <dsahern@kernel.org>
On 1/28/21 5:49 AM, Petr Machata wrote: > The logic for selecting path depends on the next-hop group type. Adapt the > nexthop_select_path() to dispatch according to the group type. > > Signed-off-by: Petr Machata <petrm@nvidia.com> > Reviewed-by: Ido Schimmel <idosch@nvidia.com> > --- > net/ipv4/nexthop.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c > index 1deb9e4df1de..43bb5f451343 100644 > --- a/net/ipv4/nexthop.c > +++ b/net/ipv4/nexthop.c > @@ -680,16 +680,11 @@ static bool ipv4_good_nh(const struct fib_nh *nh) > return !!(state & NUD_VALID); > } > > -struct nexthop *nexthop_select_path(struct nexthop *nh, int hash) > +static struct nexthop *nexthop_select_path_mp(struct nh_group *nhg, int hash) FYI: you can use nh as an abbreviation for nexthop for all static functions in nexthop.c. Helps keep name lengths in check. Reviewed-by: David Ahern <dsahern@kernel.org>
On 1/28/21 5:49 AM, Petr Machata wrote: > From: Ido Schimmel <idosch@nvidia.com> > > Currently there are only two types of in-kernel nexthop notification. > The two are distinguished by the 'is_grp' boolean field in 'struct > nh_notifier_info'. > > As more notification types are introduced for more next-hop group types, a > boolean is not an easily extensible interface. Instead, convert it to an > enum. > > Signed-off-by: Ido Schimmel <idosch@nvidia.com> > Reviewed-by: Petr Machata <petrm@nvidia.com> > Signed-off-by: Petr Machata <petrm@nvidia.com> > --- > .../ethernet/mellanox/mlxsw/spectrum_router.c | 54 ++++++++++++++----- > drivers/net/netdevsim/fib.c | 23 ++++---- > include/net/nexthop.h | 7 ++- > net/ipv4/nexthop.c | 14 ++--- > 4 files changed, 69 insertions(+), 29 deletions(-) > Reviewed-by: David Ahern <dsahern@kernel.org>
On 1/28/21 5:49 AM, Petr Machata wrote: > Requests to dump nexthops have many attributes in common with those that > requests to dump buckets of resilient NH groups will have. In order to make > reuse of this code simpler, convert the code to use a single structure with > filtering configuration instead of passing around the parameters one by > one. > > Signed-off-by: Petr Machata <petrm@nvidia.com> > Reviewed-by: Ido Schimmel <idosch@nvidia.com> > --- > net/ipv4/nexthop.c | 44 ++++++++++++++++++++++++-------------------- > 1 file changed, 24 insertions(+), 20 deletions(-) > > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c > index 7149b12c4703..ad48e5d71bf9 100644 > --- a/net/ipv4/nexthop.c > +++ b/net/ipv4/nexthop.c > @@ -1971,16 +1971,23 @@ static int rtm_get_nexthop(struct sk_buff *in_skb, struct nlmsghdr *nlh, > goto out; > } > > -static bool nh_dump_filtered(struct nexthop *nh, int dev_idx, int master_idx, > - bool group_filter, u8 family) > +struct nh_dump_filter { > + int dev_idx; > + int master_idx; > + bool group_filter; > + bool fdb_filter; > +}; > + I should have made that a struct from the beginning. Reviewed-by: David Ahern <dsahern@kernel.org>
On 1/28/21 5:49 AM, Petr Machata wrote: > Requests to dump nexthops have many attributes in common with those that > requests to dump buckets of resilient NH groups will have. However, they > have different policies. To allow reuse of this code, extract a > policy-agnostic wrapper out of nh_valid_dump_req(), and convert this > function into a thin wrapper around it. > > Signed-off-by: Petr Machata <petrm@nvidia.com> > Reviewed-by: Ido Schimmel <idosch@nvidia.com> > --- > net/ipv4/nexthop.c | 31 +++++++++++++++++++------------ > 1 file changed, 19 insertions(+), 12 deletions(-) > Reviewed-by: David Ahern <dsahern@kernel.org>
On 1/28/21 5:49 AM, Petr Machata wrote: > In order to allow different handling for next-hop tree dumper and for > bucket dumper, parameterize the next-hop tree walker with a callback. Add > rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree > dumping. > > Signed-off-by: Petr Machata <petrm@nvidia.com> > Reviewed-by: Ido Schimmel <idosch@nvidia.com> > --- > net/ipv4/nexthop.c | 32 ++++++++++++++++++++++---------- > 1 file changed, 22 insertions(+), 10 deletions(-) > Reviewed-by: David Ahern <dsahern@kernel.org>
On 1/28/21 5:49 AM, Petr Machata wrote: > At this moment, there is only one type of next-hop group: an mpath group. > Mpath groups implement the hash-threshold algorithm, described in RFC > 2992[1]. > > To select a next hop, hash-threshold algorithm first assigns a range of > hashes to each next hop in the group, and then selects the next hop by > comparing the SKB hash with the individual ranges. When a next hop is > removed from the group, the ranges are recomputed, which leads to > reassignment of parts of hash space from one next hop to another. RFC 2992 > illustrates it thus: > > +-------+-------+-------+-------+-------+ > | 1 | 2 | 3 | 4 | 5 | > +-------+-+-----+---+---+-----+-+-------+ > | 1 | 2 | 4 | 5 | > +---------+---------+---------+---------+ > > Before and after deletion of next hop 3 > under the hash-threshold algorithm. > > Note how next hop 2 gave up part of the hash space in favor of next hop 1, > and 4 in favor of 5. While there will usually be some overlap between the > previous and the new distribution, some traffic flows change the next hop > that they resolve to. > > If a multipath group is used for load-balancing between multiple servers, > this hash space reassignment causes an issue that packets from a single > flow suddenly end up arriving at a server that does not expect them, which > may lead to TCP reset. > > If a multipath group is used for load-balancing among available paths to > the same server, the issue is that different latencies and reordering along > the way causes the packets to arrive in wrong order. > > Resilient hashing is a technique to address the above problem. Resilient > next-hop group has another layer of indirection between the group itself > and its constituent next hops: a hash table. The selection algorithm uses a > straightforward modulo operation to choose a hash bucket, and then reads > the next hop that this bucket contains, and forwards traffic there. > > This indirection brings an important feature. In the hash-threshold > algorithm, the range of hashes associated with a next hop must be > continuous. With a hash table, mapping between the hash table buckets and > the individual next hops is arbitrary. Therefore when a next hop is deleted > the buckets that held it are simply reassigned to other next hops: > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > v v v v > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > > Before and after deletion of next hop 3 > under the resilient hashing algorithm. > > When weights of next hops in a group are altered, it may be possible to > choose a subset of buckets that are currently not used for forwarding > traffic, and use those to satisfy the new next-hop distribution demands, > keeping the "busy" buckets intact. This way, established flows are ideally > kept being forwarded to the same endpoints through the same paths as before > the next-hop group change. > > This patchset prepares the next-hop code for eventual introduction of > resilient hashing groups. > > - Patches #1-#4 carry otherwise disjoint changes that just remove certain > assumptions in the next-hop code. > > - Patches #5-#6 extend the in-kernel next-hop notifiers to support more > next-hop group types. > > - Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups > will introduce a new logical object, a hash table bucket. It turns out > that handling bucket-related messages is similar to how next-hop messages > are handled. These patches extract the commonalities into reusable > components. > > The plan is to contribute approximately the following patchsets: > > 1) Nexthop policy refactoring (already pushed) > 2) Preparations for resilient next hop groups (this patchset) > 3) Implementation of resilient next hop group > 4) Netdevsim offload plus a suite of selftests > 5) Preparations for mlxsw offload of resilient next-hop groups > 6) mlxsw offload including selftests > > Interested parties can look at the current state of the code at [2] and > [3]. > > [1] https://tools.ietf.org/html/rfc2992 > [2] https://github.com/idosch/linux/commits/submit/res_integ_v1 > [3] https://github.com/idosch/iproute2/commits/submit/res_v1 > Very easy to review patchset. Thank you for that and for this cover letter with the end goal and progress.
Hello: This series was applied to netdev/net-next.git (refs/heads/master): On Thu, 28 Jan 2021 13:49:12 +0100 you wrote: > At this moment, there is only one type of next-hop group: an mpath group. > Mpath groups implement the hash-threshold algorithm, described in RFC > 2992[1]. > > To select a next hop, hash-threshold algorithm first assigns a range of > hashes to each next hop in the group, and then selects the next hop by > comparing the SKB hash with the individual ranges. When a next hop is > removed from the group, the ranges are recomputed, which leads to > reassignment of parts of hash space from one next hop to another. RFC 2992 > illustrates it thus: > > [...] Here is the summary with links: - [net-next,01/12] nexthop: Rename nexthop_free_mpath https://git.kernel.org/netdev/net-next/c/5d1f0f09b5f0 - [net-next,02/12] nexthop: Dispatch nexthop_select_path() by group type https://git.kernel.org/netdev/net-next/c/79bc55e3fee9 - [net-next,03/12] nexthop: Introduce to struct nh_grp_entry a per-type union https://git.kernel.org/netdev/net-next/c/b9bae61be466 - [net-next,04/12] nexthop: Assert the invariant that a NH group is of only one type https://git.kernel.org/netdev/net-next/c/720ccd9a7285 - [net-next,05/12] nexthop: Use enum to encode notification type https://git.kernel.org/netdev/net-next/c/09ad6becf535 - [net-next,06/12] nexthop: Dispatch notifier init()/fini() by group type https://git.kernel.org/netdev/net-next/c/da230501f2c9 - [net-next,07/12] nexthop: Extract dump filtering parameters into a single structure https://git.kernel.org/netdev/net-next/c/56450ec6b7fc - [net-next,08/12] nexthop: Extract a common helper for parsing dump attributes https://git.kernel.org/netdev/net-next/c/b9ebea127661 - [net-next,09/12] nexthop: Strongly-type context of rtm_dump_nexthop() https://git.kernel.org/netdev/net-next/c/a6fbbaa64c3b - [net-next,10/12] nexthop: Extract a helper for walking the next-hop tree https://git.kernel.org/netdev/net-next/c/cbee18071e72 - [net-next,11/12] nexthop: Add a callback parameter to rtm_dump_walk_nexthops() https://git.kernel.org/netdev/net-next/c/e948217d258f - [net-next,12/12] nexthop: Extract a helper for validation of get/del RTNL requests https://git.kernel.org/netdev/net-next/c/0bccf8ed8aa6 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html